CN114511086A

CN114511086A - Strategy generation method, device and equipment

Info

Publication number: CN114511086A
Application number: CN202210138348.1A
Authority: CN
Inventors: 徐博; 张文韬; 王燕娜; 张文圣
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2022-05-17

Abstract

The invention discloses a strategy generation method, a device and equipment, wherein the method comprises the following steps: selecting a virtual object corresponding to a preset main strategy style to fight against a competitor; predicting a fighting strategy style of the fighting party, wherein the fighting strategy style is one of at least three preset strategy styles, the at least three strategy styles comprise a main strategy style and at least two non-main strategy styles, and a restraining relationship exists between every two of the at least three strategy styles; selecting a strategy style to restrain the virtual object of the fighting strategy style to fight against the fighter; if the game ending rule is not triggered, repeatedly executing the fighting strategy style of the predicted fighting party, and selecting the strategy style to restrain the virtual object in the fighting strategy style from fighting with the fighting party; and if the preset game ending rule is triggered, ending the game. Through the mode, the game winning rate is increased.

Description

Strategy generation method, device and equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a strategy generation method, a strategy generation device and strategy generation equipment.

Background

In the instant game scenario, no ready-made competitors can play in most cases, so the self-game needs to be adopted to learn the fighting strategy in the game scenario. However, since the game scene is learned through the self-game, the involved strategy styles are few, and the obtained strategy model is easy to converge to a single fighting strategy, thereby causing poor performance.

In the prior art, a self-game method is improved, namely three types of virtual objects are constructed, then based on historical data, each type of virtual object is initialized by a fighting strategy, and then training is carried out by reinforcement learning and environment interaction.

However, with the improved self-gaming method, on one hand, historical data are difficult to acquire in many instant gaming scenes, which results in that the improved self-gaming method cannot realize the initialization of the fighting strategy. On the other hand, the strategy model obtained by the improved self-game method can not generate a large number of fighting strategies with different styles and different levels, and the diversity of the fighting strategies can not be increased.

Disclosure of Invention

In order to solve the above problems, a policy generation method, apparatus, and device according to embodiments of the present invention are provided.

According to an aspect of an embodiment of the present invention, there is provided a policy generation method, including:

selecting a virtual object corresponding to a preset main strategy style to fight against a competitor;

predicting a fighting strategy style of the fighting party, wherein the fighting strategy style is one of at least three preset strategy styles, the at least three strategy styles comprise a main strategy style and at least two non-main strategy styles, and a restraining relationship exists between every two of the at least three strategy styles;

selecting a strategy style to restrain the virtual object of the fighting strategy style to fight against the fighter;

if the game ending rule is not triggered, repeatedly executing the fighting strategy style of the predicted fighting party, and selecting the strategy style to restrain the virtual object in the fighting strategy style from fighting with the fighting party;

and if the preset game ending rule is triggered, ending the game.

Optionally, predicting a strategy style of the competitor comprises:

determining a prediction parameter of the competitor corresponding to each of the at least three strategy styles;

if the highest value of the parameter of the strategy style in the prediction parameters is greater than or equal to a preset threshold value, determining the strategy style of the competitor as the strategy style of the highest value of the parameter in the prediction parameters;

and if the highest value of the strategy style parameters in the prediction parameters is smaller than a preset threshold value, determining that the strategy style of the competitor is the undetected strategy style.

Optionally, selecting a strategy style to suppress the virtual object of the fight strategy style to fight against the competitor comprises:

if the fighting strategy style is one of the at least three strategy styles, selecting a virtual object corresponding to the strategy style for restraining the fighting strategy style to fight against a competitor;

and if the fighting strategy style is the undetected strategy style, selecting the virtual object corresponding to the main strategy style to fight against the competitor.

Optionally, after selecting the virtual object corresponding to the preset main policy style to fight against the competitor, the method further includes:

inputting corresponding fighting information into the neural network obtained by training;

after the game is ended, the method further comprises the following steps:

and storing the operation data generated in the game ending process.

Optionally, the neural network is obtained by training through the following method:

taking the pre-stored operation data as a training sample;

constructing at least three strategy styles according to the training sample;

designing population pools with the same quantity as the at least three strategy styles according to the at least three strategy styles, wherein each population pool corresponds to one strategy style and comprises at least one virtual object;

repeatedly executing the following steps until the neural network meets a preset fighting stopping rule:

respectively selecting each virtual object in each population pool to fight with a competitor, and obtaining the result of fighting between each virtual object and the competitor;

obtaining the result information of each virtual object according to the result of the match between each virtual object and a competitor, wherein the match result information is used for representing the victory rate of the match of each virtual object in the corresponding population pool;

and updating the parameters of each virtual object in the corresponding population pool according to the fight result information of each virtual object.

Optionally, the population pool is divided into a main policy style population pool and at least two other policy style population pools, where the main policy style population pool includes at least one main virtual object, and each of the other policy style population pools in the at least two other policy style population pools includes at least one virtual object.

Optionally, selecting each virtual object in each population pool to fight against a competitor respectively includes:

if the fighting strategy style of the fighter is the main strategy style, selecting one other strategy style population pool of the at least two other strategy style population pools according to a first preset rule, and then selecting one virtual object in the other strategy style population pools according to a second preset rule to fight with the fighter;

and if the fighting strategy style of the fighter is other strategy styles, selecting one strategy style population pool of all the population pools according to a third preset rule, and selecting one virtual object in the strategy style population pool to fight with the fighter according to a fourth preset rule.

According to another aspect of the embodiments of the present invention, there is provided a policy generation apparatus, including:

the first fighting module is used for selecting a virtual object corresponding to a preset main strategy style to fight with a competitor;

the forecasting module is used for forecasting a fighting strategy style of the competitor, wherein the fighting strategy style is one of at least three preset strategy styles, the at least three strategy styles comprise a main strategy style and at least two non-main strategy styles, and a restraining relationship exists between every two of the at least three strategy styles;

the second fight module is used for selecting a strategy style to restrain the virtual object in the fight strategy style from fighting with a fight party; if the game ending rule is not triggered, repeatedly executing the fighting strategy style of the predicted fighting party, and selecting the strategy style to restrain the virtual object in the fighting strategy style from fighting with the fighting party; and if the preset game ending rule is triggered, ending the game.

According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the policy generation method.

According to a further aspect of the embodiments of the present invention, there is provided a computer storage medium, in which at least one executable instruction is stored, and the executable instruction causes a processor to perform an operation corresponding to the policy generation method.

According to the scheme provided by the embodiment of the invention, the virtual object corresponding to the preset main strategy style is selected to fight against the competitor; predicting a fighting strategy style of the fighting party, wherein the fighting strategy style is one of at least three preset strategy styles, the at least three strategy styles comprise a main strategy style and at least two non-main strategy styles, and a restraining relationship exists between every two of the at least three strategy styles; selecting a strategy style to restrain the virtual object of the fighting strategy style to fight against the fighter; if the game ending rule is not triggered, repeatedly executing the fighting strategy style of the predicted fighting party, and selecting the strategy style to restrain the virtual object in the fighting strategy style from fighting with the fighting party; if a preset game ending rule is triggered, the game is ended, a training model capable of coping with various playing styles is constructed, and the game winning rate can be increased by the training model.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method of generating a policy according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a data flow of a policy generation method according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of neural network training provided by an embodiment of the present invention;

FIG. 4 is a data flow diagram illustrating neural network training provided by an embodiment of the present invention;

FIG. 5 is a diagram of a neural network training algorithm framework provided by an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a policy generation apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a method flowchart of a policy generation method provided by an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 11, selecting a virtual object corresponding to a preset main strategy style to fight against a competitor;

step 12, predicting a fighting strategy style of the competitor, wherein the fighting strategy style is one of at least three preset strategy styles, the at least three strategy styles comprise a main strategy style and at least two non-main strategy styles, and a restraining relationship exists between every two of the at least three strategy styles;

step 13, selecting a strategy style to restrain the virtual object of the fighting strategy style to fight against a competitor;

step 14, if the game ending rule is not triggered, repeatedly executing the fighting strategy style of the predicted fighting party, and selecting the strategy style to restrain the virtual object in the fighting strategy style from fighting with the fighting party;

and step 15, if the preset game ending rule is triggered, ending the game.

In the embodiment, a virtual object corresponding to a preset main strategy style is selected to fight against a competitor; predicting a fighting strategy style of the fighting party, wherein the fighting strategy style is one of at least three preset strategy styles, the at least three strategy styles comprise a main strategy style and at least two non-main strategy styles, and a restraining relationship exists between every two of the at least three strategy styles; selecting a strategy style to restrain the virtual object of the fighting strategy style to fight against the fighter; if the game ending rule is not triggered, repeatedly executing the fighting strategy style of the predicted competitor, and selecting the strategy style to restrain the virtual object in the fighting strategy style against the competitor; if a preset game ending rule is triggered, the game is ended, a training model capable of coping with various playing styles is constructed, and the game winning rate can be increased by the training model.

In step 11, the preset strategy styles can be at least three, including a main strategy style and at least two non-main strategy styles, and there is a restraining relationship between the at least three strategy styles, and as the game is real, the game may be played with a competitor for the first time, and the strategy style of the competitor is not clear, in the initial game stage, a preset main virtual object is adopted to fight with the competitor, the comprehensive capacity of the virtual object is strong, and the average winning rate for different strategy styles is high.

In

steps

14 and 15, the game ending rule can be preset, for example, the virtual objects of both parties are completely matched, or the game time is set, but not limited to the above.

In an alternative embodiment of the present invention, step 12 may include:

step 121, determining a prediction parameter of each of the at least three strategy styles corresponding to the competitor;

step 122, if the highest value of the parameter of the strategy style in the prediction parameters is greater than or equal to a preset threshold value, determining the strategy style of the competitor as the strategy style of the highest value of the parameter in the prediction parameters;

and 123, if the highest value of the strategy style parameters in the prediction parameters is smaller than a preset threshold, determining that the strategy style of the competitor is the undetected strategy style.

As shown in fig. 2, in this embodiment, after the battle is performed with the competitor, the strategy style prediction can be performed on the competitor according to the strategy style track of the competitor; during prediction, the preset threshold values corresponding to different strategy styles can be changed.

In yet another alternative embodiment of the present invention, step 13 may comprise:

step 131, if the fighting strategy style is one of the at least three strategy styles, selecting a virtual object corresponding to the strategy style for restraining the fighting strategy style to fight against a competitor;

in step 132, if the strategy style of the fight is not detected, the virtual object corresponding to the main strategy style is selected to fight against the competitor.

In this embodiment, if the strategy style of the competitor is not detected, the virtual object corresponding to the main strategy style is selected to fight with the competitor, so that the risk of fighting can be reduced, and the victory rate of fighting can be improved.

In the following, a specific game scenario is taken as an example to describe how to perform a match with a competitor after predicting the match strategy style of the competitor, but not limited to the game scenario, for example:

assuming that a game scene exists, the game ending rule is to carry out three rounds of fight, and the scene has four strategy styles: the policy system comprises a main policy style, an A policy style, a B policy style and a C policy style, wherein the preset threshold value of each policy style is 50, each policy style corresponds to at least one virtual object, each virtual object also corresponds to only one policy style, namely the virtual object corresponding to the main policy style has the optimal parameter which is the main virtual object, the virtual object corresponding to the A policy style has the optimal parameter which is the A virtual object, the virtual object corresponding to the B policy style has the optimal parameter which is the B virtual object, and the virtual object corresponding to the C policy style has the optimal parameter which is the C virtual object, wherein the A policy style controls the B policy style, the B policy control the C policy style, the C policy control the A policy style, and the main policy is equal to the A policy style, the B policy style and the C policy style.

In the first round, the main virtual object is used by the party to fight with the competitor, and in the process of fighting, the fighting strategy style of the competitor is predicted to be as follows: main strategy style prediction parameters 40, strategy style prediction parameters a 50, strategy style prediction parameters B60, strategy style prediction parameters C70. And determining that the fighting strategy style of the competitor is the C strategy style if the highest value of the strategy style parameters in the fighting strategy prediction parameters is the C strategy style and the prediction parameter 70 is greater than the preset threshold value 50.

In the second round, since the strategy style of the competitor is predicted to be the strategy style C in the first round, the party uses the virtual object B to fight against the competitor, and in the second round of fighting, the strategy style of the competitor is predicted to be: main strategy style prediction parameters 10, strategy style prediction parameters A20, strategy style prediction parameters B30 and strategy style prediction parameters C40. And determining the fighting strategy style of the competitor as the undetected strategy style because the highest value of the strategy style parameter in the fighting strategy prediction parameters is the strategy style C and the prediction parameter 40 is less than the preset threshold value 50.

In the third round, since it is predicted in the second round that the fighting strategy style of the competitor is the undetected strategy style, i. the competitor is combated with the main virtual object by the party, at this time, the game ending rule is triggered, and the game is ended.

In a further optional embodiment of the present invention, after step 11, the method may further include:

step 111, inputting corresponding fighting information into a neural network obtained by training;

after step 15, further comprising:

and step 151, storing the operation data generated in the process of ending the game.

In this embodiment, the operating data is stored in a data buffer, the data buffer supports data parallel storage, supports data storage, calculation and sampling in a parallel environment, and stores in a matrix manner to accelerate the calculation speed.

Fig. 3 is a flowchart of a method of training a neural network according to an embodiment of the present invention, where the neural network is trained by the following method, as shown in fig. 3:

step 31, using the pre-stored operation data as a training sample;

step 32, constructing at least three strategy styles according to the training sample;

step 33, designing population pools with the same number as the at least three strategy styles according to the at least three strategy styles, wherein each population pool corresponds to one strategy style and comprises at least one virtual object;

and step 34, repeatedly executing the following steps until the neural network meets a preset match-out stopping rule:

step 35, selecting each virtual object in each population pool to fight with the competitor respectively to obtain the result of fighting between each virtual object and the competitor;

step 36, obtaining the result information of the battle of each virtual object according to the result of the battle of each virtual object and the battle partner, wherein the result information of the battle is used for representing the victory rate of the battle of each virtual object in the corresponding population pool;

and step 37, updating the parameters of each virtual object in the corresponding population pool according to the fight result information of each virtual object.

As shown in fig. 4, in this embodiment, a specific game scenario is taken as an example to illustrate how the neural network is trained, wherein the game scenario is divided into two categories, namely a self party and an opponent party, each party has the same number of virtual objects, and the two parties are in a non-cooperative competitive relationship and have the purposes of occupying resources, targeting points or fighting the opponent party. All the virtual objects in the own party or the competitor need to cooperate to achieve the game goal. In addition, the scene also has the following characteristics:

first, incomplete information: the condition of the whole environment can be observed in real time in the chess game. In the instant game scene, complete environmental information cannot be acquired, and the virtual object needs to be operated according to incomplete information;

second, sparse awarding: the game is played for a long time, and the final game reward is fed back when the game is ended. But each step requires the virtual object to take action in real time;

thirdly, the motion space is large: at any time, a plurality of virtual objects need to be controlled, the action characteristics of each virtual object are different, and the combination of actions is proportional to the complexity of the environment;

fourth, the player information is difficult to obtain: the information of the real competitors is difficult to obtain, so that different styles of the competitors are simulated in a training stage, the strategy of the own party can overcome different competitors, and the robustness of the strategy of the own party is improved;

fifth, data without supervised learning: there is no supervised learning data available and therefore no supervised learning algorithm can be employed to initialize policy network parameters.

In this scenario, the specific training steps may include:

step one, a strategy style system is constructed, and the style system can be constructed according to rules such as game targets or game forms, such as defensive strategy style, offensive strategy style, neutral body strategy style and the like, but not limited to the above.

Step two, establishing a competitive league, wherein each virtual object exists in the league in a mode of being capable of competing with a competitor, and the league further has the following seven functions or purposes:

firstly, establishing game principal alliance, including own party alliance and competitor alliance;

secondly, constructing a party population pool with the same number according to a style system;

thirdly, one style population pool is provided with a plurality of virtual objects;

fourthly, supporting pairwise game of the virtual objects;

fifthly, a matching mechanism can be established, and which two virtual objects are selected from the own party alliance and the competitor alliance when to fight;

sixthly, recording the fight result of each round;

and seventhly, establishing a set of scoring mechanism to score each virtual object according to the fight result.

And step three, establishing a alliance information base, wherein the alliance information base stores or maintains information of each virtual object, such as establishing a unique identification of the virtual object, storing a grading result of each virtual object, and maintaining a global hash table to quickly access the identification and the grading of each virtual object model, but not limited to the above.

And step four, starting the alliance service program.

And step five, starting each virtual object service program.

And step six, initializing a network model, wherein the network model of each virtual object with different strategy styles needs to be initialized before entering the alliance.

And step seven, matching the competitors, requesting the matched competitor from the alliance by the current latest virtual object, and returning the competitor according to the fight matching rule in the step 35 and then performing fighting by the two parties. And after the fight is finished, sending the fight result to the alliance, and waiting for the alliance to return the next matching object. Meanwhile, the fighting data is stored in an experience base for training.

And step eight, updating the strategy indexes, and updating the scores of all the network models in all the population pools in the alliance information base after the alliance receives the combat result information of the virtual object.

Step nine, circulating the game, namely, continuously carrying out the match request and match of the competitors until the condition of storing the model is reached, sending model information to the alliance, and storing the model into a population pool; secondly, the alliance receives the request of storing the model, establishes a scoring mechanism for the model and updates a scoring table at the same time; and finally, continuously and circularly iterating until a game training end rule is reached.

In the scene, the training step designs a multi-strategy style design, an evolutionary learning frame, a competitor matching, an evaluation index, environment randomization, a reinforcement learning module and the like related to the strategy evolution method based on multi-style league learning to realize strategy generation in the instant confrontation scene.

The above steps one to nine are merely descriptions of the training process in the corresponding game scenario, and are not limited to all the training processes in the embodiments of the present invention.

Fig. 5 shows a schematic diagram of a neural network training algorithm framework provided by an embodiment of the present invention, which includes a multi-format design framework, an evolutionary learning framework, a competitor matching framework, an evaluation index framework, an environment random initialization framework, a reinforcement learning framework, and an adaptive game framework.

Multi-grid design framework: due to the fact that the real game scene has the characteristic that the competitors are uncertain, the competitors with different strategy styles need to be simulated during training, and the stability of the strategy model is improved by means of the competitors. In the framework, a style objective function is designed in the alliance training of the self-game according to a set strategy style system. For example, in the game scene, an aggressive strategy style, a neutral strategy style, an aggressive strategy style and a main strategy style are designed by taking an acquisition target point, keeping own strength and fighting as the final target of the game scene. The main strategy style model is compatible with other strategy styles, is a comparatively comprehensive strategy style, and each other strategy style focuses on a style making method.

In the framework, a style standard system is constructed, style optimization targets can be flexibly designed according to requirements, competitors with diversified styles can be dealt with, and meanwhile the diversity of own strategy styles can be increased.

An evolutionary learning framework: step one, starting each virtual object service: a main virtual object, an aggressive virtual object, a defensive virtual object, a balanced virtual object;

secondly, initializing each virtual object model, uploading the model to a strategy model pool in a coalition, and storing by adopting a distributed object storage cluster;

thirdly, initializing a alliance strategy model pool, establishing a scoring object and a unique identifier for each virtual object, and maintaining global information to rapidly access the identifier and the score of each model;

fourthly, after the model is initialized by the virtual object, the model is used as a virtual object strategy of the self-side and the opposite side to call to carry out self-game;

and fifthly, the two parties continuously interact with the environment until the virtual object reaches the condition of the storage model, and information is sent to the alliance to indicate that the game of the two parties is finished and the storage model is needed. The storage request comprises a model ID and a corresponding storage path obtained in the step;

sixthly, the alliance receives the request of saving the model, establishes a new unique ID for the model, finds the catalog where the model is located, establishes a scoring mechanism for the model and updates the evaluation index of the model;

seventhly, continuously iterating, thinking the alliance again to request the match of the competitors, and waiting for the alliance to return to the next matched object;

and eighthly, evolving the latest virtual object model of each style population pool according to the steps from the first step to the seventh step, and continuously updating the population pool.

In the first step to the eighth step, due to rapid data interaction, deep reinforcement learning needs to be continuously explored and tried out with the environment, and the training period is long. Therefore, the efficient evolution learning framework is very effective, and the training time can be greatly prolonged.

The match frame of the battle: for example, when a virtual object is randomly selected to fight with a competitor, if the fighting strategy style of the competitor is the main strategy style, the probability that the virtual object in all strategy styles of the party is the same as the fighting probability of the competitor can be set, if the fighting strategy style of the competitor is other strategy styles, 60% of the virtual object corresponding to the main strategy style can be selected, and 40% of the virtual object corresponding to the other strategy styles can be selected.

After determining the strategy style selected by the user, randomly selecting one of the latest N models with a probability of 70%, and selecting the virtual object with the largest scoring parameter based on the scoring parameters of the models with a probability of 30%, wherein N is a positive integer greater than or equal to 1.

After each battle, the confrontation data is also stored in the experience base of our party in the data buffer for training.

In the framework, in the training stage, a competitor matching mechanism is designed, representative competitors can be efficiently selected for playing games, and the game performance can be rapidly improved.

Evaluation index framework: a scoring system TrueSkill based on Bayesian inference can be used as an evaluation index system, the TrueSkill system can replace the traditional Elo scoring, and the TrueSkill scoring system is mainly used in multiplayer games. The TrueSkill scoring system takes into account the uncertainty of the level, and comprehensively takes into account the winning rate of the player and the possible level fluctuation. As the player progresses through more games, the system changes the score for better understanding of the level, even if the win ratio is unchanged. Compared with an Elo evaluation system, the TrueSkill system has the following advantages by adopting the TrueSkill scoring standard:

firstly, the method is suitable for complex formation forms and has more generality;

second, the Elo system is only suitable for 1v1, while the TrueSkill system is not limited to 1v 1;

thirdly, the TrueSkill system has a more complete modeling system and is easy to expand;

fourthly, the TrueSkill system has a fast convergence speed, and can judge the level of the player without many games.

Environment random initialization framework: in the training phase, during environment initialization, can set up different arms strength cloth situation at random, different array types, different initial position etc. increase the variety of environment initialization condition for own side's model can with diversified competitor's game, increase the robustness.

A reinforcement learning framework: after the own model and the competitor model are selected, the game process is entered, and the process is modeled by using reinforcement learning. A common model of reinforcement learning is the standard Markov Decision Process (MDP), one defined as a quadruple (S, a, P, R), where S represents a set of environmental states; a is an action set; p is transition probability, and defines the state transition of the environment according to the action; r is a reward function defining a reward that may be obtained from the environment after the virtual object performs an action.

The strategy is the mapping from the situation S to the action A, the strategy is usually represented by a symbol pi, and the strategy refers to a distribution on an action set when the state S is given. One of the core parts of the evolution learning is reinforcement learning, and the evolution learning is realized through continuous iteration updating of the reinforcement learning. The reinforcement learning module can be designed based on existing reinforcement learning algorithms. The contents common in the reinforcement learning module are as follows:

environment: the engine for reinforcement learning execution, when the virtual object acts on the environment for one action, the environment returns to the state and rewards;

designing a state space: after the virtual object obtains the original state characteristics from the environment, the state space can be designed, and the original space characteristics are processed;

designing an action space: according to the characteristics of the game main body, the action space of the virtual object can be designed, so that the action acting on the environment is more effective;

reward plasticity: after the virtual object obtains the original bonus features from the environment, the bonus conversion content can be designed according to the enhanced target, and the original bonus features are processed, for example, in the game scenario, the main virtual object bonus can be designed as: r is a radical of hydrogen_occupy+r_remain+r_attackWherein r is_occupyRepresents the occupancy target point score, r_remainScore, r, representing the power of the reserve of oneself_attackThe score representing the force of attacking the opponent; the aggressive virtual object reward may be designed as: r is_occupy+r_remain+r_attack+α_aggressive*r_{own_blood}+β_aggressive*r_{enemy_blood}Wherein, since aggressive strategic style is biased towards attack, blood volume change is increased as an assessment dimension, so r_{own_blood}Indicating the blood volume change of the own virtual object, r, when the current time t is compared with the last time t-1_{enemy_blood}Indicates the blood volume change of the virtual object of the opponent party, alpha, from the current time t to the last time t-1_aggressiveWeight, β, representing blood volume of own-side virtual object in aggressive policy style_aggressiveA weight representing a blood volume of the competitor virtual object in the aggressive policy style; defensive virtual object rewards may be designed as:

r_occupy+r_remain+r_attack+α_defensive*r_{own_blood}+β_defensive*r_{enemy_blood}wherein α is_defensiveWeight, β, representing blood volume of own-party virtual object in defensive policy style_defensiveA weight indicating a blood volume of the competitor virtual object in the defensive-type policy style; the neutral virtual object award may be designed as: r is_occupy+r_remain+r_attack+α_balance*r_{own_blood}+β_balance*r_{enemy_blood}Wherein α is_balanceWeight, β, representing the blood volume of the own virtual object in the neutral strategy style_balanceIs shown inIn the neutral strategy style, the weight of the blood volume of the virtual opponent is balanced attack and defense, so the meaning of the parameters of the two styles is the same as that of the aggressive style, but r_{own_blood}And R_{enemy_blood}The corresponding weights are different, for example: alpha is alpha_aggressive＝0.2，β_aggressive＝5；α_defensive＝5；β_defensive＝0.2；α_balance＝1，β_balancWhen e is 1, the blood volume of the aggressive virtual object changes greatly and the casualty is heavy, so β is_aggressive＞α_aggressiveThe defense-type virtual object is not good at attacking and fighting enemies because it is considered that own strength tends to be preserved, and therefore α_aggressive＞β_aggressiveNeutral virtual object balance aggressiveness and defense, thus alpha_aggressive＝β_aggressive′。

The reward plasticity is only used for explaining the corresponding game scene and is not limited to any other game scene, and the specific reward shaping and the weight can be modified according to the actual game scene.

And (3) network structure design: at present, the reinforcement learning mainly adopts a deep reinforcement learning algorithm, and the deep learning and the traditional reinforcement learning are combined, so that different network structures can be designed according to game targets when the reinforcement learning is used;

the reinforcement learning algorithm: the reinforcement learning algorithm is used as the key content of reinforcement learning, and a method that a model is used for simulating an environment and feedback is obtained through the simulated environment by using information given by the environment or a model-based algorithm which is a learning and understanding environment and is based on a model-free algorithm without learning and understanding the environment can be adopted;

data buffer memory: if the adopted reinforcement learning algorithm relates to the storage of data in the interaction between the virtual object and the environment, a data buffer is required to be designed for data storage;

training is accelerated: the reinforcement learning usually takes a long time from zero to convergence, so in order to shorten the training time, the reinforcement learning can be integrated into the decision making process through an Action Mask mechanism based on the existing knowledge, and thus, the advantages of the reinforcement learning can be effectively combined with the advantages of the existing knowledge to better solve the problem of large-scale decision making.

The adaptive gaming framework: after reinforcement learning, final models of different strategy styles can be obtained to form a self strategy model library. In real fight, because the fight party can never meet, the style of the fight party is uncertain, therefore in the initial fight stage, can adopt the general main virtual object to fight, the virtual object comprehensive ability is stronger, and the average rate of winning is higher to different styles.

And after the game is played for a certain time, strategy style prediction is carried out on the game according to the strategy style track of the competitor. If the predicted strategy style is greater than or equal to the set threshold, a model capable of restraining the style is selected from the own model pool. And if the predicted strategy style is smaller than the set threshold value, in order to reduce the risk, continuously selecting the general main virtual object for game playing. And continuously predicting the strategy style and continuously switching the strategy style until the game is ended.

In the framework, own strategy style switching is performed in stages in a real game, and a virtual object for restraining the strategy style can be adaptively selected based on the strategy prediction of the fighting style of a competitor, so that the game winning is increased.

In the framework, a large-scale, high-level and diversified-style competitor strategy pool and a local strategy pool can be generated, and a high-level virtual object is finally generated by playing games through reinforcement learning.

In a further optional embodiment of the present invention, the population pool is divided into a main policy style population pool and at least two other policy style population pools, wherein the main policy style population pool comprises at least one main virtual object, and each of the at least two other policy style population pools comprises at least one virtual object.

In this embodiment, the virtual object in each policy style population pool is continuously updated and optimized to a new virtual object as it is being combated with the competitor.

In yet another alternative embodiment of the present invention, the step 35 of selecting each virtual object in each group pool to match with the competitor respectively may include:

step 351, if the fighting strategy style of the fighter is the main strategy style, selecting one other strategy style population pool of the at least two other strategy style population pools according to a first preset rule, and selecting one virtual object in the other strategy style population pools to fight with the fighter according to a second preset rule;

step 352, if the fighting strategy style of the competitor is other strategy styles, selecting one strategy style population pool of all the population pools according to a third preset rule, and then selecting one virtual object in the strategy style population pool to fight with the competitor according to a fourth preset rule.

In this embodiment, the first preset rule and the third preset rule refer to probability distribution rules for selecting a policy style population pool, the second preset rule and the fourth preset rule refer to probability distribution rules for selecting a virtual object from the policy style population pool after the policy style population pool is selected, there is no necessary connection between the distribution probabilities of the first preset rule, the second preset rule, the third preset rule and the fourth preset rule, and the distribution probabilities may be the same or different, or two or three of them are the same, but are not limited to the above. When a virtual object is selected according to a preset rule to fight with a competitor, for example, if the fighting strategy style of the competitor is the main strategy style, the probability that the virtual object in all strategy styles of the party and the competitor fight is the same can be allocated, if the fighting strategy style of the competitor is other strategy styles, 60% of the virtual object corresponding to the main strategy style can be allocated, and 40% of the virtual object corresponding to other strategy styles can be allocated, but the allocation probability of the preset rule is not limited to the above.

After determining the policy style selected by the party, selecting a virtual object corresponding to the policy style according to the distribution probability of a preset rule, for example: one of the latest N models is randomly selected with a probability of 70%, and the virtual object with the largest scoring parameter is selected with a probability of 30% based on the scoring parameter of the model, but the assignment probability of the preset rule is not limited to the above, where N is a positive integer greater than or equal to 1.

In the above embodiments of the present invention, on one hand, a high-level main virtual object is constructed and trained, and virtual objects with various strategy styles can be handled. On the other hand, virtual objects with different strategy styles are trained, when the opponent is in battle with the opponent, the opponent is subjected to instant style strategy prediction, and the virtual object capable of restraining the opponent style strategy is selected. The ability of the evolution model is greatly improved through the two modes, game winning is increased, different force distribution potentials can be set when the environment is initialized, the deployment diversity is increased, and the generalization and the winning rate of the model are further improved.

Fig. 6 is a schematic structural diagram of a policy generation apparatus 60 according to an embodiment of the present invention. As shown in fig. 6, the apparatus includes:

the first fighting module 61 is used for selecting a virtual object corresponding to a preset main strategy style to fight against a competitor;

the prediction module 62 is configured to predict a fighting strategy style of the competitor, where the fighting strategy style is one of at least three preset strategy styles, the at least three strategy styles include the main strategy style and at least two non-main strategy styles, and a restraining relationship exists between every two of the at least three strategy styles;

a second fight module 63 for selecting a strategy style to fight the virtual object of the fight strategy style against the competitor; if the game ending rule is not triggered, repeatedly executing the fighting strategy style of the predicted fighting party, and selecting the strategy style to restrain the virtual object in the fighting strategy style from fighting with the fighting party; and if the preset game ending rule is triggered, ending the game.

Optionally, the prediction module 62 is further configured to determine a prediction parameter corresponding to each of the at least three policy styles for the competitor;

Optionally, the second fighting module 63 is further configured to select a virtual object corresponding to a strategy style of the fighting strategy style to fight against a competitor if the fighting strategy style is one of the at least three strategy styles;

Optionally, the second engagement module 63 is further configured to input corresponding engagement information into the trained neural network;

and storing the operation data generated in the game ending process.

taking the pre-stored operation data as a training sample;

constructing at least three strategy styles according to the training sample;

Optionally, the second fighting module 63 is further configured to, if the fighting strategy style of the competitor is the main strategy style, select one of the at least two other strategy style population pools according to a first preset rule, and then select one virtual object in the other strategy style population pools according to a second preset rule to fight with the competitor;

if the fighting strategy style of the competitor is other strategy styles, selecting one strategy style population pool from all the population pools according to a third preset rule, and selecting one virtual object in the strategy style population pool to fight with the competitor according to a fourth preset rule.

It should be understood that the above description of the method embodiments illustrated in fig. 1 to 5 is merely an illustration of the technical solution of the present invention by way of alternative examples, and does not limit the policy generation method related to the present invention. In other embodiments, the execution steps and the sequence of the policy generation method according to the present invention may be different from those of the above embodiments, and the embodiments of the present invention do not limit this.

It should be noted that this embodiment is an apparatus embodiment corresponding to the above method embodiment, and all the implementations in the above method embodiment are applicable to this apparatus embodiment, and the same technical effects can be achieved.

An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute the policy generation method in any of the above method embodiments.

Fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and a specific embodiment of the present invention does not limit a specific implementation of the computing device.

As shown in fig. 7, the computing device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.

Wherein: the processor, the communication interface, and the memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers. The processor is used for executing the program, and particularly can execute the relevant steps in the embodiment of the policy generation method for the computing device.

In particular, the program may include program code comprising computer operating instructions.

The processor may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program may in particular be adapted to cause a processor to perform the policy generation method in any of the above-described method embodiments. For specific implementation of each step in the program, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiment of the policy generation method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A method for policy generation, the method comprising:

and if the preset game ending rule is triggered, ending the game.

2. The strategy generation method of claim 1, wherein predicting the strategy style of the competitor comprises:

3. The method of claim 2, wherein selecting a virtual object having a strategy style that curbs the battle strategy style to battle with a competitor comprises:

4. The method of claim 1, further comprising, after selecting a virtual object corresponding to a preset master policy style to engage with a competitor:

after the game is ended, the method further comprises the following steps:

and storing the operation data generated in the game ending process.

5. The strategy generation method of claim 4, wherein the neural network is trained by:

taking the pre-stored operation data as a training sample;

constructing at least three strategy styles according to the training sample;

obtaining the result information of each virtual object in the fight according to the result of each virtual object in the fight with the fight, wherein the fight result information is used for representing the victory rate of each virtual object in the fight in the corresponding population pool;

6. A policy generation method according to claim 5, wherein said population pool is divided into a main policy style population pool and at least two other policy style population pools, wherein said main policy style population pool comprises at least one main virtual object, and each of said at least two other policy style population pools comprises at least one virtual object.

7. The method of claim 6, wherein selecting each virtual object in each population pool to engage with a competitor separately comprises:

if the fighting strategy style of the competitor is the main strategy style, selecting one other strategy style population pool of the at least two other strategy style population pools according to a first preset rule, and selecting one virtual object in the other strategy style population pools to fight with the competitor according to a second preset rule;

8. A policy generation apparatus, characterized in that the apparatus comprises:

9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is to store at least one executable instruction that when executed causes the processor to perform the policy generation method of any one of claims 1-7.

10. A computer storage medium having stored therein at least one executable instruction that when executed causes a computing device to perform the policy generation method of any one of claims 1-7.