CN114053712B

CN114053712B - Action generation method, device and equipment of virtual object

Info

Publication number: CN114053712B
Application number: CN202210048175.4A
Authority: CN
Inventors: 徐博; 王燕娜; 张鸿铭
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-22
Anticipated expiration: 2042-01-17
Also published as: CN114053712A

Abstract

The invention discloses a method, a device and equipment for generating actions of virtual objects, wherein the method comprises the following steps: acquiring characteristic information of a plurality of virtual objects, wherein the plurality of virtual objects belong to the same group; mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; and controlling each virtual object to execute the corresponding second policy action. Through the mode, the method can improve the training efficiency, simplify the operation process, and simultaneously realize the cooperativity of intelligently controlling the actions of the plurality of virtual objects, so that the plurality of virtual objects in one group show the cooperativity among the actions in the process of resisting the opponents, and the game result of the plurality of virtual objects is continuously optimized based on the preset target in the virtual scene.

Description

Action generation method, device and equipment of virtual object

Technical Field

The invention relates to the technical field of reinforcement learning, in particular to a method, a device and equipment for generating actions of a virtual object.

Background

In the field of reinforcement learning, instant confrontation scenarios require that individuals make continuous decisions for a limited time and require cooperativity among decision-making virtual objects of the same group. Due to the problems of long delay, action cooperativity, action constraint, local observation and the like in the instant countermeasure scene, reinforcement learning is required to cooperate the actions among the virtual objects.

The existing reinforcement learning algorithm is divided into a single-agent reinforcement learning algorithm and a multi-agent reinforcement learning algorithm, wherein the single-agent reinforcement learning algorithm is as follows: the action and the strategy of one virtual object are learned, so that the action characteristics of a plurality of virtual objects in a group cannot be comprehensively embodied; the multi-agent reinforcement learning algorithm can learn the cooperativity among a plurality of virtual objects, but the model is complex and the training difficulty is high. Therefore, based on the problem that the existing reinforcement learning algorithm cannot coordinate a plurality of virtual objects in a group while ensuring the training efficiency, an algorithm for efficiently training, simply operating and cooperatively controlling the actions of the plurality of virtual objects needs to be provided.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a method, an apparatus, and a device for generating an action of a virtual object, which overcome the above problems or at least partially solve the above problems.

According to an aspect of the embodiments of the present invention, there is provided a method for generating an action of a virtual object, including:

acquiring characteristic information of a plurality of virtual objects, wherein the plurality of virtual objects belong to the same group;

mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object;

obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object;

generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;

and controlling each virtual object to execute the corresponding second policy action.

Optionally, mapping the feature information of the plurality of virtual objects to feature information of one total virtual object includes:

extracting initial characteristic information of the plurality of virtual objects;

respectively carrying out normalization processing on the initial characteristic information of the plurality of virtual objects to obtain the characteristic information of the plurality of virtual objects;

and synthesizing the characteristic information of the plurality of virtual objects to obtain the characteristic information of the total virtual object, wherein the characteristic information of the total virtual object represents the characteristics of the total virtual object from three dimensions.

Optionally, obtaining a first policy action of the plurality of virtual objects according to the feature information of the total virtual object includes:

obtaining an overall strategy according to the characteristic information of the total virtual object;

and splitting the whole strategy based on the relationship between the plurality of virtual objects and the total virtual object to obtain a first strategy action of each virtual object.

Optionally, generating a second policy action for each virtual object according to the first policy action for the virtual object includes:

extracting at least one basic action contained in each first strategy action, wherein the basic action is a preset action;

and generating a second policy action corresponding to each virtual object according to the basic action corresponding to each virtual object, wherein the second policy action comprises a corresponding basic action.

Optionally, after obtaining the feature information of the plurality of virtual objects, the method further includes:

inputting the characteristic information of the plurality of virtual objects into a trained neural network;

after generating a second policy action for the corresponding virtual object according to the first policy action for each virtual object, the method further includes:

and storing the operation data generated by the process of obtaining the second strategy action.

Optionally, the neural network is obtained by training through the following method:

taking pre-stored operation data as a training sample;

extracting characteristic information of the training sample;

inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object;

converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object;

rewarding each fourth policy action to obtain a reward value of each fourth policy action;

sharing the reward value of each fourth strategy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth strategy action;

and adjusting the parameters of the neural network to be optimized according to the shared reward value to obtain the neural network.

Optionally, the outputting, by the to-be-optimized neural network, a third policy action of each training virtual object includes:

the neural network to be optimized generates at least two actions of each training virtual object and the probability of each action in the at least two actions based on the characteristic information of the training sample;

and corresponding to each training virtual object, taking the action with the highest probability in at least two actions corresponding to the corresponding training virtual object as a third strategy action of the training virtual object.

Optionally, converting the third policy action of each training virtual object to obtain a fourth policy action of each training virtual object, including:

filtering illegal actions in the third strategy actions of each training virtual object to obtain filtered actions of each training virtual object;

and generating a fourth strategy action of the corresponding training virtual object according to the filtered action of each training virtual object.

According to another aspect of the embodiments of the present invention, there is provided an action generation apparatus for a virtual object, including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring characteristic information of a plurality of virtual objects, and the virtual objects belong to the same group;

the processing module is used for mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;

and the control module is used for controlling each virtual object to execute the corresponding second strategy action.

According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the action generation method of the virtual object.

According to a further aspect of the embodiments of the present invention, there is provided a computer storage medium, in which at least one executable instruction is stored, and the executable instruction causes a processor to execute operations corresponding to the method for generating an action of a virtual object.

According to the scheme provided by the above embodiment of the present invention, by obtaining feature information of a plurality of virtual objects, the plurality of virtual objects belong to the same group; mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; and controlling each virtual object to execute the corresponding second policy action. The method can improve training efficiency, simplify operation flow, and simultaneously realize the cooperativity of intelligently controlling the actions of a plurality of virtual objects, so that a group of a plurality of virtual objects can show the cooperativity of the actions in the process of resisting opponents, and game results of the plurality of virtual objects are continuously optimized based on preset targets in a virtual scene.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method for generating actions of virtual objects according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a specific instant confrontation scenario provided by an embodiment of the invention;

FIG. 3 is a diagram illustrating a neural network model provided by an embodiment of the present invention;

FIG. 4 is a diagram illustrating a data buffer according to an embodiment of the present invention;

FIG. 5 is a flow chart of a neural network training method provided by an embodiment of the present invention;

FIG. 6 illustrates an overall flow diagram of neural network model optimization provided by an embodiment of the present invention;

FIG. 7 is a flow chart illustrating a specific action transformation provided by an embodiment of the present invention;

FIG. 8 is a flowchart of another method for generating actions of virtual objects according to an embodiment of the present invention;

FIG. 9 shows a central control explanatory diagram provided by an embodiment of the invention;

fig. 10 is a schematic structural diagram illustrating an apparatus for generating a motion of a virtual object according to an embodiment of the present invention;

fig. 11 shows a schematic structural diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a flowchart of an action generation method for a virtual object according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 11, obtaining characteristic information of a plurality of virtual objects, wherein the plurality of virtual objects belong to the same group;

step 12, mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object;

step 13, obtaining a first policy action of each virtual object in the plurality of virtual objects according to the feature information of the total virtual object;

step 14, generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;

step 15, controlling each virtual object to execute the corresponding second policy action

In this embodiment, by acquiring feature information of a plurality of virtual objects, the plurality of virtual objects belong to the same group; mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; and controlling each virtual object to execute the corresponding second strategy action, so that the training efficiency can be improved, the operation process can be simplified, and meanwhile, the cooperativity of intelligently controlling the actions of the plurality of virtual objects can be realized, so that the plurality of virtual objects in one group show the cooperativity among the actions in the process of resisting the opponents, and the game result of the plurality of virtual objects is continuously optimized based on the preset target in the virtual scene.

In an alternative embodiment of the present invention, step 12 may include:

step 121, extracting initial characteristic information of the plurality of virtual objects;

step 122, respectively carrying out normalization processing on the initial characteristic information of the plurality of virtual objects to obtain the characteristic information of the plurality of virtual objects;

and 123, synthesizing the feature information of the plurality of virtual objects to obtain the feature information of the total virtual object, wherein the feature information of the total virtual object represents the features of the total virtual object from three dimensions.

In this embodiment, the initial feature information of the virtual object includes: location information, cell information, global information, and the like, but not limited to the above. Fig. 2 shows a schematic diagram of a specific instant confrontation scene, in which an agent is one of virtual objects, and the scene has the following characteristics: 1. the game party is divided into a self party and an opponent; 2. the types of agents for the two confrontation parties are different; 3. different agent types lead to different effective action types; 4. the operators of the two parties have different initial positions, different distances from a target point, different landforms and the like; 5. due to the influence of terrain, a single operator cannot completely observe the situation of other operators; 6. one party wins either eliminating the other operator completely or occupying all the targets. The scene needs an algorithm to control the intelligent agent to make effective decisions, including path planning, elimination of the intelligent agent of the other party, cooperation with the intelligent agent of the own party, occupation of a target point and the like. Taking the instant countermeasure scene shown in fig. 2 as an example, the position information may be fixed-size map information centered on a robbing control point, including map information of the current step and map information of the previous n steps; unit information comprises blood volume, type, speed, guiding capacity, coordinate position, weapon cooling time, fatigue and the like of the current agent; the global information includes the current deduction time, whether the robbing point is occupied, etc.

After the initial feature information of all the virtual objects is extracted, all the initial feature information is normalized by adopting the minimum value and the maximum value. And then, splicing and deforming the processed feature information to form three-dimensional features, so that information can be observed to the maximum extent, namely, information seen by all intelligent agents can be observed from one visual angle.

In yet another alternative embodiment of the present invention, step 13 may comprise:

step 131, obtaining an overall strategy according to the characteristic information of the total virtual object;

step 132, splitting the overall policy based on the relationship between the plurality of virtual objects and the total virtual object, to obtain a first policy action for each virtual object.

In yet another alternative embodiment of the present invention, step 14 may comprise;

step 141, extracting at least one basic action included in each first policy action, where the basic action is a preset action;

and 142, generating a second policy action corresponding to each virtual object according to the basic action corresponding to each virtual object, wherein the second policy action comprises a corresponding basic action.

In this embodiment, the basic action extracted from the first policy action is first retained, and the basic action is a preset action, for example, in the scenario of instant confrontation shown in fig. 2, the basic action may be set as: movement, shooting, grab control, etc. And then generating a high-order action from the basic action according to the environment factors, wherein the high-order action is the second strategy action. For example, the movement is generated as a forward movement, the shot is generated as a backward shot, and the robbing control is generated as a robbing control of the first target point.

As shown in fig. 3, in the above embodiment, first, the feature information of the plurality of virtual objects is input into the neural network model shown in fig. 3, the neural network synthesizes the feature information of the plurality of virtual objects into the feature information of one total virtual object, and then, the neural network obtains the first policy action of each virtual object through continuous reinforcement learning. The neural network structure adopts 3 layers of convolutional neural networks and a full connection layer, and centralizes and controls all virtual objects, so that the action cooperativity among all the virtual objects is realized. And finally, outputting a second strategy action of each virtual object according to the first strategy action of each virtual object, wherein each virtual object outputs an independent action through branch control, so that each agent can have action output in each step of interaction with the environment.

In another optional embodiment of the present invention, after step 11, further comprising:

step 111, inputting the characteristic information of the plurality of virtual objects into a trained neural network;

after step 14, further comprising:

step 143, storing the operation data generated by the process of obtaining the second policy action.

As shown in fig. 4, in this embodiment, the operation data is stored in the data buffer shown in fig. 4, the data buffer supports data parallel storage, supports data storage, computation and sampling in a parallel environment, and stores in a matrix manner to increase the computation speed.

Fig. 5 is a flowchart illustrating a neural network training method according to an embodiment of the present invention. As shown in fig. 5, the neural network is trained by the following method:

step 51, using pre-stored operation data as a training sample;

step 52, extracting characteristic information of the training sample;

step 53, inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object;

step 54, converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object;

step 55, rewarding each fourth policy action to obtain a reward value of each fourth policy action;

step 56, sharing the reward value of each fourth policy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth policy action;

and 57, adjusting the parameters of the neural network to be optimized according to the shared reward value to obtain the neural network.

In the embodiment, a reward value mechanism is added into the training neural network, and the reward comprises the sum of all scores of all virtual objects of one party, so the design can ensure that the virtual objects can return a reward value after each decision, and all the virtual objects can share the reward value. Taking the scenario of instant countermeasure shown in fig. 2 as an example, when designing a reward mechanism for the scenario, the reward may be set as the sum of the remaining scores, attack scores, and override scores of all virtual objects of one party. Wherein, the rest is as follows: the sum of the scores of all the virtual objects with life values of one party is different, and the scores of the virtual objects of different types are different; the attack is divided into: the score of one virtual object for eliminating the other virtual object is the score of the type corresponding to the eliminated virtual object; the seizing control score was: and according to the value of the target point, if the virtual object occupies the corresponding target point, the seizing control score is the value of the target point.

Fig. 6 is a flowchart illustrating an overall process of neural network model optimization according to an embodiment of the present invention, and as shown in fig. 6, pre-stored operation data is first obtained from a data buffer, and the obtained pre-stored operation data is used as a training sample. And then establishing an Actor-Critic algorithm framework for training a model, modeling the Actor and Critic networks by using a neural network, and taking the update of the PPO-Clip algorithm as an example, but not limited to the PPO-Clip algorithm.

First, initialize Policy network

And Value network

Where π is the strategy and V is the value,

for the network parameters, the initialized network parameters of policy and value may be different;

second step, constantly interacting with the environmentAt the present time t, state s_tActions derived through a strategic pi-network

Wherein

Representing the output of the policy network, selecting actions according to a probability distribution, wherein

Status features returned for the environment;

third, perform action a_tObtaining environment-corresponding reward value

The reward value is calculated by the environment and the generated track data

Storing in an experience pool, wherein s_tIs the environmental state at time t, a_tIs the movement at time t, r_tFor the reward function at time t, i.e. defining the reward that a virtual object can obtain from the environment after performing an action, s_t+1Is the ambient state at a time after t;

fourthly, when the sample size of the experience pool reaches a certain amount, training the model, and randomly selecting a batch of samples from the experience pool to train the model;

fifthly, updating Policy network parameters, and optimizing the network parameters according to an objective function of accumulated expected returns:

wherein

In order to be a new strategy, the system is provided with,

in the case of the old policy, the policy,

the importance weights of the new policy and the old policy. min was taken as the minimum.

For a fixed hyper-parameter, the value range is usually

. clip is a cut, for

Is constrained at

And

a is_tEstimating for the advantage function;

the merit function estimate A_tThe difference between the score obtained for the current strategy and environment interaction and the benchmark if the merit function A_tIf the current state-action pair can obtain higher return than the benchmark, if the advantage function A is more than 0_tIf < 0, it indicates that the current state-action pair may not get a higher reward than the benchmark. In particular, a merit function estimate is obtained

Comprises the following steps:

wherein, in the step (A),

for the time-sequential differentiation at time t,

for the timing difference at a time instant after t,

，

and

is a constant, T is [0, T]The time index of (1) is set,

for the cost function at the time t,

is a cost function at a time after t.

The sixth step, according to the formula

And updating the Value network parameters, wherein,

is a constant number of times, and is,

。

in another optional embodiment of the present invention, in step 53, the outputting, by the neural network to be optimized, a third policy action for each training virtual object includes:

step 531, the neural network to be optimized generates at least two actions of each training virtual object and a probability of each action of the at least two actions based on the feature information of the training samples;

step 532, corresponding to each training virtual object, taking the action with the highest probability in the at least two actions corresponding to the corresponding training virtual object as the third strategy action of the training virtual object.

In this implementation, if there are a plurality of strategies with the highest probability in the at least two strategies for each training virtual object, one of the strategies with the highest probability is randomly selected as the third strategy action for each training virtual object, and the third strategy action for each training virtual object is output.

In yet another alternative embodiment of the present invention, step 54 may comprise:

step 541, filtering an illegal action in the third policy action of each training virtual object to obtain a filtered action of each training virtual object, wherein the illegal action is an action violating a preset rule;

and 542, generating a fourth strategy action of the corresponding training virtual object according to the filtered action of each training virtual object.

As shown in fig. 7, in this embodiment, the basic action extracted from the first policy action is retained first, and then the high-order action is generated from the basic action according to the environmental factors to accelerate the exploration, and during the exploration, the illegal action is automatically filtered to reduce the exploration of the illegal action, so that the data validity can be improved. And converting the action index output by the network into an action with actual meaning, and returning the action in a format which can be analyzed by an engine, so that the interaction with the environment is facilitated.

Specifically, a dynamic frame skipping mechanism is added in action conversion, an effective action mask mechanism is added when the network outputs, the mask value of illegal actions is preset to be 0, the action probability value predicted by the network is multiplied by the action mask, the multiplication result of the illegal actions is 0, the illegal actions can be directly filtered, and the fourth strategy action of the virtual object is obtained. Taking the scenario of timely countermeasure shown in fig. 2 as an example, when a virtual object "needs a weapon", and the weapon is cooled or switched to a state, the current virtual object is required to be unable to execute an action due to the physical characteristics of the environment, so the mask value of "needing a weapon" is preset to 0, and the illegal action can be directly filtered out by multiplying the action probability value predicted by the network by the mask value of the action.

Fig. 8 is a flowchart illustrating another method for generating actions of a virtual object according to an embodiment of the present invention, and as shown in fig. 8, a real-time confrontation scenario shown in fig. 2 is taken as an example for modeling, but the method is not limited to the above scenario in which two parties confront in time, and may also be a scenario in which multiple parties confront in time.

First, state feature extraction is performed from an interactive environment. The state characteristics of each virtual object of the confrontation party are coded and then shared to form uniform situation characteristics;

and secondly, designing a central control type neural network structure, so that the cooperative consistency of the virtual objects can be effectively realized. The input is the state characteristics of one party sharing expression, the output data length is the number of one party virtual object, and the decision data length corresponding to each virtual object is the number of designed decision actions. In order to effectively filter out illegal actions, an action filtering module is added before network output;

thirdly, designing an action conversion module to convert the action output by the network into an action format required by the environment;

designing a reward module, wherein the reward is sparse due to a long delay environment, so that the reward value of each step is designed according to the final winning condition of the whole scene, and all own virtual objects share the reward;

fifthly, storing data (such as state, action, reward, state of next moment, but not limited to the above) generated by interaction with the environment into a data buffer;

and sixthly, continuously acquiring data from the buffer by adopting an Actor-Critic network architecture of deep reinforcement learning to carry out strategy training.

Based on the flow from the first step to the sixth step, the strategy can be continuously and iteratively learned, and finally the strategy optimization under the instant countermeasure scene is realized.

In another action generating method for a virtual object shown in fig. 8, policy optimization in an instant confrontation scene is realized by designing modules such as state feature code sharing, a central network structure, effective action filtering, reward value designing, a data buffer, a training algorithm and the like, so that the goals of intelligent posture sharing, reward sharing, action collaboration, illegal action filtering and the like are achieved, and the problem of sequential decision making such as virtual object decision cooperativity, decision action constraint, unobservable state and the like in the instant confrontation scene is solved.

Fig. 9 shows an explanatory diagram of central control provided by an embodiment of the present invention, where the central control refers to a manner used by a neural network in an embodiment of the present invention to control each virtual object in a lower layer, and an understanding of the central control provided by an embodiment of the present invention is described below with reference to fig. 9 by taking the instant confrontation scenario shown in fig. 2 as an example:

1. the lower layer is controlled in a centralized way: in the countermeasure, an upper commander controls the forces of the lower layer, namely each virtual object, and the central mode is embodied as that the upper layer centrally controls each virtual object of the lower layer.

2. Sharing situation: after each virtual object carries out decision-making action on the environment, the environment returns to the state, and each virtual object shares the state of the environment as a uniform situation, which can be understood as all own situations observed by a command officer at the upper layer.

3. A central network: the unified network structure is adopted, so that the decision idea of a commander on the upper layer can be understood, the currently observed global situation is input, and the decision action of each force on the lower layer is output. Therefore, the unified network structure design can be understood as that the commander thinks the whole target of the game when deciding each force action.

4. Each virtual object awards the same, and the states and actions are different: the reward is designed as one, i.e. the final target index of the game is the same for all virtual objects, so that the indexes are shared by all the lower-layer virtual objects, and the optimization direction is consistent in the interaction with the environment and the subsequent model optimization.

5. And (3) centralized training algorithm: after decision actions of each force are output through the same network, trajectory data (state, action, reward and next moment state) are continuously generated by interaction with the environment, the trajectory data are collected and then are uniformly trained, training targets are consistent, therefore, network optimization directions are consistent, and action cooperativity of each virtual object output by the network is stronger and stronger in continuous updating of new associations.

In the embodiment of the invention, the global situation of the countermeasure is formed by uniformly splicing the feature codes of all the virtual objects of one party, so that situation sharing among all the virtual objects of one party can be realized; all the virtual objects are designed with the same reward, and all the virtual objects share the reward according to the final target design of the game, so that the optimization directions can be consistent in the subsequent model optimization; by adopting a central control type network structure, the shared situation characteristics are input, and the decision-making action of each virtual object is output, so that the action cooperativity among the virtual objects can be realized; by filtering out illegal actions before the decision actions are output by the network, the model training efficiency can be improved; through storing the data generated by interaction of each virtual object and the environment into the data cache pool for strategy learning, the neural network can more fully use more abundant and effective data, continuously optimize strategies and generate better strategy actions.

Fig. 10 is a schematic structural diagram illustrating an apparatus 100 for generating a virtual object according to an embodiment of the present invention. As shown in fig. 10, the apparatus includes:

an obtaining module 101, configured to obtain feature information of a plurality of virtual objects, where the plurality of virtual objects belong to a same group;

a processing module 102, configured to map feature information of the plurality of virtual objects into feature information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;

a control module 103, configured to control each virtual object to execute the corresponding second policy action.

Optionally, the processing module 102 is further configured to extract initial feature information of the plurality of virtual objects;

Optionally, the processing module 102 is further configured to obtain an overall policy according to the feature information of the total virtual object;

Optionally, the processing module 102 is further configured to extract at least one basic action included in each of the first policy actions, where the basic action is a preset action;

Optionally, the processing module 102 is further configured to input feature information of the plurality of virtual objects into a trained neural network;

Optionally, the processing module 102 is further configured to use pre-stored operation data as a training sample;

extracting characteristic information of the training sample;

Optionally, the processing module 102 is further configured to generate, by the neural network to be optimized, at least two actions of each training virtual object and a probability of each action of the at least two actions based on the feature information of the training sample;

Optionally, the processing module 102 is further configured to filter an illegal action in the third policy action of each training virtual object, so as to obtain a filtered action of each training virtual object;

It should be noted that this embodiment is an apparatus embodiment corresponding to the above method embodiment, and all the implementations in the above method embodiment are applicable to this apparatus embodiment, and the same technical effects can be achieved.

An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute the method for generating the action of the virtual object in any method embodiment described above.

Fig. 11 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.

As shown in fig. 11, the computing device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.

Wherein: the processor, the communication interface, and the memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers. The processor is used for executing a program, and particularly can execute relevant steps in the motion generation method embodiment of the virtual object for the computing device.

In particular, the program may include program code comprising computer operating instructions.

The processor may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program may specifically be configured to cause a processor to execute the method for generating an action of a virtual object in any of the method embodiments described above. For specific implementation of each step in the program, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiment of the method for generating an action of a virtual object, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention to the elements, and that the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A method for generating a motion of a virtual object, the method comprising:

controlling each virtual object to execute the corresponding second policy action;

after acquiring the feature information of the plurality of virtual objects, the method further comprises the following steps:

storing and obtaining operation data generated in the process of the second strategy action;

the neural network is obtained by training the following method:

taking pre-stored operation data as a training sample;

extracting characteristic information of the training sample;

2. The method for generating motion of a virtual object according to claim 1, wherein mapping feature information of the plurality of virtual objects to feature information of one total virtual object includes:

and synthesizing the plurality of normalized feature information to obtain feature information of the total virtual object, wherein the feature information of the total virtual object represents the features of the total virtual object from three dimensions.

3. The method according to claim 1, wherein obtaining the first policy action of the plurality of virtual objects according to the feature information of the total virtual object includes:

4. The method according to claim 1, wherein generating the second policy action of each virtual object according to the first policy action of the virtual object comprises:

5. The method according to claim 1, wherein the neural network to be optimized outputs a third policy action for each training virtual object, and the third policy action comprises:

6. The method according to claim 1, wherein converting the third policy action of each training virtual object to obtain a fourth policy action of each training virtual object comprises:

7. An apparatus for generating a motion of a virtual object, the apparatus comprising:

the processing module is used for mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; inputting the characteristic information of the plurality of virtual objects into a trained neural network; storing and obtaining operation data generated in the process of the second strategy action; taking pre-stored operation data as a training sample; extracting characteristic information of the training sample; inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object; converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object; rewarding each fourth policy action to obtain a reward value of each fourth policy action; sharing the reward value of each fourth strategy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth strategy action; adjusting parameters of the neural network to be optimized according to the shared reward value to obtain the neural network;

8. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that when executed causes the processor to perform a method of action generation of a virtual object according to any of claims 1-6.

9. A computer storage medium having stored therein at least one executable instruction that when executed causes a computing device to perform a method of action generation of a virtual object as claimed in any one of claims 1 to 6.