CN114053712B - Action generation method, device and equipment of virtual object - Google Patents

Action generation method, device and equipment of virtual object Download PDF

Info

Publication number
CN114053712B
CN114053712B CN202210048175.4A CN202210048175A CN114053712B CN 114053712 B CN114053712 B CN 114053712B CN 202210048175 A CN202210048175 A CN 202210048175A CN 114053712 B CN114053712 B CN 114053712B
Authority
CN
China
Prior art keywords
virtual object
action
training
characteristic information
virtual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210048175.4A
Other languages
Chinese (zh)
Other versions
CN114053712A (en
Inventor
徐博
王燕娜
张鸿铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210048175.4A priority Critical patent/CN114053712B/en
Publication of CN114053712A publication Critical patent/CN114053712A/en
Application granted granted Critical
Publication of CN114053712B publication Critical patent/CN114053712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/55Controlling game characters or game objects based on the game progress
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/837Shooting of targets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method, a device and equipment for generating actions of virtual objects, wherein the method comprises the following steps: acquiring characteristic information of a plurality of virtual objects, wherein the plurality of virtual objects belong to the same group; mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; and controlling each virtual object to execute the corresponding second policy action. Through the mode, the method can improve the training efficiency, simplify the operation process, and simultaneously realize the cooperativity of intelligently controlling the actions of the plurality of virtual objects, so that the plurality of virtual objects in one group show the cooperativity among the actions in the process of resisting the opponents, and the game result of the plurality of virtual objects is continuously optimized based on the preset target in the virtual scene.

Description

Action generation method, device and equipment of virtual object
Technical Field
The invention relates to the technical field of reinforcement learning, in particular to a method, a device and equipment for generating actions of a virtual object.
Background
In the field of reinforcement learning, instant confrontation scenarios require that individuals make continuous decisions for a limited time and require cooperativity among decision-making virtual objects of the same group. Due to the problems of long delay, action cooperativity, action constraint, local observation and the like in the instant countermeasure scene, reinforcement learning is required to cooperate the actions among the virtual objects.
The existing reinforcement learning algorithm is divided into a single-agent reinforcement learning algorithm and a multi-agent reinforcement learning algorithm, wherein the single-agent reinforcement learning algorithm is as follows: the action and the strategy of one virtual object are learned, so that the action characteristics of a plurality of virtual objects in a group cannot be comprehensively embodied; the multi-agent reinforcement learning algorithm can learn the cooperativity among a plurality of virtual objects, but the model is complex and the training difficulty is high. Therefore, based on the problem that the existing reinforcement learning algorithm cannot coordinate a plurality of virtual objects in a group while ensuring the training efficiency, an algorithm for efficiently training, simply operating and cooperatively controlling the actions of the plurality of virtual objects needs to be provided.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a method, an apparatus, and a device for generating an action of a virtual object, which overcome the above problems or at least partially solve the above problems.
According to an aspect of the embodiments of the present invention, there is provided a method for generating an action of a virtual object, including:
acquiring characteristic information of a plurality of virtual objects, wherein the plurality of virtual objects belong to the same group;
mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object;
obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object;
generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;
and controlling each virtual object to execute the corresponding second policy action.
Optionally, mapping the feature information of the plurality of virtual objects to feature information of one total virtual object includes:
extracting initial characteristic information of the plurality of virtual objects;
respectively carrying out normalization processing on the initial characteristic information of the plurality of virtual objects to obtain the characteristic information of the plurality of virtual objects;
and synthesizing the characteristic information of the plurality of virtual objects to obtain the characteristic information of the total virtual object, wherein the characteristic information of the total virtual object represents the characteristics of the total virtual object from three dimensions.
Optionally, obtaining a first policy action of the plurality of virtual objects according to the feature information of the total virtual object includes:
obtaining an overall strategy according to the characteristic information of the total virtual object;
and splitting the whole strategy based on the relationship between the plurality of virtual objects and the total virtual object to obtain a first strategy action of each virtual object.
Optionally, generating a second policy action for each virtual object according to the first policy action for the virtual object includes:
extracting at least one basic action contained in each first strategy action, wherein the basic action is a preset action;
and generating a second policy action corresponding to each virtual object according to the basic action corresponding to each virtual object, wherein the second policy action comprises a corresponding basic action.
Optionally, after obtaining the feature information of the plurality of virtual objects, the method further includes:
inputting the characteristic information of the plurality of virtual objects into a trained neural network;
after generating a second policy action for the corresponding virtual object according to the first policy action for each virtual object, the method further includes:
and storing the operation data generated by the process of obtaining the second strategy action.
Optionally, the neural network is obtained by training through the following method:
taking pre-stored operation data as a training sample;
extracting characteristic information of the training sample;
inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object;
converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object;
rewarding each fourth policy action to obtain a reward value of each fourth policy action;
sharing the reward value of each fourth strategy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth strategy action;
and adjusting the parameters of the neural network to be optimized according to the shared reward value to obtain the neural network.
Optionally, the outputting, by the to-be-optimized neural network, a third policy action of each training virtual object includes:
the neural network to be optimized generates at least two actions of each training virtual object and the probability of each action in the at least two actions based on the characteristic information of the training sample;
and corresponding to each training virtual object, taking the action with the highest probability in at least two actions corresponding to the corresponding training virtual object as a third strategy action of the training virtual object.
Optionally, converting the third policy action of each training virtual object to obtain a fourth policy action of each training virtual object, including:
filtering illegal actions in the third strategy actions of each training virtual object to obtain filtered actions of each training virtual object;
and generating a fourth strategy action of the corresponding training virtual object according to the filtered action of each training virtual object.
According to another aspect of the embodiments of the present invention, there is provided an action generation apparatus for a virtual object, including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring characteristic information of a plurality of virtual objects, and the virtual objects belong to the same group;
the processing module is used for mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;
and the control module is used for controlling each virtual object to execute the corresponding second strategy action.
According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the action generation method of the virtual object.
According to a further aspect of the embodiments of the present invention, there is provided a computer storage medium, in which at least one executable instruction is stored, and the executable instruction causes a processor to execute operations corresponding to the method for generating an action of a virtual object.
According to the scheme provided by the above embodiment of the present invention, by obtaining feature information of a plurality of virtual objects, the plurality of virtual objects belong to the same group; mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; and controlling each virtual object to execute the corresponding second policy action. The method can improve training efficiency, simplify operation flow, and simultaneously realize the cooperativity of intelligently controlling the actions of a plurality of virtual objects, so that a group of a plurality of virtual objects can show the cooperativity of the actions in the process of resisting opponents, and game results of the plurality of virtual objects are continuously optimized based on preset targets in a virtual scene.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for generating actions of virtual objects according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a specific instant confrontation scenario provided by an embodiment of the invention;
FIG. 3 is a diagram illustrating a neural network model provided by an embodiment of the present invention;
FIG. 4 is a diagram illustrating a data buffer according to an embodiment of the present invention;
FIG. 5 is a flow chart of a neural network training method provided by an embodiment of the present invention;
FIG. 6 illustrates an overall flow diagram of neural network model optimization provided by an embodiment of the present invention;
FIG. 7 is a flow chart illustrating a specific action transformation provided by an embodiment of the present invention;
FIG. 8 is a flowchart of another method for generating actions of virtual objects according to an embodiment of the present invention;
FIG. 9 shows a central control explanatory diagram provided by an embodiment of the invention;
fig. 10 is a schematic structural diagram illustrating an apparatus for generating a motion of a virtual object according to an embodiment of the present invention;
fig. 11 shows a schematic structural diagram of a computing device provided by an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Fig. 1 shows a flowchart of an action generation method for a virtual object according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:
step 11, obtaining characteristic information of a plurality of virtual objects, wherein the plurality of virtual objects belong to the same group;
step 12, mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object;
step 13, obtaining a first policy action of each virtual object in the plurality of virtual objects according to the feature information of the total virtual object;
step 14, generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;
step 15, controlling each virtual object to execute the corresponding second policy action
In this embodiment, by acquiring feature information of a plurality of virtual objects, the plurality of virtual objects belong to the same group; mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; and controlling each virtual object to execute the corresponding second strategy action, so that the training efficiency can be improved, the operation process can be simplified, and meanwhile, the cooperativity of intelligently controlling the actions of the plurality of virtual objects can be realized, so that the plurality of virtual objects in one group show the cooperativity among the actions in the process of resisting the opponents, and the game result of the plurality of virtual objects is continuously optimized based on the preset target in the virtual scene.
In an alternative embodiment of the present invention, step 12 may include:
step 121, extracting initial characteristic information of the plurality of virtual objects;
step 122, respectively carrying out normalization processing on the initial characteristic information of the plurality of virtual objects to obtain the characteristic information of the plurality of virtual objects;
and 123, synthesizing the feature information of the plurality of virtual objects to obtain the feature information of the total virtual object, wherein the feature information of the total virtual object represents the features of the total virtual object from three dimensions.
In this embodiment, the initial feature information of the virtual object includes: location information, cell information, global information, and the like, but not limited to the above. Fig. 2 shows a schematic diagram of a specific instant confrontation scene, in which an agent is one of virtual objects, and the scene has the following characteristics: 1. the game party is divided into a self party and an opponent; 2. the types of agents for the two confrontation parties are different; 3. different agent types lead to different effective action types; 4. the operators of the two parties have different initial positions, different distances from a target point, different landforms and the like; 5. due to the influence of terrain, a single operator cannot completely observe the situation of other operators; 6. one party wins either eliminating the other operator completely or occupying all the targets. The scene needs an algorithm to control the intelligent agent to make effective decisions, including path planning, elimination of the intelligent agent of the other party, cooperation with the intelligent agent of the own party, occupation of a target point and the like. Taking the instant countermeasure scene shown in fig. 2 as an example, the position information may be fixed-size map information centered on a robbing control point, including map information of the current step and map information of the previous n steps; unit information comprises blood volume, type, speed, guiding capacity, coordinate position, weapon cooling time, fatigue and the like of the current agent; the global information includes the current deduction time, whether the robbing point is occupied, etc.
After the initial feature information of all the virtual objects is extracted, all the initial feature information is normalized by adopting the minimum value and the maximum value. And then, splicing and deforming the processed feature information to form three-dimensional features, so that information can be observed to the maximum extent, namely, information seen by all intelligent agents can be observed from one visual angle.
In yet another alternative embodiment of the present invention, step 13 may comprise:
step 131, obtaining an overall strategy according to the characteristic information of the total virtual object;
step 132, splitting the overall policy based on the relationship between the plurality of virtual objects and the total virtual object, to obtain a first policy action for each virtual object.
In yet another alternative embodiment of the present invention, step 14 may comprise;
step 141, extracting at least one basic action included in each first policy action, where the basic action is a preset action;
and 142, generating a second policy action corresponding to each virtual object according to the basic action corresponding to each virtual object, wherein the second policy action comprises a corresponding basic action.
In this embodiment, the basic action extracted from the first policy action is first retained, and the basic action is a preset action, for example, in the scenario of instant confrontation shown in fig. 2, the basic action may be set as: movement, shooting, grab control, etc. And then generating a high-order action from the basic action according to the environment factors, wherein the high-order action is the second strategy action. For example, the movement is generated as a forward movement, the shot is generated as a backward shot, and the robbing control is generated as a robbing control of the first target point.
As shown in fig. 3, in the above embodiment, first, the feature information of the plurality of virtual objects is input into the neural network model shown in fig. 3, the neural network synthesizes the feature information of the plurality of virtual objects into the feature information of one total virtual object, and then, the neural network obtains the first policy action of each virtual object through continuous reinforcement learning. The neural network structure adopts 3 layers of convolutional neural networks and a full connection layer, and centralizes and controls all virtual objects, so that the action cooperativity among all the virtual objects is realized. And finally, outputting a second strategy action of each virtual object according to the first strategy action of each virtual object, wherein each virtual object outputs an independent action through branch control, so that each agent can have action output in each step of interaction with the environment.
In another optional embodiment of the present invention, after step 11, further comprising:
step 111, inputting the characteristic information of the plurality of virtual objects into a trained neural network;
after step 14, further comprising:
step 143, storing the operation data generated by the process of obtaining the second policy action.
As shown in fig. 4, in this embodiment, the operation data is stored in the data buffer shown in fig. 4, the data buffer supports data parallel storage, supports data storage, computation and sampling in a parallel environment, and stores in a matrix manner to increase the computation speed.
Fig. 5 is a flowchart illustrating a neural network training method according to an embodiment of the present invention. As shown in fig. 5, the neural network is trained by the following method:
step 51, using pre-stored operation data as a training sample;
step 52, extracting characteristic information of the training sample;
step 53, inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object;
step 54, converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object;
step 55, rewarding each fourth policy action to obtain a reward value of each fourth policy action;
step 56, sharing the reward value of each fourth policy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth policy action;
and 57, adjusting the parameters of the neural network to be optimized according to the shared reward value to obtain the neural network.
In the embodiment, a reward value mechanism is added into the training neural network, and the reward comprises the sum of all scores of all virtual objects of one party, so the design can ensure that the virtual objects can return a reward value after each decision, and all the virtual objects can share the reward value. Taking the scenario of instant countermeasure shown in fig. 2 as an example, when designing a reward mechanism for the scenario, the reward may be set as the sum of the remaining scores, attack scores, and override scores of all virtual objects of one party. Wherein, the rest is as follows: the sum of the scores of all the virtual objects with life values of one party is different, and the scores of the virtual objects of different types are different; the attack is divided into: the score of one virtual object for eliminating the other virtual object is the score of the type corresponding to the eliminated virtual object; the seizing control score was: and according to the value of the target point, if the virtual object occupies the corresponding target point, the seizing control score is the value of the target point.
Fig. 6 is a flowchart illustrating an overall process of neural network model optimization according to an embodiment of the present invention, and as shown in fig. 6, pre-stored operation data is first obtained from a data buffer, and the obtained pre-stored operation data is used as a training sample. And then establishing an Actor-Critic algorithm framework for training a model, modeling the Actor and Critic networks by using a neural network, and taking the update of the PPO-Clip algorithm as an example, but not limited to the PPO-Clip algorithm.
First, initialize Policy network
Figure 865165DEST_PATH_IMAGE001
And Value network
Figure 381597DEST_PATH_IMAGE002
Where π is the strategy and V is the value,
Figure 425777DEST_PATH_IMAGE003
for the network parameters, the initialized network parameters of policy and value may be different;
second step, constantly interacting with the environmentAt the present time t, state stActions derived through a strategic pi-network
Figure 903025DEST_PATH_IMAGE004
Wherein
Figure 487591DEST_PATH_IMAGE005
Representing the output of the policy network, selecting actions according to a probability distribution, wherein
Figure 487557DEST_PATH_IMAGE006
Status features returned for the environment;
third, perform action atObtaining environment-corresponding reward value
Figure 499376DEST_PATH_IMAGE007
The reward value is calculated by the environment and the generated track data
Figure 198341DEST_PATH_IMAGE008
Storing in an experience pool, wherein stIs the environmental state at time t, atIs the movement at time t, rtFor the reward function at time t, i.e. defining the reward that a virtual object can obtain from the environment after performing an action, st+1Is the ambient state at a time after t;
fourthly, when the sample size of the experience pool reaches a certain amount, training the model, and randomly selecting a batch of samples from the experience pool to train the model;
fifthly, updating Policy network parameters, and optimizing the network parameters according to an objective function of accumulated expected returns:
Figure 586597DEST_PATH_IMAGE009
wherein
Figure 936676DEST_PATH_IMAGE010
In order to be a new strategy, the system is provided with,
Figure 119396DEST_PATH_IMAGE011
in the case of the old policy, the policy,
Figure 571237DEST_PATH_IMAGE012
the importance weights of the new policy and the old policy. min was taken as the minimum.
Figure 497605DEST_PATH_IMAGE013
For a fixed hyper-parameter, the value range is usually
Figure 203655DEST_PATH_IMAGE014
. clip is a cut, for
Figure 229380DEST_PATH_IMAGE015
Is constrained at
Figure 27571DEST_PATH_IMAGE016
And
Figure 882264DEST_PATH_IMAGE017
a istEstimating for the advantage function;
the merit function estimate AtThe difference between the score obtained for the current strategy and environment interaction and the benchmark if the merit function AtIf the current state-action pair can obtain higher return than the benchmark, if the advantage function A is more than 0tIf < 0, it indicates that the current state-action pair may not get a higher reward than the benchmark. In particular, a merit function estimate is obtained
Figure 816722DEST_PATH_IMAGE018
Comprises the following steps:
Figure 482189DEST_PATH_IMAGE019
wherein, in the step (A),
Figure 767677DEST_PATH_IMAGE020
for the time-sequential differentiation at time t,
Figure 659016DEST_PATH_IMAGE021
for the timing difference at a time instant after t,
Figure 713560DEST_PATH_IMAGE022
Figure 815508DEST_PATH_IMAGE023
and
Figure 322713DEST_PATH_IMAGE024
is a constant, T is [0, T]The time index of (1) is set,
Figure 519208DEST_PATH_IMAGE025
for the cost function at the time t,
Figure 693837DEST_PATH_IMAGE026
is a cost function at a time after t.
The sixth step, according to the formula
Figure 435528DEST_PATH_IMAGE027
And updating the Value network parameters, wherein,
Figure 695608DEST_PATH_IMAGE028
is a constant number of times, and is,
Figure 508844DEST_PATH_IMAGE029
in another optional embodiment of the present invention, in step 53, the outputting, by the neural network to be optimized, a third policy action for each training virtual object includes:
step 531, the neural network to be optimized generates at least two actions of each training virtual object and a probability of each action of the at least two actions based on the feature information of the training samples;
step 532, corresponding to each training virtual object, taking the action with the highest probability in the at least two actions corresponding to the corresponding training virtual object as the third strategy action of the training virtual object.
In this implementation, if there are a plurality of strategies with the highest probability in the at least two strategies for each training virtual object, one of the strategies with the highest probability is randomly selected as the third strategy action for each training virtual object, and the third strategy action for each training virtual object is output.
In yet another alternative embodiment of the present invention, step 54 may comprise:
step 541, filtering an illegal action in the third policy action of each training virtual object to obtain a filtered action of each training virtual object, wherein the illegal action is an action violating a preset rule;
and 542, generating a fourth strategy action of the corresponding training virtual object according to the filtered action of each training virtual object.
As shown in fig. 7, in this embodiment, the basic action extracted from the first policy action is retained first, and then the high-order action is generated from the basic action according to the environmental factors to accelerate the exploration, and during the exploration, the illegal action is automatically filtered to reduce the exploration of the illegal action, so that the data validity can be improved. And converting the action index output by the network into an action with actual meaning, and returning the action in a format which can be analyzed by an engine, so that the interaction with the environment is facilitated.
Specifically, a dynamic frame skipping mechanism is added in action conversion, an effective action mask mechanism is added when the network outputs, the mask value of illegal actions is preset to be 0, the action probability value predicted by the network is multiplied by the action mask, the multiplication result of the illegal actions is 0, the illegal actions can be directly filtered, and the fourth strategy action of the virtual object is obtained. Taking the scenario of timely countermeasure shown in fig. 2 as an example, when a virtual object "needs a weapon", and the weapon is cooled or switched to a state, the current virtual object is required to be unable to execute an action due to the physical characteristics of the environment, so the mask value of "needing a weapon" is preset to 0, and the illegal action can be directly filtered out by multiplying the action probability value predicted by the network by the mask value of the action.
Fig. 8 is a flowchart illustrating another method for generating actions of a virtual object according to an embodiment of the present invention, and as shown in fig. 8, a real-time confrontation scenario shown in fig. 2 is taken as an example for modeling, but the method is not limited to the above scenario in which two parties confront in time, and may also be a scenario in which multiple parties confront in time.
First, state feature extraction is performed from an interactive environment. The state characteristics of each virtual object of the confrontation party are coded and then shared to form uniform situation characteristics;
and secondly, designing a central control type neural network structure, so that the cooperative consistency of the virtual objects can be effectively realized. The input is the state characteristics of one party sharing expression, the output data length is the number of one party virtual object, and the decision data length corresponding to each virtual object is the number of designed decision actions. In order to effectively filter out illegal actions, an action filtering module is added before network output;
thirdly, designing an action conversion module to convert the action output by the network into an action format required by the environment;
designing a reward module, wherein the reward is sparse due to a long delay environment, so that the reward value of each step is designed according to the final winning condition of the whole scene, and all own virtual objects share the reward;
fifthly, storing data (such as state, action, reward, state of next moment, but not limited to the above) generated by interaction with the environment into a data buffer;
and sixthly, continuously acquiring data from the buffer by adopting an Actor-Critic network architecture of deep reinforcement learning to carry out strategy training.
Based on the flow from the first step to the sixth step, the strategy can be continuously and iteratively learned, and finally the strategy optimization under the instant countermeasure scene is realized.
In another action generating method for a virtual object shown in fig. 8, policy optimization in an instant confrontation scene is realized by designing modules such as state feature code sharing, a central network structure, effective action filtering, reward value designing, a data buffer, a training algorithm and the like, so that the goals of intelligent posture sharing, reward sharing, action collaboration, illegal action filtering and the like are achieved, and the problem of sequential decision making such as virtual object decision cooperativity, decision action constraint, unobservable state and the like in the instant confrontation scene is solved.
Fig. 9 shows an explanatory diagram of central control provided by an embodiment of the present invention, where the central control refers to a manner used by a neural network in an embodiment of the present invention to control each virtual object in a lower layer, and an understanding of the central control provided by an embodiment of the present invention is described below with reference to fig. 9 by taking the instant confrontation scenario shown in fig. 2 as an example:
1. the lower layer is controlled in a centralized way: in the countermeasure, an upper commander controls the forces of the lower layer, namely each virtual object, and the central mode is embodied as that the upper layer centrally controls each virtual object of the lower layer.
2. Sharing situation: after each virtual object carries out decision-making action on the environment, the environment returns to the state, and each virtual object shares the state of the environment as a uniform situation, which can be understood as all own situations observed by a command officer at the upper layer.
3. A central network: the unified network structure is adopted, so that the decision idea of a commander on the upper layer can be understood, the currently observed global situation is input, and the decision action of each force on the lower layer is output. Therefore, the unified network structure design can be understood as that the commander thinks the whole target of the game when deciding each force action.
4. Each virtual object awards the same, and the states and actions are different: the reward is designed as one, i.e. the final target index of the game is the same for all virtual objects, so that the indexes are shared by all the lower-layer virtual objects, and the optimization direction is consistent in the interaction with the environment and the subsequent model optimization.
5. And (3) centralized training algorithm: after decision actions of each force are output through the same network, trajectory data (state, action, reward and next moment state) are continuously generated by interaction with the environment, the trajectory data are collected and then are uniformly trained, training targets are consistent, therefore, network optimization directions are consistent, and action cooperativity of each virtual object output by the network is stronger and stronger in continuous updating of new associations.
In the embodiment of the invention, the global situation of the countermeasure is formed by uniformly splicing the feature codes of all the virtual objects of one party, so that situation sharing among all the virtual objects of one party can be realized; all the virtual objects are designed with the same reward, and all the virtual objects share the reward according to the final target design of the game, so that the optimization directions can be consistent in the subsequent model optimization; by adopting a central control type network structure, the shared situation characteristics are input, and the decision-making action of each virtual object is output, so that the action cooperativity among the virtual objects can be realized; by filtering out illegal actions before the decision actions are output by the network, the model training efficiency can be improved; through storing the data generated by interaction of each virtual object and the environment into the data cache pool for strategy learning, the neural network can more fully use more abundant and effective data, continuously optimize strategies and generate better strategy actions.
Fig. 10 is a schematic structural diagram illustrating an apparatus 100 for generating a virtual object according to an embodiment of the present invention. As shown in fig. 10, the apparatus includes:
an obtaining module 101, configured to obtain feature information of a plurality of virtual objects, where the plurality of virtual objects belong to a same group;
a processing module 102, configured to map feature information of the plurality of virtual objects into feature information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;
a control module 103, configured to control each virtual object to execute the corresponding second policy action.
Optionally, the processing module 102 is further configured to extract initial feature information of the plurality of virtual objects;
respectively carrying out normalization processing on the initial characteristic information of the plurality of virtual objects to obtain the characteristic information of the plurality of virtual objects;
and synthesizing the characteristic information of the plurality of virtual objects to obtain the characteristic information of the total virtual object, wherein the characteristic information of the total virtual object represents the characteristics of the total virtual object from three dimensions.
Optionally, the processing module 102 is further configured to obtain an overall policy according to the feature information of the total virtual object;
and splitting the whole strategy based on the relationship between the plurality of virtual objects and the total virtual object to obtain a first strategy action of each virtual object.
Optionally, the processing module 102 is further configured to extract at least one basic action included in each of the first policy actions, where the basic action is a preset action;
and generating a second policy action corresponding to each virtual object according to the basic action corresponding to each virtual object, wherein the second policy action comprises a corresponding basic action.
Optionally, the processing module 102 is further configured to input feature information of the plurality of virtual objects into a trained neural network;
and storing the operation data generated by the process of obtaining the second strategy action.
Optionally, the processing module 102 is further configured to use pre-stored operation data as a training sample;
extracting characteristic information of the training sample;
inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object;
converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object;
rewarding each fourth policy action to obtain a reward value of each fourth policy action;
sharing the reward value of each fourth strategy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth strategy action;
and adjusting the parameters of the neural network to be optimized according to the shared reward value to obtain the neural network.
Optionally, the processing module 102 is further configured to generate, by the neural network to be optimized, at least two actions of each training virtual object and a probability of each action of the at least two actions based on the feature information of the training sample;
and corresponding to each training virtual object, taking the action with the highest probability in at least two actions corresponding to the corresponding training virtual object as a third strategy action of the training virtual object.
Optionally, the processing module 102 is further configured to filter an illegal action in the third policy action of each training virtual object, so as to obtain a filtered action of each training virtual object;
and generating a fourth strategy action of the corresponding training virtual object according to the filtered action of each training virtual object.
It should be noted that this embodiment is an apparatus embodiment corresponding to the above method embodiment, and all the implementations in the above method embodiment are applicable to this apparatus embodiment, and the same technical effects can be achieved.
An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute the method for generating the action of the virtual object in any method embodiment described above.
Fig. 11 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.
As shown in fig. 11, the computing device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.
Wherein: the processor, the communication interface, and the memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers. The processor is used for executing a program, and particularly can execute relevant steps in the motion generation method embodiment of the virtual object for the computing device.
In particular, the program may include program code comprising computer operating instructions.
The processor may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program may specifically be configured to cause a processor to execute the method for generating an action of a virtual object in any of the method embodiments described above. For specific implementation of each step in the program, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiment of the method for generating an action of a virtual object, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention to the elements, and that the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims (9)

1. A method for generating a motion of a virtual object, the method comprising:
acquiring characteristic information of a plurality of virtual objects, wherein the plurality of virtual objects belong to the same group;
mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object;
obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object;
generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object;
controlling each virtual object to execute the corresponding second policy action;
after acquiring the feature information of the plurality of virtual objects, the method further comprises the following steps:
inputting the characteristic information of the plurality of virtual objects into a trained neural network;
after generating a second policy action for the corresponding virtual object according to the first policy action for each virtual object, the method further includes:
storing and obtaining operation data generated in the process of the second strategy action;
the neural network is obtained by training the following method:
taking pre-stored operation data as a training sample;
extracting characteristic information of the training sample;
inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object;
converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object;
rewarding each fourth policy action to obtain a reward value of each fourth policy action;
sharing the reward value of each fourth strategy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth strategy action;
and adjusting the parameters of the neural network to be optimized according to the shared reward value to obtain the neural network.
2. The method for generating motion of a virtual object according to claim 1, wherein mapping feature information of the plurality of virtual objects to feature information of one total virtual object includes:
extracting initial characteristic information of the plurality of virtual objects;
respectively carrying out normalization processing on the initial characteristic information of the plurality of virtual objects to obtain the characteristic information of the plurality of virtual objects;
and synthesizing the plurality of normalized feature information to obtain feature information of the total virtual object, wherein the feature information of the total virtual object represents the features of the total virtual object from three dimensions.
3. The method according to claim 1, wherein obtaining the first policy action of the plurality of virtual objects according to the feature information of the total virtual object includes:
obtaining an overall strategy according to the characteristic information of the total virtual object;
and splitting the whole strategy based on the relationship between the plurality of virtual objects and the total virtual object to obtain a first strategy action of each virtual object.
4. The method according to claim 1, wherein generating the second policy action of each virtual object according to the first policy action of the virtual object comprises:
extracting at least one basic action contained in each first strategy action, wherein the basic action is a preset action;
and generating a second policy action corresponding to each virtual object according to the basic action corresponding to each virtual object, wherein the second policy action comprises a corresponding basic action.
5. The method according to claim 1, wherein the neural network to be optimized outputs a third policy action for each training virtual object, and the third policy action comprises:
the neural network to be optimized generates at least two actions of each training virtual object and the probability of each action in the at least two actions based on the characteristic information of the training sample;
and corresponding to each training virtual object, taking the action with the highest probability in at least two actions corresponding to the corresponding training virtual object as a third strategy action of the training virtual object.
6. The method according to claim 1, wherein converting the third policy action of each training virtual object to obtain a fourth policy action of each training virtual object comprises:
filtering illegal actions in the third strategy actions of each training virtual object to obtain filtered actions of each training virtual object;
and generating a fourth strategy action of the corresponding training virtual object according to the filtered action of each training virtual object.
7. An apparatus for generating a motion of a virtual object, the apparatus comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring characteristic information of a plurality of virtual objects, and the virtual objects belong to the same group;
the processing module is used for mapping the characteristic information of the plurality of virtual objects into the characteristic information of a total virtual object; obtaining a first policy action of each virtual object in the plurality of virtual objects according to the characteristic information of the total virtual object; generating a second policy action of the corresponding virtual object according to the first policy action of each virtual object; inputting the characteristic information of the plurality of virtual objects into a trained neural network; storing and obtaining operation data generated in the process of the second strategy action; taking pre-stored operation data as a training sample; extracting characteristic information of the training sample; inputting the characteristic information of the training sample into a neural network to be optimized, wherein the neural network to be optimized outputs a third strategy action of each training virtual object; converting the third strategy action of each training virtual object to obtain a fourth strategy action of each training virtual object; rewarding each fourth policy action to obtain a reward value of each fourth policy action; sharing the reward value of each fourth strategy action to obtain a shared reward value, wherein the shared reward value is used for representing the effectiveness degree of the corresponding fourth strategy action; adjusting parameters of the neural network to be optimized according to the shared reward value to obtain the neural network;
and the control module is used for controlling each virtual object to execute the corresponding second strategy action.
8. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that when executed causes the processor to perform a method of action generation of a virtual object according to any of claims 1-6.
9. A computer storage medium having stored therein at least one executable instruction that when executed causes a computing device to perform a method of action generation of a virtual object as claimed in any one of claims 1 to 6.
CN202210048175.4A 2022-01-17 2022-01-17 Action generation method, device and equipment of virtual object Active CN114053712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210048175.4A CN114053712B (en) 2022-01-17 2022-01-17 Action generation method, device and equipment of virtual object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210048175.4A CN114053712B (en) 2022-01-17 2022-01-17 Action generation method, device and equipment of virtual object

Publications (2)

Publication Number Publication Date
CN114053712A CN114053712A (en) 2022-02-18
CN114053712B true CN114053712B (en) 2022-04-22

Family

ID=80231091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210048175.4A Active CN114053712B (en) 2022-01-17 2022-01-17 Action generation method, device and equipment of virtual object

Country Status (1)

Country Link
CN (1) CN114053712B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117808172B (en) * 2024-02-29 2024-05-07 佛山慧谷科技股份有限公司 Automatic stone material discharging method and device, electronic equipment and readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749041A (en) * 2019-10-29 2021-05-04 中国移动通信集团浙江有限公司 Virtualized network function backup strategy self-decision method and device and computing equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110404264B (en) * 2019-07-25 2022-11-01 哈尔滨工业大学(深圳) Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium
CN111589166A (en) * 2020-05-15 2020-08-28 深圳海普参数科技有限公司 Interactive task control, intelligent decision model training methods, apparatus, and media
CN113792846A (en) * 2021-09-06 2021-12-14 中国科学院自动化研究所 State space processing method and system under ultrahigh-precision exploration environment in reinforcement learning and electronic equipment
CN113926181A (en) * 2021-10-21 2022-01-14 腾讯科技(深圳)有限公司 Object control method and device of virtual scene and electronic equipment
CN113893539B (en) * 2021-12-09 2022-03-25 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749041A (en) * 2019-10-29 2021-05-04 中国移动通信集团浙江有限公司 Virtualized network function backup strategy self-decision method and device and computing equipment

Also Published As

Publication number Publication date
CN114053712A (en) 2022-02-18

Similar Documents

Publication Publication Date Title
CN110874578B (en) Unmanned aerial vehicle visual angle vehicle recognition tracking method based on reinforcement learning
CN111766782B (en) Strategy selection method based on Actor-Critic framework in deep reinforcement learning
KR20210028728A (en) Method, apparatus, and device for scheduling virtual objects in a virtual environment
CN112801290B (en) Multi-agent deep reinforcement learning method, system and application
CN112791394B (en) Game model training method and device, electronic equipment and storage medium
CN104102522B (en) The artificial emotion driving method of intelligent non-player roles in interactive entertainment
Ma et al. Contrastive variational reinforcement learning for complex observations
CN113627596A (en) Multi-agent confrontation method and system based on dynamic graph neural network
CN114053712B (en) Action generation method, device and equipment of virtual object
CN114162146B (en) Driving strategy model training method and automatic driving control method
CN112561028A (en) Method for training neural network model, and method and device for data processing
CN114741886A (en) Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN115860107A (en) Multi-machine search method and system based on multi-agent deep reinforcement learning
CN113947022B (en) Near-end strategy optimization method based on model
CN113313209A (en) Multi-agent reinforcement learning training method with high sample efficiency
CN112121419B (en) Virtual object control method, device, electronic equipment and storage medium
Ji et al. Improving decision-making efficiency of image game based on deep Q-learning
Waris et al. Evolving deep neural networks with cultural algorithms for real-time industrial applications
CN114840024A (en) Unmanned aerial vehicle control decision method based on context memory
CN117648585B (en) Intelligent decision model generalization method and device based on task similarity
CN114037048B (en) Belief-consistent multi-agent reinforcement learning method based on variational circulation network model
CN116842761B (en) Self-game-based blue army intelligent body model construction method and device
Zhu et al. Deep residual attention reinforcement learning
CN116663803A (en) Multi-unmanned aerial vehicle task allocation method and system considering uncertainty information
CN116360435A (en) Training method and system for multi-agent collaborative strategy based on plot memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant