CN111260040B - Video game decision method based on intrinsic rewards - Google Patents
Video game decision method based on intrinsic rewards Download PDFInfo
- Publication number
- CN111260040B CN111260040B CN202010370070.1A CN202010370070A CN111260040B CN 111260040 B CN111260040 B CN 111260040B CN 202010370070 A CN202010370070 A CN 202010370070A CN 111260040 B CN111260040 B CN 111260040B
- Authority
- CN
- China
- Prior art keywords
- intrinsic
- reward
- network
- video game
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000003062 neural network model Methods 0.000 claims abstract description 21
- 238000004088 simulation Methods 0.000 claims abstract description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 77
- 239000003795 chemical substances by application Substances 0.000 claims description 61
- 230000009471 action Effects 0.000 claims description 58
- 238000005457 optimization Methods 0.000 claims description 39
- 238000013507 mapping Methods 0.000 claims description 31
- 239000013598 vector Substances 0.000 claims description 20
- 238000005070 sampling Methods 0.000 claims description 10
- 230000008901 benefit Effects 0.000 claims description 6
- 230000007613 environmental effect Effects 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 230000007774 longterm Effects 0.000 claims description 4
- 230000009286 beneficial effect Effects 0.000 abstract description 4
- 230000002787 reinforcement Effects 0.000 description 34
- 230000008569 process Effects 0.000 description 20
- 230000007246 mechanism Effects 0.000 description 11
- 238000013528 artificial neural network Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000011160 research Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000010355 oscillation Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 210000003027 ear inner Anatomy 0.000 description 1
- 230000021824 exploration behavior Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/45—Controlling the progress of the video game
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/45—Controlling the progress of the video game
- A63F13/46—Computing the game score
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention provides a video game decision method based on intrinsic rewards, which comprises the following steps: s1, acquiring a video game simulation environment; s2, constructing a neural network model; s3, designing an internal reward model; s4, combining the intrinsic reward model with the constructed neural network model structure; s5, obtaining game records through a simulation environment; s6, updating the neural network model through the acquired game record; and S7, circularly training the neural network model until convergence. The invention has the beneficial effects that: the problem that the environment feedback reward value is frequently lacked in the three-dimensional scene is solved well.
Description
Technical Field
The invention relates to a video game decision method, in particular to a video game decision method based on intrinsic rewards.
Background
Since the birth of video games, video games appeared in the early 70 s of the 20 th century, the technology of realizing automatic decision making of an intelligent agent in the video games through an artificial intelligence technology is always a hot point of research in the industrial and academic fields, and has great commercial value. In recent years, the rapid development of deep reinforcement learning methods provides an effective way for realizing the technology. Generally speaking, the quality of game decision making techniques is determined entirely by how much score is scored in the game or whether a game can be won, as is the case with video games.
The deep reinforcement learning algorithm applied to the complex game scene has the advantage of end-to-end characteristic, the intelligent agent action strategy is learned through the deep reinforcement learning algorithm, so that the mapping from the input game state to the output action can be directly completed, a set of universal algorithm framework is provided for solving various game tasks, and the Actor-criticic algorithm is a more representative algorithm. In a deep reinforcement learning algorithm taking an Actor-Critic algorithm as a basic framework, in order to train various machine game intelligent agents, the common method is to firstly perform feature extraction on a game state by designing a convolution network, then perform intelligent agent action strategy learning by using the Actor network, perform strategy evaluation and improvement by using the Critic network, and continuously perform iterative training until convergence. However, in a few Atari video game scenes, an intelligent agent using the algorithm as a basic framework is difficult to learn a strategy for efficiently acquiring the environment reward, a similar point of the scenes is that the environment where the intelligent agent is located is complex, reward feedback is difficult to obtain directly, and the intelligent agent often needs to make an action of acquiring a forward reward value through a series of action decisions or by referring to more historical information. The reason is that the Actor-Critic algorithm substantially considers a value iteration method and a strategy gradient method comprehensively, wherein the strategy gradient method needs to sample and update a strategy according to a track in an intelligent agent interaction process, and if a sufficient sampling track is lacked or the quality of the sampling track is not good enough, the optimization process of the strategy gradient is affected, so that the intelligent agent cannot learn a correct and efficient strategy. In a three-dimensional video game Vizdoom, an intelligent body can only contact a small part of environment in a sight line range in a game scene, meanwhile, a large number of labyrinths, traps and other design mechanisms influence exploration and reward acquisition of the intelligent body in the scene, due to the sparsity of reward feedback, the proportion of high-income-value actions in a sampling track is small, a positive reward sampling track is lacked in the training process of a strategy gradient algorithm, and the variance in the whole training process is high. The Actor-Critic algorithm introduces a value model in a value iteration method and then estimates a track value by using a value network, so that the defect of high variance of a strategy gradient method can be theoretically alleviated, but the problem that the update amplitude of an intelligent action strategy is too high in oscillation and not stable enough can still be generated during training by using the algorithm in the actual training process of a Vizdoom scene. In a Vizdoom scene with sparse feedback of part of environment rewards, due to the lack of reward signals, the algorithm cannot be subjected to strategy updating or cannot be converged due to large-amplitude oscillation generated in training. Therefore, for the application of the depth reinforcement learning algorithm in the three-dimensional video game Vizdoom, the problem that the environment feedback reward value is usually lacked in the three-dimensional scene exists.
Disclosure of Invention
To solve the problems in the prior art, the present invention provides a video game decision method based on intrinsic rewards.
The invention provides a video game decision method based on intrinsic rewards, which comprises the following steps:
s1, acquiring a video game simulation environment;
s2, constructing a neural network model;
s3, designing an internal reward model;
s4, combining the intrinsic reward model with the constructed neural network model structure;
s5, obtaining game records through a simulation environment;
s6, updating the neural network model through the acquired game record;
and S7, circularly training the neural network model until convergence.
The invention has the beneficial effects that: through the scheme, the problem that the environment feedback reward value is frequently lacked in the three-dimensional scene is solved well.
Drawings
FIG. 1 is an overall flow chart of a video game decision method based on intrinsic rewards of the invention.
FIG. 2 is a diagram of a Vizdoom simulation environment for an intrinsic bonus based video game decision method of the present invention.
FIG. 3 is a block diagram of a neural network for deep reinforcement learning solution in a video game decision method according to the present invention.
FIG. 4 is a diagram of an intrinsic reward mechanism reinforcement learning model of a video game decision method based on intrinsic rewards according to the invention.
FIG. 5 is a block diagram of an intrinsic prize generation module of a video game decision method based on intrinsic prizes of the present invention.
FIG. 6 is a network architecture diagram of a target mapping network and a prediction network for an intrinsic bonus based video game decision method of the present invention.
FIG. 7 is a flow chart of an intrinsic prize generation mechanism for a video game based on an intrinsic prize video game decision method of the present invention.
FIG. 8 is a schematic diagram of a differentiated prize fusion method of the intrinsic prize-based video game decision method according to the present invention.
FIG. 9 is a variation of the value network architecture of a video game decision method based on intrinsic rewards of the invention.
FIG. 10 is a flow chart of an intrinsic bonus strategy optimization algorithm for an intrinsic bonus based video game decision method of the present invention.
FIG. 11 is a diagram of a routing scenario of a Vizdoom platform in a video game decision method based on intrinsic rewards of the invention.
FIG. 12 is a comparison graph of the training effect of the IBPO algorithm for the video game decision method based on intrinsic rewards of the invention.
Detailed Description
The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.
The method applies a deep reinforcement learning method and combines an advanced internal reward mechanism to form a decision model and a technology with a certain intelligence level, so that a game intelligent agent obtains high scores in a video game, and the method is the core content of the invention.
The invention mainly researches the strategy solving problem of the three-dimensional video game under the condition of incomplete information. (1) Aiming at the problem that the common environment feedback reward value is lack in the three-dimensional scene, the invention provides an internal reward model. (2) An intrinsic reward strategy optimization algorithm is proposed by fusing intrinsic rewards with external reward differences.
As shown in fig. 1, a video game decision method based on intrinsic rewards comprises the following steps:
s1, acquiring a video game simulation environment;
s2, constructing a neural network model;
s3, designing an internal reward model;
s4, combining the intrinsic reward model with the constructed neural network model structure;
s5, obtaining game records through a simulation environment;
s6, updating the neural network model through the acquired game record;
and S7, circularly training the neural network model until convergence.
The invention mainly researches the strategy solving problem of the three-dimensional video game under the condition of incomplete information. Aiming at the problems of incomplete high-dimensional state space and information perception in video game games, a deep reinforcement learning method based on an internal reward strategy optimization algorithm is provided. In the method, firstly, aiming at the problem that the environment feedback reward value is commonly lacked in a three-dimensional scene, the invention provides an internal reward model, the internal reward value generated by designing a target mapping network and a prediction network is used for making up the lacked environment feedback reward value, and an intelligent agent is assisted to carry out strategy updating. And secondly, considering the structural difference between the internal reward model and the traditional strategy optimization algorithm, the internal reward strategy optimization algorithm is provided by fusing the internal reward model and the traditional strategy optimization algorithm by adjusting the structure of the value network, and the action effect of the intelligent agent in the sparse reward three-dimensional scene is improved. Fig. 1 is a general flow chart of a video game decision method based on internal rewards.
The invention provides a video game decision method based on internal rewards, which comprises the following specific processes:
1. acquiring and installing a video game simulation environment;
in recent years, DRL (deep reinforcement learning) is also hot with the increasing popularity of deep learning. Therefore, various new reinforcement learning research platforms emerge from the eyerings after rain, and the trend is to slowly expand the 3D maze from a simple toy scene, a first-person shooting game, an instant strategy game, a complex robot control scene and the like. For example, Vizdoom allows for the development of AI robots that play DOOM using visual information (screen buffers). The method is mainly used for machine vision learning, in particular to the research of deep reinforcement learning. The Vizdoom simulated game environment is acquired and installed through the Vizdoom official website, as shown in FIG. 2.
2. Constructing a neural network;
fig. 3 is a diagram of a network architecture for solving a video game using deep reinforcement learning, in which the input of a model is each frame image of the video game, the output of the model is the operation of the corresponding video game, and the parameters of the network in the middle layer are the corresponding strategies that need to be trained using deep reinforcement learning. The present invention collects data by making decisions in a simulation environment using agents, and optimizes the agent's strategy using deep learning algorithms based on the collected state and action pairs. How to train a good model based on video game features is the key of the performance of the intelligence and is the core innovation of the invention.
3. Designing a video game internal reward generation mechanism;
the general reinforcement learning model only has two types of entities, namely an environment and an agent, and the reward signals are all originated from the environment where the agent is located, so that additional reward signals are not easy to generate based on the reinforcement learning model. Aiming at the limitation and the deficiency of a general reinforcement learning model, the invention utilizes the concept of the internal reward and generates auxiliary reward information by designing a corresponding internal reward mechanism to help the intelligent agent to carry out strategy updating according to the internal reward when lacking the environmental reward information. Such a reinforcement Learning model including intrinsic reward benefits is defined as an intrinsic incentive reinforcement Learning model (IMRL). A general model of intrinsic incentive reinforcement learning is shown in fig. 4.
The invention designs the following intrinsic reward generation module: defining a target mapping Network (targeting mapping Network) and a Prediction Network (Prediction Network) with the same structure, performing feature extraction and state mapping on an input three-dimensional state picture by using the target mapping Network and the Prediction Network to respectively obtain corresponding embedded vectors, and calculating the similarity of the target mapping Network and the Prediction Network to obtain the value of the intrinsic reward. The target mapping network and the prediction network are defined as shown in formula (3-1) and formula (3-2), respectively:
the target mapping network is defined as the mapping of states to target embedded vectors:
in the formula,
the prediction network is defined as the mapping of states to prediction embedding vectors:
in the formula,
The designed intrinsic reward generation module is shown in fig. 5.
In the intrinsic reward generation module, the target mapping network has dimensions of the same size as the embedded vectors output by the prediction network. For the target mapping network, a strategy pre-training or random initialization mode can be adopted for network initialization, the network initialization mode adopted by the invention utilizes a deep reinforcement learning random strategy intelligent agent, and the intelligent agent is utilized to develop limited steps of exploration in the environment so as to initialize the target mapping network. And then training a predicted value network based on a sample generated by interaction of the intelligent agent and the environment of the strategy optimization algorithm, and minimizing the mean square error of the predicted value network and the target mapping network on the generated sample data. The main objective of this approach is to reduce the error from the randomness of the fit by setting a fixed optimization target and the error from the complexity of the objective function by setting a simpler target mapping network initialized with a random strategy. The network structure of the target mapping network and the prediction network is shown in fig. 6.
The target mapping network and the prediction network of the internal reward generation module both have network structures as shown in fig. 5, an input state picture in a three-dimensional scene is subjected to feature extraction through a three-layer convolutional neural network, and finally vector representation of fixed dimensions is output. The same network structure is adopted to reduce the error influence on the calculation of the vector similarity caused by the difference of the network structures. Loss function of intrinsic reward generation moduleIs defined as:
the similarity of the output vectors of the target mapping network and the prediction network is used as the magnitude of the generated intrinsic reward value in the intrinsic reward generation module, and the consideration is that the intelligent agent tends to perform exploration behaviors in a sparse reward environment. In the initial stage of the training of the intelligent agent, the activity range of the intelligent agent in the three-dimensional scene is smaller, the number of states is larger, the similarity of the output vectors of the two networks is smaller, the calculated internal reward value is larger, and the intelligent agent mainly plays back the reward source in the data tuple by taking the internal reward signal as the reward source of the experience when the self action strategy is updated. In addition, because the environmental information faced by the agent in different three-dimensional scenes or different exploration moments of the same scene is not very same, if the normalization processing is not performed on the internal reward value and the input environmental state information, the variation range of the corresponding numerical value is too large, which is not beneficial to the super-parameter selection and the input information characterization in the training process. Therefore, the intrinsic reward value and the environmental status information need to be normalized. In addition, training of the predictive network requires the collection of agents and environmentsThe interaction sample, and the action output of the agent in the environment, is derived from the policy network, so the generation of the intrinsic reward signal during the training process is also related to the policy optimization algorithm. To be provided withAs an action strategy for the strategy optimization algorithm agent, an intrinsic reward generation algorithm for a single training round is given, as shown in fig. 7.
The intrinsic reward generation algorithm is as follows:
inputting:
And (3) outputting:
1, initializing parameters.
7, ending the circulation;
when j is equal to [1, E ], circularly executing the following steps:
Ending the circulation;
4. Combining the intrinsic reward with the established neural network and applying to the video game;
the training process of the prediction network of the intrinsic reward mechanism needs to be based on the action data samples of the agents, and the action process of the agents in the environment is driven by the strategy optimization algorithm, so that the complete intrinsic reward algorithm needs to be combined with the strategy gradient or strategy optimization algorithm. The invention combines an improved Policy optimization algorithm with an Intrinsic reward mechanism to propose an Intrinsic Based Policy Optimization (IBPO). Because the model basis of the strategy optimization algorithm is a general reinforcement learning model and has a certain difference with an internal incentive reinforcement learning model introduced with an internal reward mechanism, the structural difference of the strategy optimization algorithm and the internal incentive reinforcement learning model needs to be analyzed so as to adapt to the strategy optimization algorithm and the internal incentive reinforcement learning model. In the strategy optimization algorithm, the strategy is updated depending on the action strategy probability ratio and the estimation function in the objective function, and the objective function calculates the action strategy probability ratio and the estimation function based on the reward value, so the key point of adapting the strategy optimization algorithm and the internal reward mechanism is the processing of external reward and internal reward. In the conventional policy optimization algorithm, the reward value of the policy update is derived from only including the external reward, so after the internal reward mechanism is introduced, the combination mode of the internal reward and the policy optimization algorithm, namely the combination mode of the internal reward and the external reward, needs to be considered. The present invention employs a combination of long-term intrinsic awards and round-based external awards, as shown in fig. 8.
The original intention of the internal reward is to increase the exploration capacity of the agent for the environment, if the internal reward value is reset at the beginning of each training round of the strategy optimization algorithm, namely if the internal reward is accumulated according to a round mode, once the agent takes a certain action to end the current round, the internal reward value is zero immediately, strategy iteration and updating cannot be continued according to the internal reward, and the agent is more inclined to the conservative strategy rather than exploration and is contrary to the original intention of introducing the internal reward. Thus, the manner in which the intrinsic rewards are handled during the training process is to be accumulated continuously over each training round. For the external reward, the general method is to perform zero clearing after the end of a single training round, so as to prevent the intelligent agent from accumulating the external reward when the intelligent agent makes an action leading to the end of the round, so that negative feedback of the environment cannot cause negative updating to the strategy gradient, and the action strategy of the intelligent agent is influenced. In summary, the present invention combines the long-term intrinsic award with the round-based external award.
After a combination of long-term intrinsic rewards and turn-based external rewards is adopted, a specific value network structure needs to be considered. Only external rewards are contained in the traditional strategy optimization algorithm, only a single Critic value network is usually required to process input reward value information, and the newly introduced internal rewards obviously have different meanings from the traditional external rewards, so that the two reward information needs to be distinguished from each other on the basis of a network model. If two types of reward information are processed by using the independent value networks respectively, two independent neural networks are needed to be used for fitting the criticic value network, the parameter number and the model complexity are doubled, the training time is increased, and even the loss of the model precision is possibly brought. From the perspective of being as efficient as possible and fully considering the meanings of the two types of reward information, the method uses a value network model with two input heads to respectively process the internal reward and the external reward, different discount factors can be endowed to the two types of rewards based on the design, the additional supervision information of the cost function of the strategy optimization algorithm is endowed, the difference between the internal reward signal and the external reward information is highlighted, and the strategy and the value network structure are changed as shown in the figure 9.
The reward information and the value network structure are processed, the value network of the strategy optimization algorithm is adjusted to be a structure similar to the value network in the internal incentive reinforcement learning model, the external reward signals and the internal reward signals are added in a differentiation fusion mode to be combined into the overall reward signals in the experience playback pool, and therefore the strategy optimization algorithm and the internal reward generation module are combined. The integral model structure of the internal reward strategy optimization algorithm is similar to that of the traditional actor critic algorithm model structure, and the shared convolution network, the value network and the strategy network form the main part of the model, and the three parts respectively play roles in extracting characteristic information, carrying out state value estimation and outputting intelligent agent action probability distribution. The intrinsic reward strategy optimization algorithm model is mainly different from the traditional model in two points: firstly, a value network in an internal incentive strategy optimization algorithm model is subjected to structural adjustment, and can adapt to a structure that an internal incentive reinforcement learning model contains external environment incentive information and internal incentive signals; secondly, an additional intrinsic reward generation module is arranged in the intrinsic reward strategy optimization algorithm model and used for extracting features from the input state picture and calculating the intrinsic reward value.
For an end-to-end intrinsic reward policy optimization agent, the main flow from sensing three-dimensional scene environment information to making action decisions of the agent is as follows: the input three-dimensional state picture is processed by a shared convolution network and an internal reward generation module network, and the process mainly aims at sensing environment state information; and then processed by a value network and a strategy network respectively, wherein the process mainly aims at estimating state value and outputting action probability distribution. The parameter updating of the strategy network represents the updating of the action strategy of the whole algorithm model corresponding to the intelligent agent, and the parameter change is jointly determined by the internal reward value, the environment external reward value and the state value output by the value network. The specific intrinsic reward policy optimization algorithm effectively combines the policy optimization algorithm with the intrinsic reward generation module, the initialization process is similar to the intrinsic reward generation algorithm, and the overall process is shown in fig. 10.
The intrinsic reward policy optimization algorithm is as follows:
inputting:
initializing total iteration roundsPolicy update step sizeGo back toCombination of Chinese herbsStep size,
And (3) outputting:
Initializing parameters and target mapping network.
9, ending the circulation;
15, ending the circulation;
ending the circulation;
5. Using the constructed neural network and the obtained simulated game environment to interactively obtain game records;
generating game images through a video game simulation environment and inputting the game images into a neural network, generating legal actions by the neural network and returning the legal actions to the simulation environment, generating value by a value network and generating internal value by an internal reward generation network,
finally, fusing the external value and the internal value difference; while the simulated environment gives a score and the next image based on the actions generated by the neural network. The variables generated above are merged into a game record.
6. Updating the network according to the corresponding reinforcement learning algorithm by using the acquired game record;
and updating the neural network according to an intrinsic reward strategy optimization algorithm by using the acquired game records, and circularly training the neural network until convergence.
The video game decision method based on the internal rewards has the following beneficial effects:
1. experimental setup
The method takes the three-dimensional video game Vizdoom under the incomplete information condition as a research object and a test platform, and realizes the intrinsic reward strategy optimization algorithm IBPO based on the test platform.
1.1 introduction to Vizdoom scene
The Vizdoom platform is a three-dimensional video game with a first-person perspective, and the action of an intelligent body in the scene of the platform is similar to the action of an object in the real world, namely, the visual signal is received and then action decision is made. As a test platform of the current mainstream deep reinforcement learning algorithm, the Vizdoom platform provides an interface for receiving action input and feeding back a reward signal, and simulates the environment in a reinforcement learning model. At present, the Vizdoom platform provides comprehensive testing capability for training an intelligent agent to explore a three-dimensional environment, and the invention performs experiments based on the path-finding scene of the platform.
The route searching scene is a scene with sparse rewards in the Vizdoom test platform, the whole map is composed of a plurality of opaque rooms with different pictures, only a fixed target position is arranged in a certain specific room, the intelligent agent can move freely in the route searching scene, as reward feedback exists only at the target object in the scene, the intelligent agent can not obtain reward feedback at any other position, the starting position of the intelligent agent in the route searching scene is far away from the target position, and the intelligent agent needs to pass through the rooms with different contents in the process. The way-finding scenario is shown in fig. 11.
1.2 Experimental development Environment as in Table 1
TABLE 1
1.3 comparison of existing methods
(1) DFP: high dimensional sensory flow and low dimensional measurement flow are utilized for sensorimotor control in an immersive environment.
(2) DRQN: modular architecture is used to address the three-dimensional environment in a first person shooter game with game information provided by a simulator.
2. Results of the experiment
The depth reinforcement learning algorithm generally takes the score value output by the game simulation environment as the measurement standard of the performance of the intelligent object, which is slightly different for different game scenes, but is an equivalent representation form of the depth reinforcement learning reward value. In the route finding scene, the average reward value and the average action step number are used as evaluation indexes. Wherein the average prize value is defined as follows: when training is carried out to the current step number, the ratio of the number of agents which can reach the target position within the specified step number to the number of all training agents represents the path-finding success rate of the agents. The average number of steps of an action is defined as follows: the average value of the action steps of the intelligent agent after algorithm convergence for performing 100 verification interactions in the route finding scene represents the stability of the action strategy of the intelligent agent. The effect of agents trained by different algorithms in a path finding scenario is shown in fig. 12.
The ordinate axis represents the average reward value obtained by the reinforcement learning agent in the route-finding scene, the abscissa axis represents the change of the time step in the training process, the IBPO algorithm can finally reach the average reward value of 0.92 after being trained, and the DRQN algorithm and the DFP algorithm can finally reach the average reward values of 0.79 and 0.86 after being trained. The closer the average reward value is to 1, the more effective the algorithm training the agent learns an action strategy and the more chance of reaching the desired location. The main reason for this result is the lack of reward feedback in the routing scenario, and the intrinsic reward signal in the IBPO algorithm provides assistance for the action strategy update of the agent in such a scenario, so as to make up for the lack of positive reward value in the experience playback pool, and thus, the exploration strategy can be learned faster. The comparison test proves that the IBPO algorithm can be used for training the reinforcement learning intelligent body with relatively high exploration performance in the three-dimensional visual angle road-finding scene.
The average reward values and average number of steps of action associated with the hunting scenario experiments are shown in table 2. Wherein the average reward value corresponds to fig. 12, and the average number of action steps represents the different number of action steps required by the agent trained by the different algorithms to reach the target location, wherein the IBPO algorithm performs optimally in the three algorithms with an average number of action steps of 61.8, and the minimum average number of action steps indicates that the IBPO algorithm can find a path to the target location more quickly under average conditions, i.e. has a more stable action strategy.
TABLE 2 IBPO Algorithm Experimental data comparison
Evaluation index | IBPO | DFP | DRQN |
Average prize value | 0.92 | 0.86 | 0.75.7 |
Average number of steps of action | 61.8 | 69.3 | 75.7 |
By combining the analysis, the intrinsic reward value generated by the intrinsic reward generation module provides an auxiliary reward signal for strategy updating of the intelligent agent in the training process, so that the intelligent agent can still learn an effective exploration strategy in a sparse reward scene. Through comparative analysis with the DRQN algorithm and the DFP algorithm, the IBPO algorithm surpasses the average reward value and the average action step number in two evaluation indexes, and shows better comprehensive performance.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (8)
1. A method for intrinsic bonus based video game decision making, comprising the steps of:
s1, acquiring a video game simulation environment;
s2, constructing a neural network model;
s3, designing an internal reward model;
s4, combining the intrinsic reward model with the constructed neural network model;
s5, obtaining game records through a simulation environment;
s6, updating the neural network model through the acquired game record;
s7, circularly training the neural network model until convergence;
wherein,
step S3 includes an intrinsic reward generation algorithm as follows:
inputting:
And (3) outputting:
1) Initializing parameters;
7) I is increased by 1;
8) when j is equal to 1, E), executing steps 9) to 13) circularly, otherwise executing step 14);
13) J is increased by 1;
2. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: step S3 includes: designing an internal reward generating module, defining a target mapping network and a prediction network with the same structure, performing feature extraction and state mapping on an input three-dimensional state picture by using the target mapping network and the prediction network to respectively obtain corresponding embedded vectors, and calculating the similarity of the two embedded vectors to obtain the value of the internal reward.
3. A method of intrinsic bonus based video game decision making as recited in claim 2, wherein: in step S3, the target mapping network and the prediction network are defined as shown in formula (3-1) and formula (3-2), respectively:
the target mapping network is defined as the mapping of states to target embedded vectors:
in the formula,
the prediction network is defined as the mapping of states to target embedded vectors:
in the formula,
4. A method of intrinsic bonus based video game decision making as recited in claim 3, wherein: in step S3, the loss function of the intrinsic reward generation moduleIs defined as:
5. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: in step S4, a combination of long-term intrinsic awards and round-based external awards is used.
6. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: step S4 includes an intrinsic reward policy optimization algorithm, which is as follows:
inputting:
And (3) outputting:
(1) Initializing parameters and a target mapping network;
(2) when J' is E [1, J ]:
(3) in addition toIf so, executing the steps (4) to (9) in a circulating manner, otherwise, executing the step (10);
(9) increasing the value of k by 1;
(12) whenIf so, executing the steps (13) to (15) in a circulating manner, otherwise, executing the step (16);
(15) Increasing the value of l by 1;
7. The intrinsic bonus-based video game decision method of claim 6, wherein: step S5 includes: the established neural network model and the obtained simulated game environment are used for carrying out interaction to obtain game records, game images are generated through the video game simulated environment and are input into the neural network model, the neural network model generates legal actions and returns the legal actions to the simulated environment, meanwhile, the value network generates external values, the internal reward generates a network to generate internal values, and finally, the external values and the internal values are fused; while the simulated environment gives a score and the next image based on the actions generated by the neural network model.
8. The intrinsic bonus-based video game decision method of claim 6, wherein: step S6 includes: the neural network model is updated according to an intrinsic reward strategy optimization algorithm using the acquired game record.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010370070.1A CN111260040B (en) | 2020-05-06 | 2020-05-06 | Video game decision method based on intrinsic rewards |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010370070.1A CN111260040B (en) | 2020-05-06 | 2020-05-06 | Video game decision method based on intrinsic rewards |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111260040A CN111260040A (en) | 2020-06-09 |
CN111260040B true CN111260040B (en) | 2020-11-06 |
Family
ID=70955207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010370070.1A Active CN111260040B (en) | 2020-05-06 | 2020-05-06 | Video game decision method based on intrinsic rewards |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111260040B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112221140B (en) * | 2020-11-04 | 2024-03-22 | 腾讯科技(深圳)有限公司 | Method, device, equipment and medium for training action determination model of virtual object |
CN112818672A (en) * | 2021-01-26 | 2021-05-18 | 山西三友和智慧信息技术股份有限公司 | Reinforced learning emotion analysis system based on text game |
CN113392971B (en) * | 2021-06-11 | 2022-09-02 | 武汉大学 | Strategy network training method, device, equipment and readable storage medium |
CN113704979B (en) * | 2021-08-07 | 2024-05-10 | 中国航空工业集团公司沈阳飞机设计研究所 | Air countermeasure maneuvering control method based on random neural network |
CN113947022B (en) * | 2021-10-20 | 2022-07-12 | 哈尔滨工业大学(深圳) | Near-end strategy optimization method based on model |
CN116384469B (en) * | 2023-06-05 | 2023-08-08 | 中国人民解放军国防科技大学 | Agent policy generation method and device, computer equipment and storage medium |
CN117953351B (en) * | 2024-03-27 | 2024-07-23 | 之江实验室 | Decision method based on model reinforcement learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109540150A (en) * | 2018-12-26 | 2019-03-29 | 北京化工大学 | One kind being applied to multi-robots Path Planning Method under harmful influence environment |
CN111062491A (en) * | 2019-12-13 | 2020-04-24 | 周世海 | Intelligent agent unknown environment exploration method based on reinforcement learning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017218699A1 (en) * | 2016-06-17 | 2017-12-21 | Graham Leslie Fyffe | System and methods for intrinsic reward reinforcement learning |
US10423861B2 (en) * | 2017-10-16 | 2019-09-24 | Illumina, Inc. | Deep learning-based techniques for training deep convolutional neural networks |
CN109978133A (en) * | 2018-12-29 | 2019-07-05 | 南京大学 | A kind of intensified learning moving method based on action mode |
-
2020
- 2020-05-06 CN CN202010370070.1A patent/CN111260040B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109540150A (en) * | 2018-12-26 | 2019-03-29 | 北京化工大学 | One kind being applied to multi-robots Path Planning Method under harmful influence environment |
CN111062491A (en) * | 2019-12-13 | 2020-04-24 | 周世海 | Intelligent agent unknown environment exploration method based on reinforcement learning |
Non-Patent Citations (2)
Title |
---|
Intrinsically motivated model learning for developing curious robots;ToddHester, et al.;《Artificial Intelligence》;20151231;第1-17页 * |
内部动机驱动的机器人未知环境在线自主学习;胡启祥等;《计算机工程与应用》;20141231;第50卷(第4期);第110-113页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111260040A (en) | 2020-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111260040B (en) | Video game decision method based on intrinsic rewards | |
CN111282267B (en) | Information processing method, information processing apparatus, information processing medium, and electronic device | |
CN111111220B (en) | Self-chess-playing model training method and device for multiplayer battle game and computer equipment | |
CN112791394B (en) | Game model training method and device, electronic equipment and storage medium | |
CN111291890A (en) | Game strategy optimization method, system and storage medium | |
CN111260026B (en) | Navigation migration method based on meta reinforcement learning | |
CN113688977B (en) | Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium | |
CN113947022B (en) | Near-end strategy optimization method based on model | |
CN105637540A (en) | Methods and apparatus for reinforcement learning | |
CN114757351B (en) | Defense method for resisting attack by deep reinforcement learning model | |
CN114330651A (en) | Layered multi-agent reinforcement learning method oriented to multi-element joint instruction control | |
CN113894780B (en) | Multi-robot cooperation countermeasure method, device, electronic equipment and storage medium | |
Chen et al. | Visual hide and seek | |
CN115185294B (en) | QMIX-based aviation soldier multi-formation collaborative autonomous behavior decision modeling method | |
CN111282272A (en) | Information processing method, computer readable medium and electronic device | |
CN115964898A (en) | Bignty game confrontation-oriented BC-QMIX on-line multi-agent behavior decision modeling method | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
CN115826621A (en) | Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning | |
CN114371729B (en) | Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback | |
CN118416460A (en) | Interpretable chess prediction method and system based on heterogeneous graph neural network | |
CN117539241A (en) | Path planning method integrating global artificial potential field and local reinforcement learning | |
CN115944921B (en) | Game data processing method, device, equipment and medium | |
CN111437605A (en) | Method for determining virtual object behaviors and hosting virtual object behaviors | |
CN116736729A (en) | Method for generating perception error-resistant maneuvering strategy of air combat in line of sight | |
CN116796844A (en) | M2 GPI-based unmanned aerial vehicle one-to-one chase game method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |