CN111260040B - Video game decision method based on intrinsic rewards - Google Patents

Video game decision method based on intrinsic rewards Download PDF

Info

Publication number
CN111260040B
CN111260040B CN202010370070.1A CN202010370070A CN111260040B CN 111260040 B CN111260040 B CN 111260040B CN 202010370070 A CN202010370070 A CN 202010370070A CN 111260040 B CN111260040 B CN 111260040B
Authority
CN
China
Prior art keywords
intrinsic
reward
network
video game
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010370070.1A
Other languages
Chinese (zh)
Other versions
CN111260040A (en
Inventor
王轩
漆舒汉
张加佳
曹睿
何志坤
刘洋
蒋琳
廖清
夏文
李化乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202010370070.1A priority Critical patent/CN111260040B/en
Publication of CN111260040A publication Critical patent/CN111260040A/en
Application granted granted Critical
Publication of CN111260040B publication Critical patent/CN111260040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/45Controlling the progress of the video game
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/45Controlling the progress of the video game
    • A63F13/46Computing the game score
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a video game decision method based on intrinsic rewards, which comprises the following steps: s1, acquiring a video game simulation environment; s2, constructing a neural network model; s3, designing an internal reward model; s4, combining the intrinsic reward model with the constructed neural network model structure; s5, obtaining game records through a simulation environment; s6, updating the neural network model through the acquired game record; and S7, circularly training the neural network model until convergence. The invention has the beneficial effects that: the problem that the environment feedback reward value is frequently lacked in the three-dimensional scene is solved well.

Description

Video game decision method based on intrinsic rewards
Technical Field
The invention relates to a video game decision method, in particular to a video game decision method based on intrinsic rewards.
Background
Since the birth of video games, video games appeared in the early 70 s of the 20 th century, the technology of realizing automatic decision making of an intelligent agent in the video games through an artificial intelligence technology is always a hot point of research in the industrial and academic fields, and has great commercial value. In recent years, the rapid development of deep reinforcement learning methods provides an effective way for realizing the technology. Generally speaking, the quality of game decision making techniques is determined entirely by how much score is scored in the game or whether a game can be won, as is the case with video games.
The deep reinforcement learning algorithm applied to the complex game scene has the advantage of end-to-end characteristic, the intelligent agent action strategy is learned through the deep reinforcement learning algorithm, so that the mapping from the input game state to the output action can be directly completed, a set of universal algorithm framework is provided for solving various game tasks, and the Actor-criticic algorithm is a more representative algorithm. In a deep reinforcement learning algorithm taking an Actor-Critic algorithm as a basic framework, in order to train various machine game intelligent agents, the common method is to firstly perform feature extraction on a game state by designing a convolution network, then perform intelligent agent action strategy learning by using the Actor network, perform strategy evaluation and improvement by using the Critic network, and continuously perform iterative training until convergence. However, in a few Atari video game scenes, an intelligent agent using the algorithm as a basic framework is difficult to learn a strategy for efficiently acquiring the environment reward, a similar point of the scenes is that the environment where the intelligent agent is located is complex, reward feedback is difficult to obtain directly, and the intelligent agent often needs to make an action of acquiring a forward reward value through a series of action decisions or by referring to more historical information. The reason is that the Actor-Critic algorithm substantially considers a value iteration method and a strategy gradient method comprehensively, wherein the strategy gradient method needs to sample and update a strategy according to a track in an intelligent agent interaction process, and if a sufficient sampling track is lacked or the quality of the sampling track is not good enough, the optimization process of the strategy gradient is affected, so that the intelligent agent cannot learn a correct and efficient strategy. In a three-dimensional video game Vizdoom, an intelligent body can only contact a small part of environment in a sight line range in a game scene, meanwhile, a large number of labyrinths, traps and other design mechanisms influence exploration and reward acquisition of the intelligent body in the scene, due to the sparsity of reward feedback, the proportion of high-income-value actions in a sampling track is small, a positive reward sampling track is lacked in the training process of a strategy gradient algorithm, and the variance in the whole training process is high. The Actor-Critic algorithm introduces a value model in a value iteration method and then estimates a track value by using a value network, so that the defect of high variance of a strategy gradient method can be theoretically alleviated, but the problem that the update amplitude of an intelligent action strategy is too high in oscillation and not stable enough can still be generated during training by using the algorithm in the actual training process of a Vizdoom scene. In a Vizdoom scene with sparse feedback of part of environment rewards, due to the lack of reward signals, the algorithm cannot be subjected to strategy updating or cannot be converged due to large-amplitude oscillation generated in training. Therefore, for the application of the depth reinforcement learning algorithm in the three-dimensional video game Vizdoom, the problem that the environment feedback reward value is usually lacked in the three-dimensional scene exists.
Disclosure of Invention
To solve the problems in the prior art, the present invention provides a video game decision method based on intrinsic rewards.
The invention provides a video game decision method based on intrinsic rewards, which comprises the following steps:
s1, acquiring a video game simulation environment;
s2, constructing a neural network model;
s3, designing an internal reward model;
s4, combining the intrinsic reward model with the constructed neural network model structure;
s5, obtaining game records through a simulation environment;
s6, updating the neural network model through the acquired game record;
and S7, circularly training the neural network model until convergence.
The invention has the beneficial effects that: through the scheme, the problem that the environment feedback reward value is frequently lacked in the three-dimensional scene is solved well.
Drawings
FIG. 1 is an overall flow chart of a video game decision method based on intrinsic rewards of the invention.
FIG. 2 is a diagram of a Vizdoom simulation environment for an intrinsic bonus based video game decision method of the present invention.
FIG. 3 is a block diagram of a neural network for deep reinforcement learning solution in a video game decision method according to the present invention.
FIG. 4 is a diagram of an intrinsic reward mechanism reinforcement learning model of a video game decision method based on intrinsic rewards according to the invention.
FIG. 5 is a block diagram of an intrinsic prize generation module of a video game decision method based on intrinsic prizes of the present invention.
FIG. 6 is a network architecture diagram of a target mapping network and a prediction network for an intrinsic bonus based video game decision method of the present invention.
FIG. 7 is a flow chart of an intrinsic prize generation mechanism for a video game based on an intrinsic prize video game decision method of the present invention.
FIG. 8 is a schematic diagram of a differentiated prize fusion method of the intrinsic prize-based video game decision method according to the present invention.
FIG. 9 is a variation of the value network architecture of a video game decision method based on intrinsic rewards of the invention.
FIG. 10 is a flow chart of an intrinsic bonus strategy optimization algorithm for an intrinsic bonus based video game decision method of the present invention.
FIG. 11 is a diagram of a routing scenario of a Vizdoom platform in a video game decision method based on intrinsic rewards of the invention.
FIG. 12 is a comparison graph of the training effect of the IBPO algorithm for the video game decision method based on intrinsic rewards of the invention.
Detailed Description
The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.
The method applies a deep reinforcement learning method and combines an advanced internal reward mechanism to form a decision model and a technology with a certain intelligence level, so that a game intelligent agent obtains high scores in a video game, and the method is the core content of the invention.
The invention mainly researches the strategy solving problem of the three-dimensional video game under the condition of incomplete information. (1) Aiming at the problem that the common environment feedback reward value is lack in the three-dimensional scene, the invention provides an internal reward model. (2) An intrinsic reward strategy optimization algorithm is proposed by fusing intrinsic rewards with external reward differences.
As shown in fig. 1, a video game decision method based on intrinsic rewards comprises the following steps:
s1, acquiring a video game simulation environment;
s2, constructing a neural network model;
s3, designing an internal reward model;
s4, combining the intrinsic reward model with the constructed neural network model structure;
s5, obtaining game records through a simulation environment;
s6, updating the neural network model through the acquired game record;
and S7, circularly training the neural network model until convergence.
The invention mainly researches the strategy solving problem of the three-dimensional video game under the condition of incomplete information. Aiming at the problems of incomplete high-dimensional state space and information perception in video game games, a deep reinforcement learning method based on an internal reward strategy optimization algorithm is provided. In the method, firstly, aiming at the problem that the environment feedback reward value is commonly lacked in a three-dimensional scene, the invention provides an internal reward model, the internal reward value generated by designing a target mapping network and a prediction network is used for making up the lacked environment feedback reward value, and an intelligent agent is assisted to carry out strategy updating. And secondly, considering the structural difference between the internal reward model and the traditional strategy optimization algorithm, the internal reward strategy optimization algorithm is provided by fusing the internal reward model and the traditional strategy optimization algorithm by adjusting the structure of the value network, and the action effect of the intelligent agent in the sparse reward three-dimensional scene is improved. Fig. 1 is a general flow chart of a video game decision method based on internal rewards.
The invention provides a video game decision method based on internal rewards, which comprises the following specific processes:
1. acquiring and installing a video game simulation environment;
in recent years, DRL (deep reinforcement learning) is also hot with the increasing popularity of deep learning. Therefore, various new reinforcement learning research platforms emerge from the eyerings after rain, and the trend is to slowly expand the 3D maze from a simple toy scene, a first-person shooting game, an instant strategy game, a complex robot control scene and the like. For example, Vizdoom allows for the development of AI robots that play DOOM using visual information (screen buffers). The method is mainly used for machine vision learning, in particular to the research of deep reinforcement learning. The Vizdoom simulated game environment is acquired and installed through the Vizdoom official website, as shown in FIG. 2.
2. Constructing a neural network;
fig. 3 is a diagram of a network architecture for solving a video game using deep reinforcement learning, in which the input of a model is each frame image of the video game, the output of the model is the operation of the corresponding video game, and the parameters of the network in the middle layer are the corresponding strategies that need to be trained using deep reinforcement learning. The present invention collects data by making decisions in a simulation environment using agents, and optimizes the agent's strategy using deep learning algorithms based on the collected state and action pairs. How to train a good model based on video game features is the key of the performance of the intelligence and is the core innovation of the invention.
3. Designing a video game internal reward generation mechanism;
the general reinforcement learning model only has two types of entities, namely an environment and an agent, and the reward signals are all originated from the environment where the agent is located, so that additional reward signals are not easy to generate based on the reinforcement learning model. Aiming at the limitation and the deficiency of a general reinforcement learning model, the invention utilizes the concept of the internal reward and generates auxiliary reward information by designing a corresponding internal reward mechanism to help the intelligent agent to carry out strategy updating according to the internal reward when lacking the environmental reward information. Such a reinforcement Learning model including intrinsic reward benefits is defined as an intrinsic incentive reinforcement Learning model (IMRL). A general model of intrinsic incentive reinforcement learning is shown in fig. 4.
The invention designs the following intrinsic reward generation module: defining a target mapping Network (targeting mapping Network) and a Prediction Network (Prediction Network) with the same structure, performing feature extraction and state mapping on an input three-dimensional state picture by using the target mapping Network and the Prediction Network to respectively obtain corresponding embedded vectors, and calculating the similarity of the target mapping Network and the Prediction Network to obtain the value of the intrinsic reward. The target mapping network and the prediction network are defined as shown in formula (3-1) and formula (3-2), respectively:
the target mapping network is defined as the mapping of states to target embedded vectors:
Figure DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE002
-a target mapping network;
Figure DEST_PATH_IMAGE003
-status;
Figure DEST_PATH_IMAGE004
-target embedding vector;
the prediction network is defined as the mapping of states to prediction embedding vectors:
Figure DEST_PATH_IMAGE005
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE006
-prediction network;
Figure 985777DEST_PATH_IMAGE003
-status;
Figure 355578DEST_PATH_IMAGE004
target embedding vector.
The designed intrinsic reward generation module is shown in fig. 5.
In the intrinsic reward generation module, the target mapping network has dimensions of the same size as the embedded vectors output by the prediction network. For the target mapping network, a strategy pre-training or random initialization mode can be adopted for network initialization, the network initialization mode adopted by the invention utilizes a deep reinforcement learning random strategy intelligent agent, and the intelligent agent is utilized to develop limited steps of exploration in the environment so as to initialize the target mapping network. And then training a predicted value network based on a sample generated by interaction of the intelligent agent and the environment of the strategy optimization algorithm, and minimizing the mean square error of the predicted value network and the target mapping network on the generated sample data. The main objective of this approach is to reduce the error from the randomness of the fit by setting a fixed optimization target and the error from the complexity of the objective function by setting a simpler target mapping network initialized with a random strategy. The network structure of the target mapping network and the prediction network is shown in fig. 6.
The target mapping network and the prediction network of the internal reward generation module both have network structures as shown in fig. 5, an input state picture in a three-dimensional scene is subjected to feature extraction through a three-layer convolutional neural network, and finally vector representation of fixed dimensions is output. The same network structure is adopted to reduce the error influence on the calculation of the vector similarity caused by the difference of the network structures. Loss function of intrinsic reward generation module
Figure DEST_PATH_IMAGE007
Is defined as:
Figure DEST_PATH_IMAGE008
in the formula
Figure DEST_PATH_IMAGE009
-prediction vector;
Figure DEST_PATH_IMAGE010
-a target vector;
Figure DEST_PATH_IMAGE011
-a parameter regularization term;
Figure DEST_PATH_IMAGE012
-a regularization term penalty factor;
the similarity of the output vectors of the target mapping network and the prediction network is used as the magnitude of the generated intrinsic reward value in the intrinsic reward generation module, and the consideration is that the intelligent agent tends to perform exploration behaviors in a sparse reward environment. In the initial stage of the training of the intelligent agent, the activity range of the intelligent agent in the three-dimensional scene is smaller, the number of states is larger, the similarity of the output vectors of the two networks is smaller, the calculated internal reward value is larger, and the intelligent agent mainly plays back the reward source in the data tuple by taking the internal reward signal as the reward source of the experience when the self action strategy is updated. In addition, because the environmental information faced by the agent in different three-dimensional scenes or different exploration moments of the same scene is not very same, if the normalization processing is not performed on the internal reward value and the input environmental state information, the variation range of the corresponding numerical value is too large, which is not beneficial to the super-parameter selection and the input information characterization in the training process. Therefore, the intrinsic reward value and the environmental status information need to be normalized. In addition, training of the predictive network requires the collection of agents and environmentsThe interaction sample, and the action output of the agent in the environment, is derived from the policy network, so the generation of the intrinsic reward signal during the training process is also related to the policy optimization algorithm. To be provided with
Figure DEST_PATH_IMAGE013
As an action strategy for the strategy optimization algorithm agent, an intrinsic reward generation algorithm for a single training round is given, as shown in fig. 7.
The intrinsic reward generation algorithm is as follows:
inputting:
random initialization step size
Figure DEST_PATH_IMAGE014
Training round termination step E, random strategy
Figure DEST_PATH_IMAGE015
,
Attenuation factor
Figure DEST_PATH_IMAGE016
Time step
Figure DEST_PATH_IMAGE017
And (3) outputting:
intrinsic prize value
Figure DEST_PATH_IMAGE018
1, initializing parameters.
2 when
Figure DEST_PATH_IMAGE019
Then, the following steps are executed in a circulating manner:
according to a random strategy
Figure 557758DEST_PATH_IMAGE015
Sampling a current time step action
Figure DEST_PATH_IMAGE020
;
Based on actions
Figure 669327DEST_PATH_IMAGE020
Get the next state
Figure DEST_PATH_IMAGE021
;
Normalizing environmental information
Figure DEST_PATH_IMAGE022
;
6: time step update
Figure DEST_PATH_IMAGE023
;
7, ending the circulation;
when j is equal to [1, E ], circularly executing the following steps:
action policy according to agent
Figure DEST_PATH_IMAGE024
Sampling a current time step action
Figure 340480DEST_PATH_IMAGE020
;
Action-based
Figure 327022DEST_PATH_IMAGE020
Get the next state
Figure 337703DEST_PATH_IMAGE021
;
Calculating intrinsic prize value
Figure DEST_PATH_IMAGE025
;
12 time step update
Figure 367976DEST_PATH_IMAGE023
;
Ending the circulation;
returning internal rewards
Figure DEST_PATH_IMAGE026
4. Combining the intrinsic reward with the established neural network and applying to the video game;
the training process of the prediction network of the intrinsic reward mechanism needs to be based on the action data samples of the agents, and the action process of the agents in the environment is driven by the strategy optimization algorithm, so that the complete intrinsic reward algorithm needs to be combined with the strategy gradient or strategy optimization algorithm. The invention combines an improved Policy optimization algorithm with an Intrinsic reward mechanism to propose an Intrinsic Based Policy Optimization (IBPO). Because the model basis of the strategy optimization algorithm is a general reinforcement learning model and has a certain difference with an internal incentive reinforcement learning model introduced with an internal reward mechanism, the structural difference of the strategy optimization algorithm and the internal incentive reinforcement learning model needs to be analyzed so as to adapt to the strategy optimization algorithm and the internal incentive reinforcement learning model. In the strategy optimization algorithm, the strategy is updated depending on the action strategy probability ratio and the estimation function in the objective function, and the objective function calculates the action strategy probability ratio and the estimation function based on the reward value, so the key point of adapting the strategy optimization algorithm and the internal reward mechanism is the processing of external reward and internal reward. In the conventional policy optimization algorithm, the reward value of the policy update is derived from only including the external reward, so after the internal reward mechanism is introduced, the combination mode of the internal reward and the policy optimization algorithm, namely the combination mode of the internal reward and the external reward, needs to be considered. The present invention employs a combination of long-term intrinsic awards and round-based external awards, as shown in fig. 8.
The original intention of the internal reward is to increase the exploration capacity of the agent for the environment, if the internal reward value is reset at the beginning of each training round of the strategy optimization algorithm, namely if the internal reward is accumulated according to a round mode, once the agent takes a certain action to end the current round, the internal reward value is zero immediately, strategy iteration and updating cannot be continued according to the internal reward, and the agent is more inclined to the conservative strategy rather than exploration and is contrary to the original intention of introducing the internal reward. Thus, the manner in which the intrinsic rewards are handled during the training process is to be accumulated continuously over each training round. For the external reward, the general method is to perform zero clearing after the end of a single training round, so as to prevent the intelligent agent from accumulating the external reward when the intelligent agent makes an action leading to the end of the round, so that negative feedback of the environment cannot cause negative updating to the strategy gradient, and the action strategy of the intelligent agent is influenced. In summary, the present invention combines the long-term intrinsic award with the round-based external award.
After a combination of long-term intrinsic rewards and turn-based external rewards is adopted, a specific value network structure needs to be considered. Only external rewards are contained in the traditional strategy optimization algorithm, only a single Critic value network is usually required to process input reward value information, and the newly introduced internal rewards obviously have different meanings from the traditional external rewards, so that the two reward information needs to be distinguished from each other on the basis of a network model. If two types of reward information are processed by using the independent value networks respectively, two independent neural networks are needed to be used for fitting the criticic value network, the parameter number and the model complexity are doubled, the training time is increased, and even the loss of the model precision is possibly brought. From the perspective of being as efficient as possible and fully considering the meanings of the two types of reward information, the method uses a value network model with two input heads to respectively process the internal reward and the external reward, different discount factors can be endowed to the two types of rewards based on the design, the additional supervision information of the cost function of the strategy optimization algorithm is endowed, the difference between the internal reward signal and the external reward information is highlighted, and the strategy and the value network structure are changed as shown in the figure 9.
The reward information and the value network structure are processed, the value network of the strategy optimization algorithm is adjusted to be a structure similar to the value network in the internal incentive reinforcement learning model, the external reward signals and the internal reward signals are added in a differentiation fusion mode to be combined into the overall reward signals in the experience playback pool, and therefore the strategy optimization algorithm and the internal reward generation module are combined. The integral model structure of the internal reward strategy optimization algorithm is similar to that of the traditional actor critic algorithm model structure, and the shared convolution network, the value network and the strategy network form the main part of the model, and the three parts respectively play roles in extracting characteristic information, carrying out state value estimation and outputting intelligent agent action probability distribution. The intrinsic reward strategy optimization algorithm model is mainly different from the traditional model in two points: firstly, a value network in an internal incentive strategy optimization algorithm model is subjected to structural adjustment, and can adapt to a structure that an internal incentive reinforcement learning model contains external environment incentive information and internal incentive signals; secondly, an additional intrinsic reward generation module is arranged in the intrinsic reward strategy optimization algorithm model and used for extracting features from the input state picture and calculating the intrinsic reward value.
For an end-to-end intrinsic reward policy optimization agent, the main flow from sensing three-dimensional scene environment information to making action decisions of the agent is as follows: the input three-dimensional state picture is processed by a shared convolution network and an internal reward generation module network, and the process mainly aims at sensing environment state information; and then processed by a value network and a strategy network respectively, wherein the process mainly aims at estimating state value and outputting action probability distribution. The parameter updating of the strategy network represents the updating of the action strategy of the whole algorithm model corresponding to the intelligent agent, and the parameter change is jointly determined by the internal reward value, the environment external reward value and the state value output by the value network. The specific intrinsic reward policy optimization algorithm effectively combines the policy optimization algorithm with the intrinsic reward generation module, the initialization process is similar to the intrinsic reward generation algorithm, and the overall process is shown in fig. 10.
The intrinsic reward policy optimization algorithm is as follows:
inputting:
initializing total iteration rounds
Figure DEST_PATH_IMAGE027
Policy update step size
Figure DEST_PATH_IMAGE028
Go back toCombination of Chinese herbs
Figure DEST_PATH_IMAGE029
Step size
Figure DEST_PATH_IMAGE030
,
Attenuation factor
Figure 838009DEST_PATH_IMAGE016
Time step
Figure 549613DEST_PATH_IMAGE017
And (3) outputting:
policy network parameters
Figure DEST_PATH_IMAGE031
Predicting network parameters
Figure DEST_PATH_IMAGE032
Initializing parameters and target mapping network.
2 when
Figure DEST_PATH_IMAGE033
The method comprises the following steps:
3: and
Figure DEST_PATH_IMAGE034
the method comprises the following steps:
according to action policy
Figure 227850DEST_PATH_IMAGE024
Sampling a current time step action
Figure 429025DEST_PATH_IMAGE020
;
5 based on actions
Figure 12453DEST_PATH_IMAGE020
Get the next state
Figure 262168DEST_PATH_IMAGE021
And external awards
Figure DEST_PATH_IMAGE035
;
6 calculating intrinsic prize value
Figure DEST_PATH_IMAGE036
7 storing agent information quintuple
Figure DEST_PATH_IMAGE037
To experience playback pool
Figure DEST_PATH_IMAGE038
;
8 time step update
Figure 624273DEST_PATH_IMAGE023
;
9, ending the circulation;
calculating an instant prize value
Figure DEST_PATH_IMAGE039
And reward advantage estimate
Figure DEST_PATH_IMAGE040
;
11 according to
Figure DEST_PATH_IMAGE041
Normalizing the environment information;
12 when
Figure DEST_PATH_IMAGE042
The method comprises the following steps:
playback of samples empirically
Figure 809398DEST_PATH_IMAGE041
And reward advantage estimate
Figure DEST_PATH_IMAGE043
Updating
Figure DEST_PATH_IMAGE044
;
Playback of samples empirically
Figure 191706DEST_PATH_IMAGE041
Updating
Figure DEST_PATH_IMAGE045
;
15, ending the circulation;
ending the circulation;
17: return to
Figure DEST_PATH_IMAGE046
5. Using the constructed neural network and the obtained simulated game environment to interactively obtain game records;
generating game images through a video game simulation environment and inputting the game images into a neural network, generating legal actions by the neural network and returning the legal actions to the simulation environment, generating value by a value network and generating internal value by an internal reward generation network,
finally, fusing the external value and the internal value difference; while the simulated environment gives a score and the next image based on the actions generated by the neural network. The variables generated above are merged into a game record.
6. Updating the network according to the corresponding reinforcement learning algorithm by using the acquired game record;
and updating the neural network according to an intrinsic reward strategy optimization algorithm by using the acquired game records, and circularly training the neural network until convergence.
The video game decision method based on the internal rewards has the following beneficial effects:
1. experimental setup
The method takes the three-dimensional video game Vizdoom under the incomplete information condition as a research object and a test platform, and realizes the intrinsic reward strategy optimization algorithm IBPO based on the test platform.
1.1 introduction to Vizdoom scene
The Vizdoom platform is a three-dimensional video game with a first-person perspective, and the action of an intelligent body in the scene of the platform is similar to the action of an object in the real world, namely, the visual signal is received and then action decision is made. As a test platform of the current mainstream deep reinforcement learning algorithm, the Vizdoom platform provides an interface for receiving action input and feeding back a reward signal, and simulates the environment in a reinforcement learning model. At present, the Vizdoom platform provides comprehensive testing capability for training an intelligent agent to explore a three-dimensional environment, and the invention performs experiments based on the path-finding scene of the platform.
The route searching scene is a scene with sparse rewards in the Vizdoom test platform, the whole map is composed of a plurality of opaque rooms with different pictures, only a fixed target position is arranged in a certain specific room, the intelligent agent can move freely in the route searching scene, as reward feedback exists only at the target object in the scene, the intelligent agent can not obtain reward feedback at any other position, the starting position of the intelligent agent in the route searching scene is far away from the target position, and the intelligent agent needs to pass through the rooms with different contents in the process. The way-finding scenario is shown in fig. 11.
1.2 Experimental development Environment as in Table 1
TABLE 1
Figure DEST_PATH_IMAGE047
1.3 comparison of existing methods
(1) DFP: high dimensional sensory flow and low dimensional measurement flow are utilized for sensorimotor control in an immersive environment.
(2) DRQN: modular architecture is used to address the three-dimensional environment in a first person shooter game with game information provided by a simulator.
2. Results of the experiment
The depth reinforcement learning algorithm generally takes the score value output by the game simulation environment as the measurement standard of the performance of the intelligent object, which is slightly different for different game scenes, but is an equivalent representation form of the depth reinforcement learning reward value. In the route finding scene, the average reward value and the average action step number are used as evaluation indexes. Wherein the average prize value is defined as follows: when training is carried out to the current step number, the ratio of the number of agents which can reach the target position within the specified step number to the number of all training agents represents the path-finding success rate of the agents. The average number of steps of an action is defined as follows: the average value of the action steps of the intelligent agent after algorithm convergence for performing 100 verification interactions in the route finding scene represents the stability of the action strategy of the intelligent agent. The effect of agents trained by different algorithms in a path finding scenario is shown in fig. 12.
The ordinate axis represents the average reward value obtained by the reinforcement learning agent in the route-finding scene, the abscissa axis represents the change of the time step in the training process, the IBPO algorithm can finally reach the average reward value of 0.92 after being trained, and the DRQN algorithm and the DFP algorithm can finally reach the average reward values of 0.79 and 0.86 after being trained. The closer the average reward value is to 1, the more effective the algorithm training the agent learns an action strategy and the more chance of reaching the desired location. The main reason for this result is the lack of reward feedback in the routing scenario, and the intrinsic reward signal in the IBPO algorithm provides assistance for the action strategy update of the agent in such a scenario, so as to make up for the lack of positive reward value in the experience playback pool, and thus, the exploration strategy can be learned faster. The comparison test proves that the IBPO algorithm can be used for training the reinforcement learning intelligent body with relatively high exploration performance in the three-dimensional visual angle road-finding scene.
The average reward values and average number of steps of action associated with the hunting scenario experiments are shown in table 2. Wherein the average reward value corresponds to fig. 12, and the average number of action steps represents the different number of action steps required by the agent trained by the different algorithms to reach the target location, wherein the IBPO algorithm performs optimally in the three algorithms with an average number of action steps of 61.8, and the minimum average number of action steps indicates that the IBPO algorithm can find a path to the target location more quickly under average conditions, i.e. has a more stable action strategy.
TABLE 2 IBPO Algorithm Experimental data comparison
Evaluation index IBPO DFP DRQN
Average prize value 0.92 0.86 0.75.7
Average number of steps of action 61.8 69.3 75.7
By combining the analysis, the intrinsic reward value generated by the intrinsic reward generation module provides an auxiliary reward signal for strategy updating of the intelligent agent in the training process, so that the intelligent agent can still learn an effective exploration strategy in a sparse reward scene. Through comparative analysis with the DRQN algorithm and the DFP algorithm, the IBPO algorithm surpasses the average reward value and the average action step number in two evaluation indexes, and shows better comprehensive performance.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (8)

1. A method for intrinsic bonus based video game decision making, comprising the steps of:
s1, acquiring a video game simulation environment;
s2, constructing a neural network model;
s3, designing an internal reward model;
s4, combining the intrinsic reward model with the constructed neural network model;
s5, obtaining game records through a simulation environment;
s6, updating the neural network model through the acquired game record;
s7, circularly training the neural network model until convergence;
wherein the content of the first and second substances,
step S3 includes an intrinsic reward generation algorithm as follows:
inputting:
random initialization step size
Figure 781063DEST_PATH_IMAGE001
Training round termination step E, random strategy
Figure 229362DEST_PATH_IMAGE002
,
Attenuation factor
Figure 743520DEST_PATH_IMAGE003
Time step
Figure 669888DEST_PATH_IMAGE004
And (3) outputting:
intrinsic prize value
Figure 280998DEST_PATH_IMAGE005
1) Initializing parameters;
2) when
Figure 637548DEST_PATH_IMAGE006
If yes, circularly executing the steps 3) to 7), otherwise, executing the step 8);
3) according to a random strategy
Figure 639002DEST_PATH_IMAGE002
Sampling a current time step action
Figure 369061DEST_PATH_IMAGE007
;
4) Based on actions
Figure 834677DEST_PATH_IMAGE007
Get the next state
Figure 562462DEST_PATH_IMAGE008
;
5) Normalized environmental information
Figure 847950DEST_PATH_IMAGE009
;
6) Time step update
Figure 381699DEST_PATH_IMAGE010
;
7) I is increased by 1;
8) when j is equal to 1, E), executing steps 9) to 13) circularly, otherwise executing step 14);
9) according to the agent action policy
Figure 436243DEST_PATH_IMAGE011
Sampling a current time step action
Figure 397246DEST_PATH_IMAGE007
;
10) Based on actions
Figure 373292DEST_PATH_IMAGE007
Get the next state
Figure 179574DEST_PATH_IMAGE008
;
11) Calculating intrinsic prize value
Figure 619783DEST_PATH_IMAGE012
;
12) Time step update
Figure 689370DEST_PATH_IMAGE010
;
13) J is increased by 1;
14) returning the intrinsic prize value
Figure 949450DEST_PATH_IMAGE013
2. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: step S3 includes: designing an internal reward generating module, defining a target mapping network and a prediction network with the same structure, performing feature extraction and state mapping on an input three-dimensional state picture by using the target mapping network and the prediction network to respectively obtain corresponding embedded vectors, and calculating the similarity of the two embedded vectors to obtain the value of the internal reward.
3. A method of intrinsic bonus based video game decision making as recited in claim 2, wherein: in step S3, the target mapping network and the prediction network are defined as shown in formula (3-1) and formula (3-2), respectively:
the target mapping network is defined as the mapping of states to target embedded vectors:
Figure 825002DEST_PATH_IMAGE014
in the formula (I), the compound is shown in the specification,
Figure 854138DEST_PATH_IMAGE015
-a target mapping network;
Figure 156943DEST_PATH_IMAGE016
-status;
Figure 842002DEST_PATH_IMAGE017
-target embedding vector;
the prediction network is defined as the mapping of states to target embedded vectors:
Figure 521245DEST_PATH_IMAGE018
in the formula (I), the compound is shown in the specification,
Figure 608150DEST_PATH_IMAGE019
-prediction network;
Figure 81857DEST_PATH_IMAGE016
-status;
Figure 316529DEST_PATH_IMAGE017
target embedding vector.
4. A method of intrinsic bonus based video game decision making as recited in claim 3, wherein: in step S3, the loss function of the intrinsic reward generation module
Figure 471567DEST_PATH_IMAGE020
Is defined as:
Figure 472365DEST_PATH_IMAGE021
in the formula
Figure 913711DEST_PATH_IMAGE022
-prediction vector;
Figure 573362DEST_PATH_IMAGE023
-a target vector;
Figure 594408DEST_PATH_IMAGE024
-a parameter regularization term;
Figure 187063DEST_PATH_IMAGE025
-regularization term penalty factor.
5. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: in step S4, a combination of long-term intrinsic awards and round-based external awards is used.
6. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: step S4 includes an intrinsic reward policy optimization algorithm, which is as follows:
inputting:
initializing total iteration rounds
Figure 205835DEST_PATH_IMAGE026
Policy update step size
Figure 149520DEST_PATH_IMAGE027
Round step size
Figure 974257DEST_PATH_IMAGE028
,
Attenuation factor
Figure 890260DEST_PATH_IMAGE003
Time step
Figure 611091DEST_PATH_IMAGE004
And (3) outputting:
policy network parameters
Figure 307652DEST_PATH_IMAGE029
Predicting network parameters
Figure 608183DEST_PATH_IMAGE030
(1) Initializing parameters and a target mapping network;
(2) when J' is E [1, J ]:
(3) in addition to
Figure 441010DEST_PATH_IMAGE031
If so, executing the steps (4) to (9) in a circulating manner, otherwise, executing the step (10);
(4) according to the action policy
Figure 598322DEST_PATH_IMAGE011
Sampling a current time step action
Figure 719862DEST_PATH_IMAGE007
;
(5) Based on actions
Figure 620822DEST_PATH_IMAGE007
Get the next state
Figure 308155DEST_PATH_IMAGE008
And external awards
Figure 839630DEST_PATH_IMAGE032
;
(6) ComputingIntrinsic prize value
Figure 979625DEST_PATH_IMAGE033
(7) Storing agent information quintuple
Figure 949855DEST_PATH_IMAGE034
To empirical playback of samples
Figure 491695DEST_PATH_IMAGE035
;
(8) Time step update
Figure 194071DEST_PATH_IMAGE010
The value of j' is increased by 1;
(9) increasing the value of k by 1;
(10) calculating the real-time prize value
Figure 824292DEST_PATH_IMAGE036
And reward advantage estimate
Figure 332633DEST_PATH_IMAGE037
;
(11) According to
Figure 463400DEST_PATH_IMAGE038
Normalizing the environment information;
(12) when
Figure 398995DEST_PATH_IMAGE039
If so, executing the steps (13) to (15) in a circulating manner, otherwise, executing the step (16);
(13) playback of samples empirically
Figure 513582DEST_PATH_IMAGE038
And reward advantage estimate
Figure 497718DEST_PATH_IMAGE040
Updating
Figure 14150DEST_PATH_IMAGE041
;
(14) Playback of samples empirically
Figure 120647DEST_PATH_IMAGE038
Updating
Figure 722529DEST_PATH_IMAGE042
(15) Increasing the value of l by 1;
(16) returning
Figure 510357DEST_PATH_IMAGE043
7. The intrinsic bonus-based video game decision method of claim 6, wherein: step S5 includes: the established neural network model and the obtained simulated game environment are used for carrying out interaction to obtain game records, game images are generated through the video game simulated environment and are input into the neural network model, the neural network model generates legal actions and returns the legal actions to the simulated environment, meanwhile, the value network generates external values, the internal reward generates a network to generate internal values, and finally, the external values and the internal values are fused; while the simulated environment gives a score and the next image based on the actions generated by the neural network model.
8. The intrinsic bonus-based video game decision method of claim 6, wherein: step S6 includes: the neural network model is updated according to an intrinsic reward strategy optimization algorithm using the acquired game record.
CN202010370070.1A 2020-05-06 2020-05-06 Video game decision method based on intrinsic rewards Active CN111260040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010370070.1A CN111260040B (en) 2020-05-06 2020-05-06 Video game decision method based on intrinsic rewards

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010370070.1A CN111260040B (en) 2020-05-06 2020-05-06 Video game decision method based on intrinsic rewards

Publications (2)

Publication Number Publication Date
CN111260040A CN111260040A (en) 2020-06-09
CN111260040B true CN111260040B (en) 2020-11-06

Family

ID=70955207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010370070.1A Active CN111260040B (en) 2020-05-06 2020-05-06 Video game decision method based on intrinsic rewards

Country Status (1)

Country Link
CN (1) CN111260040B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112221140B (en) * 2020-11-04 2024-03-22 腾讯科技(深圳)有限公司 Method, device, equipment and medium for training action determination model of virtual object
CN112818672A (en) * 2021-01-26 2021-05-18 山西三友和智慧信息技术股份有限公司 Reinforced learning emotion analysis system based on text game
CN113392971B (en) * 2021-06-11 2022-09-02 武汉大学 Strategy network training method, device, equipment and readable storage medium
CN113704979A (en) * 2021-08-07 2021-11-26 中国航空工业集团公司沈阳飞机设计研究所 Air countermeasure maneuver control method based on random neural network
CN113947022B (en) * 2021-10-20 2022-07-12 哈尔滨工业大学(深圳) Near-end strategy optimization method based on model
CN116384469B (en) * 2023-06-05 2023-08-08 中国人民解放军国防科技大学 Agent policy generation method and device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109540150A (en) * 2018-12-26 2019-03-29 北京化工大学 One kind being applied to multi-robots Path Planning Method under harmful influence environment
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017218699A1 (en) * 2016-06-17 2017-12-21 Graham Leslie Fyffe System and methods for intrinsic reward reinforcement learning
NZ759818A (en) * 2017-10-16 2022-04-29 Illumina Inc Semi-supervised learning for training an ensemble of deep convolutional neural networks
CN109978133A (en) * 2018-12-29 2019-07-05 南京大学 A kind of intensified learning moving method based on action mode

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109540150A (en) * 2018-12-26 2019-03-29 北京化工大学 One kind being applied to multi-robots Path Planning Method under harmful influence environment
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Intrinsically motivated model learning for developing curious robots;ToddHester, et al.;《Artificial Intelligence》;20151231;第1-17页 *
内部动机驱动的机器人未知环境在线自主学习;胡启祥等;《计算机工程与应用》;20141231;第50卷(第4期);第110-113页 *

Also Published As

Publication number Publication date
CN111260040A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111260040B (en) Video game decision method based on intrinsic rewards
CN111291890B (en) Game strategy optimization method, system and storage medium
CN111111220B (en) Self-chess-playing model training method and device for multiplayer battle game and computer equipment
CN112791394B (en) Game model training method and device, electronic equipment and storage medium
CN105637540A (en) Methods and apparatus for reinforcement learning
CN111282267B (en) Information processing method, information processing apparatus, information processing medium, and electronic device
CN113688977B (en) Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
CN111260026B (en) Navigation migration method based on meta reinforcement learning
CN114330651A (en) Layered multi-agent reinforcement learning method oriented to multi-element joint instruction control
CN114757351B (en) Defense method for resisting attack by deep reinforcement learning model
CN113947022B (en) Near-end strategy optimization method based on model
Chen et al. Visual hide and seek
CN113894780B (en) Multi-robot cooperation countermeasure method, device, electronic equipment and storage medium
CN111282272A (en) Information processing method, computer readable medium and electronic device
CN116796844A (en) M2 GPI-based unmanned aerial vehicle one-to-one chase game method
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
CN114371729B (en) Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback
Adamsson Curriculum learning for increasing the performance of a reinforcement learning agent in a static first-person shooter game
CN114662655A (en) Attention mechanism-based weapon and chess deduction AI hierarchical decision method and device
CN111437605B (en) Method for determining virtual object behaviors and hosting virtual object behaviors
Rajabi et al. A dynamic balanced level generator for video games based on deep convolutional generative adversarial networks
CN115944921B (en) Game data processing method, device, equipment and medium
Liu et al. Soft-Actor-Attention-Critic Based on Unknown Agent Action Prediction for Multi-Agent Collaborative Confrontation
Liu Meta-Reinforcement Learning: Algorithms and Applications
CN117474077B (en) Auxiliary decision making method and device based on OAR model and reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant