CN111260040B

CN111260040B - Video game decision method based on intrinsic rewards

Info

Publication number: CN111260040B
Application number: CN202010370070.1A
Authority: CN
Inventors: 王轩; 漆舒汉; 张加佳; 曹睿; 何志坤; 刘洋; 蒋琳; 廖清; 夏文; 李化乐
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2020-11-06
Anticipated expiration: 2040-05-06
Also published as: CN111260040A

Abstract

The invention provides a video game decision method based on intrinsic rewards, which comprises the following steps: s1, acquiring a video game simulation environment; s2, constructing a neural network model; s3, designing an internal reward model; s4, combining the intrinsic reward model with the constructed neural network model structure; s5, obtaining game records through a simulation environment; s6, updating the neural network model through the acquired game record; and S7, circularly training the neural network model until convergence. The invention has the beneficial effects that: the problem that the environment feedback reward value is frequently lacked in the three-dimensional scene is solved well.

Description

Video game decision method based on intrinsic rewards

Technical Field

The invention relates to a video game decision method, in particular to a video game decision method based on intrinsic rewards.

Background

Since the birth of video games, video games appeared in the early 70 s of the 20 th century, the technology of realizing automatic decision making of an intelligent agent in the video games through an artificial intelligence technology is always a hot point of research in the industrial and academic fields, and has great commercial value. In recent years, the rapid development of deep reinforcement learning methods provides an effective way for realizing the technology. Generally speaking, the quality of game decision making techniques is determined entirely by how much score is scored in the game or whether a game can be won, as is the case with video games.

The deep reinforcement learning algorithm applied to the complex game scene has the advantage of end-to-end characteristic, the intelligent agent action strategy is learned through the deep reinforcement learning algorithm, so that the mapping from the input game state to the output action can be directly completed, a set of universal algorithm framework is provided for solving various game tasks, and the Actor-criticic algorithm is a more representative algorithm. In a deep reinforcement learning algorithm taking an Actor-Critic algorithm as a basic framework, in order to train various machine game intelligent agents, the common method is to firstly perform feature extraction on a game state by designing a convolution network, then perform intelligent agent action strategy learning by using the Actor network, perform strategy evaluation and improvement by using the Critic network, and continuously perform iterative training until convergence. However, in a few Atari video game scenes, an intelligent agent using the algorithm as a basic framework is difficult to learn a strategy for efficiently acquiring the environment reward, a similar point of the scenes is that the environment where the intelligent agent is located is complex, reward feedback is difficult to obtain directly, and the intelligent agent often needs to make an action of acquiring a forward reward value through a series of action decisions or by referring to more historical information. The reason is that the Actor-Critic algorithm substantially considers a value iteration method and a strategy gradient method comprehensively, wherein the strategy gradient method needs to sample and update a strategy according to a track in an intelligent agent interaction process, and if a sufficient sampling track is lacked or the quality of the sampling track is not good enough, the optimization process of the strategy gradient is affected, so that the intelligent agent cannot learn a correct and efficient strategy. In a three-dimensional video game Vizdoom, an intelligent body can only contact a small part of environment in a sight line range in a game scene, meanwhile, a large number of labyrinths, traps and other design mechanisms influence exploration and reward acquisition of the intelligent body in the scene, due to the sparsity of reward feedback, the proportion of high-income-value actions in a sampling track is small, a positive reward sampling track is lacked in the training process of a strategy gradient algorithm, and the variance in the whole training process is high. The Actor-Critic algorithm introduces a value model in a value iteration method and then estimates a track value by using a value network, so that the defect of high variance of a strategy gradient method can be theoretically alleviated, but the problem that the update amplitude of an intelligent action strategy is too high in oscillation and not stable enough can still be generated during training by using the algorithm in the actual training process of a Vizdoom scene. In a Vizdoom scene with sparse feedback of part of environment rewards, due to the lack of reward signals, the algorithm cannot be subjected to strategy updating or cannot be converged due to large-amplitude oscillation generated in training. Therefore, for the application of the depth reinforcement learning algorithm in the three-dimensional video game Vizdoom, the problem that the environment feedback reward value is usually lacked in the three-dimensional scene exists.

Disclosure of Invention

To solve the problems in the prior art, the present invention provides a video game decision method based on intrinsic rewards.

The invention provides a video game decision method based on intrinsic rewards, which comprises the following steps:

s1, acquiring a video game simulation environment;

s2, constructing a neural network model;

s3, designing an internal reward model;

s4, combining the intrinsic reward model with the constructed neural network model structure;

s5, obtaining game records through a simulation environment;

s6, updating the neural network model through the acquired game record;

and S7, circularly training the neural network model until convergence.

The invention has the beneficial effects that: through the scheme, the problem that the environment feedback reward value is frequently lacked in the three-dimensional scene is solved well.

Drawings

FIG. 1 is an overall flow chart of a video game decision method based on intrinsic rewards of the invention.

FIG. 2 is a diagram of a Vizdoom simulation environment for an intrinsic bonus based video game decision method of the present invention.

FIG. 3 is a block diagram of a neural network for deep reinforcement learning solution in a video game decision method according to the present invention.

FIG. 4 is a diagram of an intrinsic reward mechanism reinforcement learning model of a video game decision method based on intrinsic rewards according to the invention.

FIG. 5 is a block diagram of an intrinsic prize generation module of a video game decision method based on intrinsic prizes of the present invention.

FIG. 6 is a network architecture diagram of a target mapping network and a prediction network for an intrinsic bonus based video game decision method of the present invention.

FIG. 7 is a flow chart of an intrinsic prize generation mechanism for a video game based on an intrinsic prize video game decision method of the present invention.

FIG. 8 is a schematic diagram of a differentiated prize fusion method of the intrinsic prize-based video game decision method according to the present invention.

FIG. 9 is a variation of the value network architecture of a video game decision method based on intrinsic rewards of the invention.

FIG. 10 is a flow chart of an intrinsic bonus strategy optimization algorithm for an intrinsic bonus based video game decision method of the present invention.

FIG. 11 is a diagram of a routing scenario of a Vizdoom platform in a video game decision method based on intrinsic rewards of the invention.

FIG. 12 is a comparison graph of the training effect of the IBPO algorithm for the video game decision method based on intrinsic rewards of the invention.

Detailed Description

The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.

The method applies a deep reinforcement learning method and combines an advanced internal reward mechanism to form a decision model and a technology with a certain intelligence level, so that a game intelligent agent obtains high scores in a video game, and the method is the core content of the invention.

The invention mainly researches the strategy solving problem of the three-dimensional video game under the condition of incomplete information. (1) Aiming at the problem that the common environment feedback reward value is lack in the three-dimensional scene, the invention provides an internal reward model. (2) An intrinsic reward strategy optimization algorithm is proposed by fusing intrinsic rewards with external reward differences.

As shown in fig. 1, a video game decision method based on intrinsic rewards comprises the following steps:

s1, acquiring a video game simulation environment;

s2, constructing a neural network model;

s3, designing an internal reward model;

s5, obtaining game records through a simulation environment;

s6, updating the neural network model through the acquired game record;

and S7, circularly training the neural network model until convergence.

The invention mainly researches the strategy solving problem of the three-dimensional video game under the condition of incomplete information. Aiming at the problems of incomplete high-dimensional state space and information perception in video game games, a deep reinforcement learning method based on an internal reward strategy optimization algorithm is provided. In the method, firstly, aiming at the problem that the environment feedback reward value is commonly lacked in a three-dimensional scene, the invention provides an internal reward model, the internal reward value generated by designing a target mapping network and a prediction network is used for making up the lacked environment feedback reward value, and an intelligent agent is assisted to carry out strategy updating. And secondly, considering the structural difference between the internal reward model and the traditional strategy optimization algorithm, the internal reward strategy optimization algorithm is provided by fusing the internal reward model and the traditional strategy optimization algorithm by adjusting the structure of the value network, and the action effect of the intelligent agent in the sparse reward three-dimensional scene is improved. Fig. 1 is a general flow chart of a video game decision method based on internal rewards.

The invention provides a video game decision method based on internal rewards, which comprises the following specific processes:

1. acquiring and installing a video game simulation environment;

in recent years, DRL (deep reinforcement learning) is also hot with the increasing popularity of deep learning. Therefore, various new reinforcement learning research platforms emerge from the eyerings after rain, and the trend is to slowly expand the 3D maze from a simple toy scene, a first-person shooting game, an instant strategy game, a complex robot control scene and the like. For example, Vizdoom allows for the development of AI robots that play DOOM using visual information (screen buffers). The method is mainly used for machine vision learning, in particular to the research of deep reinforcement learning. The Vizdoom simulated game environment is acquired and installed through the Vizdoom official website, as shown in FIG. 2.

2. Constructing a neural network;

fig. 3 is a diagram of a network architecture for solving a video game using deep reinforcement learning, in which the input of a model is each frame image of the video game, the output of the model is the operation of the corresponding video game, and the parameters of the network in the middle layer are the corresponding strategies that need to be trained using deep reinforcement learning. The present invention collects data by making decisions in a simulation environment using agents, and optimizes the agent's strategy using deep learning algorithms based on the collected state and action pairs. How to train a good model based on video game features is the key of the performance of the intelligence and is the core innovation of the invention.

3. Designing a video game internal reward generation mechanism;

the general reinforcement learning model only has two types of entities, namely an environment and an agent, and the reward signals are all originated from the environment where the agent is located, so that additional reward signals are not easy to generate based on the reinforcement learning model. Aiming at the limitation and the deficiency of a general reinforcement learning model, the invention utilizes the concept of the internal reward and generates auxiliary reward information by designing a corresponding internal reward mechanism to help the intelligent agent to carry out strategy updating according to the internal reward when lacking the environmental reward information. Such a reinforcement Learning model including intrinsic reward benefits is defined as an intrinsic incentive reinforcement Learning model (IMRL). A general model of intrinsic incentive reinforcement learning is shown in fig. 4.

The invention designs the following intrinsic reward generation module: defining a target mapping Network (targeting mapping Network) and a Prediction Network (Prediction Network) with the same structure, performing feature extraction and state mapping on an input three-dimensional state picture by using the target mapping Network and the Prediction Network to respectively obtain corresponding embedded vectors, and calculating the similarity of the target mapping Network and the Prediction Network to obtain the value of the intrinsic reward. The target mapping network and the prediction network are defined as shown in formula (3-1) and formula (3-2), respectively:

the target mapping network is defined as the mapping of states to target embedded vectors:

in the formula,

-a target mapping network;

-status;

-target embedding vector;

the prediction network is defined as the mapping of states to prediction embedding vectors:

in the formula,

-prediction network;

-status;

target embedding vector.

The designed intrinsic reward generation module is shown in fig. 5.

In the intrinsic reward generation module, the target mapping network has dimensions of the same size as the embedded vectors output by the prediction network. For the target mapping network, a strategy pre-training or random initialization mode can be adopted for network initialization, the network initialization mode adopted by the invention utilizes a deep reinforcement learning random strategy intelligent agent, and the intelligent agent is utilized to develop limited steps of exploration in the environment so as to initialize the target mapping network. And then training a predicted value network based on a sample generated by interaction of the intelligent agent and the environment of the strategy optimization algorithm, and minimizing the mean square error of the predicted value network and the target mapping network on the generated sample data. The main objective of this approach is to reduce the error from the randomness of the fit by setting a fixed optimization target and the error from the complexity of the objective function by setting a simpler target mapping network initialized with a random strategy. The network structure of the target mapping network and the prediction network is shown in fig. 6.

The target mapping network and the prediction network of the internal reward generation module both have network structures as shown in fig. 5, an input state picture in a three-dimensional scene is subjected to feature extraction through a three-layer convolutional neural network, and finally vector representation of fixed dimensions is output. The same network structure is adopted to reduce the error influence on the calculation of the vector similarity caused by the difference of the network structures. Loss function of intrinsic reward generation module

Is defined as:

in the formula

-prediction vector;

-a target vector;

-a parameter regularization term;

-a regularization term penalty factor;

the similarity of the output vectors of the target mapping network and the prediction network is used as the magnitude of the generated intrinsic reward value in the intrinsic reward generation module, and the consideration is that the intelligent agent tends to perform exploration behaviors in a sparse reward environment. In the initial stage of the training of the intelligent agent, the activity range of the intelligent agent in the three-dimensional scene is smaller, the number of states is larger, the similarity of the output vectors of the two networks is smaller, the calculated internal reward value is larger, and the intelligent agent mainly plays back the reward source in the data tuple by taking the internal reward signal as the reward source of the experience when the self action strategy is updated. In addition, because the environmental information faced by the agent in different three-dimensional scenes or different exploration moments of the same scene is not very same, if the normalization processing is not performed on the internal reward value and the input environmental state information, the variation range of the corresponding numerical value is too large, which is not beneficial to the super-parameter selection and the input information characterization in the training process. Therefore, the intrinsic reward value and the environmental status information need to be normalized. In addition, training of the predictive network requires the collection of agents and environmentsThe interaction sample, and the action output of the agent in the environment, is derived from the policy network, so the generation of the intrinsic reward signal during the training process is also related to the policy optimization algorithm. To be provided with

As an action strategy for the strategy optimization algorithm agent, an intrinsic reward generation algorithm for a single training round is given, as shown in fig. 7.

The intrinsic reward generation algorithm is as follows:

inputting:

random initialization step size

Training round termination step E, random strategy

,

Attenuation factor

Time step

。

And (3) outputting:

intrinsic prize value

。

1, initializing parameters.

2 when

Then, the following steps are executed in a circulating manner:

according to a random strategy

Sampling a current time step action

;

Based on actions

Get the next state

;

Normalizing environmental information

;

6: time step update

;

7, ending the circulation;

when j is equal to [1, E ], circularly executing the following steps:

action policy according to agent

Sampling a current time step action

;

Action-based

Get the next state

;

Calculating intrinsic prize value

;

12 time step update

;

Ending the circulation;

returning internal rewards

。

4. Combining the intrinsic reward with the established neural network and applying to the video game;

the training process of the prediction network of the intrinsic reward mechanism needs to be based on the action data samples of the agents, and the action process of the agents in the environment is driven by the strategy optimization algorithm, so that the complete intrinsic reward algorithm needs to be combined with the strategy gradient or strategy optimization algorithm. The invention combines an improved Policy optimization algorithm with an Intrinsic reward mechanism to propose an Intrinsic Based Policy Optimization (IBPO). Because the model basis of the strategy optimization algorithm is a general reinforcement learning model and has a certain difference with an internal incentive reinforcement learning model introduced with an internal reward mechanism, the structural difference of the strategy optimization algorithm and the internal incentive reinforcement learning model needs to be analyzed so as to adapt to the strategy optimization algorithm and the internal incentive reinforcement learning model. In the strategy optimization algorithm, the strategy is updated depending on the action strategy probability ratio and the estimation function in the objective function, and the objective function calculates the action strategy probability ratio and the estimation function based on the reward value, so the key point of adapting the strategy optimization algorithm and the internal reward mechanism is the processing of external reward and internal reward. In the conventional policy optimization algorithm, the reward value of the policy update is derived from only including the external reward, so after the internal reward mechanism is introduced, the combination mode of the internal reward and the policy optimization algorithm, namely the combination mode of the internal reward and the external reward, needs to be considered. The present invention employs a combination of long-term intrinsic awards and round-based external awards, as shown in fig. 8.

The original intention of the internal reward is to increase the exploration capacity of the agent for the environment, if the internal reward value is reset at the beginning of each training round of the strategy optimization algorithm, namely if the internal reward is accumulated according to a round mode, once the agent takes a certain action to end the current round, the internal reward value is zero immediately, strategy iteration and updating cannot be continued according to the internal reward, and the agent is more inclined to the conservative strategy rather than exploration and is contrary to the original intention of introducing the internal reward. Thus, the manner in which the intrinsic rewards are handled during the training process is to be accumulated continuously over each training round. For the external reward, the general method is to perform zero clearing after the end of a single training round, so as to prevent the intelligent agent from accumulating the external reward when the intelligent agent makes an action leading to the end of the round, so that negative feedback of the environment cannot cause negative updating to the strategy gradient, and the action strategy of the intelligent agent is influenced. In summary, the present invention combines the long-term intrinsic award with the round-based external award.

After a combination of long-term intrinsic rewards and turn-based external rewards is adopted, a specific value network structure needs to be considered. Only external rewards are contained in the traditional strategy optimization algorithm, only a single Critic value network is usually required to process input reward value information, and the newly introduced internal rewards obviously have different meanings from the traditional external rewards, so that the two reward information needs to be distinguished from each other on the basis of a network model. If two types of reward information are processed by using the independent value networks respectively, two independent neural networks are needed to be used for fitting the criticic value network, the parameter number and the model complexity are doubled, the training time is increased, and even the loss of the model precision is possibly brought. From the perspective of being as efficient as possible and fully considering the meanings of the two types of reward information, the method uses a value network model with two input heads to respectively process the internal reward and the external reward, different discount factors can be endowed to the two types of rewards based on the design, the additional supervision information of the cost function of the strategy optimization algorithm is endowed, the difference between the internal reward signal and the external reward information is highlighted, and the strategy and the value network structure are changed as shown in the figure 9.

The reward information and the value network structure are processed, the value network of the strategy optimization algorithm is adjusted to be a structure similar to the value network in the internal incentive reinforcement learning model, the external reward signals and the internal reward signals are added in a differentiation fusion mode to be combined into the overall reward signals in the experience playback pool, and therefore the strategy optimization algorithm and the internal reward generation module are combined. The integral model structure of the internal reward strategy optimization algorithm is similar to that of the traditional actor critic algorithm model structure, and the shared convolution network, the value network and the strategy network form the main part of the model, and the three parts respectively play roles in extracting characteristic information, carrying out state value estimation and outputting intelligent agent action probability distribution. The intrinsic reward strategy optimization algorithm model is mainly different from the traditional model in two points: firstly, a value network in an internal incentive strategy optimization algorithm model is subjected to structural adjustment, and can adapt to a structure that an internal incentive reinforcement learning model contains external environment incentive information and internal incentive signals; secondly, an additional intrinsic reward generation module is arranged in the intrinsic reward strategy optimization algorithm model and used for extracting features from the input state picture and calculating the intrinsic reward value.

For an end-to-end intrinsic reward policy optimization agent, the main flow from sensing three-dimensional scene environment information to making action decisions of the agent is as follows: the input three-dimensional state picture is processed by a shared convolution network and an internal reward generation module network, and the process mainly aims at sensing environment state information; and then processed by a value network and a strategy network respectively, wherein the process mainly aims at estimating state value and outputting action probability distribution. The parameter updating of the strategy network represents the updating of the action strategy of the whole algorithm model corresponding to the intelligent agent, and the parameter change is jointly determined by the internal reward value, the environment external reward value and the state value output by the value network. The specific intrinsic reward policy optimization algorithm effectively combines the policy optimization algorithm with the intrinsic reward generation module, the initialization process is similar to the intrinsic reward generation algorithm, and the overall process is shown in fig. 10.

The intrinsic reward policy optimization algorithm is as follows:

inputting:

initializing total iteration rounds

Policy update step size

Go back toCombination of Chinese herbs

Step size

,

Attenuation factor

Time step

。

And (3) outputting:

policy network parameters

Predicting network parameters

。

Initializing parameters and target mapping network.

2 when

The method comprises the following steps:

3: and

the method comprises the following steps:

according to action policy

Sampling a current time step action

;

5 based on actions

Get the next state

And external awards

;

6 calculating intrinsic prize value

；

7 storing agent information quintuple

To experience playback pool

;

8 time step update

;

9, ending the circulation;

calculating an instant prize value

And reward advantage estimate

;

11 according to

Normalizing the environment information;

12 when

The method comprises the following steps:

playback of samples empirically

And reward advantage estimate

Updating

;

Playback of samples empirically

Updating

;

15, ending the circulation;

ending the circulation;

17: return to

。

5. Using the constructed neural network and the obtained simulated game environment to interactively obtain game records;

generating game images through a video game simulation environment and inputting the game images into a neural network, generating legal actions by the neural network and returning the legal actions to the simulation environment, generating value by a value network and generating internal value by an internal reward generation network,

finally, fusing the external value and the internal value difference; while the simulated environment gives a score and the next image based on the actions generated by the neural network. The variables generated above are merged into a game record.

6. Updating the network according to the corresponding reinforcement learning algorithm by using the acquired game record;

and updating the neural network according to an intrinsic reward strategy optimization algorithm by using the acquired game records, and circularly training the neural network until convergence.

The video game decision method based on the internal rewards has the following beneficial effects:

1. experimental setup

The method takes the three-dimensional video game Vizdoom under the incomplete information condition as a research object and a test platform, and realizes the intrinsic reward strategy optimization algorithm IBPO based on the test platform.

1.1 introduction to Vizdoom scene

The Vizdoom platform is a three-dimensional video game with a first-person perspective, and the action of an intelligent body in the scene of the platform is similar to the action of an object in the real world, namely, the visual signal is received and then action decision is made. As a test platform of the current mainstream deep reinforcement learning algorithm, the Vizdoom platform provides an interface for receiving action input and feeding back a reward signal, and simulates the environment in a reinforcement learning model. At present, the Vizdoom platform provides comprehensive testing capability for training an intelligent agent to explore a three-dimensional environment, and the invention performs experiments based on the path-finding scene of the platform.

The route searching scene is a scene with sparse rewards in the Vizdoom test platform, the whole map is composed of a plurality of opaque rooms with different pictures, only a fixed target position is arranged in a certain specific room, the intelligent agent can move freely in the route searching scene, as reward feedback exists only at the target object in the scene, the intelligent agent can not obtain reward feedback at any other position, the starting position of the intelligent agent in the route searching scene is far away from the target position, and the intelligent agent needs to pass through the rooms with different contents in the process. The way-finding scenario is shown in fig. 11.

1.2 Experimental development Environment as in Table 1

TABLE 1

1.3 comparison of existing methods

(1) DFP: high dimensional sensory flow and low dimensional measurement flow are utilized for sensorimotor control in an immersive environment.

(2) DRQN: modular architecture is used to address the three-dimensional environment in a first person shooter game with game information provided by a simulator.

2. Results of the experiment

The depth reinforcement learning algorithm generally takes the score value output by the game simulation environment as the measurement standard of the performance of the intelligent object, which is slightly different for different game scenes, but is an equivalent representation form of the depth reinforcement learning reward value. In the route finding scene, the average reward value and the average action step number are used as evaluation indexes. Wherein the average prize value is defined as follows: when training is carried out to the current step number, the ratio of the number of agents which can reach the target position within the specified step number to the number of all training agents represents the path-finding success rate of the agents. The average number of steps of an action is defined as follows: the average value of the action steps of the intelligent agent after algorithm convergence for performing 100 verification interactions in the route finding scene represents the stability of the action strategy of the intelligent agent. The effect of agents trained by different algorithms in a path finding scenario is shown in fig. 12.

The ordinate axis represents the average reward value obtained by the reinforcement learning agent in the route-finding scene, the abscissa axis represents the change of the time step in the training process, the IBPO algorithm can finally reach the average reward value of 0.92 after being trained, and the DRQN algorithm and the DFP algorithm can finally reach the average reward values of 0.79 and 0.86 after being trained. The closer the average reward value is to 1, the more effective the algorithm training the agent learns an action strategy and the more chance of reaching the desired location. The main reason for this result is the lack of reward feedback in the routing scenario, and the intrinsic reward signal in the IBPO algorithm provides assistance for the action strategy update of the agent in such a scenario, so as to make up for the lack of positive reward value in the experience playback pool, and thus, the exploration strategy can be learned faster. The comparison test proves that the IBPO algorithm can be used for training the reinforcement learning intelligent body with relatively high exploration performance in the three-dimensional visual angle road-finding scene.

The average reward values and average number of steps of action associated with the hunting scenario experiments are shown in table 2. Wherein the average reward value corresponds to fig. 12, and the average number of action steps represents the different number of action steps required by the agent trained by the different algorithms to reach the target location, wherein the IBPO algorithm performs optimally in the three algorithms with an average number of action steps of 61.8, and the minimum average number of action steps indicates that the IBPO algorithm can find a path to the target location more quickly under average conditions, i.e. has a more stable action strategy.

TABLE 2 IBPO Algorithm Experimental data comparison

Evaluation index	IBPO	DFP	DRQN
				Average prize value	0.92	0.86	0.75.7
Average number of steps of action	61.8	69.3	75.7

By combining the analysis, the intrinsic reward value generated by the intrinsic reward generation module provides an auxiliary reward signal for strategy updating of the intelligent agent in the training process, so that the intelligent agent can still learn an effective exploration strategy in a sparse reward scene. Through comparative analysis with the DRQN algorithm and the DFP algorithm, the IBPO algorithm surpasses the average reward value and the average action step number in two evaluation indexes, and shows better comprehensive performance.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A method for intrinsic bonus based video game decision making, comprising the steps of:

s1, acquiring a video game simulation environment;

s2, constructing a neural network model;

s3, designing an internal reward model;

s4, combining the intrinsic reward model with the constructed neural network model;

s5, obtaining game records through a simulation environment;

s6, updating the neural network model through the acquired game record;

s7, circularly training the neural network model until convergence;

wherein,

step S3 includes an intrinsic reward generation algorithm as follows:

inputting:

random initialization step size

Training round termination step E, random strategy

,

Attenuation factor

Time step

；

And (3) outputting:

intrinsic prize value

，

1) Initializing parameters;

2) when

If yes, circularly executing the steps 3) to 7), otherwise, executing the step 8);

3) according to a random strategy

Sampling a current time step action

;

4) Based on actions

Get the next state

;

5) Normalized environmental information

;

6) Time step update

;

7) I is increased by 1;

8) when j is equal to 1, E), executing steps 9) to 13) circularly, otherwise executing step 14);

9) according to the agent action policy

Sampling a current time step action

;

10) Based on actions

Get the next state

;

11) Calculating intrinsic prize value

;

12) Time step update

;

13) J is increased by 1;

14) returning the intrinsic prize value

。

2. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: step S3 includes: designing an internal reward generating module, defining a target mapping network and a prediction network with the same structure, performing feature extraction and state mapping on an input three-dimensional state picture by using the target mapping network and the prediction network to respectively obtain corresponding embedded vectors, and calculating the similarity of the two embedded vectors to obtain the value of the internal reward.

3. A method of intrinsic bonus based video game decision making as recited in claim 2, wherein: in step S3, the target mapping network and the prediction network are defined as shown in formula (3-1) and formula (3-2), respectively:

in the formula,

-a target mapping network;

-status;

-target embedding vector;

the prediction network is defined as the mapping of states to target embedded vectors:

in the formula,

-prediction network;

-status;

target embedding vector.

4. A method of intrinsic bonus based video game decision making as recited in claim 3, wherein: in step S3, the loss function of the intrinsic reward generation module

Is defined as:

in the formula

-prediction vector;

-a target vector;

-a parameter regularization term;

-regularization term penalty factor.

5. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: in step S4, a combination of long-term intrinsic awards and round-based external awards is used.

6. A method of intrinsic bonus based video game decision making as recited in claim 1, wherein: step S4 includes an intrinsic reward policy optimization algorithm, which is as follows:

inputting:

initializing total iteration rounds

Policy update step size

Round step size

,

Attenuation factor

Time step

；

And (3) outputting:

policy network parameters

Predicting network parameters

；

(1) Initializing parameters and a target mapping network;

(2) when J' is E [1, J ]:

(3) in addition to

If so, executing the steps (4) to (9) in a circulating manner, otherwise, executing the step (10);

(4) according to the action policy

Sampling a current time step action

;

(5) Based on actions

Get the next state

And external awards

;

(6) ComputingIntrinsic prize value

；

(7) Storing agent information quintuple

To empirical playback of samples

;

(8) Time step update

The value of j' is increased by 1;

(9) increasing the value of k by 1;

(10) calculating the real-time prize value

And reward advantage estimate

;

(11) According to

Normalizing the environment information;

(12) when

If so, executing the steps (13) to (15) in a circulating manner, otherwise, executing the step (16);

(13) playback of samples empirically

And reward advantage estimate

Updating

;

(14) Playback of samples empirically

Updating

；

(15) Increasing the value of l by 1;

(16) returning

。

7. The intrinsic bonus-based video game decision method of claim 6, wherein: step S5 includes: the established neural network model and the obtained simulated game environment are used for carrying out interaction to obtain game records, game images are generated through the video game simulated environment and are input into the neural network model, the neural network model generates legal actions and returns the legal actions to the simulated environment, meanwhile, the value network generates external values, the internal reward generates a network to generate internal values, and finally, the external values and the internal values are fused; while the simulated environment gives a score and the next image based on the actions generated by the neural network model.

8. The intrinsic bonus-based video game decision method of claim 6, wherein: step S6 includes: the neural network model is updated according to an intrinsic reward strategy optimization algorithm using the acquired game record.