CN111260039A

CN111260039A - Video game decision-making method based on auxiliary task learning

Info

Publication number: CN111260039A
Application number: CN202010369831.1A
Authority: CN
Inventors: 王轩; 张加佳; 漆舒汉; 曹睿; 杜明欣; 刘洋; 蒋琳; 廖清; 夏文; 李化乐
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2020-06-09
Anticipated expiration: 2040-05-06
Also published as: CN111260039B

Abstract

The invention provides a video game decision method based on auxiliary task learning, which comprises the following steps: s1, constructing a neural grid model; s2, starting a multi-process video game environment; s3, judging whether the specified turn is operated, if not, entering the step S4, if yes, entering the step S6; s4, obtaining game experience and updating an experience pool; s5, inputting the experience into the neural grid model, updating the parameters of the neural grid model, and returning to the step S3; s6, saving the neural grid model; s7, making a decision by using a neural grid model in the video game; and S8, ending. The invention has the beneficial effects that: state values in a three-dimensional scene and agent actions that cause state changes can be estimated more accurately.

Description

Video game decision-making method based on auxiliary task learning

Technical Field

The invention relates to a video game decision method, in particular to a video game decision method based on auxiliary task learning.

Background

Since the birth of video games, video games appeared in the early 70 s of the 20 th century, the technology of realizing automatic decision making of an intelligent agent in the video games through an artificial intelligence technology is always a hot point of research in the industrial and academic fields, and has great commercial value. In recent years, the rapid development of deep reinforcement learning methods provides an effective way for realizing the technology. Generally speaking, the quality of game decision making techniques is determined entirely by how much score is scored in the game or whether a game can be won, as is the case with video games.

The development of artificial intelligence technology is changing day by day, and machine gaming has received wide attention from researchers as a popular research field among them. In recent years, machine game methods typified by deep reinforcement learning algorithms have been developed. On one hand, the success of weiqi agents such as AlphaGo marks that deep reinforcement learning algorithms have made a major breakthrough in the field of complete information machine game. On the other hand, the game of the incomplete information machine becomes a new research focus in the field of artificial intelligence due to the characteristics of high complexity, incomplete information perception and the like.

Although the problem of incomplete perception of high-dimensional state space and information in video game games can be well solved by using a deep reinforcement learning method based on an internal reward strategy optimization algorithm, the principle of an internal reward generation module is to quantitatively generate corresponding internal reward values according to uncertainty of unseen states, and after an intelligent agent gradually explores the environment, the activity range of the intelligent agent in the scene is wider and wider, so that the probability that the intelligent agent experiences unseen states is smaller and smaller. Thus, as the agent exploration area expands, the intrinsic reward values gradually decrease and become a non-primary source of reward signals in the experience playback pool. In order to obtain a useful reward value for the intelligent agent to learn long-term actions from complex three-dimensional video game signal input, the problem is not comprehensive only from the perspective of increasing the reward information sources by utilizing the intrinsic reward. On the other hand, action decision making by an agent in a three-dimensional video game is generally a step-wise and longer time-series process, and the action decision of the agent has higher relevance with a current state picture, the action made by the agent can cause the change of the state picture, and the state value is estimated by a neural network in deep reinforcement learning.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a video game decision method based on auxiliary task learning.

The invention provides a video game decision method based on auxiliary task learning, which comprises the following steps:

s1, constructing a neural grid model;

s2, starting a multi-process video game environment;

s3, judging whether the specified turn is operated, if not, entering the step S4, if yes, entering the step S6;

s4, obtaining game experience and updating an experience pool;

s5, inputting the experience into the neural grid model, updating the parameters of the neural grid model, and returning to the step S3;

s6, saving the neural grid model;

s7, making a decision by using a neural grid model in the video game;

and S8, ending.

The invention has the beneficial effects that: by the scheme, the state value in the three-dimensional scene and the intelligent action causing the state change can be estimated more accurately.

Drawings

FIG. 1 is a flow chart of a video game decision-making method based on task-assisted learning according to the present invention.

FIG. 2 is a model structure diagram of a reward prediction network trained by data samples of a video game decision method based on task-assisted learning according to the present invention.

FIG. 3 is a model structure diagram of a data sample training state value auxiliary network of a video game decision method based on auxiliary task learning according to the present invention.

FIG. 4 is a model structure diagram of a data sample training action value auxiliary network of a video game decision method based on auxiliary task learning according to the present invention.

FIG. 5 is a schematic diagram of an AIBPO model of a video game decision method based on task-assisted learning according to the present invention.

FIG. 6 is a diagram of an AIBPO network architecture for a video game decision-making method based on task-assisted learning according to the present invention.

FIG. 7 is a diagram of a reinforcement learning model of a video game decision method based on task-assisted learning according to the present invention.

FIG. 8 is a diagram of an action decision process of a video game decision method based on task-assisted learning according to the present invention.

FIG. 9 is a diagram of a routing scene of a Vizdoom platform in a video game decision method based on task-assisted learning according to the present invention.

FIG. 10 is a Vizdoom platform survival scene diagram of the video game decision method based on task-assisted learning according to the present invention.

Detailed Description

The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.

The method applies a deep reinforcement learning method and combines an advanced internal reward mechanism to form a decision model and a technology with a certain intelligence level, so that a game intelligent agent obtains high scores in a video game, and the method is the core content of the invention.

As shown in fig. 1, a video game decision method based on task-assisted learning includes the following steps:

s1, constructing a neural grid model;

s2, starting a multi-process video game environment;

s4, obtaining game experience and updating an experience pool;

s6, saving the neural grid model;

s7, making a decision by using a neural grid model in the video game;

and S8, ending.

The method is based on the perspective of enhancing the perception capability of the intelligent agent on the environment reward information and the estimation accuracy of the reinforcement learning state information, and provides and designs three types of auxiliary learning tasks according to an auxiliary task learning mechanism in multi-task learning. By using the experience playback technology, the interactive data of the intelligent agent is sampled for training the auxiliary task, and the effective combination of deep reinforcement learning and auxiliary task learning is realized. On the basis, an auxiliary task learning mechanism is combined with an internal reward strategy optimization algorithm, and the performance of the intelligent body trained by the original reinforcement learning algorithm in a three-dimensional scene is further improved.

The method mainly provides a feasible scheme for enhancing the perception capability of the intelligent agent on the environment reward information and improving the estimation accuracy of the reinforcement learning state information.

Task-assisted learning mechanisms are a class of approaches in the field of multi-task learning. Compared with a learning task only optimizing a single target, the multi-task learning method can utilize richer auxiliary information from multiple angles, and an auxiliary task learning mechanism improves an original main task model by using related tasks, so that the advantages of accelerating learning speed, enabling an algorithm to be more robust, improving the training effect of the algorithm and the like can be brought, and the requirements of a three-dimensional scene intelligent body action strategy training process are met. The performance of the intelligent agent in the three-dimensional video game scene is improved by designing auxiliary tasks related to the intelligent agent training process of the three-dimensional scene. For a deep reinforcement learning algorithm, a key part in an agent training process is learning and updating of strategy network parameters representing agent action strategies, the goal of the agent training process is to maximize obtained reward values, so that the goal of designing auxiliary task learning is to enable an agent to learn more efficient and robust action strategies by using implicit decision information provided by auxiliary tasks on the basis of main action strategies, different auxiliary tasks are expressed to be different auxiliary strategies for providing decision support information for the agent, and a reward feature enhancement framework based on an auxiliary task learning mechanism is provided and defined as follows:

(3-1)

in the formula

-an action policy of the agent;

-a set of auxiliary tasks;

-a discount reward value;

-a policy corresponding to the auxiliary task;

formula (3-1) defines a basic framework of an auxiliary task learning mechanism, and the auxiliary task provides a corresponding auxiliary strategy for the intelligent agent in the three-dimensional scene. The basic assumptions for finding the auxiliary task are: the secondary tasks should be closely related to the primary task or tasks that can benefit the learning process of the primary task, and the design goals of the secondary tasks are to bias the agent toward learning a more effective action strategy. Based on the assumptions, the reward prediction auxiliary task, the state value auxiliary task and the action value auxiliary task are provided as related tasks of an internal reward strategy optimization algorithm, and the training effect of a deep reinforcement learning algorithm in the three-dimensional video game is improved in the aspects of enhancing the perception capability of an intelligent agent on environmental information and the estimation accuracy of the state information.

The specific process of constructing the neural network is as follows:

1. a reward characteristic enhancement algorithm based on auxiliary task learning;

a. a reward prediction based feature enhancement method;

the reward value is a driving factor for updating the strategy of the intelligent agent by a deep reinforcement learning algorithm, the intelligent agent can directly obtain reward feedback of the environment at the current state in a three-dimensional scene, but the final decision goal of the intelligent agent is to maximize the obtained accumulated reward value, which usually needs to be obtained through a series of continuous actions. Aiming at the characteristic of the deep reinforcement learning algorithm training process in the three-dimensional scene, a reward prediction auxiliary task is designed, and the auxiliary task is based on the assumption that: in order for the algorithm to learn the dynamic changes of the environmental information as a whole, the agent needs to identify the features of the input states that can bring higher reward feedback values at a future time, at which point the agent can tend to make action decisions that bring about long-term maximized rewards. Thus, if the secondary policy enables the agent to better predict the environmental reward value to be harvested at the next time, the secondary policy may help the agent learn a more efficient representation of the state value, and the information it extracts from the state may also facilitate the action policy learning of the primary task.

The method proposes the following reward prediction auxiliary tasks: using experience playback pools

The state information and the reward information in the network are sampled from the input frame picture information of continuous time steps and the reward information of the next state as data samples to train a reward prediction network. The reward prediction network is represented by a simple shallow convolutional neural network, and the model structure of the task is shown in fig. 2.

In the reward prediction auxiliary task, input frame pictures at three continuous moments are sampled to serve as the input of a network, classification categories are output through the processing of the convolutional layer and the full connection layer, and the classification categories are defined as categories of reward values obtained by the intelligent agent and comprise positive value rewards, negative value rewards and zero value rewards. The label of the classification task is a one-hot code corresponding to the reward information at the next moment sampled from an experience playback pool, and the reward is used for predicting the loss function of the network according to a multi-class cross entropy loss function commonly used for the classification task

Is defined as:

(3-2)

in the formula:

-network output classification categories;

-the prize value at the next moment is one-hot coded;

as the reward values in the experience playback pool of the deep reinforcement learning algorithm comprise positive value rewards, negative value rewards and zero value rewards, and the proportion of the three types of reward values is different, for a three-dimensional video game scene, the zero value rewards in the experience review samples generated by the interaction of the intelligent bodies are more, and if the samples are directly sampled from the playback pool for training the reward prediction network, a large batch of training samples are related to the zero value rewards. Therefore, a biased sampling strategy is used when the data samples of the experience playback pool are sampled, and the non-zero value reward information and the zero value reward information are ensured to have the same sampling probability, so that the occupation ratios of different types of reward values in the training samples are ensured not to have too obvious difference. By designing the reward prediction auxiliary task, the algorithm pays attention to the relation between the recent input state picture and the instant future reward information in the training process of the intelligent agent, so that the intelligent agent recognizes the state picture which possibly brings the reward signal in the three-dimensional scene environment and tends to make an action capable of bringing reward value feedback.

b. A state value based feature enhancement method;

the advantage function estimation value stored in the experience replay pool is used for calculation when strategy updating is carried out in the internal reward strategy optimization algorithm, the calculation process of the advantage estimation value needs to use a reward value and a state value, wherein the reward value is derived from an internal reward signal generated by the internal reward generation module and an external reward signal fed back by the environment, the state value is derived from value estimation of the current input state by a value network of the internal reward strategy optimization algorithm, and the value estimation in deep reinforcement learning is biased, so that if the auxiliary strategy enables the deep reinforcement learning algorithm to tend to do more accurate state value estimation output, the calculated advantage estimation value is more accurate, the training process of the intelligent agent is more stable by the auxiliary strategy, and learning of the main action strategy is more efficient. Based on such considerations, the following state value assistance tasks are proposed herein: the internal reward strategy optimization algorithm intelligent agent stores the environment interaction data tuples in the experience playback pool in the training process and trains the value of the historical state in the experience pool for the algorithm

Sampling to obtain input frame picture of continuous time and next timeAs a data sample to train a state value assistance task. The state value network is composed of a shallow convolutional neural network and a Long-Short Term Memory (LSTM), and the model structure of the task is shown in fig. 3.

In the state value network shown in fig. 3, input frame pictures at three consecutive moments are sampled as input of the network, the state value of the frame picture at the next moment is predicted and output through the processing of the convolutional layer, the full link layer and the LSTM layer, the label of the regression prediction task is the state value corresponding to the state picture at the next moment sampled from the empirical replay pool, and the loss function of the state value network is determined according to the mean square error loss function commonly used for the regression task

Is defined as:

(3-3)

in the formula

-a parameter regularization term;

-status value of network output;

-a target state value;

-a regularization term penalty factor;

target state value in state value auxiliary task

Is empirically played back together with successive input frame picturesThe sampling in the pool is equivalent to directly utilizing interactive data samples of the intelligent agent, and since the state value is a value output by a value network of an internal reward strategy optimization algorithm and has different sizes, the sampling is directly carried out from the experience playback pool by a Gaussian strategy without considering the proportion condition of different value state values. By designing the state value auxiliary task, the algorithm carries out experience playback on the state value estimated value in recent training batch in the training process of the intelligent agent, so that the value network of the algorithm tends to carry out more accurate estimation on the state value corresponding to the input frame picture at each moment, the merit function value is calculated more accurately, and the action strategy learning process of the intelligent agent is more stable.

c. A feature enhancement method based on action value;

in the training process of the intelligent agent of the reward strategy optimization algorithm, the strategy network outputs the action probability distribution of the intelligent agent, the algorithm generates a certain action through the action probability distribution sampling, the action output of the intelligent agent in the corresponding input state is taken as being improved, the action decision of the intelligent agent in the environment can cause the change of the environment state of the intelligent agent, and therefore reward feedback of the environment is obtained. The action value represents the instant reward feedback which can be obtained when the intelligent agent takes a certain action in the current state, and if a certain action of the intelligent agent can bring higher instant reward value feedback, the action has higher action value. Thus, if the secondary strategy can bias a deep reinforcement learning agent to act with greater action value, then the agent will more easily learn the primary action strategy that efficiently gets reward feedback during the overall training process. Accordingly, the following action value assistance tasks are presented herein: and sampling the action value stored in the intelligent agent training process of the internal strategy optimization algorithm, and taking the input frame picture and the time sequence difference action value at continuous time as a data sample training action value auxiliary task. The action value network is composed of a shallow convolutional neural network and a long-short term memory network, and the model structure of the task is shown in FIG. 4.

Sampling in an action value networkThe input frame pictures at three continuous moments are used as the input of the network, the action value of the frame picture at the next moment is output through the processing prediction of the convolution layer, the full connection layer and the LSTM layer, and the label of the regression prediction task is the time sequence difference action value sampled from the experience playback pool, namely the label is obtained through calculation according to the instant reward value and the discount state value in the experience playback pool. Transforming the loss function of the action value network according to the mean square error loss function commonly used in regression tasks

Is defined as:

(3-4)

in the formula

-a regularization term penalty factor;

-a parameter regularization term;

-the action value of the network input;

-a time sequence differential action value;

timing difference action value in action value auxiliary task

Is calculated from samples of data from an empirical playback pool. Similar to the state value assistance task, the magnitude of the time-series differential action value itself is determined by the reward value and the state value, and therefore the proportion of positive, negative and zero values does not need to be considered. Assisting tasks by designing action valuesThe policy network of the algorithm is enabled to more accurately output the action value corresponding to each input frame picture, so that the intelligent agent tends to make actions bringing larger reward values.

2. An intrinsic reward strategy optimization algorithm combined with an auxiliary task;

learning and updating of policy network parameters of a deep reinforcement learning algorithm determines action decisions of an agent in a three-dimensional scene, and a main action policy of the agent is determined by the network. The auxiliary task is provided to provide auxiliary decision support information for the action of the intelligent agent, and the intelligent agent optimizes the main action strategy and simultaneously considers the auxiliary strategy through different auxiliary tasks, so that a more effective and robust three-dimensional scene action strategy is learned. The intrinsic reward strategy optimization algorithm provided by the method can be used for learning of the action strategy of the intelligent agent, so that auxiliary tasks are combined with the intrinsic reward strategy optimization algorithm, and auxiliary decision information can be provided for the intrinsic reward strategy optimization intelligent agent. The key point of the combination of the two is that the experience revisit pool in the deep reinforcement learning, the training of the auxiliary task needs data samples related to corresponding state information, reward information and action information, and the training process of the internal reward strategy optimization agent just stores the data samples. Therefore, the auxiliary tasks and the internal reward strategy optimization algorithm can be combined through the experience replay pool, and the method is characterized in that the step of updating the network parameters of the auxiliary tasks according to experience replay samples is added in the strategy updating process of the internal reward strategy optimization algorithm. In this way, an Intrinsic reward Policy optimization Algorithm (AIBPO) in conjunction with an Auxiliary task is defined as:

（3-5）

in the formula

-rewarding the weight parameter of the predicted task;

-weight parameters of the status value task;

-weight parameters of the action value task;

-rewarding the loss function of the prediction task;

-a loss function of the state value task;

-a loss function of the action value task;

-a loss function of an intrinsic reward policy optimization algorithm;

the formula (3-5) provides an overall framework of the AIBPO algorithm, different auxiliary tasks are introduced to enable the IBPO algorithm to learn related information provided by different auxiliary tasks in the training process, the related information is related to strategy updating of the deep reinforcement learning algorithm or information perception in a scene, and the expression effect of the strategy optimization algorithm in a three-dimensional scene is improved from different angles. In addition, the influence of the auxiliary task on the main task is determined by setting a weight parameter for the auxiliary task loss function. The overall model structure of the AIBPO algorithm is shown in fig. 5.

The deep reinforcement learning algorithm stores corresponding reinforcement learning state information in an experience playback pool in a tuple mode when an agent is trained, stores information samples obtained by interaction between the agent based on the internal reward strategy optimization algorithm and the environment in the training process into the corresponding experience playback pool, and takes data in the experience playback pool as training samples of auxiliary tasks based on different sampling strategies, so that the characteristics of the deep reinforcement learning algorithm are effectively utilized, and sufficient training samples are provided for the auxiliary tasks.

3. Constructing a network model;

the method is improved on the basis of IBPO (intuitive Based policy optimization), a reward feature enhancement algorithm Based on auxiliary task learning is added on the basis of IBPO, and the main body structure is shown in figure 6. The neural network built by using the deep learning framework pytorch is integrally divided into 3 parts.

a. And a strategy optimization part, wherein the strategy optimization part comprises an Actor network and a criticic network which share part of hidden layer neurons. In order to give the external reward and the internal reward independent supervision information, the Critic network in the strategy optimization algorithm network structure has two independent reward output heads, and the network structure is shown as a table 3-1

TABLE 3-1 policy optimization Algorithm network architecture

b. An intrinsic reward component in which the target mapping network has the same network structure as the predicted network, the network structure being as shown in tables 3-2

TABLE 3-2 intrinsic reward model network architecture

c. The performance of the algorithm in the three-dimensional video game is improved from the perspective of enhancing the environment perception capability and the state estimation capability of the intelligent agent, and the data samples of the three auxiliary tasks are derived from experience playback data stored when the intelligent agent trained by the deep reinforcement learning algorithm interacts with the environment. The network structure of the reward prediction, the state value and the action value auxiliary task is composed of a shallow convolutional neural network and a long-term and short-term memory network. The network structure of the supplementary learning task is shown in tables 3-3

TABLE 3-3 auxiliary task learning network architecture

Initiating a multi-process video game environment includes: the reinforcement learning needs a large amount of data to optimize the network, and a single process cannot meet the requirement of training data acquisition, so that the game process is run as much as possible by utilizing the multiprocessing module of the torch under the condition of sufficient computing resources. Tensor sharing by cuda, deadlock avoidance and prevention, buffer reuse through queue sending, and asynchronous multiprocess training can be achieved by using multiprocessing modules.

Obtaining game experience, wherein the updating experience pool specifically comprises the following steps: the game winning goal can be broadly considered to be the seeking of a reasonable behaviour a in the ambient state s, while all that the method does is seeking to make corrections using the ambient feedback r. Here, the state s is a state (state) in which the individual and the relevant area are located as the action occurs; behavior a refers to the corresponding behavior (action) made by the agent in state s, and feedback r refers to the evaluation incentive (reward) made by the agent in environment s. In the reinforcement learning problem, the agent may change the environmental state s by action a; the boosting means changes the behavior a by using the evaluation feedback r; the behavior a and the state s are combined to determine the corresponding feedback value r, and the flow is shown in fig. 7. All that is required in this step is to collect the (s, a, r, s') quadruples and store them in an experience pool to provide a data base for model training.

Experience is input into the network, and the updating of network parameters specifically comprises the following steps: and calculating the value of the advantage, the value of the internal reward and the value of the auxiliary task according to the data acquired in the last step. Using equations 3-5

And updating the network, facilitating gradient rise and updating network parameters, thereby continuously optimizing the network and obtaining higher benefits by the intelligent agent.

The video game utilization model decision specifically comprises the following steps: as shown in fig. 8, after the game is started, each frame of the game is intelligently obtained, each frame is normalized, and 4 consecutive frames are input into the trained model as input values. And returning the action capable of obtaining high profit after the model calculation, executing the action by the intelligent agent in the game, further obtaining a new game picture, and repeating the above process until the game is finished.

As shown in fig. 9-10, the present invention is applied to various video games, and the advantages of the present technology are shown by using Vizdoom as an example. The Vizdoom platform is a three-dimensional video game with a first-person perspective, and the action of an intelligent body in the scene of the platform is similar to the action of an object in the real world, namely, the visual signal is received and then action decision is made. As a test platform of the current mainstream deep reinforcement learning algorithm, the Vizdoom platform provides an interface for receiving action input and feeding back a reward signal, and simulates the environment in a reinforcement learning model. At present, a Vizdoom platform provides comprehensive testing capability for training an intelligent agent to explore a three-dimensional environment, and a path finding scene, a survival scene and an exploration scene based on the platform are applied.

The application and development environment of the method is shown in table 1:

TABLE 1 Experimental development Environment

The present invention is compared to several prior methods:

(1) IBPO an Intrinsic Based Policy optimization algorithm (Intrasic Based Policy optimization).

(2) DRQN, Deep Current Q-Networks, adding a circulation layer after the convolution layer to process part observable states.

(3) Direct future prediction, which pulls up on the full dismatch computation in Visual Doom AI computation 2016.

(4) A3C: the Asynchronous advertisement Actor-Critic places the Actor-Critic into a plurality of threads for synchronous training, so that computer resources can be effectively utilized, and training effectiveness is improved.

(5) Rainbow: the algorithm integrates several common DQN improvement methods, and the improvement of the performance of the intelligence is realized.

The AIBPO algorithm performance is compared as follows:

in Vizdoom survival scenario, in survival time step (b

) And number of drug packs picking Life value

) As an evaluation index equivalent to a game score value in deep reinforcement learning. Wherein the survival time step is defined as the average of the maximum number of action steps that the agent can survive 100 interactive verifications in the scene after the algorithm converges. The number of the picked-up life value medicine packages is defined as the average value of the number of the medicine packages which are picked up by the intelligent agent after the algorithm converges and subjected to 100 times of interactive verification in the scene. The experimental results of the survival scenario are shown in table 2.

Table 2 survival scenario experimental result data comparison table i

As can be seen from Table 2, the IBPO algorithm is as follows

And

all indexes lag behind the DFP algorithm only in

The index is slightly stronger than that of the DRQN algorithm, because the IBPO algorithm mainly solves the problem of sparse reward in a three-dimensional scene, and the reward information sources are rich in most cases in a survival scene, so that the survival scene is bornThe IBPO algorithm effect is not prominent in the presence scenario. The AIBPO algorithm introduces three types of auxiliary tasks, and as can be seen from the analysis in chapter IV of the text, the rewarding auxiliary tasks mainly play a role in enhancing the rewarding perception and the state estimation capability of the intelligent agent in the training process, because the maze characteristics in the survival scene require the intelligent agent to make long-term consideration in the training process according to the historical information, and the rich rewarding information in the survival scene can be utilized by the auxiliary tasks, the AIBPO algorithm obtains the effect superior to the IBPO algorithm and the DRQN algorithm in the survival scene experiment, and the AIBPO algorithm obtains the effect superior to the IBPO algorithm and the DRQN algorithm in the survival scene experiment

Approaches the DFP algorithm in index at

The evaluation index is superior to the DFP algorithm. As the deep reinforcement learning algorithms have different performances on different types of task scenes, the IBPO algorithm provided by the thesis mainly solves the problem of exploration in a sparse rewarding three-dimensional scene, and the correlation experiment of a path-finding scene proves that the IBPO algorithm has certain effect improvement on the scene compared with a DFP algorithm. The AIBPO algorithm proposed by the thesis mainly makes up the defects of the IBPO algorithm in the ability of sensing environment reward information through an auxiliary task mechanism, the main body of the AIBPO algorithm is still the IBPO algorithm, and experimental result data of a survival scene prove that the auxiliary task learning mechanism has a great effect improvement on the IBPO algorithm as an auxiliary means. Considering that the performances of the deep reinforcement learning algorithm are different in different task scenes, the AIBPO algorithm and the DFP algorithm have smaller difference in survival time step indexes in a survival scene, and the existing algorithm is difficult to comprehensively surpass the DFP algorithm.

Although there are some performance gaps between AIBPO algorithm and DFP algorithm, AIBPO algorithm has a larger performance improvement compared to the conventional deep reinforcement learning algorithm, and the experimental pair of A3C algorithm and Rainbow algorithm is shown in table 3.

Table 3 survival scenario experimental result data comparison table two

As can be seen from table 3, the A3C algorithm and the Rainbow algorithm are existing algorithms with better performance in the depth reinforcement learning algorithm based on Actor-Critic and based on value iteration, and the performance of the trained agent in the living scene is inferior to that of the AIBPO algorithm. The reason is that the A3C algorithm and the Rainbow algorithm lack special processing aiming at the three-dimensional scene reinforcement learning task, so that the performances of the three-dimensional scene, namely the survival scene, which has both exploration maze and reward obtaining, in the atari video game and the continuous control task are difficult to achieve.

And (4) integrating the analysis and the performance of the AIBPO agent combined with the auxiliary task learning mechanism in the survival scene. The performance is obviously improved, the survival time step number and the capability of acquiring the life value medicine package of the intelligent agent obviously exceed the traditional deep reinforcement learning algorithm, and reach the level close to the intelligent agent of the competition.

The auxiliary task comparative experiment was as follows:

the AIBPO algorithm comprises reward prediction tasks

State value task

And action value tasks

In order to reflect the improvement effect of each auxiliary task on the original strategy optimization algorithm, the method is characterized in that different types of auxiliary tasks are independently added on the basis of the IBPO algorithm to compare and evaluate the effects of the three types of tasks. The experiments for evaluating each auxiliary task are shown in Table 4

TABLE 4 comparison of data of auxiliary task experiment results

As can be seen from table 4, the performance of the agent trained by combining the reference algorithms of different auxiliary tasks is improved in the survival scene, and the improvement effect of each auxiliary task on the reference algorithm is different. Wherein the reward forecast auxiliary task is

The promotion effect on the evaluation index is most obvious, and the action value auxiliary task is

The optimal effect of the three is obtained on the index. Relevant experiment results show that the IBPO algorithm combined with each auxiliary task greatly improves the performance of the intelligent agent in a three-dimensional scene on the original basis, so that the survival time of the intelligent agent in a survival scene is greatly prolonged. In a comprehensive view, the AIBPO algorithm combining the three auxiliary tasks is improved more obviously on relevant indexes, which shows that the cooperation of the three auxiliary tasks can further improve the performance of the trained intelligent body in the three-dimensional video game.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A video game decision method based on auxiliary task learning is characterized by comprising the following steps:

s1, constructing a neural grid model;

s2, starting a multi-process video game environment;

s4, obtaining game experience and updating an experience pool;

s6, saving the neural grid model;

s7, making a decision by using a neural grid model in the video game;

and S8, ending.

2. A video game decision-making method based on assistive task learning as claimed in claim 1, wherein: step S1 includes the following substeps:

s101, a reward characteristic enhancement method based on auxiliary task learning;

s102, an internal reward strategy optimization method combined with auxiliary task learning;

s103, constructing a neural grid model.

3. A video game decision-making method based on assistive task learning as claimed in claim 2, wherein: the method for enhancing the reward characteristics based on the auxiliary task learning comprises the following frame definition:

(3-1)

in the formula (3-1);

-an action policy of the agent;

-a set of auxiliary tasks;

-a discount reward value;

-a policy corresponding to the auxiliary task;

the formula (3-1) defines a basic framework of an auxiliary task learning mechanism, and the auxiliary task learning provides a corresponding auxiliary strategy for the intelligent agent in the three-dimensional scene.

4. A video game decision-making method based on assistive task learning as claimed in claim 2, wherein: step S101 includes the following substeps:

s1011, a feature enhancement method based on reward prediction;

s1012, a state value-based feature enhancement method;

and S1013, a feature enhancement method based on the action value.

5. The video game decision-making method based on task-assisted learning of claim 4, wherein: step S1011 includes: using state information and reward information in an experience playback pool, sampling input frame picture information of continuous time steps and reward information of a next state as data samples to train a reward prediction network, wherein the data sample training reward prediction network is represented by a shallow convolutional neural network, in the data sample training reward prediction network, sampling input frame pictures of continuous three moments as input of the network, outputting classification categories through processing of a convolutional layer and a full connection layer, the classification categories are defined as categories of reward values obtained by an intelligent body and comprise positive value reward, negative value reward and zero value reward, labels of classification tasks are unique hot codes corresponding to the reward information of the next moment sampled from the experience playback pool, and loss functions of the reward prediction network are realized according to multi-category cross entropy loss functions commonly used for the classification tasks

Is defined as:

(3-2)

in the formula (3-2),

-network output classification categories;

-the prize value at the next moment is one-hot coded.

6. The video game decision-making method based on task-assisted learning of claim 4, wherein: step S1012 includes: the internal reward strategy optimization algorithm agent stores the environmental interaction data tuples in an experience replay pool during the training process, sampling historical state values in an algorithm training experience pool, taking input frame pictures at continuous moments and state values at the next moment as a data sample training state value auxiliary network, wherein the data sample training state value auxiliary network mainly comprises a shallow convolutional neural network and a long-short term memory network, in the data sample training state value auxiliary network, sampling input frame pictures at three continuous moments as the input of the network, predicting and outputting the state value corresponding to the frame picture at the next moment through the processing of a convolution layer, a full connection layer and an LSTM layer, the label of the regression prediction task is the state value corresponding to the state picture at the next time sampled from the empirical replay pool, and (4) performing a loss function of the state value network according to a mean square error loss function commonly used for the regression task.

Is defined as:

(3-3)

in the formula (3-3),

-a parameter regularization term;

-status value of network output;

-a target state value;

-a regular term penalty factor.

7. The video game decision-making method based on task-assisted learning of claim 4, wherein: step S1013 includes: the action value stored in the training process of the intelligent agent of the internal strategy optimization algorithm is sampled, the input frame picture and the time sequence difference action value at continuous time are used as a data sample training action value auxiliary network, the data sample training action value auxiliary network is mainly composed of a shallow convolutional neural network and a long-short term memory network, in the data sample training motion value auxiliary network, sampling input frame pictures at three continuous moments as the input of the network, predicting and outputting the motion value corresponding to the frame picture at the next moment through the processing of a convolution layer, a full connection layer and an LSTM layer, the label of the regression prediction task is the value of the time-sequence differential action sampled from the empirical replay pool, the action value network loss function is obtained by calculation according to the instant reward value and the discount state value in the experience playback pool, and the action value network loss function is used for the regression task according to the mean square error loss function commonly used in the regression task.

Is defined as:

(3-4)

in the formula (3-4), the metal oxide,

-a regularization term penalty factor;

-a parameter regularization term;

-the action value of the network input;

-a time sequence differential action value.

8. A video game decision-making method based on assistive task learning as claimed in claim 2, wherein: step S102 includes: adding a step of updating network parameters of the auxiliary tasks according to experience playback samples in the strategy updating process of the internal reward strategy optimization method, and defining the internal reward strategy optimization method combining with auxiliary task learning as follows:

（3-5）

in the formula (3-5), the metal oxide,

-rewarding the weight parameter of the predicted task;

-weight parameters of the status value task;

-weight parameters of the action value task;

-rewarding the loss function of the prediction task;

-a loss function of the state value task;

-a loss function of the action value task;

-a penalty function of the intrinsic reward policy optimization algorithm.