CN111260039A - Video game decision-making method based on auxiliary task learning - Google Patents

Video game decision-making method based on auxiliary task learning Download PDF

Info

Publication number
CN111260039A
CN111260039A CN202010369831.1A CN202010369831A CN111260039A CN 111260039 A CN111260039 A CN 111260039A CN 202010369831 A CN202010369831 A CN 202010369831A CN 111260039 A CN111260039 A CN 111260039A
Authority
CN
China
Prior art keywords
reward
task
value
network
auxiliary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010369831.1A
Other languages
Chinese (zh)
Other versions
CN111260039B (en
Inventor
王轩
张加佳
漆舒汉
曹睿
杜明欣
刘洋
蒋琳
廖清
夏文
李化乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202010369831.1A priority Critical patent/CN111260039B/en
Publication of CN111260039A publication Critical patent/CN111260039A/en
Application granted granted Critical
Publication of CN111260039B publication Critical patent/CN111260039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/45Controlling the progress of the video game
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/45Controlling the progress of the video game
    • A63F13/46Computing the game score
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video game decision method based on auxiliary task learning, which comprises the following steps: s1, constructing a neural grid model; s2, starting a multi-process video game environment; s3, judging whether the specified turn is operated, if not, entering the step S4, if yes, entering the step S6; s4, obtaining game experience and updating an experience pool; s5, inputting the experience into the neural grid model, updating the parameters of the neural grid model, and returning to the step S3; s6, saving the neural grid model; s7, making a decision by using a neural grid model in the video game; and S8, ending. The invention has the beneficial effects that: state values in a three-dimensional scene and agent actions that cause state changes can be estimated more accurately.

Description

Video game decision-making method based on auxiliary task learning
Technical Field
The invention relates to a video game decision method, in particular to a video game decision method based on auxiliary task learning.
Background
Since the birth of video games, video games appeared in the early 70 s of the 20 th century, the technology of realizing automatic decision making of an intelligent agent in the video games through an artificial intelligence technology is always a hot point of research in the industrial and academic fields, and has great commercial value. In recent years, the rapid development of deep reinforcement learning methods provides an effective way for realizing the technology. Generally speaking, the quality of game decision making techniques is determined entirely by how much score is scored in the game or whether a game can be won, as is the case with video games.
The development of artificial intelligence technology is changing day by day, and machine gaming has received wide attention from researchers as a popular research field among them. In recent years, machine game methods typified by deep reinforcement learning algorithms have been developed. On one hand, the success of weiqi agents such as AlphaGo marks that deep reinforcement learning algorithms have made a major breakthrough in the field of complete information machine game. On the other hand, the game of the incomplete information machine becomes a new research focus in the field of artificial intelligence due to the characteristics of high complexity, incomplete information perception and the like.
Although the problem of incomplete perception of high-dimensional state space and information in video game games can be well solved by using a deep reinforcement learning method based on an internal reward strategy optimization algorithm, the principle of an internal reward generation module is to quantitatively generate corresponding internal reward values according to uncertainty of unseen states, and after an intelligent agent gradually explores the environment, the activity range of the intelligent agent in the scene is wider and wider, so that the probability that the intelligent agent experiences unseen states is smaller and smaller. Thus, as the agent exploration area expands, the intrinsic reward values gradually decrease and become a non-primary source of reward signals in the experience playback pool. In order to obtain a useful reward value for the intelligent agent to learn long-term actions from complex three-dimensional video game signal input, the problem is not comprehensive only from the perspective of increasing the reward information sources by utilizing the intrinsic reward. On the other hand, action decision making by an agent in a three-dimensional video game is generally a step-wise and longer time-series process, and the action decision of the agent has higher relevance with a current state picture, the action made by the agent can cause the change of the state picture, and the state value is estimated by a neural network in deep reinforcement learning.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a video game decision method based on auxiliary task learning.
The invention provides a video game decision method based on auxiliary task learning, which comprises the following steps:
s1, constructing a neural grid model;
s2, starting a multi-process video game environment;
s3, judging whether the specified turn is operated, if not, entering the step S4, if yes, entering the step S6;
s4, obtaining game experience and updating an experience pool;
s5, inputting the experience into the neural grid model, updating the parameters of the neural grid model, and returning to the step S3;
s6, saving the neural grid model;
s7, making a decision by using a neural grid model in the video game;
and S8, ending.
The invention has the beneficial effects that: by the scheme, the state value in the three-dimensional scene and the intelligent action causing the state change can be estimated more accurately.
Drawings
FIG. 1 is a flow chart of a video game decision-making method based on task-assisted learning according to the present invention.
FIG. 2 is a model structure diagram of a reward prediction network trained by data samples of a video game decision method based on task-assisted learning according to the present invention.
FIG. 3 is a model structure diagram of a data sample training state value auxiliary network of a video game decision method based on auxiliary task learning according to the present invention.
FIG. 4 is a model structure diagram of a data sample training action value auxiliary network of a video game decision method based on auxiliary task learning according to the present invention.
FIG. 5 is a schematic diagram of an AIBPO model of a video game decision method based on task-assisted learning according to the present invention.
FIG. 6 is a diagram of an AIBPO network architecture for a video game decision-making method based on task-assisted learning according to the present invention.
FIG. 7 is a diagram of a reinforcement learning model of a video game decision method based on task-assisted learning according to the present invention.
FIG. 8 is a diagram of an action decision process of a video game decision method based on task-assisted learning according to the present invention.
FIG. 9 is a diagram of a routing scene of a Vizdoom platform in a video game decision method based on task-assisted learning according to the present invention.
FIG. 10 is a Vizdoom platform survival scene diagram of the video game decision method based on task-assisted learning according to the present invention.
Detailed Description
The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.
The method applies a deep reinforcement learning method and combines an advanced internal reward mechanism to form a decision model and a technology with a certain intelligence level, so that a game intelligent agent obtains high scores in a video game, and the method is the core content of the invention.
As shown in fig. 1, a video game decision method based on task-assisted learning includes the following steps:
s1, constructing a neural grid model;
s2, starting a multi-process video game environment;
s3, judging whether the specified turn is operated, if not, entering the step S4, if yes, entering the step S6;
s4, obtaining game experience and updating an experience pool;
s5, inputting the experience into the neural grid model, updating the parameters of the neural grid model, and returning to the step S3;
s6, saving the neural grid model;
s7, making a decision by using a neural grid model in the video game;
and S8, ending.
The method is based on the perspective of enhancing the perception capability of the intelligent agent on the environment reward information and the estimation accuracy of the reinforcement learning state information, and provides and designs three types of auxiliary learning tasks according to an auxiliary task learning mechanism in multi-task learning. By using the experience playback technology, the interactive data of the intelligent agent is sampled for training the auxiliary task, and the effective combination of deep reinforcement learning and auxiliary task learning is realized. On the basis, an auxiliary task learning mechanism is combined with an internal reward strategy optimization algorithm, and the performance of the intelligent body trained by the original reinforcement learning algorithm in a three-dimensional scene is further improved.
The method mainly provides a feasible scheme for enhancing the perception capability of the intelligent agent on the environment reward information and improving the estimation accuracy of the reinforcement learning state information.
Task-assisted learning mechanisms are a class of approaches in the field of multi-task learning. Compared with a learning task only optimizing a single target, the multi-task learning method can utilize richer auxiliary information from multiple angles, and an auxiliary task learning mechanism improves an original main task model by using related tasks, so that the advantages of accelerating learning speed, enabling an algorithm to be more robust, improving the training effect of the algorithm and the like can be brought, and the requirements of a three-dimensional scene intelligent body action strategy training process are met. The performance of the intelligent agent in the three-dimensional video game scene is improved by designing auxiliary tasks related to the intelligent agent training process of the three-dimensional scene. For a deep reinforcement learning algorithm, a key part in an agent training process is learning and updating of strategy network parameters representing agent action strategies, the goal of the agent training process is to maximize obtained reward values, so that the goal of designing auxiliary task learning is to enable an agent to learn more efficient and robust action strategies by using implicit decision information provided by auxiliary tasks on the basis of main action strategies, different auxiliary tasks are expressed to be different auxiliary strategies for providing decision support information for the agent, and a reward feature enhancement framework based on an auxiliary task learning mechanism is provided and defined as follows:
Figure 339199DEST_PATH_IMAGE001
(3-1)
in the formula
Figure 443290DEST_PATH_IMAGE002
-an action policy of the agent;
Figure 363973DEST_PATH_IMAGE003
-a set of auxiliary tasks;
Figure 290341DEST_PATH_IMAGE004
-a discount reward value;
Figure 760505DEST_PATH_IMAGE005
-a policy corresponding to the auxiliary task;
formula (3-1) defines a basic framework of an auxiliary task learning mechanism, and the auxiliary task provides a corresponding auxiliary strategy for the intelligent agent in the three-dimensional scene. The basic assumptions for finding the auxiliary task are: the secondary tasks should be closely related to the primary task or tasks that can benefit the learning process of the primary task, and the design goals of the secondary tasks are to bias the agent toward learning a more effective action strategy. Based on the assumptions, the reward prediction auxiliary task, the state value auxiliary task and the action value auxiliary task are provided as related tasks of an internal reward strategy optimization algorithm, and the training effect of a deep reinforcement learning algorithm in the three-dimensional video game is improved in the aspects of enhancing the perception capability of an intelligent agent on environmental information and the estimation accuracy of the state information.
The specific process of constructing the neural network is as follows:
1. a reward characteristic enhancement algorithm based on auxiliary task learning;
a. a reward prediction based feature enhancement method;
the reward value is a driving factor for updating the strategy of the intelligent agent by a deep reinforcement learning algorithm, the intelligent agent can directly obtain reward feedback of the environment at the current state in a three-dimensional scene, but the final decision goal of the intelligent agent is to maximize the obtained accumulated reward value, which usually needs to be obtained through a series of continuous actions. Aiming at the characteristic of the deep reinforcement learning algorithm training process in the three-dimensional scene, a reward prediction auxiliary task is designed, and the auxiliary task is based on the assumption that: in order for the algorithm to learn the dynamic changes of the environmental information as a whole, the agent needs to identify the features of the input states that can bring higher reward feedback values at a future time, at which point the agent can tend to make action decisions that bring about long-term maximized rewards. Thus, if the secondary policy enables the agent to better predict the environmental reward value to be harvested at the next time, the secondary policy may help the agent learn a more efficient representation of the state value, and the information it extracts from the state may also facilitate the action policy learning of the primary task.
The method proposes the following reward prediction auxiliary tasks: using experience playback pools
Figure 379705DEST_PATH_IMAGE006
The state information and the reward information in the network are sampled from the input frame picture information of continuous time steps and the reward information of the next state as data samples to train a reward prediction network. The reward prediction network is represented by a simple shallow convolutional neural network, and the model structure of the task is shown in fig. 2.
In the reward prediction auxiliary task, input frame pictures at three continuous moments are sampled to serve as the input of a network, classification categories are output through the processing of the convolutional layer and the full connection layer, and the classification categories are defined as categories of reward values obtained by the intelligent agent and comprise positive value rewards, negative value rewards and zero value rewards. The label of the classification task is a one-hot code corresponding to the reward information at the next moment sampled from an experience playback pool, and the reward is used for predicting the loss function of the network according to a multi-class cross entropy loss function commonly used for the classification task
Figure 53263DEST_PATH_IMAGE007
Is defined as:
Figure 783322DEST_PATH_IMAGE008
(3-2)
in the formula:
Figure 107993DEST_PATH_IMAGE009
-network output classification categories;
Figure 632515DEST_PATH_IMAGE010
-the prize value at the next moment is one-hot coded;
as the reward values in the experience playback pool of the deep reinforcement learning algorithm comprise positive value rewards, negative value rewards and zero value rewards, and the proportion of the three types of reward values is different, for a three-dimensional video game scene, the zero value rewards in the experience review samples generated by the interaction of the intelligent bodies are more, and if the samples are directly sampled from the playback pool for training the reward prediction network, a large batch of training samples are related to the zero value rewards. Therefore, a biased sampling strategy is used when the data samples of the experience playback pool are sampled, and the non-zero value reward information and the zero value reward information are ensured to have the same sampling probability, so that the occupation ratios of different types of reward values in the training samples are ensured not to have too obvious difference. By designing the reward prediction auxiliary task, the algorithm pays attention to the relation between the recent input state picture and the instant future reward information in the training process of the intelligent agent, so that the intelligent agent recognizes the state picture which possibly brings the reward signal in the three-dimensional scene environment and tends to make an action capable of bringing reward value feedback.
b. A state value based feature enhancement method;
the advantage function estimation value stored in the experience replay pool is used for calculation when strategy updating is carried out in the internal reward strategy optimization algorithm, the calculation process of the advantage estimation value needs to use a reward value and a state value, wherein the reward value is derived from an internal reward signal generated by the internal reward generation module and an external reward signal fed back by the environment, the state value is derived from value estimation of the current input state by a value network of the internal reward strategy optimization algorithm, and the value estimation in deep reinforcement learning is biased, so that if the auxiliary strategy enables the deep reinforcement learning algorithm to tend to do more accurate state value estimation output, the calculated advantage estimation value is more accurate, the training process of the intelligent agent is more stable by the auxiliary strategy, and learning of the main action strategy is more efficient. Based on such considerations, the following state value assistance tasks are proposed herein: the internal reward strategy optimization algorithm intelligent agent stores the environment interaction data tuples in the experience playback pool in the training process and trains the value of the historical state in the experience pool for the algorithm
Figure 121265DEST_PATH_IMAGE011
Sampling to obtain input frame picture of continuous time and next timeAs a data sample to train a state value assistance task. The state value network is composed of a shallow convolutional neural network and a Long-Short Term Memory (LSTM), and the model structure of the task is shown in fig. 3.
In the state value network shown in fig. 3, input frame pictures at three consecutive moments are sampled as input of the network, the state value of the frame picture at the next moment is predicted and output through the processing of the convolutional layer, the full link layer and the LSTM layer, the label of the regression prediction task is the state value corresponding to the state picture at the next moment sampled from the empirical replay pool, and the loss function of the state value network is determined according to the mean square error loss function commonly used for the regression task
Figure 530381DEST_PATH_IMAGE012
Is defined as:
Figure 584925DEST_PATH_IMAGE013
(3-3)
in the formula
Figure 749190DEST_PATH_IMAGE014
-a parameter regularization term;
Figure 643678DEST_PATH_IMAGE015
-status value of network output;
Figure 715539DEST_PATH_IMAGE016
-a target state value;
Figure 765535DEST_PATH_IMAGE017
-a regularization term penalty factor;
target state value in state value auxiliary task
Figure 835122DEST_PATH_IMAGE018
Is empirically played back together with successive input frame picturesThe sampling in the pool is equivalent to directly utilizing interactive data samples of the intelligent agent, and since the state value is a value output by a value network of an internal reward strategy optimization algorithm and has different sizes, the sampling is directly carried out from the experience playback pool by a Gaussian strategy without considering the proportion condition of different value state values. By designing the state value auxiliary task, the algorithm carries out experience playback on the state value estimated value in recent training batch in the training process of the intelligent agent, so that the value network of the algorithm tends to carry out more accurate estimation on the state value corresponding to the input frame picture at each moment, the merit function value is calculated more accurately, and the action strategy learning process of the intelligent agent is more stable.
c. A feature enhancement method based on action value;
in the training process of the intelligent agent of the reward strategy optimization algorithm, the strategy network outputs the action probability distribution of the intelligent agent, the algorithm generates a certain action through the action probability distribution sampling, the action output of the intelligent agent in the corresponding input state is taken as being improved, the action decision of the intelligent agent in the environment can cause the change of the environment state of the intelligent agent, and therefore reward feedback of the environment is obtained. The action value represents the instant reward feedback which can be obtained when the intelligent agent takes a certain action in the current state, and if a certain action of the intelligent agent can bring higher instant reward value feedback, the action has higher action value. Thus, if the secondary strategy can bias a deep reinforcement learning agent to act with greater action value, then the agent will more easily learn the primary action strategy that efficiently gets reward feedback during the overall training process. Accordingly, the following action value assistance tasks are presented herein: and sampling the action value stored in the intelligent agent training process of the internal strategy optimization algorithm, and taking the input frame picture and the time sequence difference action value at continuous time as a data sample training action value auxiliary task. The action value network is composed of a shallow convolutional neural network and a long-short term memory network, and the model structure of the task is shown in FIG. 4.
Sampling in an action value networkThe input frame pictures at three continuous moments are used as the input of the network, the action value of the frame picture at the next moment is output through the processing prediction of the convolution layer, the full connection layer and the LSTM layer, and the label of the regression prediction task is the time sequence difference action value sampled from the experience playback pool, namely the label is obtained through calculation according to the instant reward value and the discount state value in the experience playback pool. Transforming the loss function of the action value network according to the mean square error loss function commonly used in regression tasks
Figure 95202DEST_PATH_IMAGE019
Is defined as:
Figure 95388DEST_PATH_IMAGE020
(3-4)
in the formula
Figure 327786DEST_PATH_IMAGE021
-a regularization term penalty factor;
Figure 630591DEST_PATH_IMAGE022
-a parameter regularization term;
Figure 253334DEST_PATH_IMAGE023
-the action value of the network input;
Figure 666997DEST_PATH_IMAGE024
-a time sequence differential action value;
timing difference action value in action value auxiliary task
Figure 753902DEST_PATH_IMAGE024
Is calculated from samples of data from an empirical playback pool. Similar to the state value assistance task, the magnitude of the time-series differential action value itself is determined by the reward value and the state value, and therefore the proportion of positive, negative and zero values does not need to be considered. Assisting tasks by designing action valuesThe policy network of the algorithm is enabled to more accurately output the action value corresponding to each input frame picture, so that the intelligent agent tends to make actions bringing larger reward values.
2. An intrinsic reward strategy optimization algorithm combined with an auxiliary task;
learning and updating of policy network parameters of a deep reinforcement learning algorithm determines action decisions of an agent in a three-dimensional scene, and a main action policy of the agent is determined by the network. The auxiliary task is provided to provide auxiliary decision support information for the action of the intelligent agent, and the intelligent agent optimizes the main action strategy and simultaneously considers the auxiliary strategy through different auxiliary tasks, so that a more effective and robust three-dimensional scene action strategy is learned. The intrinsic reward strategy optimization algorithm provided by the method can be used for learning of the action strategy of the intelligent agent, so that auxiliary tasks are combined with the intrinsic reward strategy optimization algorithm, and auxiliary decision information can be provided for the intrinsic reward strategy optimization intelligent agent. The key point of the combination of the two is that the experience revisit pool in the deep reinforcement learning, the training of the auxiliary task needs data samples related to corresponding state information, reward information and action information, and the training process of the internal reward strategy optimization agent just stores the data samples. Therefore, the auxiliary tasks and the internal reward strategy optimization algorithm can be combined through the experience replay pool, and the method is characterized in that the step of updating the network parameters of the auxiliary tasks according to experience replay samples is added in the strategy updating process of the internal reward strategy optimization algorithm. In this way, an Intrinsic reward Policy optimization Algorithm (AIBPO) in conjunction with an Auxiliary task is defined as:
Figure 617822DEST_PATH_IMAGE025
(3-5)
in the formula
Figure 586915DEST_PATH_IMAGE026
-rewarding the weight parameter of the predicted task;
Figure 741953DEST_PATH_IMAGE017
-weight parameters of the status value task;
Figure 417785DEST_PATH_IMAGE021
-weight parameters of the action value task;
Figure 265655DEST_PATH_IMAGE027
-rewarding the loss function of the prediction task;
Figure 112257DEST_PATH_IMAGE028
-a loss function of the state value task;
Figure 867723DEST_PATH_IMAGE029
-a loss function of the action value task;
Figure 663641DEST_PATH_IMAGE030
-a loss function of an intrinsic reward policy optimization algorithm;
the formula (3-5) provides an overall framework of the AIBPO algorithm, different auxiliary tasks are introduced to enable the IBPO algorithm to learn related information provided by different auxiliary tasks in the training process, the related information is related to strategy updating of the deep reinforcement learning algorithm or information perception in a scene, and the expression effect of the strategy optimization algorithm in a three-dimensional scene is improved from different angles. In addition, the influence of the auxiliary task on the main task is determined by setting a weight parameter for the auxiliary task loss function. The overall model structure of the AIBPO algorithm is shown in fig. 5.
The deep reinforcement learning algorithm stores corresponding reinforcement learning state information in an experience playback pool in a tuple mode when an agent is trained, stores information samples obtained by interaction between the agent based on the internal reward strategy optimization algorithm and the environment in the training process into the corresponding experience playback pool, and takes data in the experience playback pool as training samples of auxiliary tasks based on different sampling strategies, so that the characteristics of the deep reinforcement learning algorithm are effectively utilized, and sufficient training samples are provided for the auxiliary tasks.
3. Constructing a network model;
the method is improved on the basis of IBPO (intuitive Based policy optimization), a reward feature enhancement algorithm Based on auxiliary task learning is added on the basis of IBPO, and the main body structure is shown in figure 6. The neural network built by using the deep learning framework pytorch is integrally divided into 3 parts.
a. And a strategy optimization part, wherein the strategy optimization part comprises an Actor network and a criticic network which share part of hidden layer neurons. In order to give the external reward and the internal reward independent supervision information, the Critic network in the strategy optimization algorithm network structure has two independent reward output heads, and the network structure is shown as a table 3-1
TABLE 3-1 policy optimization Algorithm network architecture
Figure 354517DEST_PATH_IMAGE031
b. An intrinsic reward component in which the target mapping network has the same network structure as the predicted network, the network structure being as shown in tables 3-2
TABLE 3-2 intrinsic reward model network architecture
Figure 563781DEST_PATH_IMAGE032
c. The performance of the algorithm in the three-dimensional video game is improved from the perspective of enhancing the environment perception capability and the state estimation capability of the intelligent agent, and the data samples of the three auxiliary tasks are derived from experience playback data stored when the intelligent agent trained by the deep reinforcement learning algorithm interacts with the environment. The network structure of the reward prediction, the state value and the action value auxiliary task is composed of a shallow convolutional neural network and a long-term and short-term memory network. The network structure of the supplementary learning task is shown in tables 3-3
TABLE 3-3 auxiliary task learning network architecture
Figure 250502DEST_PATH_IMAGE033
Initiating a multi-process video game environment includes: the reinforcement learning needs a large amount of data to optimize the network, and a single process cannot meet the requirement of training data acquisition, so that the game process is run as much as possible by utilizing the multiprocessing module of the torch under the condition of sufficient computing resources. Tensor sharing by cuda, deadlock avoidance and prevention, buffer reuse through queue sending, and asynchronous multiprocess training can be achieved by using multiprocessing modules.
Obtaining game experience, wherein the updating experience pool specifically comprises the following steps: the game winning goal can be broadly considered to be the seeking of a reasonable behaviour a in the ambient state s, while all that the method does is seeking to make corrections using the ambient feedback r. Here, the state s is a state (state) in which the individual and the relevant area are located as the action occurs; behavior a refers to the corresponding behavior (action) made by the agent in state s, and feedback r refers to the evaluation incentive (reward) made by the agent in environment s. In the reinforcement learning problem, the agent may change the environmental state s by action a; the boosting means changes the behavior a by using the evaluation feedback r; the behavior a and the state s are combined to determine the corresponding feedback value r, and the flow is shown in fig. 7. All that is required in this step is to collect the (s, a, r, s') quadruples and store them in an experience pool to provide a data base for model training.
Experience is input into the network, and the updating of network parameters specifically comprises the following steps: and calculating the value of the advantage, the value of the internal reward and the value of the auxiliary task according to the data acquired in the last step. Using equations 3-5
Figure 963243DEST_PATH_IMAGE034
And updating the network, facilitating gradient rise and updating network parameters, thereby continuously optimizing the network and obtaining higher benefits by the intelligent agent.
The video game utilization model decision specifically comprises the following steps: as shown in fig. 8, after the game is started, each frame of the game is intelligently obtained, each frame is normalized, and 4 consecutive frames are input into the trained model as input values. And returning the action capable of obtaining high profit after the model calculation, executing the action by the intelligent agent in the game, further obtaining a new game picture, and repeating the above process until the game is finished.
As shown in fig. 9-10, the present invention is applied to various video games, and the advantages of the present technology are shown by using Vizdoom as an example. The Vizdoom platform is a three-dimensional video game with a first-person perspective, and the action of an intelligent body in the scene of the platform is similar to the action of an object in the real world, namely, the visual signal is received and then action decision is made. As a test platform of the current mainstream deep reinforcement learning algorithm, the Vizdoom platform provides an interface for receiving action input and feeding back a reward signal, and simulates the environment in a reinforcement learning model. At present, a Vizdoom platform provides comprehensive testing capability for training an intelligent agent to explore a three-dimensional environment, and a path finding scene, a survival scene and an exploration scene based on the platform are applied.
The application and development environment of the method is shown in table 1:
TABLE 1 Experimental development Environment
Figure 887337DEST_PATH_IMAGE035
The present invention is compared to several prior methods:
(1) IBPO an Intrinsic Based Policy optimization algorithm (Intrasic Based Policy optimization).
(2) DRQN, Deep Current Q-Networks, adding a circulation layer after the convolution layer to process part observable states.
(3) Direct future prediction, which pulls up on the full dismatch computation in Visual Doom AI computation 2016.
(4) A3C: the Asynchronous advertisement Actor-Critic places the Actor-Critic into a plurality of threads for synchronous training, so that computer resources can be effectively utilized, and training effectiveness is improved.
(5) Rainbow: the algorithm integrates several common DQN improvement methods, and the improvement of the performance of the intelligence is realized.
The AIBPO algorithm performance is compared as follows:
in Vizdoom survival scenario, in survival time step (b
Figure 193684DEST_PATH_IMAGE036
) And number of drug packs picking Life value
Figure 290953DEST_PATH_IMAGE037
) As an evaluation index equivalent to a game score value in deep reinforcement learning. Wherein the survival time step is defined as the average of the maximum number of action steps that the agent can survive 100 interactive verifications in the scene after the algorithm converges. The number of the picked-up life value medicine packages is defined as the average value of the number of the medicine packages which are picked up by the intelligent agent after the algorithm converges and subjected to 100 times of interactive verification in the scene. The experimental results of the survival scenario are shown in table 2.
Table 2 survival scenario experimental result data comparison table i
Figure 513993DEST_PATH_IMAGE038
As can be seen from Table 2, the IBPO algorithm is as follows
Figure 405726DEST_PATH_IMAGE036
And
Figure 199369DEST_PATH_IMAGE037
all indexes lag behind the DFP algorithm only in
Figure 303591DEST_PATH_IMAGE037
The index is slightly stronger than that of the DRQN algorithm, because the IBPO algorithm mainly solves the problem of sparse reward in a three-dimensional scene, and the reward information sources are rich in most cases in a survival scene, so that the survival scene is bornThe IBPO algorithm effect is not prominent in the presence scenario. The AIBPO algorithm introduces three types of auxiliary tasks, and as can be seen from the analysis in chapter IV of the text, the rewarding auxiliary tasks mainly play a role in enhancing the rewarding perception and the state estimation capability of the intelligent agent in the training process, because the maze characteristics in the survival scene require the intelligent agent to make long-term consideration in the training process according to the historical information, and the rich rewarding information in the survival scene can be utilized by the auxiliary tasks, the AIBPO algorithm obtains the effect superior to the IBPO algorithm and the DRQN algorithm in the survival scene experiment, and the AIBPO algorithm obtains the effect superior to the IBPO algorithm and the DRQN algorithm in the survival scene experiment
Figure 725346DEST_PATH_IMAGE036
Approaches the DFP algorithm in index at
Figure 443772DEST_PATH_IMAGE037
The evaluation index is superior to the DFP algorithm. As the deep reinforcement learning algorithms have different performances on different types of task scenes, the IBPO algorithm provided by the thesis mainly solves the problem of exploration in a sparse rewarding three-dimensional scene, and the correlation experiment of a path-finding scene proves that the IBPO algorithm has certain effect improvement on the scene compared with a DFP algorithm. The AIBPO algorithm proposed by the thesis mainly makes up the defects of the IBPO algorithm in the ability of sensing environment reward information through an auxiliary task mechanism, the main body of the AIBPO algorithm is still the IBPO algorithm, and experimental result data of a survival scene prove that the auxiliary task learning mechanism has a great effect improvement on the IBPO algorithm as an auxiliary means. Considering that the performances of the deep reinforcement learning algorithm are different in different task scenes, the AIBPO algorithm and the DFP algorithm have smaller difference in survival time step indexes in a survival scene, and the existing algorithm is difficult to comprehensively surpass the DFP algorithm.
Although there are some performance gaps between AIBPO algorithm and DFP algorithm, AIBPO algorithm has a larger performance improvement compared to the conventional deep reinforcement learning algorithm, and the experimental pair of A3C algorithm and Rainbow algorithm is shown in table 3.
Table 3 survival scenario experimental result data comparison table two
Figure 583766DEST_PATH_IMAGE039
As can be seen from table 3, the A3C algorithm and the Rainbow algorithm are existing algorithms with better performance in the depth reinforcement learning algorithm based on Actor-Critic and based on value iteration, and the performance of the trained agent in the living scene is inferior to that of the AIBPO algorithm. The reason is that the A3C algorithm and the Rainbow algorithm lack special processing aiming at the three-dimensional scene reinforcement learning task, so that the performances of the three-dimensional scene, namely the survival scene, which has both exploration maze and reward obtaining, in the atari video game and the continuous control task are difficult to achieve.
And (4) integrating the analysis and the performance of the AIBPO agent combined with the auxiliary task learning mechanism in the survival scene. The performance is obviously improved, the survival time step number and the capability of acquiring the life value medicine package of the intelligent agent obviously exceed the traditional deep reinforcement learning algorithm, and reach the level close to the intelligent agent of the competition.
The auxiliary task comparative experiment was as follows:
the AIBPO algorithm comprises reward prediction tasks
Figure 491679DEST_PATH_IMAGE027
State value task
Figure 908885DEST_PATH_IMAGE027
And action value tasks
Figure 673579DEST_PATH_IMAGE040
In order to reflect the improvement effect of each auxiliary task on the original strategy optimization algorithm, the method is characterized in that different types of auxiliary tasks are independently added on the basis of the IBPO algorithm to compare and evaluate the effects of the three types of tasks. The experiments for evaluating each auxiliary task are shown in Table 4
TABLE 4 comparison of data of auxiliary task experiment results
Figure 425503DEST_PATH_IMAGE041
As can be seen from table 4, the performance of the agent trained by combining the reference algorithms of different auxiliary tasks is improved in the survival scene, and the improvement effect of each auxiliary task on the reference algorithm is different. Wherein the reward forecast auxiliary task is
Figure 668266DEST_PATH_IMAGE042
The promotion effect on the evaluation index is most obvious, and the action value auxiliary task is
Figure 267874DEST_PATH_IMAGE036
The optimal effect of the three is obtained on the index. Relevant experiment results show that the IBPO algorithm combined with each auxiliary task greatly improves the performance of the intelligent agent in a three-dimensional scene on the original basis, so that the survival time of the intelligent agent in a survival scene is greatly prolonged. In a comprehensive view, the AIBPO algorithm combining the three auxiliary tasks is improved more obviously on relevant indexes, which shows that the cooperation of the three auxiliary tasks can further improve the performance of the trained intelligent body in the three-dimensional video game.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (8)

1. A video game decision method based on auxiliary task learning is characterized by comprising the following steps:
s1, constructing a neural grid model;
s2, starting a multi-process video game environment;
s3, judging whether the specified turn is operated, if not, entering the step S4, if yes, entering the step S6;
s4, obtaining game experience and updating an experience pool;
s5, inputting the experience into the neural grid model, updating the parameters of the neural grid model, and returning to the step S3;
s6, saving the neural grid model;
s7, making a decision by using a neural grid model in the video game;
and S8, ending.
2. A video game decision-making method based on assistive task learning as claimed in claim 1, wherein: step S1 includes the following substeps:
s101, a reward characteristic enhancement method based on auxiliary task learning;
s102, an internal reward strategy optimization method combined with auxiliary task learning;
s103, constructing a neural grid model.
3. A video game decision-making method based on assistive task learning as claimed in claim 2, wherein: the method for enhancing the reward characteristics based on the auxiliary task learning comprises the following frame definition:
Figure 679287DEST_PATH_IMAGE001
(3-1)
in the formula (3-1);
Figure 799690DEST_PATH_IMAGE002
-an action policy of the agent;
Figure 172902DEST_PATH_IMAGE003
-a set of auxiliary tasks;
Figure 833691DEST_PATH_IMAGE004
-a discount reward value;
Figure 851325DEST_PATH_IMAGE005
-a policy corresponding to the auxiliary task;
the formula (3-1) defines a basic framework of an auxiliary task learning mechanism, and the auxiliary task learning provides a corresponding auxiliary strategy for the intelligent agent in the three-dimensional scene.
4. A video game decision-making method based on assistive task learning as claimed in claim 2, wherein: step S101 includes the following substeps:
s1011, a feature enhancement method based on reward prediction;
s1012, a state value-based feature enhancement method;
and S1013, a feature enhancement method based on the action value.
5. The video game decision-making method based on task-assisted learning of claim 4, wherein: step S1011 includes: using state information and reward information in an experience playback pool, sampling input frame picture information of continuous time steps and reward information of a next state as data samples to train a reward prediction network, wherein the data sample training reward prediction network is represented by a shallow convolutional neural network, in the data sample training reward prediction network, sampling input frame pictures of continuous three moments as input of the network, outputting classification categories through processing of a convolutional layer and a full connection layer, the classification categories are defined as categories of reward values obtained by an intelligent body and comprise positive value reward, negative value reward and zero value reward, labels of classification tasks are unique hot codes corresponding to the reward information of the next moment sampled from the experience playback pool, and loss functions of the reward prediction network are realized according to multi-category cross entropy loss functions commonly used for the classification tasks
Figure 267263DEST_PATH_IMAGE006
Is defined as:
Figure 799876DEST_PATH_IMAGE007
(3-2)
in the formula (3-2),
Figure 202038DEST_PATH_IMAGE008
-network output classification categories;
Figure 464392DEST_PATH_IMAGE009
-the prize value at the next moment is one-hot coded.
6. The video game decision-making method based on task-assisted learning of claim 4, wherein: step S1012 includes: the internal reward strategy optimization algorithm agent stores the environmental interaction data tuples in an experience replay pool during the training process, sampling historical state values in an algorithm training experience pool, taking input frame pictures at continuous moments and state values at the next moment as a data sample training state value auxiliary network, wherein the data sample training state value auxiliary network mainly comprises a shallow convolutional neural network and a long-short term memory network, in the data sample training state value auxiliary network, sampling input frame pictures at three continuous moments as the input of the network, predicting and outputting the state value corresponding to the frame picture at the next moment through the processing of a convolution layer, a full connection layer and an LSTM layer, the label of the regression prediction task is the state value corresponding to the state picture at the next time sampled from the empirical replay pool, and (4) performing a loss function of the state value network according to a mean square error loss function commonly used for the regression task.
Figure 723335DEST_PATH_IMAGE010
Is defined as:
Figure 680927DEST_PATH_IMAGE011
(3-3)
in the formula (3-3),
Figure 683518DEST_PATH_IMAGE012
-a parameter regularization term;
Figure 797449DEST_PATH_IMAGE013
-status value of network output;
Figure 696135DEST_PATH_IMAGE014
-a target state value;
Figure 265657DEST_PATH_IMAGE015
-a regular term penalty factor.
7. The video game decision-making method based on task-assisted learning of claim 4, wherein: step S1013 includes: the action value stored in the training process of the intelligent agent of the internal strategy optimization algorithm is sampled, the input frame picture and the time sequence difference action value at continuous time are used as a data sample training action value auxiliary network, the data sample training action value auxiliary network is mainly composed of a shallow convolutional neural network and a long-short term memory network, in the data sample training motion value auxiliary network, sampling input frame pictures at three continuous moments as the input of the network, predicting and outputting the motion value corresponding to the frame picture at the next moment through the processing of a convolution layer, a full connection layer and an LSTM layer, the label of the regression prediction task is the value of the time-sequence differential action sampled from the empirical replay pool, the action value network loss function is obtained by calculation according to the instant reward value and the discount state value in the experience playback pool, and the action value network loss function is used for the regression task according to the mean square error loss function commonly used in the regression task.
Figure 806359DEST_PATH_IMAGE016
Is defined as:
Figure 918672DEST_PATH_IMAGE017
(3-4)
in the formula (3-4), the metal oxide,
Figure 581734DEST_PATH_IMAGE018
-a regularization term penalty factor;
Figure 779497DEST_PATH_IMAGE019
-a parameter regularization term;
Figure 123891DEST_PATH_IMAGE020
-the action value of the network input;
Figure 215344DEST_PATH_IMAGE021
-a time sequence differential action value.
8. A video game decision-making method based on assistive task learning as claimed in claim 2, wherein: step S102 includes: adding a step of updating network parameters of the auxiliary tasks according to experience playback samples in the strategy updating process of the internal reward strategy optimization method, and defining the internal reward strategy optimization method combining with auxiliary task learning as follows:
Figure 455832DEST_PATH_IMAGE022
(3-5)
in the formula (3-5), the metal oxide,
Figure 734367DEST_PATH_IMAGE023
-rewarding the weight parameter of the predicted task;
Figure 882452DEST_PATH_IMAGE015
-weight parameters of the status value task;
Figure 438198DEST_PATH_IMAGE018
-weight parameters of the action value task;
Figure 974221DEST_PATH_IMAGE024
-rewarding the loss function of the prediction task;
Figure 677735DEST_PATH_IMAGE025
-a loss function of the state value task;
Figure 301615DEST_PATH_IMAGE026
-a loss function of the action value task;
Figure 367660DEST_PATH_IMAGE027
-a penalty function of the intrinsic reward policy optimization algorithm.
CN202010369831.1A 2020-05-06 2020-05-06 Video game decision-making method based on auxiliary task learning Active CN111260039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010369831.1A CN111260039B (en) 2020-05-06 2020-05-06 Video game decision-making method based on auxiliary task learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010369831.1A CN111260039B (en) 2020-05-06 2020-05-06 Video game decision-making method based on auxiliary task learning

Publications (2)

Publication Number Publication Date
CN111260039A true CN111260039A (en) 2020-06-09
CN111260039B CN111260039B (en) 2020-08-07

Family

ID=70950007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010369831.1A Active CN111260039B (en) 2020-05-06 2020-05-06 Video game decision-making method based on auxiliary task learning

Country Status (1)

Country Link
CN (1) CN111260039B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329948A (en) * 2020-11-04 2021-02-05 腾讯科技(深圳)有限公司 Multi-agent strategy prediction method and device
CN112843726A (en) * 2021-03-15 2021-05-28 网易(杭州)网络有限公司 Intelligent agent processing method and device
CN114413910A (en) * 2022-03-31 2022-04-29 中国科学院自动化研究所 Visual target navigation method and device
CN115300910A (en) * 2022-07-15 2022-11-08 浙江大学 Confusion-removing game strategy model generation method based on multi-agent reinforcement learning
CN116747521A (en) * 2023-08-17 2023-09-15 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for controlling intelligent agent to conduct office

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063823A (en) * 2018-07-24 2018-12-21 北京工业大学 A kind of intelligent body explores batch A3C intensified learning method in the labyrinth 3D
CN106422332B (en) * 2016-09-08 2019-02-26 腾讯科技(深圳)有限公司 Artificial intelligence operating method and device applied to game
CN109540150A (en) * 2018-12-26 2019-03-29 北京化工大学 One kind being applied to multi-robots Path Planning Method under harmful influence environment
CN109621431A (en) * 2018-11-30 2019-04-16 网易(杭州)网络有限公司 A kind for the treatment of method and apparatus of game action
CN109870162A (en) * 2019-04-04 2019-06-11 北京航空航天大学 A kind of unmanned plane during flying paths planning method based on competition deep learning network
CN110152290A (en) * 2018-11-26 2019-08-23 深圳市腾讯信息技术有限公司 Game running method and device, storage medium and electronic device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106422332B (en) * 2016-09-08 2019-02-26 腾讯科技(深圳)有限公司 Artificial intelligence operating method and device applied to game
CN109063823A (en) * 2018-07-24 2018-12-21 北京工业大学 A kind of intelligent body explores batch A3C intensified learning method in the labyrinth 3D
CN110152290A (en) * 2018-11-26 2019-08-23 深圳市腾讯信息技术有限公司 Game running method and device, storage medium and electronic device
CN109621431A (en) * 2018-11-30 2019-04-16 网易(杭州)网络有限公司 A kind for the treatment of method and apparatus of game action
CN109540150A (en) * 2018-12-26 2019-03-29 北京化工大学 One kind being applied to multi-robots Path Planning Method under harmful influence environment
CN109870162A (en) * 2019-04-04 2019-06-11 北京航空航天大学 A kind of unmanned plane during flying paths planning method based on competition deep learning network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BRADLY C.STADIE 等: "INCENTIVIZING EXPLORATION IN REINFORCEMENT LEARNING WITH DEEP PREDICTIVE MODELS", 《ARXIV》 *
HANS LEHNERT 等: "Bio-Inspired Deep Reinforcement Learning for Autonomous Navigation of Artificial Agents", 《IEEE LATIN AMERICA TRANSACTIONS》 *
MAX JADERBERG 等: "REINFORCEMENT LEARNING WITH UNSUPERVISED AUXILIARY TASKS", 《ARXIV》 *
杨惟轶 等: "深度强化学习中稀疏奖励问题研究综述", 《计算机科学》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329948A (en) * 2020-11-04 2021-02-05 腾讯科技(深圳)有限公司 Multi-agent strategy prediction method and device
CN112329948B (en) * 2020-11-04 2024-05-10 腾讯科技(深圳)有限公司 Multi-agent strategy prediction method and device
CN112843726A (en) * 2021-03-15 2021-05-28 网易(杭州)网络有限公司 Intelligent agent processing method and device
CN114413910A (en) * 2022-03-31 2022-04-29 中国科学院自动化研究所 Visual target navigation method and device
CN115300910A (en) * 2022-07-15 2022-11-08 浙江大学 Confusion-removing game strategy model generation method based on multi-agent reinforcement learning
CN115300910B (en) * 2022-07-15 2023-07-21 浙江大学 Confusion-removing game strategy model generation method based on multi-agent reinforcement learning
CN116747521A (en) * 2023-08-17 2023-09-15 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for controlling intelligent agent to conduct office
CN116747521B (en) * 2023-08-17 2023-11-03 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for controlling intelligent agent to conduct office

Also Published As

Publication number Publication date
CN111260039B (en) 2020-08-07

Similar Documents

Publication Publication Date Title
CN111260039B (en) Video game decision-making method based on auxiliary task learning
CN110399920B (en) Non-complete information game method, device and system based on deep reinforcement learning and storage medium
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
Du et al. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications
Hernandez-Leal et al. Agent modeling as auxiliary task for deep reinforcement learning
Hasanbeig et al. Deepsynth: Automata synthesis for automatic task segmentation in deep reinforcement learning
Van Otterlo The logic of adaptive behavior: Knowledge representation and algorithms for adaptive sequential decision making under uncertainty in first-order and relational domains
CN113688977B (en) Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
Rosenbloom et al. Towards emotion in sigma: from appraisal to attention
CN111061959A (en) Developer characteristic-based crowd-sourcing software task recommendation method
CN111105442B (en) Switching type target tracking method
CN116702872A (en) Reinforced learning method and device based on offline pre-training state transition transducer model
Chandra et al. Machine learning: a practitioner's approach
Huang et al. Unified curiosity-driven learning with smoothed intrinsic reward estimation
CN112121419B (en) Virtual object control method, device, electronic equipment and storage medium
Zhou et al. A Real-time algorithm for USV navigation based on deep reinforcement learning
Pandey et al. A Review of Current Perspective and Propensity in Reinforcement Learning (RL) in an Orderly Manner
Gan et al. Noisy agents: Self-supervised exploration by predicting auditory events
Burch A survey of machine learning
Zhao et al. State representation learning with adjacent state consistency loss for deep reinforcement learning
Cao et al. Intrinsic motivation for deep deterministic policy gradient in multi-agent environments
Wenning What Influence Does the AI Strategy Have on Possible Outcomes of Chinese Foreign Policy and Economic Development?
Liu et al. The Guiding Role of Reward Based on Phased Goal in Reinforcement Learning
Chiang et al. Efficient exploration in side-scrolling video games with trajectory replay
Van Otterlo The Logic of Adaptive Behavior-Knowledge Representation and Algorithms for the Markov Decision Process Framework in First-Order Domains

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant