CN117150927A

CN117150927A - Deep reinforcement learning exploration method and system based on extreme novelty search

Info

Publication number: CN117150927A
Application number: CN202311264592.3A
Authority: CN
Inventors: 路圣汉
Original assignee: Beijing Hanbo Technology Co ltd
Current assignee: Beijing Hanbo Technology Co ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-01
Anticipated expiration: 2043-09-27
Also published as: CN117150927B

Abstract

The invention relates to the technical field of deep reinforcement learning, and provides a deep reinforcement learning exploration method and system based on extreme novelty search, wherein the method comprises the following steps: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data; calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data; correcting the exploration rewards according to extreme novel state preferences, and updating collected data; updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients; and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process. The invention solves the problem that the existing reinforcement learning agent has low exploration efficiency in the Markov decision process.

Description

Deep reinforcement learning exploration method and system based on extreme novelty search

Technical Field

The invention relates to the technical field of deep reinforcement learning, in particular to a deep reinforcement learning exploration method and system based on extreme novelty search.

Background

With the rapid development of deep reinforcement learning technology, it has been successful in many fields. From go to electronic games, deep reinforcement learning techniques have demonstrated tremendous potential. How to effectively explore the sparse environment of reward signals is one of the key problems faced by model-free reinforcement learning. While researchers have proposed many approaches to this problem, most of these approaches are designed for reinforcement learning tasks under a single instance setting (singleton setting).

Under a single instance setting, the specific tasks faced by the agent per round are the same, and Markov process (Markov Decision Process, MDP) modeling can be used. Recent studies have shown that agents trained under a single setting tend to overfit tasks and may not even generalize to slightly different tasks. To address the generalization problem, researchers have proposed a procedure generation setting (procedurally-generated setting) under which an agent is faced with a set of a large number of similar tasks. At the beginning of each round, the agent needs to interact with a certain task in the collection. For example, in a navigation task, an agent is required to find a target in the maze, but the layout of the maze is regenerated before each round starts. Reinforcement learning tasks under program generation settings are modeled using a contextual markov decision process (Context Markov Decision Process, C-MDP).

Exploration in C-MDP is very challenging. Extensive research has shown that the count-based exploration methods widely used for single case settings, and their modifications, are not suitable for exploration in C-MDP, as the agents are unlikely to see the same state in different rounds.

Disclosure of Invention

The invention provides a deep reinforcement learning exploration method and system based on extreme novelty search, which are used for solving the problem that the exploration efficiency of the existing reinforcement learning agent in a Markov decision process is low.

The invention provides a deep reinforcement learning exploration method based on extreme novelty search, which comprises the following steps:

initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;

calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;

correcting the exploration rewards according to extreme novel state preferences, and updating collected data;

updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;

and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.

According to the deep reinforcement learning exploration method based on extreme novelty search, the model parameters and the simulation environment are initialized, the preset agent model and the simulation environment are interacted, and data are collected, wherein the method comprises the following steps:

randomly sampling a simulation environment from the environment set through a simulator, giving an initial state, taking the initial state as input by a strategy network of the intelligent body model, and outputting related actions;

the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and the intelligent body model obtains rewards and termination identifiers;

the strategy network of the intelligent agent model takes the next moment state as input and outputs new related actions;

and repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.

According to the deep reinforcement learning exploration method based on extreme novelty search, the exploration rewards based on conditional state probability and action probability are calculated based on the collected related data, and the method comprises the following steps:

pulling data from a data list of the collected related data;

the exploration rewards based on the conditional state probabilities and the action probabilities are calculated by pulling the data.

According to the deep reinforcement learning exploration method based on extreme novelty search, the exploration rewards are corrected according to extreme novelty state preference, and collected data are updated, wherein the method comprises the following steps:

revising the explore rewards based on the extreme novel state preferences;

updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.

According to the deep reinforcement learning exploration method based on extreme novelty search, the parameters of the intelligent agent model are updated based on the updated collected data, and the strategy network is updated by adopting strategy gradients, which comprises the following steps:

pulling data from a data list of the collected related data;

updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;

and updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm.

According to the deep reinforcement learning exploration method based on extreme novelty search, the data collection list is emptied, model parameters are saved, an artificial intelligent agent model is continuously trained until iteration is completed, and the deep reinforcement learning exploration of a Markov decision process is carried out, wherein the method comprises the following steps:

clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;

training the agent model continues until the iteration is complete.

The invention also provides a deep reinforcement learning exploration system based on extreme novelty search, which comprises:

the data collection module is used for initializing model parameters and a simulation environment, interacting a preset intelligent body model with the simulation environment and collecting data;

the exploration rewards calculation module is used for calculating exploration rewards based on the conditional state probability and the action probability based on the collected related data;

the reward updating module is used for correcting the exploration rewards according to the extreme novel state preference and updating the collected data;

the model parameter updating module is used for updating the parameters of the intelligent agent model based on the updated collected data, and updating the strategy network by adopting strategy gradients;

the iteration module is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep reinforcement learning exploration method based on the extreme novelty search according to any one of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the deep reinforcement learning exploration method based on extreme novelty search as described in any of the above.

The present invention also provides a computer program product comprising a computer program which when executed by a processor implements a deep reinforcement learning exploration method based on an extreme novelty search as described in any of the above.

The invention provides a deep reinforcement learning exploration method and a system based on extreme novelty search, wherein motion entropy is considered in an exploration target, and is not only taken as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow diagram of a deep reinforcement learning exploration method based on extreme novelty searching;

FIG. 2 is a second flow chart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;

FIG. 3 is a third flow chart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;

FIG. 4 is a schematic flow diagram of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;

FIG. 5 is a schematic flow chart diagram of a deep reinforcement learning exploration method based on extreme novelty searching;

FIG. 6 is a flowchart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;

FIG. 7 is a schematic diagram of the modular connection of a deep reinforcement learning exploration system based on extreme novelty searching provided by the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.

Reference numerals:

110: a data collection module; 120: a search reward calculation module; 130: a reward updating module; 140: a model parameter updating module; 150: an iteration module;

810: a processor; 820: a communication interface; 830: a memory; 840: a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes a deep reinforcement learning exploration method based on extreme novelty search with reference to fig. 1-6, which comprises the following steps:

s100, initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;

s200, calculating exploration rewards based on conditional state probability and action probability based on the collected related data;

s300, correcting the exploration rewards according to extreme novelty state preference, and updating collected data;

s400, updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;

s500, emptying the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.

In the invention, corresponding attention force diagrams are generated according to the data, and the attention force diagrams are discretized; generating a virtual reward with the virtual reward generator based on the data and its corresponding attention attempt; updating the data with the virtual rewards; updating parameters of a strategy network and a value network of the intelligent agent according to the updated data; updating parameters of the virtual rewards generator according to the updated data; and updating parameters of the encoder module and the attention module according to the updated data, so that the virtual rewarding generator can distinguish similar images, and the exploration efficiency of the intelligent agent is effectively improved.

The invention is mainly used for simulating simulation scenes, and mainly comprises an electronic game scene, a robot simulator and the like. Taking MiniGrid games widely used in academia as an example, the invention can help the AI intelligent agent to efficiently explore the environment in the scene, thereby realizing the obtained performance. Furthermore, the proposed method of the present invention belongs to the field of reinforcement learning, relying only on the basic assumption of reinforcement learning, namely the markov decision process (Markov Decision Process, MDP). Thus, the proposed method of the present invention can be used for any scenario that can be modeled as a markov decision process.

Initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data, wherein the method comprises the following steps:

s101, randomly sampling a simulation environment from an environment set through a simulator, giving an initial state, and outputting related actions by taking the initial state as input by a strategy network of an intelligent body model;

s102, the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and obtains rewards and termination identifiers from the intelligent body model;

s103, taking the next moment state as input by the strategy network of the agent model, and outputting a new related action;

s104, repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.

Specifically, the agent interacts with the simulation environment and collects data; the interaction process is as follows, step 1: the simulator randomly samples a simulation environment from the environment set and gives an initial state s ₀ Policy network of agent model s ₀ To input and output action a ₀ The method comprises the steps of carrying out a first treatment on the surface of the Step 2, simulator, in s ₀ And a ₀ To input and output the state s at the next moment ₁ Rewards r obtained by agent model ₁ And a termination identifier d ₁ The method comprises the steps of carrying out a first treatment on the surface of the Step 3, strategy network of agent model is represented by s ₁ Output a as input ₁ . Step 2 and step 3 are repeated until a set amount of data is collected. The data is in the form of five tuples (s _t ,s _t+1 ,a _t ,r _t ,d _t ). When the environment reaches the termination condition, the simulator randomly samples a simulation environment from the environment set for the next interaction by the agent.

At initialization of model parametersInitializing parameters of a strategy network and a value network of the intelligent agent in the process of simulating the environment; initialization ofPredictor, & gt>Predictor, N(s) _t+1 ) A predictor, N predictor; initializing a data collection list; and initializing a simulation environment. Note that the initial state data (image data) will be returned after each environment is initialized.

In one specific example, an agent model interacts with an environment, collecting data;

s2-1, 128 simulation environments are used in parallel.

S2-2, for one of the parallel environments, sending the state data of the current environment into the current agent model strategy network to obtain action output corresponding to the current state.

S2-3, repeating the S2-2 process for all environments.

S2-4, each environment receives the actions of the intelligent agent to perform one-step forward simulation, and returns the state data, the rewarding information and the termination identifier of the next step.

S2-5, repeating the above steps S2-2 to S2-4 128 times to obtain 128 pieces of training data (S) having 128 track lengths _t ,s _t+1 ,a _t ,r _t ,d _t ) It is noted that when the environment simulation ends, the environment is reset to continue the simulation.

S2-6, storing the data into a data collection list.

Calculating an exploration reward based on the conditional state probability and the action probability based on the collected relevant data, comprising:

s201, pulling data from a data list of the collected related data;

s202, calculating exploration rewards based on conditional state probability and action probability through pulling data.

In the present invention, for each five-tuple, the computation is based on a stripeExploration rewards b of part state probability and action probability _t ：

Wherein pi is _mix Is a historical mixing strategy that is used to mix the data,is given policy pi _mix Agent performs a _t After that, shift to s _t+1 Probability of->Is given policy pi _mix Agent performs a _t Is a probability of (2). In a specific calculation, the probability can be replaced by the frequency, i.e. +.> Wherein N is the interaction times of the current intelligent agent and the environment, < ->Is to perform action a _t Is>Is to execute a _t Post transfer to s _t+1 Is a number of times (1). Corresponding (I)>

Correcting the exploration rewards according to extreme novel state preferences, updating the collected data, including:

s301, correcting exploration rewards based on extreme novel state preference;

s302, updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.

In the present invention, b is modified based on extreme novel state preferences _t ObtainingSpecifically:

wherein the method comprises the steps ofIs an indication function, N (s _t+1 ) Is state s _t+1 The number of times accessed, β ε [0,1 ]]Is a super parameter. The physical meaning of the above formula is to encourage the agent to visit a state that has never been visited before;

updating five-tuple(s) _t ,s _t+1 ,a _t ,r _t ,d _t ) In (a) and (b)The updating of the exploration rewards can be repeated 128 times and 128 times in the invention until all data in the data collection list are updated.

Updating parameters of the agent model based on the updated collected data, the policy network updating with a policy gradient, comprising:

s401, pulling data from a data list of the collected related data;

s402, updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;

s403, updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm. UpdatingPredictor, & gt>Predictor, N(s) _t+1 ) Predictor, N predictor.

In the invention, all five-tuple is collected, and the updated data is used for updating the parameters of the intelligent agent according to an actor-critic algorithm; the strategy network is updated by adopting strategy gradient, and the strategy gradient is calculated as follows:

wherein the policy network represents pi, theta is the corresponding neural network parameter,for policy gradient, V ^π (s _t ) Is state s _t Value estimation, p _θ (a _t |s _t ) Is state s _t Selecting action a _t M is the size of the training data. The value network loss function is:

where Actor is a neural network, critic is also a neural network, they are different neural networks, and Actor is used to predict the probability of behavior, critic is the value of predicting in this state.

Combining the methods of Policy Gradient (Actor) and Function Approximation (Critic), the Actor selects behavior based on probability, critic (may use Q-learning or Value-based) estimates the Value of each state, the Value of the next state is subtracted from the Value of this state, (TD-error), critic tells the Actor that the next action is to be updated more, TD-error is positive, next action is to be updated more, if negative, the update amplitude of Actor is to be reduced, critic evaluates the score of behavior based on Actor, and the Actor modifies the probability of selecting behavior according to the score of Critic.

Actor (Actor) refers to a strategy function pi theta (a|s), i.e., learning a strategy to get as much return as possible;

critics (Critic) refers to a value function V pi(s), and the value function of the current strategy is estimated, namely, the quality of actors is estimated;

with the help of the value function, the actor-critter algorithm can update the parameters in a single step without waiting until the end of the round.

The Actor-Critic is composed of Policy Gradients and Value-Based, and the Critic can see potential rewards of the state where the Actor is located through the relation between learning environment and rewards and punishments, so that the Actor is pointed to update each step, and if the Policy Gradients are used, the Actor can only wait for one round to update. Thus, a single step update can be performed faster than conventional Policy Gradient.

Clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process, wherein the method comprises the following steps of:

s501, clearing data in a data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;

s502, continuously training the agent model until iteration is completed.

In the present invention, updating is performed based on dataAnd +.>Repeating the steps until the end condition is reached.

Specifically, the process 10 of intelligent model parameter update is repeated by flushing the data in the data collection list ³ Updating a version parameter and storing parameters of a strategy network and a value network of the intelligent agent; preservation ofA predictor(s),Predictor, N(s) _t+1 ) Parameters of predictor, N predictor.

Training of the agent is continued until the iteration is completed, with the total amount of data collected exceeding 5 x 10 ⁷ 。

The invention provides a deep reinforcement learning exploration method based on extreme novelty search, which considers action entropy in an exploration target, and does not only take the action entropy as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.

Referring to FIG. 7, the invention also discloses a deep reinforcement learning exploration system based on extreme novelty search, the system comprising:

the data collection module 110 is used for initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;

a search reward calculation module 120 for calculating a search reward based on the conditional state probability and the action probability based on the collected related data;

a reward update module 130 for modifying the exploration rewards according to extreme novel state preferences, updating the collected data;

the model parameter updating module 140 is configured to update parameters of the agent model based on the updated collected data, and update the policy network with a policy gradient;

the iteration module 150 is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.

The data collection module randomly samples a simulation environment from the environment set through the simulator, gives an initial state, takes the initial state as input, and outputs related actions by the strategy network of the intelligent body model;

The exploration rewards calculation module is used for pulling data from a data list of the collected related data;

A reward update module that revises the explored rewards based on the extreme novel state preferences;

The model parameter updating module is used for pulling data from a data list of the collected related data;

The iteration module is used for clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model and storing parameters of a strategy network and a value network of the intelligent body model;

training the agent model continues until the iteration is complete.

The invention provides a deep reinforcement learning exploration system based on extreme novelty search, which considers action entropy in an exploration target, and does not only take the action entropy as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.

Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a deep reinforcement learning exploration method based on extreme novelty searches, the method comprising: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform a deep reinforcement learning exploration method based on extreme novelty search provided by the above methods, where the method includes: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;

In yet another aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a deep reinforcement learning exploration method based on extreme novelty searching provided by the above methods, the method comprising: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The deep reinforcement learning exploration method based on the extreme novelty search is characterized by comprising the following steps of:

2. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said initializing model parameters and simulation environment, interacting a preset agent model with the simulation environment, collecting data, comprises:

3. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said computing exploration rewards based on conditional state probabilities and action probabilities based on collected relevant data comprises:

pulling data from a data list of the collected related data;

4. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said revising said exploration rewards according to extreme novelty state preferences, updating collected data, comprises:

revising the explore rewards based on the extreme novel state preferences;

5. The deep reinforcement learning exploration method based on extreme novelty search of claim 1, wherein said updating parameters of an agent model based on updated collected data, updating a policy network with a policy gradient, comprises:

pulling data from a data list of the collected related data;

6. The method of claim 1, wherein the emptying the data collection list, saving model parameters, continuously training the artificial intelligence agent model until iteration is completed, and performing a deep reinforcement learning exploration of a markov decision process, comprising:

training the agent model continues until the iteration is complete.

7. A deep reinforcement learning exploration system based on extreme novelty searching, the system comprising:

and the iteration module is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligence body model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep reinforcement learning exploration method based on extreme novelty search of any one of claims 1 to 6 when executing the program.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the deep reinforcement learning exploration method based on extreme novelty search of any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the extreme novelty search-based deep reinforcement learning exploration method of any of claims 1 to 6.