CN117150927A - Deep reinforcement learning exploration method and system based on extreme novelty search - Google Patents
Deep reinforcement learning exploration method and system based on extreme novelty search Download PDFInfo
- Publication number
- CN117150927A CN117150927A CN202311264592.3A CN202311264592A CN117150927A CN 117150927 A CN117150927 A CN 117150927A CN 202311264592 A CN202311264592 A CN 202311264592A CN 117150927 A CN117150927 A CN 117150927A
- Authority
- CN
- China
- Prior art keywords
- data
- exploration
- reinforcement learning
- updating
- extreme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 87
- 230000002787 reinforcement Effects 0.000 title claims abstract description 70
- 230000009471 action Effects 0.000 claims abstract description 45
- 238000004088 simulation Methods 0.000 claims abstract description 43
- 238000013480 data collection Methods 0.000 claims abstract description 30
- 230000008569 process Effects 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000004590 computer program Methods 0.000 claims description 14
- 230000003993 interaction Effects 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013473 artificial intelligence Methods 0.000 claims 2
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 4
- 238000005457 optimization Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011010 flushing procedure Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Computer Hardware Design (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Geometry (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of deep reinforcement learning, and provides a deep reinforcement learning exploration method and system based on extreme novelty search, wherein the method comprises the following steps: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data; calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data; correcting the exploration rewards according to extreme novel state preferences, and updating collected data; updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients; and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process. The invention solves the problem that the existing reinforcement learning agent has low exploration efficiency in the Markov decision process.
Description
Technical Field
The invention relates to the technical field of deep reinforcement learning, in particular to a deep reinforcement learning exploration method and system based on extreme novelty search.
Background
With the rapid development of deep reinforcement learning technology, it has been successful in many fields. From go to electronic games, deep reinforcement learning techniques have demonstrated tremendous potential. How to effectively explore the sparse environment of reward signals is one of the key problems faced by model-free reinforcement learning. While researchers have proposed many approaches to this problem, most of these approaches are designed for reinforcement learning tasks under a single instance setting (singleton setting).
Under a single instance setting, the specific tasks faced by the agent per round are the same, and Markov process (Markov Decision Process, MDP) modeling can be used. Recent studies have shown that agents trained under a single setting tend to overfit tasks and may not even generalize to slightly different tasks. To address the generalization problem, researchers have proposed a procedure generation setting (procedurally-generated setting) under which an agent is faced with a set of a large number of similar tasks. At the beginning of each round, the agent needs to interact with a certain task in the collection. For example, in a navigation task, an agent is required to find a target in the maze, but the layout of the maze is regenerated before each round starts. Reinforcement learning tasks under program generation settings are modeled using a contextual markov decision process (Context Markov Decision Process, C-MDP).
Exploration in C-MDP is very challenging. Extensive research has shown that the count-based exploration methods widely used for single case settings, and their modifications, are not suitable for exploration in C-MDP, as the agents are unlikely to see the same state in different rounds.
Disclosure of Invention
The invention provides a deep reinforcement learning exploration method and system based on extreme novelty search, which are used for solving the problem that the exploration efficiency of the existing reinforcement learning agent in a Markov decision process is low.
The invention provides a deep reinforcement learning exploration method based on extreme novelty search, which comprises the following steps:
initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
According to the deep reinforcement learning exploration method based on extreme novelty search, the model parameters and the simulation environment are initialized, the preset agent model and the simulation environment are interacted, and data are collected, wherein the method comprises the following steps:
randomly sampling a simulation environment from the environment set through a simulator, giving an initial state, taking the initial state as input by a strategy network of the intelligent body model, and outputting related actions;
the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and the intelligent body model obtains rewards and termination identifiers;
the strategy network of the intelligent agent model takes the next moment state as input and outputs new related actions;
and repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
According to the deep reinforcement learning exploration method based on extreme novelty search, the exploration rewards based on conditional state probability and action probability are calculated based on the collected related data, and the method comprises the following steps:
pulling data from a data list of the collected related data;
the exploration rewards based on the conditional state probabilities and the action probabilities are calculated by pulling the data.
According to the deep reinforcement learning exploration method based on extreme novelty search, the exploration rewards are corrected according to extreme novelty state preference, and collected data are updated, wherein the method comprises the following steps:
revising the explore rewards based on the extreme novel state preferences;
updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
According to the deep reinforcement learning exploration method based on extreme novelty search, the parameters of the intelligent agent model are updated based on the updated collected data, and the strategy network is updated by adopting strategy gradients, which comprises the following steps:
pulling data from a data list of the collected related data;
updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
and updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm.
According to the deep reinforcement learning exploration method based on extreme novelty search, the data collection list is emptied, model parameters are saved, an artificial intelligent agent model is continuously trained until iteration is completed, and the deep reinforcement learning exploration of a Markov decision process is carried out, wherein the method comprises the following steps:
clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;
training the agent model continues until the iteration is complete.
The invention also provides a deep reinforcement learning exploration system based on extreme novelty search, which comprises:
the data collection module is used for initializing model parameters and a simulation environment, interacting a preset intelligent body model with the simulation environment and collecting data;
the exploration rewards calculation module is used for calculating exploration rewards based on the conditional state probability and the action probability based on the collected related data;
the reward updating module is used for correcting the exploration rewards according to the extreme novel state preference and updating the collected data;
the model parameter updating module is used for updating the parameters of the intelligent agent model based on the updated collected data, and updating the strategy network by adopting strategy gradients;
the iteration module is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep reinforcement learning exploration method based on the extreme novelty search according to any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the deep reinforcement learning exploration method based on extreme novelty search as described in any of the above.
The present invention also provides a computer program product comprising a computer program which when executed by a processor implements a deep reinforcement learning exploration method based on an extreme novelty search as described in any of the above.
The invention provides a deep reinforcement learning exploration method and a system based on extreme novelty search, wherein motion entropy is considered in an exploration target, and is not only taken as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow diagram of a deep reinforcement learning exploration method based on extreme novelty searching;
FIG. 2 is a second flow chart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 3 is a third flow chart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 4 is a schematic flow diagram of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 5 is a schematic flow chart diagram of a deep reinforcement learning exploration method based on extreme novelty searching;
FIG. 6 is a flowchart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 7 is a schematic diagram of the modular connection of a deep reinforcement learning exploration system based on extreme novelty searching provided by the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.
Reference numerals:
110: a data collection module; 120: a search reward calculation module; 130: a reward updating module; 140: a model parameter updating module; 150: an iteration module;
810: a processor; 820: a communication interface; 830: a memory; 840: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The following describes a deep reinforcement learning exploration method based on extreme novelty search with reference to fig. 1-6, which comprises the following steps:
s100, initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
s200, calculating exploration rewards based on conditional state probability and action probability based on the collected related data;
s300, correcting the exploration rewards according to extreme novelty state preference, and updating collected data;
s400, updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
s500, emptying the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
In the invention, corresponding attention force diagrams are generated according to the data, and the attention force diagrams are discretized; generating a virtual reward with the virtual reward generator based on the data and its corresponding attention attempt; updating the data with the virtual rewards; updating parameters of a strategy network and a value network of the intelligent agent according to the updated data; updating parameters of the virtual rewards generator according to the updated data; and updating parameters of the encoder module and the attention module according to the updated data, so that the virtual rewarding generator can distinguish similar images, and the exploration efficiency of the intelligent agent is effectively improved.
The invention is mainly used for simulating simulation scenes, and mainly comprises an electronic game scene, a robot simulator and the like. Taking MiniGrid games widely used in academia as an example, the invention can help the AI intelligent agent to efficiently explore the environment in the scene, thereby realizing the obtained performance. Furthermore, the proposed method of the present invention belongs to the field of reinforcement learning, relying only on the basic assumption of reinforcement learning, namely the markov decision process (Markov Decision Process, MDP). Thus, the proposed method of the present invention can be used for any scenario that can be modeled as a markov decision process.
Initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data, wherein the method comprises the following steps:
s101, randomly sampling a simulation environment from an environment set through a simulator, giving an initial state, and outputting related actions by taking the initial state as input by a strategy network of an intelligent body model;
s102, the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and obtains rewards and termination identifiers from the intelligent body model;
s103, taking the next moment state as input by the strategy network of the agent model, and outputting a new related action;
s104, repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
Specifically, the agent interacts with the simulation environment and collects data; the interaction process is as follows, step 1: the simulator randomly samples a simulation environment from the environment set and gives an initial state s 0 Policy network of agent model s 0 To input and output action a 0 The method comprises the steps of carrying out a first treatment on the surface of the Step 2, simulator, in s 0 And a 0 To input and output the state s at the next moment 1 Rewards r obtained by agent model 1 And a termination identifier d 1 The method comprises the steps of carrying out a first treatment on the surface of the Step 3, strategy network of agent model is represented by s 1 Output a as input 1 . Step 2 and step 3 are repeated until a set amount of data is collected. The data is in the form of five tuples (s t ,s t+1 ,a t ,r t ,d t ). When the environment reaches the termination condition, the simulator randomly samples a simulation environment from the environment set for the next interaction by the agent.
At initialization of model parametersInitializing parameters of a strategy network and a value network of the intelligent agent in the process of simulating the environment; initialization ofPredictor, & gt>Predictor, N(s) t+1 ) A predictor, N predictor; initializing a data collection list; and initializing a simulation environment. Note that the initial state data (image data) will be returned after each environment is initialized.
In one specific example, an agent model interacts with an environment, collecting data;
s2-1, 128 simulation environments are used in parallel.
S2-2, for one of the parallel environments, sending the state data of the current environment into the current agent model strategy network to obtain action output corresponding to the current state.
S2-3, repeating the S2-2 process for all environments.
S2-4, each environment receives the actions of the intelligent agent to perform one-step forward simulation, and returns the state data, the rewarding information and the termination identifier of the next step.
S2-5, repeating the above steps S2-2 to S2-4 128 times to obtain 128 pieces of training data (S) having 128 track lengths t ,s t+1 ,a t ,r t ,d t ) It is noted that when the environment simulation ends, the environment is reset to continue the simulation.
S2-6, storing the data into a data collection list.
Calculating an exploration reward based on the conditional state probability and the action probability based on the collected relevant data, comprising:
s201, pulling data from a data list of the collected related data;
s202, calculating exploration rewards based on conditional state probability and action probability through pulling data.
In the present invention, for each five-tuple, the computation is based on a stripeExploration rewards b of part state probability and action probability t :
Wherein pi is mix Is a historical mixing strategy that is used to mix the data,is given policy pi mix Agent performs a t After that, shift to s t+1 Probability of->Is given policy pi mix Agent performs a t Is a probability of (2). In a specific calculation, the probability can be replaced by the frequency, i.e. +.> Wherein N is the interaction times of the current intelligent agent and the environment, < ->Is to perform action a t Is>Is to execute a t Post transfer to s t+1 Is a number of times (1). Corresponding (I)>
Correcting the exploration rewards according to extreme novel state preferences, updating the collected data, including:
s301, correcting exploration rewards based on extreme novel state preference;
s302, updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
In the present invention, b is modified based on extreme novel state preferences t ObtainingSpecifically:
wherein the method comprises the steps ofIs an indication function, N (s t+1 ) Is state s t+1 The number of times accessed, β ε [0,1 ]]Is a super parameter. The physical meaning of the above formula is to encourage the agent to visit a state that has never been visited before;
updating five-tuple(s) t ,s t+1 ,a t ,r t ,d t ) In (a) and (b)The updating of the exploration rewards can be repeated 128 times and 128 times in the invention until all data in the data collection list are updated.
Updating parameters of the agent model based on the updated collected data, the policy network updating with a policy gradient, comprising:
s401, pulling data from a data list of the collected related data;
s402, updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
s403, updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm. UpdatingPredictor, & gt>Predictor, N(s) t+1 ) Predictor, N predictor.
In the invention, all five-tuple is collected, and the updated data is used for updating the parameters of the intelligent agent according to an actor-critic algorithm; the strategy network is updated by adopting strategy gradient, and the strategy gradient is calculated as follows:
wherein the policy network represents pi, theta is the corresponding neural network parameter,for policy gradient, V π (s t ) Is state s t Value estimation, p θ (a t |s t ) Is state s t Selecting action a t M is the size of the training data. The value network loss function is:
where Actor is a neural network, critic is also a neural network, they are different neural networks, and Actor is used to predict the probability of behavior, critic is the value of predicting in this state.
Combining the methods of Policy Gradient (Actor) and Function Approximation (Critic), the Actor selects behavior based on probability, critic (may use Q-learning or Value-based) estimates the Value of each state, the Value of the next state is subtracted from the Value of this state, (TD-error), critic tells the Actor that the next action is to be updated more, TD-error is positive, next action is to be updated more, if negative, the update amplitude of Actor is to be reduced, critic evaluates the score of behavior based on Actor, and the Actor modifies the probability of selecting behavior according to the score of Critic.
Actor (Actor) refers to a strategy function pi theta (a|s), i.e., learning a strategy to get as much return as possible;
critics (Critic) refers to a value function V pi(s), and the value function of the current strategy is estimated, namely, the quality of actors is estimated;
with the help of the value function, the actor-critter algorithm can update the parameters in a single step without waiting until the end of the round.
The Actor-Critic is composed of Policy Gradients and Value-Based, and the Critic can see potential rewards of the state where the Actor is located through the relation between learning environment and rewards and punishments, so that the Actor is pointed to update each step, and if the Policy Gradients are used, the Actor can only wait for one round to update. Thus, a single step update can be performed faster than conventional Policy Gradient.
Clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process, wherein the method comprises the following steps of:
s501, clearing data in a data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;
s502, continuously training the agent model until iteration is completed.
In the present invention, updating is performed based on dataAnd +.>Repeating the steps until the end condition is reached.
Specifically, the process 10 of intelligent model parameter update is repeated by flushing the data in the data collection list 3 Updating a version parameter and storing parameters of a strategy network and a value network of the intelligent agent; preservation ofA predictor(s),Predictor, N(s) t+1 ) Parameters of predictor, N predictor.
Training of the agent is continued until the iteration is completed, with the total amount of data collected exceeding 5 x 10 7 。
The invention provides a deep reinforcement learning exploration method based on extreme novelty search, which considers action entropy in an exploration target, and does not only take the action entropy as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.
Referring to FIG. 7, the invention also discloses a deep reinforcement learning exploration system based on extreme novelty search, the system comprising:
the data collection module 110 is used for initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
a search reward calculation module 120 for calculating a search reward based on the conditional state probability and the action probability based on the collected related data;
a reward update module 130 for modifying the exploration rewards according to extreme novel state preferences, updating the collected data;
the model parameter updating module 140 is configured to update parameters of the agent model based on the updated collected data, and update the policy network with a policy gradient;
the iteration module 150 is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.
The data collection module randomly samples a simulation environment from the environment set through the simulator, gives an initial state, takes the initial state as input, and outputs related actions by the strategy network of the intelligent body model;
the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and the intelligent body model obtains rewards and termination identifiers;
the strategy network of the intelligent agent model takes the next moment state as input and outputs new related actions;
and repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
The exploration rewards calculation module is used for pulling data from a data list of the collected related data;
the exploration rewards based on the conditional state probabilities and the action probabilities are calculated by pulling the data.
A reward update module that revises the explored rewards based on the extreme novel state preferences;
updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
The model parameter updating module is used for pulling data from a data list of the collected related data;
updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
and updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm.
The iteration module is used for clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model and storing parameters of a strategy network and a value network of the intelligent body model;
training the agent model continues until the iteration is complete.
The invention provides a deep reinforcement learning exploration system based on extreme novelty search, which considers action entropy in an exploration target, and does not only take the action entropy as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.
Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a deep reinforcement learning exploration method based on extreme novelty searches, the method comprising: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform a deep reinforcement learning exploration method based on extreme novelty search provided by the above methods, where the method includes: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
In yet another aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a deep reinforcement learning exploration method based on extreme novelty searching provided by the above methods, the method comprising: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. The deep reinforcement learning exploration method based on the extreme novelty search is characterized by comprising the following steps of:
initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
2. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said initializing model parameters and simulation environment, interacting a preset agent model with the simulation environment, collecting data, comprises:
randomly sampling a simulation environment from the environment set through a simulator, giving an initial state, taking the initial state as input by a strategy network of the intelligent body model, and outputting related actions;
the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and the intelligent body model obtains rewards and termination identifiers;
the strategy network of the intelligent agent model takes the next moment state as input and outputs new related actions;
and repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
3. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said computing exploration rewards based on conditional state probabilities and action probabilities based on collected relevant data comprises:
pulling data from a data list of the collected related data;
the exploration rewards based on the conditional state probabilities and the action probabilities are calculated by pulling the data.
4. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said revising said exploration rewards according to extreme novelty state preferences, updating collected data, comprises:
revising the explore rewards based on the extreme novel state preferences;
updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
5. The deep reinforcement learning exploration method based on extreme novelty search of claim 1, wherein said updating parameters of an agent model based on updated collected data, updating a policy network with a policy gradient, comprises:
pulling data from a data list of the collected related data;
updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
and updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm.
6. The method of claim 1, wherein the emptying the data collection list, saving model parameters, continuously training the artificial intelligence agent model until iteration is completed, and performing a deep reinforcement learning exploration of a markov decision process, comprising:
clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;
training the agent model continues until the iteration is complete.
7. A deep reinforcement learning exploration system based on extreme novelty searching, the system comprising:
the data collection module is used for initializing model parameters and a simulation environment, interacting a preset intelligent body model with the simulation environment and collecting data;
the exploration rewards calculation module is used for calculating exploration rewards based on the conditional state probability and the action probability based on the collected related data;
the reward updating module is used for correcting the exploration rewards according to the extreme novel state preference and updating the collected data;
the model parameter updating module is used for updating the parameters of the intelligent agent model based on the updated collected data, and updating the strategy network by adopting strategy gradients;
and the iteration module is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligence body model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep reinforcement learning exploration method based on extreme novelty search of any one of claims 1 to 6 when executing the program.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the deep reinforcement learning exploration method based on extreme novelty search of any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the extreme novelty search-based deep reinforcement learning exploration method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311264592.3A CN117150927B (en) | 2023-09-27 | 2023-09-27 | Deep reinforcement learning exploration method and system based on extreme novelty search |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311264592.3A CN117150927B (en) | 2023-09-27 | 2023-09-27 | Deep reinforcement learning exploration method and system based on extreme novelty search |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117150927A true CN117150927A (en) | 2023-12-01 |
CN117150927B CN117150927B (en) | 2024-04-02 |
Family
ID=88910092
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311264592.3A Active CN117150927B (en) | 2023-09-27 | 2023-09-27 | Deep reinforcement learning exploration method and system based on extreme novelty search |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117150927B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113962390A (en) * | 2021-12-21 | 2022-01-21 | 中国科学院自动化研究所 | Method for constructing diversified search strategy model based on deep reinforcement learning network |
CN114004370A (en) * | 2021-12-28 | 2022-02-01 | 中国科学院自动化研究所 | Method for constructing regional sensitivity model based on deep reinforcement learning network |
CN114528766A (en) * | 2022-02-21 | 2022-05-24 | 浙江大学 | Multi-intelligent hybrid cooperative optimization method based on reinforcement learning |
-
2023
- 2023-09-27 CN CN202311264592.3A patent/CN117150927B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113962390A (en) * | 2021-12-21 | 2022-01-21 | 中国科学院自动化研究所 | Method for constructing diversified search strategy model based on deep reinforcement learning network |
CN114004370A (en) * | 2021-12-28 | 2022-02-01 | 中国科学院自动化研究所 | Method for constructing regional sensitivity model based on deep reinforcement learning network |
CN114528766A (en) * | 2022-02-21 | 2022-05-24 | 浙江大学 | Multi-intelligent hybrid cooperative optimization method based on reinforcement learning |
Non-Patent Citations (1)
Title |
---|
PEI XU等: "Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 37, 26 June 2023 (2023-06-26), pages 11717 - 11725 * |
Also Published As
Publication number | Publication date |
---|---|
CN117150927B (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112668128B (en) | Method and device for selecting terminal equipment nodes in federal learning system | |
Stolle et al. | Learning options in reinforcement learning | |
CN111191934A (en) | Multi-target cloud workflow scheduling method based on reinforcement learning strategy | |
CN113098714A (en) | Low-delay network slicing method based on deep reinforcement learning | |
CN111352419B (en) | Path planning method and system for updating experience playback cache based on time sequence difference | |
CN111768028B (en) | GWLF model parameter adjusting method based on deep reinforcement learning | |
CN104657626A (en) | Method for establishing protein-protein interaction network by utilizing text data | |
Pan et al. | Stochastic generative flow networks | |
CN115526317A (en) | Multi-agent knowledge inference method and system based on deep reinforcement learning | |
CN112613608A (en) | Reinforced learning method and related device | |
CN113947022B (en) | Near-end strategy optimization method based on model | |
Yuan et al. | Euclid: Towards efficient unsupervised reinforcement learning with multi-choice dynamics model | |
Hu et al. | GLSO: grammar-guided latent space optimization for sample-efficient robot design automation | |
Wu et al. | Models as agents: optimizing multi-step predictions of interactive local models in model-based multi-agent reinforcement learning | |
CN117150927B (en) | Deep reinforcement learning exploration method and system based on extreme novelty search | |
Zhang et al. | Brain-inspired experience reinforcement model for bin packing in varying environments | |
CN111190711B (en) | BDD combined heuristic A search multi-robot task allocation method | |
CN113326884A (en) | Efficient learning method and device for large-scale abnormal graph node representation | |
Qian et al. | Deep learning for a low-data drug design system | |
CN116502779A (en) | Traveling merchant problem generation type solving method based on local attention mechanism | |
CN115022231B (en) | Optimal path planning method and system based on deep reinforcement learning | |
Ho et al. | Adaptive communication for distributed deep learning on commodity GPU cluster | |
Fortier et al. | Parameter estimation in Bayesian networks using overlapping swarm intelligence | |
Lu et al. | Sampling diversity driven exploration with state difference guidance | |
Ueda et al. | Particle filter on episode for learning decision making rule |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |