CN117150927A - Deep reinforcement learning exploration method and system based on extreme novelty search - Google Patents

Deep reinforcement learning exploration method and system based on extreme novelty search Download PDF

Info

Publication number
CN117150927A
CN117150927A CN202311264592.3A CN202311264592A CN117150927A CN 117150927 A CN117150927 A CN 117150927A CN 202311264592 A CN202311264592 A CN 202311264592A CN 117150927 A CN117150927 A CN 117150927A
Authority
CN
China
Prior art keywords
data
exploration
reinforcement learning
updating
extreme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311264592.3A
Other languages
Chinese (zh)
Other versions
CN117150927B (en
Inventor
路圣汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hanbo Technology Co ltd
Original Assignee
Beijing Hanbo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hanbo Technology Co ltd filed Critical Beijing Hanbo Technology Co ltd
Priority to CN202311264592.3A priority Critical patent/CN117150927B/en
Publication of CN117150927A publication Critical patent/CN117150927A/en
Application granted granted Critical
Publication of CN117150927B publication Critical patent/CN117150927B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Geometry (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of deep reinforcement learning, and provides a deep reinforcement learning exploration method and system based on extreme novelty search, wherein the method comprises the following steps: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data; calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data; correcting the exploration rewards according to extreme novel state preferences, and updating collected data; updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients; and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process. The invention solves the problem that the existing reinforcement learning agent has low exploration efficiency in the Markov decision process.

Description

Deep reinforcement learning exploration method and system based on extreme novelty search
Technical Field
The invention relates to the technical field of deep reinforcement learning, in particular to a deep reinforcement learning exploration method and system based on extreme novelty search.
Background
With the rapid development of deep reinforcement learning technology, it has been successful in many fields. From go to electronic games, deep reinforcement learning techniques have demonstrated tremendous potential. How to effectively explore the sparse environment of reward signals is one of the key problems faced by model-free reinforcement learning. While researchers have proposed many approaches to this problem, most of these approaches are designed for reinforcement learning tasks under a single instance setting (singleton setting).
Under a single instance setting, the specific tasks faced by the agent per round are the same, and Markov process (Markov Decision Process, MDP) modeling can be used. Recent studies have shown that agents trained under a single setting tend to overfit tasks and may not even generalize to slightly different tasks. To address the generalization problem, researchers have proposed a procedure generation setting (procedurally-generated setting) under which an agent is faced with a set of a large number of similar tasks. At the beginning of each round, the agent needs to interact with a certain task in the collection. For example, in a navigation task, an agent is required to find a target in the maze, but the layout of the maze is regenerated before each round starts. Reinforcement learning tasks under program generation settings are modeled using a contextual markov decision process (Context Markov Decision Process, C-MDP).
Exploration in C-MDP is very challenging. Extensive research has shown that the count-based exploration methods widely used for single case settings, and their modifications, are not suitable for exploration in C-MDP, as the agents are unlikely to see the same state in different rounds.
Disclosure of Invention
The invention provides a deep reinforcement learning exploration method and system based on extreme novelty search, which are used for solving the problem that the exploration efficiency of the existing reinforcement learning agent in a Markov decision process is low.
The invention provides a deep reinforcement learning exploration method based on extreme novelty search, which comprises the following steps:
initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
According to the deep reinforcement learning exploration method based on extreme novelty search, the model parameters and the simulation environment are initialized, the preset agent model and the simulation environment are interacted, and data are collected, wherein the method comprises the following steps:
randomly sampling a simulation environment from the environment set through a simulator, giving an initial state, taking the initial state as input by a strategy network of the intelligent body model, and outputting related actions;
the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and the intelligent body model obtains rewards and termination identifiers;
the strategy network of the intelligent agent model takes the next moment state as input and outputs new related actions;
and repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
According to the deep reinforcement learning exploration method based on extreme novelty search, the exploration rewards based on conditional state probability and action probability are calculated based on the collected related data, and the method comprises the following steps:
pulling data from a data list of the collected related data;
the exploration rewards based on the conditional state probabilities and the action probabilities are calculated by pulling the data.
According to the deep reinforcement learning exploration method based on extreme novelty search, the exploration rewards are corrected according to extreme novelty state preference, and collected data are updated, wherein the method comprises the following steps:
revising the explore rewards based on the extreme novel state preferences;
updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
According to the deep reinforcement learning exploration method based on extreme novelty search, the parameters of the intelligent agent model are updated based on the updated collected data, and the strategy network is updated by adopting strategy gradients, which comprises the following steps:
pulling data from a data list of the collected related data;
updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
and updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm.
According to the deep reinforcement learning exploration method based on extreme novelty search, the data collection list is emptied, model parameters are saved, an artificial intelligent agent model is continuously trained until iteration is completed, and the deep reinforcement learning exploration of a Markov decision process is carried out, wherein the method comprises the following steps:
clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;
training the agent model continues until the iteration is complete.
The invention also provides a deep reinforcement learning exploration system based on extreme novelty search, which comprises:
the data collection module is used for initializing model parameters and a simulation environment, interacting a preset intelligent body model with the simulation environment and collecting data;
the exploration rewards calculation module is used for calculating exploration rewards based on the conditional state probability and the action probability based on the collected related data;
the reward updating module is used for correcting the exploration rewards according to the extreme novel state preference and updating the collected data;
the model parameter updating module is used for updating the parameters of the intelligent agent model based on the updated collected data, and updating the strategy network by adopting strategy gradients;
the iteration module is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep reinforcement learning exploration method based on the extreme novelty search according to any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the deep reinforcement learning exploration method based on extreme novelty search as described in any of the above.
The present invention also provides a computer program product comprising a computer program which when executed by a processor implements a deep reinforcement learning exploration method based on an extreme novelty search as described in any of the above.
The invention provides a deep reinforcement learning exploration method and a system based on extreme novelty search, wherein motion entropy is considered in an exploration target, and is not only taken as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow diagram of a deep reinforcement learning exploration method based on extreme novelty searching;
FIG. 2 is a second flow chart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 3 is a third flow chart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 4 is a schematic flow diagram of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 5 is a schematic flow chart diagram of a deep reinforcement learning exploration method based on extreme novelty searching;
FIG. 6 is a flowchart of a deep reinforcement learning exploration method based on extreme novelty searching according to the present invention;
FIG. 7 is a schematic diagram of the modular connection of a deep reinforcement learning exploration system based on extreme novelty searching provided by the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.
Reference numerals:
110: a data collection module; 120: a search reward calculation module; 130: a reward updating module; 140: a model parameter updating module; 150: an iteration module;
810: a processor; 820: a communication interface; 830: a memory; 840: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The following describes a deep reinforcement learning exploration method based on extreme novelty search with reference to fig. 1-6, which comprises the following steps:
s100, initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
s200, calculating exploration rewards based on conditional state probability and action probability based on the collected related data;
s300, correcting the exploration rewards according to extreme novelty state preference, and updating collected data;
s400, updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
s500, emptying the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
In the invention, corresponding attention force diagrams are generated according to the data, and the attention force diagrams are discretized; generating a virtual reward with the virtual reward generator based on the data and its corresponding attention attempt; updating the data with the virtual rewards; updating parameters of a strategy network and a value network of the intelligent agent according to the updated data; updating parameters of the virtual rewards generator according to the updated data; and updating parameters of the encoder module and the attention module according to the updated data, so that the virtual rewarding generator can distinguish similar images, and the exploration efficiency of the intelligent agent is effectively improved.
The invention is mainly used for simulating simulation scenes, and mainly comprises an electronic game scene, a robot simulator and the like. Taking MiniGrid games widely used in academia as an example, the invention can help the AI intelligent agent to efficiently explore the environment in the scene, thereby realizing the obtained performance. Furthermore, the proposed method of the present invention belongs to the field of reinforcement learning, relying only on the basic assumption of reinforcement learning, namely the markov decision process (Markov Decision Process, MDP). Thus, the proposed method of the present invention can be used for any scenario that can be modeled as a markov decision process.
Initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data, wherein the method comprises the following steps:
s101, randomly sampling a simulation environment from an environment set through a simulator, giving an initial state, and outputting related actions by taking the initial state as input by a strategy network of an intelligent body model;
s102, the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and obtains rewards and termination identifiers from the intelligent body model;
s103, taking the next moment state as input by the strategy network of the agent model, and outputting a new related action;
s104, repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
Specifically, the agent interacts with the simulation environment and collects data; the interaction process is as follows, step 1: the simulator randomly samples a simulation environment from the environment set and gives an initial state s 0 Policy network of agent model s 0 To input and output action a 0 The method comprises the steps of carrying out a first treatment on the surface of the Step 2, simulator, in s 0 And a 0 To input and output the state s at the next moment 1 Rewards r obtained by agent model 1 And a termination identifier d 1 The method comprises the steps of carrying out a first treatment on the surface of the Step 3, strategy network of agent model is represented by s 1 Output a as input 1 . Step 2 and step 3 are repeated until a set amount of data is collected. The data is in the form of five tuples (s t ,s t+1 ,a t ,r t ,d t ). When the environment reaches the termination condition, the simulator randomly samples a simulation environment from the environment set for the next interaction by the agent.
At initialization of model parametersInitializing parameters of a strategy network and a value network of the intelligent agent in the process of simulating the environment; initialization ofPredictor, & gt>Predictor, N(s) t+1 ) A predictor, N predictor; initializing a data collection list; and initializing a simulation environment. Note that the initial state data (image data) will be returned after each environment is initialized.
In one specific example, an agent model interacts with an environment, collecting data;
s2-1, 128 simulation environments are used in parallel.
S2-2, for one of the parallel environments, sending the state data of the current environment into the current agent model strategy network to obtain action output corresponding to the current state.
S2-3, repeating the S2-2 process for all environments.
S2-4, each environment receives the actions of the intelligent agent to perform one-step forward simulation, and returns the state data, the rewarding information and the termination identifier of the next step.
S2-5, repeating the above steps S2-2 to S2-4 128 times to obtain 128 pieces of training data (S) having 128 track lengths t ,s t+1 ,a t ,r t ,d t ) It is noted that when the environment simulation ends, the environment is reset to continue the simulation.
S2-6, storing the data into a data collection list.
Calculating an exploration reward based on the conditional state probability and the action probability based on the collected relevant data, comprising:
s201, pulling data from a data list of the collected related data;
s202, calculating exploration rewards based on conditional state probability and action probability through pulling data.
In the present invention, for each five-tuple, the computation is based on a stripeExploration rewards b of part state probability and action probability t
Wherein pi is mix Is a historical mixing strategy that is used to mix the data,is given policy pi mix Agent performs a t After that, shift to s t+1 Probability of->Is given policy pi mix Agent performs a t Is a probability of (2). In a specific calculation, the probability can be replaced by the frequency, i.e. +.> Wherein N is the interaction times of the current intelligent agent and the environment, < ->Is to perform action a t Is>Is to execute a t Post transfer to s t+1 Is a number of times (1). Corresponding (I)>
Correcting the exploration rewards according to extreme novel state preferences, updating the collected data, including:
s301, correcting exploration rewards based on extreme novel state preference;
s302, updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
In the present invention, b is modified based on extreme novel state preferences t ObtainingSpecifically:
wherein the method comprises the steps ofIs an indication function, N (s t+1 ) Is state s t+1 The number of times accessed, β ε [0,1 ]]Is a super parameter. The physical meaning of the above formula is to encourage the agent to visit a state that has never been visited before;
updating five-tuple(s) t ,s t+1 ,a t ,r t ,d t ) In (a) and (b)The updating of the exploration rewards can be repeated 128 times and 128 times in the invention until all data in the data collection list are updated.
Updating parameters of the agent model based on the updated collected data, the policy network updating with a policy gradient, comprising:
s401, pulling data from a data list of the collected related data;
s402, updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
s403, updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm. UpdatingPredictor, & gt>Predictor, N(s) t+1 ) Predictor, N predictor.
In the invention, all five-tuple is collected, and the updated data is used for updating the parameters of the intelligent agent according to an actor-critic algorithm; the strategy network is updated by adopting strategy gradient, and the strategy gradient is calculated as follows:
wherein the policy network represents pi, theta is the corresponding neural network parameter,for policy gradient, V π (s t ) Is state s t Value estimation, p θ (a t |s t ) Is state s t Selecting action a t M is the size of the training data. The value network loss function is:
where Actor is a neural network, critic is also a neural network, they are different neural networks, and Actor is used to predict the probability of behavior, critic is the value of predicting in this state.
Combining the methods of Policy Gradient (Actor) and Function Approximation (Critic), the Actor selects behavior based on probability, critic (may use Q-learning or Value-based) estimates the Value of each state, the Value of the next state is subtracted from the Value of this state, (TD-error), critic tells the Actor that the next action is to be updated more, TD-error is positive, next action is to be updated more, if negative, the update amplitude of Actor is to be reduced, critic evaluates the score of behavior based on Actor, and the Actor modifies the probability of selecting behavior according to the score of Critic.
Actor (Actor) refers to a strategy function pi theta (a|s), i.e., learning a strategy to get as much return as possible;
critics (Critic) refers to a value function V pi(s), and the value function of the current strategy is estimated, namely, the quality of actors is estimated;
with the help of the value function, the actor-critter algorithm can update the parameters in a single step without waiting until the end of the round.
The Actor-Critic is composed of Policy Gradients and Value-Based, and the Critic can see potential rewards of the state where the Actor is located through the relation between learning environment and rewards and punishments, so that the Actor is pointed to update each step, and if the Policy Gradients are used, the Actor can only wait for one round to update. Thus, a single step update can be performed faster than conventional Policy Gradient.
Clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process, wherein the method comprises the following steps of:
s501, clearing data in a data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;
s502, continuously training the agent model until iteration is completed.
In the present invention, updating is performed based on dataAnd +.>Repeating the steps until the end condition is reached.
Specifically, the process 10 of intelligent model parameter update is repeated by flushing the data in the data collection list 3 Updating a version parameter and storing parameters of a strategy network and a value network of the intelligent agent; preservation ofA predictor(s),Predictor, N(s) t+1 ) Parameters of predictor, N predictor.
Training of the agent is continued until the iteration is completed, with the total amount of data collected exceeding 5 x 10 7
The invention provides a deep reinforcement learning exploration method based on extreme novelty search, which considers action entropy in an exploration target, and does not only take the action entropy as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.
Referring to FIG. 7, the invention also discloses a deep reinforcement learning exploration system based on extreme novelty search, the system comprising:
the data collection module 110 is used for initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
a search reward calculation module 120 for calculating a search reward based on the conditional state probability and the action probability based on the collected related data;
a reward update module 130 for modifying the exploration rewards according to extreme novel state preferences, updating the collected data;
the model parameter updating module 140 is configured to update parameters of the agent model based on the updated collected data, and update the policy network with a policy gradient;
the iteration module 150 is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.
The data collection module randomly samples a simulation environment from the environment set through the simulator, gives an initial state, takes the initial state as input, and outputs related actions by the strategy network of the intelligent body model;
the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and the intelligent body model obtains rewards and termination identifiers;
the strategy network of the intelligent agent model takes the next moment state as input and outputs new related actions;
and repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
The exploration rewards calculation module is used for pulling data from a data list of the collected related data;
the exploration rewards based on the conditional state probabilities and the action probabilities are calculated by pulling the data.
A reward update module that revises the explored rewards based on the extreme novel state preferences;
updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
The model parameter updating module is used for pulling data from a data list of the collected related data;
updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
and updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm.
The iteration module is used for clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model and storing parameters of a strategy network and a value network of the intelligent body model;
training the agent model continues until the iteration is complete.
The invention provides a deep reinforcement learning exploration system based on extreme novelty search, which considers action entropy in an exploration target, and does not only take the action entropy as a regular term of an optimization target of a strategy network of an agent; and extremely novel state preference is introduced, so that the intelligent agent is encouraged to explore the state which is never accessed preferentially, and the exploration efficiency of the reinforcement learning intelligent agent in the context of the Markov decision process environment can be remarkably improved.
Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a deep reinforcement learning exploration method based on extreme novelty searches, the method comprising: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform a deep reinforcement learning exploration method based on extreme novelty search provided by the above methods, where the method includes: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
In yet another aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a deep reinforcement learning exploration method based on extreme novelty searching provided by the above methods, the method comprising: initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The deep reinforcement learning exploration method based on the extreme novelty search is characterized by comprising the following steps of:
initializing model parameters and a simulation environment, interacting a preset agent model with the simulation environment, and collecting data;
calculating an exploration reward based on the conditional state probability and the action probability based on the collected related data;
correcting the exploration rewards according to extreme novel state preferences, and updating collected data;
updating parameters of the intelligent agent model based on the updated collected data, and updating a strategy network by adopting strategy gradients;
and (3) clearing the data collection list, saving model parameters, continuously training an artificial intelligent agent model until iteration is completed, and performing deep reinforcement learning exploration of a Markov decision process.
2. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said initializing model parameters and simulation environment, interacting a preset agent model with the simulation environment, collecting data, comprises:
randomly sampling a simulation environment from the environment set through a simulator, giving an initial state, taking the initial state as input by a strategy network of the intelligent body model, and outputting related actions;
the simulator takes the initial state and the output related action as input, outputs the state at the next moment, and the intelligent body model obtains rewards and termination identifiers;
the strategy network of the intelligent agent model takes the next moment state as input and outputs new related actions;
and repeating the actions until reaching the termination condition, and randomly sampling a simulation environment from the environment set by the simulator to realize the next interaction of the intelligent body model.
3. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said computing exploration rewards based on conditional state probabilities and action probabilities based on collected relevant data comprises:
pulling data from a data list of the collected related data;
the exploration rewards based on the conditional state probabilities and the action probabilities are calculated by pulling the data.
4. The extreme novelty search-based deep reinforcement learning exploration method of claim 1, wherein said revising said exploration rewards according to extreme novelty state preferences, updating collected data, comprises:
revising the explore rewards based on the extreme novel state preferences;
updating rewards obtained by the corresponding agent model in the collected data until all data in the data collection list are updated.
5. The deep reinforcement learning exploration method based on extreme novelty search of claim 1, wherein said updating parameters of an agent model based on updated collected data, updating a policy network with a policy gradient, comprises:
pulling data from a data list of the collected related data;
updating parameters of a strategy network and a value network of the intelligent agent model by using all data in the updated data list;
and updating parameters of the intelligent agent model through the updated data through an actor-critic reinforcement learning algorithm.
6. The method of claim 1, wherein the emptying the data collection list, saving model parameters, continuously training the artificial intelligence agent model until iteration is completed, and performing a deep reinforcement learning exploration of a markov decision process, comprising:
clearing data in the data collection list, repeating the parameter updating operation of the intelligent body model, and storing parameters of a strategy network and a value network of the intelligent body model;
training the agent model continues until the iteration is complete.
7. A deep reinforcement learning exploration system based on extreme novelty searching, the system comprising:
the data collection module is used for initializing model parameters and a simulation environment, interacting a preset intelligent body model with the simulation environment and collecting data;
the exploration rewards calculation module is used for calculating exploration rewards based on the conditional state probability and the action probability based on the collected related data;
the reward updating module is used for correcting the exploration rewards according to the extreme novel state preference and updating the collected data;
the model parameter updating module is used for updating the parameters of the intelligent agent model based on the updated collected data, and updating the strategy network by adopting strategy gradients;
and the iteration module is used for emptying the data collection list, saving model parameters, continuously training the artificial intelligence body model until iteration is completed, and performing deep reinforcement learning exploration of the Markov decision process.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep reinforcement learning exploration method based on extreme novelty search of any one of claims 1 to 6 when executing the program.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the deep reinforcement learning exploration method based on extreme novelty search of any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the extreme novelty search-based deep reinforcement learning exploration method of any of claims 1 to 6.
CN202311264592.3A 2023-09-27 2023-09-27 Deep reinforcement learning exploration method and system based on extreme novelty search Active CN117150927B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311264592.3A CN117150927B (en) 2023-09-27 2023-09-27 Deep reinforcement learning exploration method and system based on extreme novelty search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311264592.3A CN117150927B (en) 2023-09-27 2023-09-27 Deep reinforcement learning exploration method and system based on extreme novelty search

Publications (2)

Publication Number Publication Date
CN117150927A true CN117150927A (en) 2023-12-01
CN117150927B CN117150927B (en) 2024-04-02

Family

ID=88910092

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311264592.3A Active CN117150927B (en) 2023-09-27 2023-09-27 Deep reinforcement learning exploration method and system based on extreme novelty search

Country Status (1)

Country Link
CN (1) CN117150927B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113962390A (en) * 2021-12-21 2022-01-21 中国科学院自动化研究所 Method for constructing diversified search strategy model based on deep reinforcement learning network
CN114004370A (en) * 2021-12-28 2022-02-01 中国科学院自动化研究所 Method for constructing regional sensitivity model based on deep reinforcement learning network
CN114528766A (en) * 2022-02-21 2022-05-24 浙江大学 Multi-intelligent hybrid cooperative optimization method based on reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113962390A (en) * 2021-12-21 2022-01-21 中国科学院自动化研究所 Method for constructing diversified search strategy model based on deep reinforcement learning network
CN114004370A (en) * 2021-12-28 2022-02-01 中国科学院自动化研究所 Method for constructing regional sensitivity model based on deep reinforcement learning network
CN114528766A (en) * 2022-02-21 2022-05-24 浙江大学 Multi-intelligent hybrid cooperative optimization method based on reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PEI XU等: "Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 37, 26 June 2023 (2023-06-26), pages 11717 - 11725 *

Also Published As

Publication number Publication date
CN117150927B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN112668128B (en) Method and device for selecting terminal equipment nodes in federal learning system
Stolle et al. Learning options in reinforcement learning
CN111191934A (en) Multi-target cloud workflow scheduling method based on reinforcement learning strategy
CN113098714A (en) Low-delay network slicing method based on deep reinforcement learning
CN111352419B (en) Path planning method and system for updating experience playback cache based on time sequence difference
CN111768028B (en) GWLF model parameter adjusting method based on deep reinforcement learning
CN104657626A (en) Method for establishing protein-protein interaction network by utilizing text data
Pan et al. Stochastic generative flow networks
CN115526317A (en) Multi-agent knowledge inference method and system based on deep reinforcement learning
CN112613608A (en) Reinforced learning method and related device
CN113947022B (en) Near-end strategy optimization method based on model
Yuan et al. Euclid: Towards efficient unsupervised reinforcement learning with multi-choice dynamics model
Hu et al. GLSO: grammar-guided latent space optimization for sample-efficient robot design automation
Wu et al. Models as agents: optimizing multi-step predictions of interactive local models in model-based multi-agent reinforcement learning
CN117150927B (en) Deep reinforcement learning exploration method and system based on extreme novelty search
Zhang et al. Brain-inspired experience reinforcement model for bin packing in varying environments
CN111190711B (en) BDD combined heuristic A search multi-robot task allocation method
CN113326884A (en) Efficient learning method and device for large-scale abnormal graph node representation
Qian et al. Deep learning for a low-data drug design system
CN116502779A (en) Traveling merchant problem generation type solving method based on local attention mechanism
CN115022231B (en) Optimal path planning method and system based on deep reinforcement learning
Ho et al. Adaptive communication for distributed deep learning on commodity GPU cluster
Fortier et al. Parameter estimation in Bayesian networks using overlapping swarm intelligence
Lu et al. Sampling diversity driven exploration with state difference guidance
Ueda et al. Particle filter on episode for learning decision making rule

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant