CN113962390A - Method for constructing diversified search strategy model based on deep reinforcement learning network - Google Patents

Method for constructing diversified search strategy model based on deep reinforcement learning network Download PDF

Info

Publication number
CN113962390A
CN113962390A CN202111565916.8A CN202111565916A CN113962390A CN 113962390 A CN113962390 A CN 113962390A CN 202111565916 A CN202111565916 A CN 202111565916A CN 113962390 A CN113962390 A CN 113962390A
Authority
CN
China
Prior art keywords
virtual
state
agent
search
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111565916.8A
Other languages
Chinese (zh)
Other versions
CN113962390B (en
Inventor
黄凯奇
尹奇跃
张俊格
徐沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202111565916.8A priority Critical patent/CN113962390B/en
Publication of CN113962390A publication Critical patent/CN113962390A/en
Application granted granted Critical
Publication of CN113962390B publication Critical patent/CN113962390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for constructing a diversified search strategy model based on a deep reinforcement learning network, which is based on the setting of the weight of virtual rewards, and can ensure that different agents access different states, once a certain agent falls into misleading rewards, when other agents access a series of states leading to the misleading rewards again, because the weight is a negative value, the signals of the virtual rewards obtained by other agents are negative, the agents are forced to not access the series of states leading to the misleading rewards, thereby ensuring that different agents access different state sets, leading the updated search strategy model to be capable of finding a second target position corresponding to global optimality after being trained, effectively solving the technical problem that the global optimality cannot be searched due to the misleading rewards when searching high-dimensional data in the prior art, the probability that the agent falls into a local solution due to misleading rewards can be reduced.

Description

Method for constructing diversified search strategy model based on deep reinforcement learning network
Technical Field
The disclosure relates to the field of deep reinforcement learning and the technical field of image processing, in particular to a method for constructing a model of diversified search strategies based on a deep reinforcement learning network.
Background
With the development of artificial intelligence technology, when a decision is made in the face of a complex scene, a deep reinforcement learning method is provided. Deep Learning (DL) is a method for performing characterization Learning on data in machine Learning. Reinforcement Learning (RL) is an optimal strategy obtained by building an environment model and Learning while exploring an unknown environment. Deep Reinforcement Learning (DRL) is an artificial intelligence method that combines the perception capability of Deep learning and the decision-making capability of reinforcement learning, can be controlled directly according to input information, and is closer to the way of human thinking.
Deep reinforcement learning has become a popular way to train an agent to perform complex tasks. Deep reinforcement learning trains an agent by maximizing the reward signal. The success of deep reinforcement learning is mostly achieved in scenes where the reward signal is well-designed and sufficiently dense. However, in many environments, the reward signal is very sparse for the agent. In a scenario where the reward is dense, the agent may easily find the reward by taking random actions. However, in a scenario where the reward is sparse, it is difficult to expect the reward to be obtained by random search. And if the reward signal is not available, the deep reinforcement learning algorithm cannot update the strategy. In scenarios where the reward is sparse, the agent must have the ability to explore. Therefore, the exploration problem in the deep reinforcement learning has extremely important research and application values.
However, the conventional exploration method for deep reinforcement learning is difficult to handle the problem of misleading rewards in a scene (such as an environment in a state of an image and a high-dimensional vector) with high-dimensional data input, and the misleading rewards can prevent the agent from obtaining higher returns in a long term, which finally causes the agent to be trapped in local solution.
Disclosure of Invention
To solve the above technical problem or at least partially solve the above technical problem, embodiments of the present disclosure provide a method for building a model of a diversified search strategy based on a deep reinforcement learning network.
In a first aspect, an embodiment of the present disclosure provides a method for building a model based on a deep reinforcement learning network. The method comprises the following steps: acquiring search data for searching an image simulation environment by a plurality of agents in an initialization state, wherein the image simulation environment comprises: a first target position corresponding to the local optimum and a second target position corresponding to the global optimum; generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the position state information in the search data and an initialized virtual reward model, wherein the weights of the virtual rewards are negative values when the search data indicates that one of the plurality of agents is in the first target position; updating reward information in the search data according to the virtual reward and the weight aiming at the virtual reward, and correspondingly updating the search strategy model of the intelligent agent and the virtual reward model; and continuing training the updated search strategy model according to the updated search data and the virtual reward model until a training end condition is reached, wherein the trained search strategy model is used as an image search model capable of being positioned to the second target position.
According to an embodiment of the present disclosure, the virtual reward model includes: a virtual reward generator and an arbiter; wherein the virtual reward generator is configured to incentivize the agent to access image location states with relatively few historical access times; the discriminator is used for determining the probability of the plurality of agents visiting the specific image position state.
According to an embodiment of the present disclosure, the search data is a time-sequentially distributed data set sequence for each agent, and a data set at each time in the data set sequence includes: and the current state is the current searching action of the current state, and the current reward information is obtained by implementing the current searching action according to the current state and the next moment state. Wherein the generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the state information in the search data and the initialized virtual reward model comprises: for each time data set, the following steps are carried out: inputting the next time state into the virtual reward generator, and outputting to obtain the virtual reward corresponding to the next time state; inputting the next time state into the discriminator, and outputting to obtain the probability of the next time state accessed by each agent; and generating a weight for the virtual reward according to the probability of access by the current agent at the next moment state and the average access probability.
According to an embodiment of the present disclosure, the total number of the agents isNThe probability of the next moment state being accessed by the current agent is expressed as
Figure 314158DEST_PATH_IMAGE001
Wherein z represents the number of the current agent, the value of z is 1,2,3, … …,N
Figure 300700DEST_PATH_IMAGE002
representing the state at the next moment; the average access probability is 1N
Wherein the weight of the virtual reward
Figure DEST_PATH_IMAGE003
The following expression is satisfied:
Figure 373698DEST_PATH_IMAGE004
according to an embodiment of the present disclosure, the updating reward information in the search data according to the virtual reward and the weight for the virtual reward includes: performing weighted calculation on the virtual rewards and the weights aiming at the virtual rewards correspondingly to obtain virtual reward information; and adding and calculating the reward information and the virtual reward information in the search data to obtain updated reward information.
According to an embodiment of the present disclosure, the updating the search policy model of the agent includes: taking the search data containing updated reward information aiming at each agent as the input of a search strategy model of the current agent, and updating the parameters of the search strategy model based on an operator-critic algorithm in a deep reinforcement learning network; the search strategy model comprises a strategy network and a value network, wherein the input of the strategy network is the current state, and the output of the strategy network is the current search action aiming at the current state; the value network is used for predicting the probability of completing the search task according to the current state; the updating the parameters of the search strategy model comprises: and updating the parameters of the policy network and the value network.
According to an embodiment of the present disclosure, the discriminator includes a neural network model
Figure DEST_PATH_IMAGE005
The virtual prize generator comprises: target network with randomly initialized parameters and fixed parameters
Figure 653239DEST_PATH_IMAGE006
And parametric trainable predictive networks
Figure DEST_PATH_IMAGE007
Wherein, updating the virtual reward model comprises:
updating the parameters of the discriminator based on a first loss function by using the state information in the updated search data as the input of the discriminator; updating the parameters of the virtual reward generator based on a second loss function by taking the state information in the updated search data as the input of the virtual reward generator;
wherein the first loss function is expressed as
Figure 811688DEST_PATH_IMAGE008
Figure 523292DEST_PATH_IMAGE008
The following expression is satisfied:
Figure DEST_PATH_IMAGE009
wherein M represents the total number of training data, the neural network model of the discriminator
Figure 201529DEST_PATH_IMAGE010
In the state ofsTo input, output the statesProbability of belonging to z-th agent
Figure DEST_PATH_IMAGE011
Z takes the value of 1,2,3, … …, and N represents the total number of agents;
wherein the second loss function is expressed as
Figure 402703DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE013
The following expression is satisfied:
Figure 303575DEST_PATH_IMAGE014
according to an embodiment of the present disclosure, the acquiring search data for searching the image simulation environment by a plurality of agents in the initialization state includes: a current state given by the image simulation environment for each agent of the plurality of agents in the initialization states t The current agent output and the current state are used as the input of the current agents t Corresponding search actiona t (ii) a The image simulation environment is based on the current states t And corresponding search actiona t Output the state of the next times t+1The reward information obtained by the current intelligent agentr t And termination identificationSymbold t (ii) a Iteration is carried out based on the time sequence to obtain a data group sequence distributed according to the time sequence for each agent, wherein the data group sequence is in a six-tuple form: (s t a t r t d t s t+1Z), wherein z represents the number of the agent, the value of z is 1,2,3, … …, N, and N represents the total number of the agent.
In a second aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; and the processor is used for realizing the method for constructing the model of the diversified search strategy based on the deep reinforcement learning network when executing the program stored in the memory.
In a third aspect, embodiments of the present disclosure provide a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for building a model of a diversified search strategy based on a deep reinforcement learning network as described above.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:
generating corresponding virtual rewards and weights for the virtual rewards for a plurality of agents according to position state information in search data and an initialized virtual reward model, wherein the weights of the virtual rewards are negative values when the search data indicates that one of the agents is located at a first target position corresponding to local optimum; based on the setting of the weight of the virtual reward, different agents can access different states, once a certain agent falls into the misleading reward (the mode of obtaining the misleading reward is to access the state of generating the misleading reward, for example, to access the state corresponding to the first target position in the image simulation environment), when other agents access a series of states of guiding the misleading reward again, because the weight is a negative value, the signals of the virtual reward obtained by the other agents are negative, the agents are forced to not access the series of states of guiding the misleading reward any more, so that different agents are ensured to access different state sets, the updated search strategy model can find the second target position corresponding to the global optimum after being trained, and the problem of high-dimensional data (for example, 3D image data, etc.) in the prior art is effectively solved, Actual scene data, etc.) to cause the problem that the global optimum cannot be searched due to the misleading reward when searching, and the probability that the intelligent agent falls into the local solution due to the misleading reward can be reduced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 schematically illustrates a flow diagram of a method of building a model of a diversified search strategy based on a deep reinforcement learning network, in accordance with an embodiment of the present disclosure;
FIG. 2 schematically illustrates a schematic diagram of an image simulation environment, where (a) is a perspective schematic diagram of a 3D image simulation environment, and (b) is a top view of the 3D image simulation environment, according to an embodiment of the present disclosure;
fig. 3 schematically shows a detailed implementation process diagram of step S110 according to an embodiment of the present disclosure;
FIG. 4 schematically shows a schematic structural diagram of an arbiter according to an embodiment of the present disclosure;
fig. 5 schematically illustrates an implementation process of updating the bonus information in step S120 and step S130 according to an embodiment of the disclosure;
FIG. 6A schematically shows the results of a target search according to the prior art;
FIG. 6B schematically shows the result of target search by the image search model constructed according to the method provided by the embodiment of the disclosure; and
fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
A first exemplary embodiment of the present disclosure provides a method of building a model based on a deep reinforcement learning network.
Fig. 1 schematically shows a flowchart of a method of building a model of a diversified search strategy based on a deep reinforcement learning network according to an embodiment of the present disclosure.
Referring to fig. 1, a method for building a model of a diversified search strategy based on a deep reinforcement learning network according to an embodiment of the present disclosure includes the following steps: s110, S120, S130 and S140.
In step S110, search data for searching an image simulation environment by a plurality of agents in an initialization state is acquired, where the image simulation environment includes: a first target position corresponding to the local optimum and a second target position corresponding to the global optimum.
FIG. 2 schematically shows a schematic diagram of an image simulation environment according to an embodiment of the present disclosure, where (a) is a perspective schematic diagram of a 3D image simulation environment, and (b) is a top view of the 3D image simulation environment.
Referring to fig. 2 (a) and (b), the image simulation environment is, for example, a 3D image simulation environment, and the 3D image simulation environment may be an environment simulating a virtual environment, for example, an environment simulating an environment in a game interface (for example, a three-dimensional maze), or a simulation environment simulating a real environment (for example, a rescue scene of an article with different degrees of importance in a fire). The device where the agent is located or the agent itself can sense the surrounding environment (real environment or environment in the virtual interface) through the sensor, and the image simulation environment is obtained through simulation by using the sensing data.
In the 3D image simulation environment, two objects are included as an example, and the specific number of each object is not limited. Objects are illustrated in fig. 2 (b) as five-pointed stars, one of which is a locally optimal corresponding first object Goal1, the first object Goal1 being located at a first object location in the 3D image simulation environment; another target is a second target Goal2 with a globally optimal correspondence, the second target Goal2 being located at a second target location in the 3D image simulation environment. The first target position of the first target and the second target position of the second target may be static (over time) or may be dynamically variable over time.
When the intelligent agent is in an initialization state, the parameters in the search strategy model of the intelligent agent are initialization values. In the embodiment of the present disclosure, the agent refers to a program or an entity capable of sensing the environment through the sensor and acting on the environment through the actuator, and may be, for example, an application program: taking the state as input and the action as output; it may also be an electronic device installed with the above application program, such as an intelligent robot (e.g. a search and rescue robot) with a sensor (for detecting the environment) or other intelligent devices.
The intelligent body interacts with the image simulation environment, namely the intelligent body inputs the current state of the intelligent body into the initialized search strategy model in the image simulation environment, the search action to be executed by the intelligent body is obtained through output, and the image simulation environment obtains the state of the intelligent body at the next moment in the image simulation environment and the reward information according to the current state (such as the current position) and the search action of the intelligent body. Iteration is performed based on time sequence, and search data of each of the plurality of agents is obtained, wherein the search data at least comprises position state information (which can correspond to the current state and the description of the state at the next moment) and reward information.
In step S120, corresponding virtual rewards and weights for the virtual rewards are generated for the plurality of agents according to the location state information in the search data and the initialized virtual reward model, wherein the weights for the virtual rewards are negative values when the search data indicates that one of the plurality of agents is in the first target location.
The parameters of the initialized virtual reward model (e.g., neural network model) are initialization values. When the search data indicates that the agent is close to the first target location, the weight of the virtual reward is a negative value; different agents can be made to access different states based on the setting of the weights of the virtual rewards. Once a particular agent has been trapped in a misleading reward (the manner in which the misleading reward is obtained is by accessing a state that produces the misleading reward, e.g., a particular agent accesses a state corresponding to a first target location in an image simulation environment), when another agent accesses a series of states that lead to the misleading reward again, the signals of the virtual rewards obtained by the remaining agents are negative because the weights are negative, and the agents are forced to no longer access the series of states that lead to the misleading reward, thereby ensuring that different agents access different sets of states.
In step S130, the reward information in the search data is updated according to the virtual reward and the weight for the virtual reward, and the search policy model of the agent and the virtual reward model are updated accordingly.
By updating the reward information in the search data according to the virtual reward and the weight aiming at the virtual reward, correspondingly updating the search strategy model of the agent and updating the virtual reward model, the dynamic adjustment of the search direction (one concrete embodiment of the search strategy) through the positive and negative of the weight is realized, so that a plurality of agents are not limited to the local solution. The virtual rewards and their weights are used to give guidance to the search direction based on the reward information (this reward information), and when the weights are negative, the virtual rewards play a role in negatively regulating the reward information in the search data, so that the agent may adopt a strategy (such as being away from the first target position and gradually approaching the second target position) opposite to the previously adopted movement strategy (such as being close to the first target position).
In step S140, the updated search strategy model is continuously trained according to the updated search data and the virtual reward model until the training end condition is reached, and the trained search strategy model is used as an image search model that can be positioned to the second target position.
The training end condition includes: the data volume reaches the preset number, or the training time reaches the set value, and the like.
Based on the steps S110-S140, generating corresponding virtual rewards and weights aiming at the virtual rewards for a plurality of intelligent agents according to the position state information in the search data and the initialized virtual reward model, wherein when the search data indicates that one of the intelligent agents is located at the first target position corresponding to local optimum, the weights of the virtual rewards are negative values; in this way, the setting of the weight based on the virtual reward enables different agents to access different states, and once a certain agent has been trapped in a misleading reward, when another agent accesses a series of states leading to the misleading reward again, because the weight is negative, the signals of the virtual prizes earned by the remaining agents are negative, forcing the agents to no longer access the series of states leading to the misleading prize, therefore, different agents can access different state sets, the updated search strategy model can find the second target position corresponding to the global optimum after being trained, the technical problem that the global optimum cannot be searched due to the fact that misleading rewards are trapped when high-dimensional data (such as 3D images) are searched in the prior art is effectively solved, and the probability that the agents are trapped in local solutions due to the misleading rewards can be reduced.
The specific implementation of the above steps will be described in detail below.
Fig. 3 schematically shows a detailed implementation process diagram of step S110 according to an embodiment of the present disclosure.
The total number of agents is N (N ≧ 2 and N is an integer), the current agent is illustrated in FIG. 3 as being assigned M (M ≧ 2 and M is an integer) image simulation environments, and only one of the agents is illustrated in FIG. 3 as being in interaction with CzM of the M image simulation environments (Cz 1, Cz2, … …, CzM).
For example, referring to fig. 3, the search data is a time-series data set sequence for each agent, and the data set at each time in the data set sequence includes: current states t Current search action against current statea t The state at the next moment obtained by implementing the current search action according to the current states t+1This time of reward informationr t . The search data for each agent carries the agent's serial number identifier and termination identifier.
According to an embodiment of the present disclosure, referring to fig. 3, in step S110, acquiring search data for searching for an image simulation environment by a plurality of agents in an initialization state includes: a current state given by the image simulation environment for each agent of the plurality of agents in the initialization states t (corresponding to the location of the agent in the image simulation environment) as the input to the current agent, the current agent output and the current states t Corresponding search actiona t (ii) a The image simulation environment is based on the current states t And corresponding search actiona t Output the state of the next times t+1The reward information obtained by the current intelligent agentr t And a termination identifierd t (ii) a Iteration is carried out based on the time sequence to obtain a data group sequence distributed according to the time sequence for each agent, wherein the data group sequence is in a six-tuple form: (s t a t r t d t s t+1Z), wherein z represents the number of the agent, the value of z is 1,2,3, … …, N, and N represents the total number of the agent.
Where the initial time t takes a value of 0, such as illustrated in fig. 2 (b)s 0Status.
The detailed implementation process of step S120 is described below with reference to fig. 4 and 5.
FIG. 4 schematically shows a schematic structural diagram of an arbiter according to an embodiment of the present disclosure; fig. 5 schematically shows an implementation process of updating the bonus information in step S120 and step S130 according to an embodiment of the disclosure.
According to an embodiment of the present disclosure, the virtual reward model includes: a virtual prize generator and an arbiter. Wherein the virtual reward generator is configured to incentivize the agent to access image location states with a relatively small number of historical accesses. The discriminator is used for determining the probability of the plurality of agents visiting the specific image position state.
The function of the discriminator comprises: the probability of a certain state being accessed by each agent is output by taking the state as an input, and the output probability of the z-th agent is proportional to the number of times the z-th agent accesses the state in the historical access data (state data). The role of the virtual prize generator includes: the virtual prize is output with a certain state as input. The virtual reward is inversely proportional to the number of times all agents in the historical access data (state data) have accessed the state.
Referring to fig. 4 and 5, according to an embodiment of the present disclosure, the discriminator includes a neural network model 410, and the virtual reward generator includes two neural network models, a target network 510 with randomly initialized parameters and fixed parameters, and a prediction network 520 with trainable parameters.
In the step S120, generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the state information in the search data and the initialized virtual reward model includes: for each time-instant data set, the following substeps are performed: s121, S122 and S123.
In the substep S121, the next time state is input into the virtual bonus generator, and the virtual bonus corresponding to the next time state is output.
For example, referring to FIG. 5, the state at the next time will be describeds t+1Inputting the data into a virtual reward generator with initialized parameters, processing the data by a target network 510 and a prediction network 520 in an initialized state, and outputting the data to obtain a virtual reward, which is shown in fig. 5b e To illustrate the virtual reward.
In the substep S122, the next time state is input to the arbiter, and the probability of the next time state being accessed by each agent is output.
For example, referring to FIGS. 4 and 5, the state at the next time will be describeds t+1The probability of the current agent accessing the current state at the next moment is obtained after the input of the probability into the neural network model 410 in the discriminator and the processing of the neural network model 410.
In sub-step S123, a weight for the virtual reward is generated according to the probability of access by the current agent in the next time state and the average access probability, and the weight is described as a diversity factor (diversity factor) in fig. 5 because the weight contributes to increase the diversity of the search policy.
According to an embodiment of the present disclosure, the total number of the agents isNThe probability of the next moment state being accessed by the current agent is expressed as
Figure DEST_PATH_IMAGE015
Wherein z represents the number of the current agent, and the value of z is 1,2,3,……,N
Figure 615608DEST_PATH_IMAGE016
Representing the state at the next moment; the average access probability is 1N
Wherein the weight of the virtual reward
Figure 600881DEST_PATH_IMAGE003
The following expression is satisfied:
Figure 723689DEST_PATH_IMAGE017
(1)。
according to an embodiment of the present disclosure, in the step S130, updating reward information in the search data according to the virtual reward and the weight for the virtual reward includes the following sub-steps: s131 and S132.
In the substep S131, the virtual bonus and the weight for the virtual bonus are weighted and calculated to obtain virtual bonus information.
As shown with reference to fig. 5, tob de To illustrate virtual bonus information.
In the substep S132, the current reward information and the virtual reward information in the search data are summed up to obtain updated reward information.
As shown with reference to fig. 5, tor de To illustrate the updated reward information.
According to an embodiment of the present disclosure, in the step S130, the updating the search policy model of the agent correspondingly includes: and updating the parameters of the search strategy model based on an operator-critic algorithm in the deep reinforcement learning network by taking the search data containing the updated reward information of each agent as the input of the search strategy model of the current agent.
The search strategy model comprises a strategy network and a value network, wherein the input of the strategy network is the current state, and the output of the strategy network is the current search action aiming at the current state; the value network is used for predicting the probability of completing the search task according to the current state. The updating the parameters of the search strategy model comprises: and updating the parameters of the policy network and the value network.
In one embodiment, for a policy network, the policy gradient is used for updating, and the policy gradient satisfies the following expression:
Figure 794413DEST_PATH_IMAGE018
(2),
wherein the content of the first and second substances,
Figure 910137DEST_PATH_IMAGE019
representing the policy gradient of the z-th agent, M representing the total number of training data, pi representing the policy network,θrepresenting network parameters, z representing the number of the current agent, the value of z being 1,2,3, … …,N
Figure 749917DEST_PATH_IMAGE020
represents the current state corresponding to the current time t
Figure 542161DEST_PATH_IMAGE021
Is estimated by the value of (a) of (b),
Figure 834602DEST_PATH_IMAGE022
indicates the state of the next time corresponding to the next time t +1
Figure 691700DEST_PATH_IMAGE023
Is estimated by the value of (a) of (b),
Figure 448303DEST_PATH_IMAGE024
the information of the reward at this time is shown,
Figure 99865DEST_PATH_IMAGE025
indicating the current state
Figure 958230DEST_PATH_IMAGE026
Selecting search actions
Figure 353440DEST_PATH_IMAGE027
The probability of (c).
Loss function of the above value network
Figure 964549DEST_PATH_IMAGE028
The following expression is satisfied:
Figure 52591DEST_PATH_IMAGE029
(3)。
according to an embodiment of the present disclosure, the discriminator includes a neural network model
Figure 631209DEST_PATH_IMAGE030
The virtual prize generator comprises: target network with randomly initialized parameters and fixed parameters
Figure 830109DEST_PATH_IMAGE031
And parametric trainable predictive networks
Figure 295726DEST_PATH_IMAGE032
In step S130, the updating the virtual bonus model includes: updating the parameters of the discriminator based on a first loss function by using the state information in the updated search data as the input of the discriminator; and updating the parameters of the virtual reward generator based on a second loss function by taking the state information in the updated search data as the input of the virtual reward generator.
Wherein the first loss function is expressed as
Figure 554669DEST_PATH_IMAGE033
Figure 122047DEST_PATH_IMAGE034
The following expression is satisfied:
Figure 859059DEST_PATH_IMAGE035
(4),
wherein M represents the total number of training data, the neural network model of the discriminator
Figure 444761DEST_PATH_IMAGE036
In the state ofsTo input, output the statesProbability of belonging to z-th agent
Figure 874606DEST_PATH_IMAGE037
Z takes the value of 1,2,3, … …, and N represents the total number of agents;
wherein the second loss function is expressed as
Figure 162236DEST_PATH_IMAGE038
Figure 702939DEST_PATH_IMAGE039
The following expression is satisfied:
Figure 408727DEST_PATH_IMAGE040
(5)。
the implementation process of the method for building a model based on a deep reinforcement learning network according to the present disclosure is described below with reference to a specific example.
The method comprises the following steps:
step a, initializing an intelligent agent, an image simulation environment, a discriminator and a virtual reward generator.
Specifically, parameters of a policy network and a value network of 5 (one example of N) agents are initialized. A 5 x 32 (an example of N x M) number of image simulation environments are initialized. Parameters of a neural network model in the arbiter are initialized. Parameters of a virtual reward generator comprised of a target network and a predictive network are initialized. A data collection list is initialized. It should be noted that after initialization of each image simulation environment, the initial state data (image data) is returned, i.e. step 0.
And b, interacting the intelligent agent with the environment and collecting search data.
Specifically, step b may be implemented with the following sub-steps:
sub-step b-1, using 5 x 32 image simulation environments in parallel, assigns 32 image simulation environments (which may be subsequently simply referred to as environments) to each agent.
And a substep b-2, for one environment in the parallel environments, sending the state data of the current environment into a search strategy network of a corresponding agent to obtain action output (output search action) corresponding to the current state.
Sub-step b-3, sub-step b-2 is performed for all contexts.
And a sub-step b-4, each environment receives the action of the corresponding agent to perform one-step forward simulation, and returns the state data of the next step, the reward information and the termination identifier.
The sub-step b-5, the above sub-steps b-2 to b-4 are repeated 128 times (time sequence length), 160 search data (with 128 time sequence track length) in six-tuple form can be obtaineds t a t r t d t s t+1Z), the search data being training data,tthe value of (a) is 0-127 (inclusive), and 128 sets of training data are total.
It is noted that during this period, when the environment simulation is finished at a certain point, the reset environment (reinitialization) continues the simulation.
And a substep b-6 of storing the search data in a data collection list.
And c, generating a virtual reward signal.
Specifically, step c can be implemented by the following substeps:
substep c-1, pulls training data from the data collection list.
Substep c-2, searching data for one of the above training data (c)s t a t r t d t s t+1Z) (the value of t is definite), and the state of the next moment is determineds t+1Sending into a virtual reward generator to obtain a virtual rewardb t
And d, generating the virtual reward weight.
Specifically, step d can be implemented by the following substeps:
substep d-1, pulls training data from the data collection list.
Substep d-2, searching for data for one of the above training data: (s t a t r t d t s t+1Z) (the value of t is definite), and the state of the next moment is determineds t+1Sending the data to a discriminator to obtain the probability that the data is generated by the intelligent agent z
Figure 743893DEST_PATH_IMAGE041
Then, the weight of the virtual reward is calculated according to the above formula (1)α t
Substep d-3, updating the reward signal:r t (after update) =r t (before updating) +α t ×b t
Substep d-4, which implements the above substeps c-1, c-2, d-1, d-2, d-3 for all data in the data collection list, i.e. for each data, 160 × 128 times are performed until all search data (or described as training data) in the data collection list are updated.
Wherein, since the weight of the virtual reward is a negative value when the search data indicates that one of the agents is in the first target position, for example, the visit track of agent 2 of 5 agents is S0,S1,S2,S3S, S indicates a misleading state, and when agent 2 accesses this state S, the weight corresponding to the virtual award is negative, and the other agents 1, 3, 4, and 5 access these states subsequentlyThe state is given a negative reward which forces the agents 1, 3, 4 and 5 to avoid accessing a series of states leading to state S by adjusting the search strategy.
And e, updating the model parameters.
Specifically, step e can be implemented by the following substeps:
substep e-1, pulls training data from the data collection list.
A substep e-2 of using all the data in the data collection list and updating parameters of the policy network and the value network corresponding to the agent according to the agent number z in the data; updating parameters of the discriminator through cross entropy loss; for updating the parameters of the virtual reward generator including a target network with randomly initialized parameters and fixed parameters and a prediction network with trainable parameters, the specific updating method may refer to the description of the foregoing embodiments, and details are not repeated here.
And f, emptying the data collection list and storing the model parameters.
Specifically, step f can be implemented by the following substeps:
substep f-1, emptying the data in the data collection list.
A substep f-2 of repeating the process of the above steps b-e a predetermined number of times (e.g., 10)3Second), finishing updating of a version parameter, and storing parameters of a policy network and a value network of all agents; saving parameters of the discriminator; parameters of the target network and the predicted network in the virtual reward generator are saved.
And g, continuously training the agent until the iteration is completed.
Specifically, steps b-e are repeated until the total collected data amount exceeds the requirement of the preset data amount (for example, 2 × 10)8)。
Compared with the existing deep reinforcement learning search method, the method for constructing the search model can solve the problem that misleading rewards are difficult to process in a scene of inputting high-dimensional data (images) by the existing method, and reduces the probability that the intelligent agent falls into local solution due to the misleading rewards.
The following description is made by combining actual results to compare the advantages of the method provided by the embodiment of the present disclosure compared with the conventional exploration method of deep reinforcement learning.
Referring to fig. 2, the 3D image simulation environment is a game scene, and the task of the game is to let the agent find a target, and once the target game is found, the game is ended. The first target Goal1 in the 3D image simulation environment corresponds to a small prize, for example, a prize value of 1 point, the second target Goal2 corresponds to a large prize, for example, a prize value of 10 points, and the first target position of the first target Goal1 is far from the initial position of the agent (initial state S)0) Nearer, the second target location of the second target Goal2 is farther from the initial location of the agent.
Fig. 6A schematically shows the result of searching for an object according to the prior art, and referring to fig. 6A, a conventional deep reinforcement learning method is adopted to search for an object in a 3D image simulation environment, and as a result, the object is finally located to the first object Goal1 corresponding to the local optimal solution through environment perception and learning, and thus, the prior art is involved in misleading rewards.
Fig. 6B schematically shows the result of object search performed by the image search model constructed according to the method provided by the embodiment of the present disclosure, and as shown in fig. 6B, the method provided by the embodiment of the present disclosure is used to search for an object in the 3D image simulation environment, so as to finally realize 2 exploration paths, where the two search strategy networks correspond internally, after an agent learns a search strategy close to the first goal, as indicated by the blank arrow in fig. 6B, the weight of the virtual award will be made negative, thus causing other agents to be penalized (corresponding to a negative value of the virtual reward information) if they learn a search strategy close to the first Goal1 again, forcing other agents to change search strategies, enabling them to explore the environment further and learn a search strategy close to the second Goal2, indicated by the filled arrow, using a search strategy far from the first Goal.
The various solutions provided in the above embodiments of the present disclosure may be implemented in whole or in part in hardware, or in software modules running on one or more processors, or in a combination of them. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in an electronic device according to embodiments of the present disclosure. Embodiments of the present disclosure may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Programs implementing embodiments of the present disclosure may be stored on a computer-readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
A second exemplary embodiment of the present disclosure provides an electronic device.
Fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.
Referring to fig. 7, an electronic device 700 provided in the embodiment of the present disclosure includes a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704; a memory 703 for storing a computer program; the processor 701 is configured to implement the method for constructing the model of the diversified search strategy based on the deep reinforcement learning network as described above when executing the program stored in the memory.
A third exemplary embodiment of the present disclosure also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for building a model of a diversified search strategy based on a deep reinforcement learning network as described above.
The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for constructing a model of a diversified search strategy based on a deep reinforcement learning network is characterized by comprising the following steps:
acquiring search data for searching an image simulation environment by a plurality of agents in an initialization state, wherein the image simulation environment comprises: a first target position corresponding to the local optimum and a second target position corresponding to the global optimum;
generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the position state information in the search data and the initialized virtual reward model, wherein the weights of the virtual rewards are negative values when the search data indicates that one of the plurality of agents is in the first target position;
updating reward information in the search data according to the virtual reward and the weight aiming at the virtual reward, and correspondingly updating a search strategy model of the intelligent agent and the virtual reward model;
and continuing training the updated search strategy model according to the updated search data and the virtual reward model until a training end condition is reached, and taking the trained search strategy model as an image search model which can be positioned to the second target position.
2. The method of claim 1, wherein the virtual reward model comprises: a virtual reward generator and an arbiter;
wherein the virtual reward generator is for incentivizing the agent to access image location states with a relatively small number of historical accesses;
the arbiter is configured to determine probabilities of access to a particular image location state by the plurality of agents.
3. The method of claim 2, wherein the search data is a time-sequentially distributed sequence of data sets for each agent, the data sets at each time in the sequence of data sets comprising: the current state, aiming at the current searching action of the current state, aiming at the next moment state obtained after the current searching action is implemented on the current state, and the current reward information;
wherein the generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the state information in the search data and the initialized virtual reward model comprises:
for each time data set, the following steps are carried out:
inputting the next time state into the virtual reward generator, and outputting to obtain a virtual reward corresponding to the next time state;
inputting the next time state into the discriminator, and outputting to obtain the probability of the next time state accessed by each agent; and
and generating the weight aiming at the virtual reward according to the probability of the current agent accessing the current agent at the next moment and the average access probability.
4. The method of claim 3, wherein the total number of agents isNThe probability of the next moment state being accessed by the current agent is expressed as
Figure 422762DEST_PATH_IMAGE001
Wherein z represents the number of the current agent, the value of z is 1,2,3, … …,N
Figure 546576DEST_PATH_IMAGE002
representing the state at the next moment; the average access probability is 1N
Wherein the weight of the virtual reward
Figure 830927DEST_PATH_IMAGE003
The following expression is satisfied:
Figure 555038DEST_PATH_IMAGE004
5. the method of claim 3, wherein updating reward information in the search data based on the virtual reward and a weight for the virtual reward comprises:
performing weighted calculation on the virtual rewards and the weights aiming at the virtual rewards correspondingly to obtain virtual reward information;
and adding and calculating the reward information and the virtual reward information in the search data to obtain updated reward information.
6. The method of claim 1, wherein the correspondingly updating the search strategy model of the agent comprises:
taking the search data containing updated reward information for each agent as the input of a search strategy model of the current agent, and updating the parameters of the search strategy model based on an operator-critic algorithm in a deep reinforcement learning network;
the search strategy model comprises a strategy network and a value network, wherein the input of the strategy network is the current state, and the output of the strategy network is the current search action aiming at the current state; the value network is used for predicting the probability of completing the search task according to the current state;
updating the parameters of the search strategy model comprises: and updating the parameters of the policy network and the value network.
7. The method of claim 2, wherein the arbiter comprises a neural network model
Figure 583037DEST_PATH_IMAGE005
The virtual prize generator comprising: parameter(s)Randomly initialized and fixed parameter target network
Figure 826936DEST_PATH_IMAGE006
And parametric trainable predictive networks
Figure 16609DEST_PATH_IMAGE007
Wherein the updating the virtual reward model comprises:
updating the parameters of the discriminator based on a first loss function by taking the state information in the updated search data as the input of the discriminator;
wherein the first loss function is expressed as
Figure 729481DEST_PATH_IMAGE008
Figure 295592DEST_PATH_IMAGE008
The following expression is satisfied:
Figure 597260DEST_PATH_IMAGE009
wherein M represents the total number of training data, the neural network model of the discriminator
Figure 20151DEST_PATH_IMAGE010
In the state ofsTo input, output the statesProbability of belonging to z-th agent
Figure 407270DEST_PATH_IMAGE011
Z takes the value of 1,2,3, … …, and N represents the total number of agents;
updating parameters of the virtual reward generator based on a second loss function by taking the state information in the updated search data as the input of the virtual reward generator;
wherein the second loss function is expressed as
Figure 94516DEST_PATH_IMAGE012
Figure 985111DEST_PATH_IMAGE013
The following expression is satisfied:
Figure 844483DEST_PATH_IMAGE014
8. the method of claim 1, wherein the obtaining search data for searching the image simulation environment by a plurality of agents in an initialization state comprises:
a current state given by the image simulation environment for each agent of the plurality of agents in the initialization states t As input to the current agent, the current agent output and the current states t Corresponding search actiona t
The image simulation environment is based on the current states t And corresponding search actiona t Output the state of the next times t+1The reward information obtained by the current intelligent agentr t And a termination identifierd t
Iteration is carried out based on the time sequence to obtain a data group sequence distributed according to the time sequence for each agent, wherein the data group sequence is in a six-tuple form: (s t a t r t d t s t+1Z), wherein z represents the number of the agent, the value of z is 1,2,3, … …, N, and N represents the total number of the agent.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of any one of claims 1 to 8 when executing a program stored on a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-8.
CN202111565916.8A 2021-12-21 2021-12-21 Method for constructing diversified search strategy model based on deep reinforcement learning network Active CN113962390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111565916.8A CN113962390B (en) 2021-12-21 2021-12-21 Method for constructing diversified search strategy model based on deep reinforcement learning network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111565916.8A CN113962390B (en) 2021-12-21 2021-12-21 Method for constructing diversified search strategy model based on deep reinforcement learning network

Publications (2)

Publication Number Publication Date
CN113962390A true CN113962390A (en) 2022-01-21
CN113962390B CN113962390B (en) 2022-04-01

Family

ID=79473425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111565916.8A Active CN113962390B (en) 2021-12-21 2021-12-21 Method for constructing diversified search strategy model based on deep reinforcement learning network

Country Status (1)

Country Link
CN (1) CN113962390B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115272541A (en) * 2022-09-26 2022-11-01 成都市谛视无限科技有限公司 Gesture generation method for driving intelligent agent to reach multiple target points
CN115412401A (en) * 2022-08-26 2022-11-29 京东科技信息技术有限公司 Method and device for training virtual network embedding model and virtual network embedding
CN117150927A (en) * 2023-09-27 2023-12-01 北京汉勃科技有限公司 Deep reinforcement learning exploration method and system based on extreme novelty search

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107263449A (en) * 2017-07-05 2017-10-20 中国科学院自动化研究所 Robot remote teaching system based on virtual reality
CN110882544A (en) * 2019-11-28 2020-03-17 网易(杭州)网络有限公司 Multi-agent training method and device and electronic equipment
CN111242443A (en) * 2020-01-06 2020-06-05 国网黑龙江省电力有限公司 Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet
WO2020162211A1 (en) * 2019-02-06 2020-08-13 日本電信電話株式会社 Control device, control method and program
CN112765723A (en) * 2020-12-10 2021-05-07 南京航空航天大学 Curiosity-driven hybrid power system deep reinforcement learning energy management method
US20210200163A1 (en) * 2019-12-13 2021-07-01 Tata Consultancy Services Limited Multi-agent deep reinforcement learning for dynamically controlling electrical equipment in buildings
CN113392935A (en) * 2021-07-09 2021-09-14 浙江工业大学 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism
CN113592101A (en) * 2021-08-13 2021-11-02 大连大学 Multi-agent cooperation model based on deep reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107263449A (en) * 2017-07-05 2017-10-20 中国科学院自动化研究所 Robot remote teaching system based on virtual reality
WO2020162211A1 (en) * 2019-02-06 2020-08-13 日本電信電話株式会社 Control device, control method and program
CN110882544A (en) * 2019-11-28 2020-03-17 网易(杭州)网络有限公司 Multi-agent training method and device and electronic equipment
US20210200163A1 (en) * 2019-12-13 2021-07-01 Tata Consultancy Services Limited Multi-agent deep reinforcement learning for dynamically controlling electrical equipment in buildings
CN111242443A (en) * 2020-01-06 2020-06-05 国网黑龙江省电力有限公司 Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet
CN112765723A (en) * 2020-12-10 2021-05-07 南京航空航天大学 Curiosity-driven hybrid power system deep reinforcement learning energy management method
CN113392935A (en) * 2021-07-09 2021-09-14 浙江工业大学 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism
CN113592101A (en) * 2021-08-13 2021-11-02 大连大学 Multi-agent cooperation model based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XU PEI ET AL: "Deep Reinforcement Learning with Part-aware Exploration Bonus in Video Games", 《IEEE TRANSACTIONS ON GAMES》 *
黄凯奇: "人机对抗智能技术", 《中国科学信息科学》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115412401A (en) * 2022-08-26 2022-11-29 京东科技信息技术有限公司 Method and device for training virtual network embedding model and virtual network embedding
CN115412401B (en) * 2022-08-26 2024-04-19 京东科技信息技术有限公司 Method and device for training virtual network embedding model and virtual network embedding
CN115272541A (en) * 2022-09-26 2022-11-01 成都市谛视无限科技有限公司 Gesture generation method for driving intelligent agent to reach multiple target points
CN115272541B (en) * 2022-09-26 2023-01-03 成都市谛视无限科技有限公司 Gesture generation method for driving intelligent agent to reach multiple target points
CN117150927A (en) * 2023-09-27 2023-12-01 北京汉勃科技有限公司 Deep reinforcement learning exploration method and system based on extreme novelty search
CN117150927B (en) * 2023-09-27 2024-04-02 北京汉勃科技有限公司 Deep reinforcement learning exploration method and system based on extreme novelty search

Also Published As

Publication number Publication date
CN113962390B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
CN113962390B (en) Method for constructing diversified search strategy model based on deep reinforcement learning network
JP7159458B2 (en) Method, apparatus, device and computer program for scheduling virtual objects in a virtual environment
US11580378B2 (en) Reinforcement learning for concurrent actions
CN109511277B (en) Cooperative method and system for multi-state continuous action space
CN109952582A (en) A kind of training method, node, system and the storage medium of intensified learning model
CN111026272B (en) Training method and device for virtual object behavior strategy, electronic equipment and storage medium
US20210158162A1 (en) Training reinforcement learning agents to learn farsighted behaviors by predicting in latent space
CN114004370A (en) Method for constructing regional sensitivity model based on deep reinforcement learning network
CN113919482A (en) Intelligent agent training method and device, computer equipment and storage medium
CN115300910B (en) Confusion-removing game strategy model generation method based on multi-agent reinforcement learning
CN114261400A (en) Automatic driving decision-making method, device, equipment and storage medium
EP4014162A1 (en) Controlling agents using causally correct environment models
CN111611703B (en) Sand table deduction method, device and equipment based on digital twin and storage medium
CN117540203A (en) Multi-directional course learning training method and device for cooperative navigation of clustered robots
Adamsson Curriculum learning for increasing the performance of a reinforcement learning agent in a static first-person shooter game
CN113139644B (en) Information source navigation method and device based on deep Monte Carlo tree search
WO2022167079A1 (en) An apparatus and method for training a parametric policy
KR20220090732A (en) Method and system for determining action of device for given state using model trained based on risk measure parameter
Beaulac et al. Narrow artificial intelligence with machine learning for real-time estimation of a mobile agent’s location using hidden Markov models
Picardi A comparison of Different Machine Learning Techniques to Develop the AI of a Virtual Racing Game
CN116520851B (en) Object trapping method and device
CN115826621B (en) Unmanned aerial vehicle motion planning method and system based on deep reinforcement learning
CN117556681B (en) Intelligent air combat decision method, system and electronic equipment
Elliott et al. Using supervised training signals of observable state dynamics to speed-up and improve reinforcement learning
CN113537318A (en) Robot behavior decision method and device simulating human brain memory mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant