CN113962390A

CN113962390A - Method for constructing diversified search strategy model based on deep reinforcement learning network

Info

Publication number: CN113962390A
Application number: CN202111565916.8A
Authority: CN
Inventors: 黄凯奇; 尹奇跃; 张俊格; 徐沛
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-01-21
Anticipated expiration: 2041-12-21
Also published as: CN113962390B

Abstract

The invention relates to a method for constructing a diversified search strategy model based on a deep reinforcement learning network, which is based on the setting of the weight of virtual rewards, and can ensure that different agents access different states, once a certain agent falls into misleading rewards, when other agents access a series of states leading to the misleading rewards again, because the weight is a negative value, the signals of the virtual rewards obtained by other agents are negative, the agents are forced to not access the series of states leading to the misleading rewards, thereby ensuring that different agents access different state sets, leading the updated search strategy model to be capable of finding a second target position corresponding to global optimality after being trained, effectively solving the technical problem that the global optimality cannot be searched due to the misleading rewards when searching high-dimensional data in the prior art, the probability that the agent falls into a local solution due to misleading rewards can be reduced.

Description

Method for constructing diversified search strategy model based on deep reinforcement learning network

Technical Field

The disclosure relates to the field of deep reinforcement learning and the technical field of image processing, in particular to a method for constructing a model of diversified search strategies based on a deep reinforcement learning network.

Background

With the development of artificial intelligence technology, when a decision is made in the face of a complex scene, a deep reinforcement learning method is provided. Deep Learning (DL) is a method for performing characterization Learning on data in machine Learning. Reinforcement Learning (RL) is an optimal strategy obtained by building an environment model and Learning while exploring an unknown environment. Deep Reinforcement Learning (DRL) is an artificial intelligence method that combines the perception capability of Deep learning and the decision-making capability of reinforcement learning, can be controlled directly according to input information, and is closer to the way of human thinking.

Deep reinforcement learning has become a popular way to train an agent to perform complex tasks. Deep reinforcement learning trains an agent by maximizing the reward signal. The success of deep reinforcement learning is mostly achieved in scenes where the reward signal is well-designed and sufficiently dense. However, in many environments, the reward signal is very sparse for the agent. In a scenario where the reward is dense, the agent may easily find the reward by taking random actions. However, in a scenario where the reward is sparse, it is difficult to expect the reward to be obtained by random search. And if the reward signal is not available, the deep reinforcement learning algorithm cannot update the strategy. In scenarios where the reward is sparse, the agent must have the ability to explore. Therefore, the exploration problem in the deep reinforcement learning has extremely important research and application values.

However, the conventional exploration method for deep reinforcement learning is difficult to handle the problem of misleading rewards in a scene (such as an environment in a state of an image and a high-dimensional vector) with high-dimensional data input, and the misleading rewards can prevent the agent from obtaining higher returns in a long term, which finally causes the agent to be trapped in local solution.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, embodiments of the present disclosure provide a method for building a model of a diversified search strategy based on a deep reinforcement learning network.

In a first aspect, an embodiment of the present disclosure provides a method for building a model based on a deep reinforcement learning network. The method comprises the following steps: acquiring search data for searching an image simulation environment by a plurality of agents in an initialization state, wherein the image simulation environment comprises: a first target position corresponding to the local optimum and a second target position corresponding to the global optimum; generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the position state information in the search data and an initialized virtual reward model, wherein the weights of the virtual rewards are negative values when the search data indicates that one of the plurality of agents is in the first target position; updating reward information in the search data according to the virtual reward and the weight aiming at the virtual reward, and correspondingly updating the search strategy model of the intelligent agent and the virtual reward model; and continuing training the updated search strategy model according to the updated search data and the virtual reward model until a training end condition is reached, wherein the trained search strategy model is used as an image search model capable of being positioned to the second target position.

According to an embodiment of the present disclosure, the virtual reward model includes: a virtual reward generator and an arbiter; wherein the virtual reward generator is configured to incentivize the agent to access image location states with relatively few historical access times; the discriminator is used for determining the probability of the plurality of agents visiting the specific image position state.

According to an embodiment of the present disclosure, the search data is a time-sequentially distributed data set sequence for each agent, and a data set at each time in the data set sequence includes: and the current state is the current searching action of the current state, and the current reward information is obtained by implementing the current searching action according to the current state and the next moment state. Wherein the generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the state information in the search data and the initialized virtual reward model comprises: for each time data set, the following steps are carried out: inputting the next time state into the virtual reward generator, and outputting to obtain the virtual reward corresponding to the next time state; inputting the next time state into the discriminator, and outputting to obtain the probability of the next time state accessed by each agent; and generating a weight for the virtual reward according to the probability of access by the current agent at the next moment state and the average access probability.

According to an embodiment of the present disclosure, the total number of the agents isNThe probability of the next moment state being accessed by the current agent is expressed as

Wherein z represents the number of the current agent, the value of z is 1,2,3, … …,N；

representing the state at the next moment; the average access probability is 1N；

Wherein the weight of the virtual reward

The following expression is satisfied:

。

according to an embodiment of the present disclosure, the updating reward information in the search data according to the virtual reward and the weight for the virtual reward includes: performing weighted calculation on the virtual rewards and the weights aiming at the virtual rewards correspondingly to obtain virtual reward information; and adding and calculating the reward information and the virtual reward information in the search data to obtain updated reward information.

According to an embodiment of the present disclosure, the updating the search policy model of the agent includes: taking the search data containing updated reward information aiming at each agent as the input of a search strategy model of the current agent, and updating the parameters of the search strategy model based on an operator-critic algorithm in a deep reinforcement learning network; the search strategy model comprises a strategy network and a value network, wherein the input of the strategy network is the current state, and the output of the strategy network is the current search action aiming at the current state; the value network is used for predicting the probability of completing the search task according to the current state; the updating the parameters of the search strategy model comprises: and updating the parameters of the policy network and the value network.

According to an embodiment of the present disclosure, the discriminator includes a neural network model

The virtual prize generator comprises: target network with randomly initialized parameters and fixed parameters

And parametric trainable predictive networks

；

Wherein, updating the virtual reward model comprises:

updating the parameters of the discriminator based on a first loss function by using the state information in the updated search data as the input of the discriminator; updating the parameters of the virtual reward generator based on a second loss function by taking the state information in the updated search data as the input of the virtual reward generator;

wherein the first loss function is expressed as

，

The following expression is satisfied:

，

wherein M represents the total number of training data, the neural network model of the discriminator

In the state ofsTo input, output the statesProbability of belonging to z-th agent

Z takes the value of 1,2,3, … …, and N represents the total number of agents;

wherein the second loss function is expressed as

，

The following expression is satisfied:

。

according to an embodiment of the present disclosure, the acquiring search data for searching the image simulation environment by a plurality of agents in the initialization state includes: a current state given by the image simulation environment for each agent of the plurality of agents in the initialization states _tThe current agent output and the current state are used as the input of the current agents _tCorresponding search actiona _t(ii) a The image simulation environment is based on the current states _tAnd corresponding search actiona _tOutput the state of the next times _t+1The reward information obtained by the current intelligent agentr _tAnd termination identificationSymbold _t(ii) a Iteration is carried out based on the time sequence to obtain a data group sequence distributed according to the time sequence for each agent, wherein the data group sequence is in a six-tuple form: (s _t，a _t，r _t，d _t，s _t+1Z), wherein z represents the number of the agent, the value of z is 1,2,3, … …, N, and N represents the total number of the agent.

In a second aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; and the processor is used for realizing the method for constructing the model of the diversified search strategy based on the deep reinforcement learning network when executing the program stored in the memory.

In a third aspect, embodiments of the present disclosure provide a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for building a model of a diversified search strategy based on a deep reinforcement learning network as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:

generating corresponding virtual rewards and weights for the virtual rewards for a plurality of agents according to position state information in search data and an initialized virtual reward model, wherein the weights of the virtual rewards are negative values when the search data indicates that one of the agents is located at a first target position corresponding to local optimum; based on the setting of the weight of the virtual reward, different agents can access different states, once a certain agent falls into the misleading reward (the mode of obtaining the misleading reward is to access the state of generating the misleading reward, for example, to access the state corresponding to the first target position in the image simulation environment), when other agents access a series of states of guiding the misleading reward again, because the weight is a negative value, the signals of the virtual reward obtained by the other agents are negative, the agents are forced to not access the series of states of guiding the misleading reward any more, so that different agents are ensured to access different state sets, the updated search strategy model can find the second target position corresponding to the global optimum after being trained, and the problem of high-dimensional data (for example, 3D image data, etc.) in the prior art is effectively solved, Actual scene data, etc.) to cause the problem that the global optimum cannot be searched due to the misleading reward when searching, and the probability that the intelligent agent falls into the local solution due to the misleading reward can be reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 schematically illustrates a flow diagram of a method of building a model of a diversified search strategy based on a deep reinforcement learning network, in accordance with an embodiment of the present disclosure;

FIG. 2 schematically illustrates a schematic diagram of an image simulation environment, where (a) is a perspective schematic diagram of a 3D image simulation environment, and (b) is a top view of the 3D image simulation environment, according to an embodiment of the present disclosure;

fig. 3 schematically shows a detailed implementation process diagram of step S110 according to an embodiment of the present disclosure;

FIG. 4 schematically shows a schematic structural diagram of an arbiter according to an embodiment of the present disclosure;

fig. 5 schematically illustrates an implementation process of updating the bonus information in step S120 and step S130 according to an embodiment of the disclosure;

FIG. 6A schematically shows the results of a target search according to the prior art;

FIG. 6B schematically shows the result of target search by the image search model constructed according to the method provided by the embodiment of the disclosure; and

fig. 7 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

A first exemplary embodiment of the present disclosure provides a method of building a model based on a deep reinforcement learning network.

Fig. 1 schematically shows a flowchart of a method of building a model of a diversified search strategy based on a deep reinforcement learning network according to an embodiment of the present disclosure.

Referring to fig. 1, a method for building a model of a diversified search strategy based on a deep reinforcement learning network according to an embodiment of the present disclosure includes the following steps: s110, S120, S130 and S140.

In step S110, search data for searching an image simulation environment by a plurality of agents in an initialization state is acquired, where the image simulation environment includes: a first target position corresponding to the local optimum and a second target position corresponding to the global optimum.

FIG. 2 schematically shows a schematic diagram of an image simulation environment according to an embodiment of the present disclosure, where (a) is a perspective schematic diagram of a 3D image simulation environment, and (b) is a top view of the 3D image simulation environment.

Referring to fig. 2 (a) and (b), the image simulation environment is, for example, a 3D image simulation environment, and the 3D image simulation environment may be an environment simulating a virtual environment, for example, an environment simulating an environment in a game interface (for example, a three-dimensional maze), or a simulation environment simulating a real environment (for example, a rescue scene of an article with different degrees of importance in a fire). The device where the agent is located or the agent itself can sense the surrounding environment (real environment or environment in the virtual interface) through the sensor, and the image simulation environment is obtained through simulation by using the sensing data.

In the 3D image simulation environment, two objects are included as an example, and the specific number of each object is not limited. Objects are illustrated in fig. 2 (b) as five-pointed stars, one of which is a locally optimal corresponding first object Goal1, the first object Goal1 being located at a first object location in the 3D image simulation environment; another target is a second target Goal2 with a globally optimal correspondence, the second target Goal2 being located at a second target location in the 3D image simulation environment. The first target position of the first target and the second target position of the second target may be static (over time) or may be dynamically variable over time.

When the intelligent agent is in an initialization state, the parameters in the search strategy model of the intelligent agent are initialization values. In the embodiment of the present disclosure, the agent refers to a program or an entity capable of sensing the environment through the sensor and acting on the environment through the actuator, and may be, for example, an application program: taking the state as input and the action as output; it may also be an electronic device installed with the above application program, such as an intelligent robot (e.g. a search and rescue robot) with a sensor (for detecting the environment) or other intelligent devices.

The intelligent body interacts with the image simulation environment, namely the intelligent body inputs the current state of the intelligent body into the initialized search strategy model in the image simulation environment, the search action to be executed by the intelligent body is obtained through output, and the image simulation environment obtains the state of the intelligent body at the next moment in the image simulation environment and the reward information according to the current state (such as the current position) and the search action of the intelligent body. Iteration is performed based on time sequence, and search data of each of the plurality of agents is obtained, wherein the search data at least comprises position state information (which can correspond to the current state and the description of the state at the next moment) and reward information.

In step S120, corresponding virtual rewards and weights for the virtual rewards are generated for the plurality of agents according to the location state information in the search data and the initialized virtual reward model, wherein the weights for the virtual rewards are negative values when the search data indicates that one of the plurality of agents is in the first target location.

The parameters of the initialized virtual reward model (e.g., neural network model) are initialization values. When the search data indicates that the agent is close to the first target location, the weight of the virtual reward is a negative value; different agents can be made to access different states based on the setting of the weights of the virtual rewards. Once a particular agent has been trapped in a misleading reward (the manner in which the misleading reward is obtained is by accessing a state that produces the misleading reward, e.g., a particular agent accesses a state corresponding to a first target location in an image simulation environment), when another agent accesses a series of states that lead to the misleading reward again, the signals of the virtual rewards obtained by the remaining agents are negative because the weights are negative, and the agents are forced to no longer access the series of states that lead to the misleading reward, thereby ensuring that different agents access different sets of states.

In step S130, the reward information in the search data is updated according to the virtual reward and the weight for the virtual reward, and the search policy model of the agent and the virtual reward model are updated accordingly.

By updating the reward information in the search data according to the virtual reward and the weight aiming at the virtual reward, correspondingly updating the search strategy model of the agent and updating the virtual reward model, the dynamic adjustment of the search direction (one concrete embodiment of the search strategy) through the positive and negative of the weight is realized, so that a plurality of agents are not limited to the local solution. The virtual rewards and their weights are used to give guidance to the search direction based on the reward information (this reward information), and when the weights are negative, the virtual rewards play a role in negatively regulating the reward information in the search data, so that the agent may adopt a strategy (such as being away from the first target position and gradually approaching the second target position) opposite to the previously adopted movement strategy (such as being close to the first target position).

In step S140, the updated search strategy model is continuously trained according to the updated search data and the virtual reward model until the training end condition is reached, and the trained search strategy model is used as an image search model that can be positioned to the second target position.

The training end condition includes: the data volume reaches the preset number, or the training time reaches the set value, and the like.

Based on the steps S110-S140, generating corresponding virtual rewards and weights aiming at the virtual rewards for a plurality of intelligent agents according to the position state information in the search data and the initialized virtual reward model, wherein when the search data indicates that one of the intelligent agents is located at the first target position corresponding to local optimum, the weights of the virtual rewards are negative values; in this way, the setting of the weight based on the virtual reward enables different agents to access different states, and once a certain agent has been trapped in a misleading reward, when another agent accesses a series of states leading to the misleading reward again, because the weight is negative, the signals of the virtual prizes earned by the remaining agents are negative, forcing the agents to no longer access the series of states leading to the misleading prize, therefore, different agents can access different state sets, the updated search strategy model can find the second target position corresponding to the global optimum after being trained, the technical problem that the global optimum cannot be searched due to the fact that misleading rewards are trapped when high-dimensional data (such as 3D images) are searched in the prior art is effectively solved, and the probability that the agents are trapped in local solutions due to the misleading rewards can be reduced.

The specific implementation of the above steps will be described in detail below.

Fig. 3 schematically shows a detailed implementation process diagram of step S110 according to an embodiment of the present disclosure.

The total number of agents is N (N ≧ 2 and N is an integer), the current agent is illustrated in FIG. 3 as being assigned M (M ≧ 2 and M is an integer) image simulation environments, and only one of the agents is illustrated in FIG. 3 as being in interaction with CzM of the M image simulation environments (Cz 1, Cz2, … …, CzM).

For example, referring to fig. 3, the search data is a time-series data set sequence for each agent, and the data set at each time in the data set sequence includes: current states _tCurrent search action against current statea _tThe state at the next moment obtained by implementing the current search action according to the current states _t+1This time of reward informationr _t. The search data for each agent carries the agent's serial number identifier and termination identifier.

According to an embodiment of the present disclosure, referring to fig. 3, in step S110, acquiring search data for searching for an image simulation environment by a plurality of agents in an initialization state includes: a current state given by the image simulation environment for each agent of the plurality of agents in the initialization states _t(corresponding to the location of the agent in the image simulation environment) as the input to the current agent, the current agent output and the current states _tCorresponding search actiona _t(ii) a The image simulation environment is based on the current states _tAnd corresponding search actiona _tOutput the state of the next times _t+1The reward information obtained by the current intelligent agentr _tAnd a termination identifierd _t(ii) a Iteration is carried out based on the time sequence to obtain a data group sequence distributed according to the time sequence for each agent, wherein the data group sequence is in a six-tuple form: (s _t，a _t，r _t，d _t，s _t+1Z), wherein z represents the number of the agent, the value of z is 1,2,3, … …, N, and N represents the total number of the agent.

Where the initial time t takes a value of 0, such as illustrated in fig. 2 (b)s ₀Status.

The detailed implementation process of step S120 is described below with reference to fig. 4 and 5.

FIG. 4 schematically shows a schematic structural diagram of an arbiter according to an embodiment of the present disclosure; fig. 5 schematically shows an implementation process of updating the bonus information in step S120 and step S130 according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, the virtual reward model includes: a virtual prize generator and an arbiter. Wherein the virtual reward generator is configured to incentivize the agent to access image location states with a relatively small number of historical accesses. The discriminator is used for determining the probability of the plurality of agents visiting the specific image position state.

The function of the discriminator comprises: the probability of a certain state being accessed by each agent is output by taking the state as an input, and the output probability of the z-th agent is proportional to the number of times the z-th agent accesses the state in the historical access data (state data). The role of the virtual prize generator includes: the virtual prize is output with a certain state as input. The virtual reward is inversely proportional to the number of times all agents in the historical access data (state data) have accessed the state.

Referring to fig. 4 and 5, according to an embodiment of the present disclosure, the discriminator includes a neural network model 410, and the virtual reward generator includes two neural network models, a target network 510 with randomly initialized parameters and fixed parameters, and a prediction network 520 with trainable parameters.

In the step S120, generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the state information in the search data and the initialized virtual reward model includes: for each time-instant data set, the following substeps are performed: s121, S122 and S123.

In the substep S121, the next time state is input into the virtual bonus generator, and the virtual bonus corresponding to the next time state is output.

For example, referring to FIG. 5, the state at the next time will be describeds _t+1Inputting the data into a virtual reward generator with initialized parameters, processing the data by a target network 510 and a prediction network 520 in an initialized state, and outputting the data to obtain a virtual reward, which is shown in fig. 5b ^eTo illustrate the virtual reward.

In the substep S122, the next time state is input to the arbiter, and the probability of the next time state being accessed by each agent is output.

For example, referring to FIGS. 4 and 5, the state at the next time will be describeds _t+1The probability of the current agent accessing the current state at the next moment is obtained after the input of the probability into the neural network model 410 in the discriminator and the processing of the neural network model 410.

In sub-step S123, a weight for the virtual reward is generated according to the probability of access by the current agent in the next time state and the average access probability, and the weight is described as a diversity factor (diversity factor) in fig. 5 because the weight contributes to increase the diversity of the search policy.

Wherein z represents the number of the current agent, and the value of z is 1,2,3,……,N；

Representing the state at the next moment; the average access probability is 1N。

Wherein the weight of the virtual reward

The following expression is satisfied:

（1）。

according to an embodiment of the present disclosure, in the step S130, updating reward information in the search data according to the virtual reward and the weight for the virtual reward includes the following sub-steps: s131 and S132.

In the substep S131, the virtual bonus and the weight for the virtual bonus are weighted and calculated to obtain virtual bonus information.

As shown with reference to fig. 5, tob ^deTo illustrate virtual bonus information.

In the substep S132, the current reward information and the virtual reward information in the search data are summed up to obtain updated reward information.

As shown with reference to fig. 5, tor ^deTo illustrate the updated reward information.

According to an embodiment of the present disclosure, in the step S130, the updating the search policy model of the agent correspondingly includes: and updating the parameters of the search strategy model based on an operator-critic algorithm in the deep reinforcement learning network by taking the search data containing the updated reward information of each agent as the input of the search strategy model of the current agent.

The search strategy model comprises a strategy network and a value network, wherein the input of the strategy network is the current state, and the output of the strategy network is the current search action aiming at the current state; the value network is used for predicting the probability of completing the search task according to the current state. The updating the parameters of the search strategy model comprises: and updating the parameters of the policy network and the value network.

In one embodiment, for a policy network, the policy gradient is used for updating, and the policy gradient satisfies the following expression:

（2），

wherein the content of the first and second substances,

representing the policy gradient of the z-th agent, M representing the total number of training data, pi representing the policy network,θrepresenting network parameters, z representing the number of the current agent, the value of z being 1,2,3, … …,N，

represents the current state corresponding to the current time t

Is estimated by the value of (a) of (b),

indicates the state of the next time corresponding to the next time t +1

Is estimated by the value of (a) of (b),

the information of the reward at this time is shown,

indicating the current state

Selecting search actions

The probability of (c).

Loss function of the above value network

The following expression is satisfied:

（3）。

And parametric trainable predictive networks

。

In step S130, the updating the virtual bonus model includes: updating the parameters of the discriminator based on a first loss function by using the state information in the updated search data as the input of the discriminator; and updating the parameters of the virtual reward generator based on a second loss function by taking the state information in the updated search data as the input of the virtual reward generator.

Wherein the first loss function is expressed as

，

The following expression is satisfied:

（4），

wherein the second loss function is expressed as

，

The following expression is satisfied:

（5）。

the implementation process of the method for building a model based on a deep reinforcement learning network according to the present disclosure is described below with reference to a specific example.

The method comprises the following steps:

step a, initializing an intelligent agent, an image simulation environment, a discriminator and a virtual reward generator.

Specifically, parameters of a policy network and a value network of 5 (one example of N) agents are initialized. A 5 x 32 (an example of N x M) number of image simulation environments are initialized. Parameters of a neural network model in the arbiter are initialized. Parameters of a virtual reward generator comprised of a target network and a predictive network are initialized. A data collection list is initialized. It should be noted that after initialization of each image simulation environment, the initial state data (image data) is returned, i.e. step 0.

And b, interacting the intelligent agent with the environment and collecting search data.

Specifically, step b may be implemented with the following sub-steps:

sub-step b-1, using 5 x 32 image simulation environments in parallel, assigns 32 image simulation environments (which may be subsequently simply referred to as environments) to each agent.

And a substep b-2, for one environment in the parallel environments, sending the state data of the current environment into a search strategy network of a corresponding agent to obtain action output (output search action) corresponding to the current state.

Sub-step b-3, sub-step b-2 is performed for all contexts.

And a sub-step b-4, each environment receives the action of the corresponding agent to perform one-step forward simulation, and returns the state data of the next step, the reward information and the termination identifier.

The sub-step b-5, the above sub-steps b-2 to b-4 are repeated 128 times (time sequence length), 160 search data (with 128 time sequence track length) in six-tuple form can be obtaineds _t，a _t，r _t，d _t，s _t+1Z), the search data being training data,tthe value of (a) is 0-127 (inclusive), and 128 sets of training data are total.

It is noted that during this period, when the environment simulation is finished at a certain point, the reset environment (reinitialization) continues the simulation.

And a substep b-6 of storing the search data in a data collection list.

And c, generating a virtual reward signal.

Specifically, step c can be implemented by the following substeps:

substep c-1, pulls training data from the data collection list.

Substep c-2, searching data for one of the above training data (c)s _t，a _t，r _t，d _t，s _t+1Z) (the value of t is definite), and the state of the next moment is determineds _t+1Sending into a virtual reward generator to obtain a virtual rewardb _t。

And d, generating the virtual reward weight.

Specifically, step d can be implemented by the following substeps:

substep d-1, pulls training data from the data collection list.

Substep d-2, searching for data for one of the above training data: (s _t，a _t，r _t，d _t，s _t+1Z) (the value of t is definite), and the state of the next moment is determineds _t+1Sending the data to a discriminator to obtain the probability that the data is generated by the intelligent agent z

Then, the weight of the virtual reward is calculated according to the above formula (1)α _t。

Substep d-3, updating the reward signal:r _t(after update) =r _t(before updating) +α _t×b _t。

Substep d-4, which implements the above substeps c-1, c-2, d-1, d-2, d-3 for all data in the data collection list, i.e. for each data, 160 × 128 times are performed until all search data (or described as training data) in the data collection list are updated.

Wherein, since the weight of the virtual reward is a negative value when the search data indicates that one of the agents is in the first target position, for example, the visit track of agent 2 of 5 agents is S₀,S₁,S₂,S₃S, S indicates a misleading state, and when agent 2 accesses this state S, the weight corresponding to the virtual award is negative, and the other agents 1, 3, 4, and 5 access these states subsequentlyThe state is given a negative reward which forces the agents 1, 3, 4 and 5 to avoid accessing a series of states leading to state S by adjusting the search strategy.

And e, updating the model parameters.

Specifically, step e can be implemented by the following substeps:

substep e-1, pulls training data from the data collection list.

A substep e-2 of using all the data in the data collection list and updating parameters of the policy network and the value network corresponding to the agent according to the agent number z in the data; updating parameters of the discriminator through cross entropy loss; for updating the parameters of the virtual reward generator including a target network with randomly initialized parameters and fixed parameters and a prediction network with trainable parameters, the specific updating method may refer to the description of the foregoing embodiments, and details are not repeated here.

And f, emptying the data collection list and storing the model parameters.

Specifically, step f can be implemented by the following substeps:

substep f-1, emptying the data in the data collection list.

A substep f-2 of repeating the process of the above steps b-e a predetermined number of times (e.g., 10)³Second), finishing updating of a version parameter, and storing parameters of a policy network and a value network of all agents; saving parameters of the discriminator; parameters of the target network and the predicted network in the virtual reward generator are saved.

And g, continuously training the agent until the iteration is completed.

Specifically, steps b-e are repeated until the total collected data amount exceeds the requirement of the preset data amount (for example, 2 × 10)⁸）。

Compared with the existing deep reinforcement learning search method, the method for constructing the search model can solve the problem that misleading rewards are difficult to process in a scene of inputting high-dimensional data (images) by the existing method, and reduces the probability that the intelligent agent falls into local solution due to the misleading rewards.

The following description is made by combining actual results to compare the advantages of the method provided by the embodiment of the present disclosure compared with the conventional exploration method of deep reinforcement learning.

Referring to fig. 2, the 3D image simulation environment is a game scene, and the task of the game is to let the agent find a target, and once the target game is found, the game is ended. The first target Goal1 in the 3D image simulation environment corresponds to a small prize, for example, a prize value of 1 point, the second target Goal2 corresponds to a large prize, for example, a prize value of 10 points, and the first target position of the first target Goal1 is far from the initial position of the agent (initial state S)₀) Nearer, the second target location of the second target Goal2 is farther from the initial location of the agent.

Fig. 6A schematically shows the result of searching for an object according to the prior art, and referring to fig. 6A, a conventional deep reinforcement learning method is adopted to search for an object in a 3D image simulation environment, and as a result, the object is finally located to the first object Goal1 corresponding to the local optimal solution through environment perception and learning, and thus, the prior art is involved in misleading rewards.

Fig. 6B schematically shows the result of object search performed by the image search model constructed according to the method provided by the embodiment of the present disclosure, and as shown in fig. 6B, the method provided by the embodiment of the present disclosure is used to search for an object in the 3D image simulation environment, so as to finally realize 2 exploration paths, where the two search strategy networks correspond internally, after an agent learns a search strategy close to the first goal, as indicated by the blank arrow in fig. 6B, the weight of the virtual award will be made negative, thus causing other agents to be penalized (corresponding to a negative value of the virtual reward information) if they learn a search strategy close to the first Goal1 again, forcing other agents to change search strategies, enabling them to explore the environment further and learn a search strategy close to the second Goal2, indicated by the filled arrow, using a search strategy far from the first Goal.

The various solutions provided in the above embodiments of the present disclosure may be implemented in whole or in part in hardware, or in software modules running on one or more processors, or in a combination of them. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in an electronic device according to embodiments of the present disclosure. Embodiments of the present disclosure may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Programs implementing embodiments of the present disclosure may be stored on a computer-readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

A second exemplary embodiment of the present disclosure provides an electronic device.

Referring to fig. 7, an electronic device 700 provided in the embodiment of the present disclosure includes a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704; a memory 703 for storing a computer program; the processor 701 is configured to implement the method for constructing the model of the diversified search strategy based on the deep reinforcement learning network as described above when executing the program stored in the memory.

A third exemplary embodiment of the present disclosure also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for building a model of a diversified search strategy based on a deep reinforcement learning network as described above.

The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for constructing a model of a diversified search strategy based on a deep reinforcement learning network is characterized by comprising the following steps:

acquiring search data for searching an image simulation environment by a plurality of agents in an initialization state, wherein the image simulation environment comprises: a first target position corresponding to the local optimum and a second target position corresponding to the global optimum;

generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the position state information in the search data and the initialized virtual reward model, wherein the weights of the virtual rewards are negative values when the search data indicates that one of the plurality of agents is in the first target position;

updating reward information in the search data according to the virtual reward and the weight aiming at the virtual reward, and correspondingly updating a search strategy model of the intelligent agent and the virtual reward model;

and continuing training the updated search strategy model according to the updated search data and the virtual reward model until a training end condition is reached, and taking the trained search strategy model as an image search model which can be positioned to the second target position.

2. The method of claim 1, wherein the virtual reward model comprises: a virtual reward generator and an arbiter;

wherein the virtual reward generator is for incentivizing the agent to access image location states with a relatively small number of historical accesses;

the arbiter is configured to determine probabilities of access to a particular image location state by the plurality of agents.

3. The method of claim 2, wherein the search data is a time-sequentially distributed sequence of data sets for each agent, the data sets at each time in the sequence of data sets comprising: the current state, aiming at the current searching action of the current state, aiming at the next moment state obtained after the current searching action is implemented on the current state, and the current reward information;

wherein the generating corresponding virtual rewards and weights for the virtual rewards for the plurality of agents according to the state information in the search data and the initialized virtual reward model comprises:

for each time data set, the following steps are carried out:

inputting the next time state into the virtual reward generator, and outputting to obtain a virtual reward corresponding to the next time state;

inputting the next time state into the discriminator, and outputting to obtain the probability of the next time state accessed by each agent; and

and generating the weight aiming at the virtual reward according to the probability of the current agent accessing the current agent at the next moment and the average access probability.

4. The method of claim 3, wherein the total number of agents isNThe probability of the next moment state being accessed by the current agent is expressed as

Wherein the weight of the virtual reward

The following expression is satisfied:

。

5. the method of claim 3, wherein updating reward information in the search data based on the virtual reward and a weight for the virtual reward comprises:

performing weighted calculation on the virtual rewards and the weights aiming at the virtual rewards correspondingly to obtain virtual reward information;

and adding and calculating the reward information and the virtual reward information in the search data to obtain updated reward information.

6. The method of claim 1, wherein the correspondingly updating the search strategy model of the agent comprises:

taking the search data containing updated reward information for each agent as the input of a search strategy model of the current agent, and updating the parameters of the search strategy model based on an operator-critic algorithm in a deep reinforcement learning network;

the search strategy model comprises a strategy network and a value network, wherein the input of the strategy network is the current state, and the output of the strategy network is the current search action aiming at the current state; the value network is used for predicting the probability of completing the search task according to the current state;

updating the parameters of the search strategy model comprises: and updating the parameters of the policy network and the value network.

7. The method of claim 2, wherein the arbiter comprises a neural network model

The virtual prize generator comprising: parameter(s)Randomly initialized and fixed parameter target network

And parametric trainable predictive networks

；

Wherein the updating the virtual reward model comprises:

updating the parameters of the discriminator based on a first loss function by taking the state information in the updated search data as the input of the discriminator;

wherein the first loss function is expressed as

，

The following expression is satisfied:

，

updating parameters of the virtual reward generator based on a second loss function by taking the state information in the updated search data as the input of the virtual reward generator;

wherein the second loss function is expressed as

，

The following expression is satisfied:

。

8. the method of claim 1, wherein the obtaining search data for searching the image simulation environment by a plurality of agents in an initialization state comprises:

a current state given by the image simulation environment for each agent of the plurality of agents in the initialization states _tAs input to the current agent, the current agent output and the current states _tCorresponding search actiona _t；

The image simulation environment is based on the current states _tAnd corresponding search actiona _tOutput the state of the next times _t+1The reward information obtained by the current intelligent agentr _tAnd a termination identifierd _t；

Iteration is carried out based on the time sequence to obtain a data group sequence distributed according to the time sequence for each agent, wherein the data group sequence is in a six-tuple form: (s _t，a _t，r _t，d _t，s _t+1Z), wherein z represents the number of the agent, the value of z is 1,2,3, … …, N, and N represents the total number of the agent.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 8 when executing a program stored on a memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-8.