CN116128028A

CN116128028A - Efficient deep reinforcement learning algorithm for continuous decision space combination optimization

Info

Publication number: CN116128028A
Application number: CN202310191943.6A
Authority: CN
Inventors: 韩莉; 丁南
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-05-16

Abstract

A high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization comprises the following steps: modeling the problem into a sequence decision problem, and setting a deep reinforcement learning framework element definition which needs to obtain a continuous decision combination optimization problem; step two: modeling a continuous decision space combination optimization problem in a time sequence task as a Markov decision process through definition of the deep reinforcement learning framework elements in the first step, and obtaining a training target of the deep reinforcement learning; step three: calculating an expected value of the training target G (t) by using a probabilistic dynamic programming algorithm; step four: and obtaining an optimal solution of the continuous decision combination optimization problem. The invention can reduce the environmental interaction cost of the intelligent agent, and the continuous decision space combination optimization problem in the time sequence task is expected to be solved through the effective optimal action space search in the deep reinforcement learning and the probabilistic dynamic programming calculation rewarding. Based on the above, the invention has good application prospect.

Description

Efficient deep reinforcement learning algorithm for continuous decision space combination optimization

Technical Field

The invention relates to the technical field of deep reinforcement learning, in particular to a high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization.

Background

During deep reinforcement learning, an agent (e.g., an intelligent robot) is constantly interacting with the environment. After the agent acquires a certain state in the environment, it outputs an action, which is also called decision, by using the state, then the action is executed in the environment, and the environment outputs the next state and the rewards brought by the current action according to the action taken by the agent, so that the agent can acquire rewards from the environment as much as possible. The deep reinforcement learning is combined with the strong characterization capability of the neural network to fit the strategy model and the value model of the intelligent agent, the capability of solving the complex problem is greatly improved, great progress is made in various intelligent decision-making problems in recent years, and the method becomes a branch of rapid development in the field of artificial intelligence. The combination optimization problem involved in deep reinforcement learning is in most cases a decision sequence, i.e. a sequential decision problem, such as deciding in what order to visit each city for TSP problems (tourist problems) and deciding in what order to machine work the work pieces for Job shop scheduling problems. The recurrent neural network inside the deep neural network can just complete the mapping problem from one sequence to another sequence, so that the recurrent neural network is used for directly solving the combination optimization problem, and the method is a feasible scheme. In another technical scheme, reinforcement learning is adopted, the reinforcement learning is naturally used for making sequence decisions, so that the sequence decision problem inside the combined optimization problem can be directly solved by reinforcement learning, and the technical difficulty is how to define state and reward.

Although deep learning has strong perceptibility, certain decision-making capability is lacking; while reinforcement learning, while having decision making capabilities, is not desirable for perceived problems. Therefore, the two are combined, the advantages are complementary, and a solution idea can be provided for the perception decision problem of the complex system; therefore, very popular and popular deep reinforcement learning is currently presented. Deep reinforcement learning combines the perception capability of deep learning with the decision capability of reinforcement learning, and is an artificial intelligence method which is closer to the human thinking mode. Early reinforcement learning is developed from trial-and-error learning and simulates learning behaviors of humans and animals; such algorithms are only suitable for handling simple decision-making problems such as maze walking, cross chess, etc. The RL (reinforcement learning) of modern meaning is a solving algorithm of an optimal control problem, especially a sequence decision task of a random environment, and in principle, the requirement problem is that the RL (reinforcement learning) has Markov (chain) property, and the requirement is that the state is separable, which is a key of applying the Bellman (algorithm) principle, if a deep reinforcement learning method is used for solving the sequence decision combination optimization problem of the random environment, a neural network model is generally required to interact with the environment in a large amount, which is extremely time-consuming and is full of noise.

At present, the traditional deep reinforcement learning algorithm is difficult to quickly solve the problem of complex long time sequence tasks, for example, the DQN algorithm is used for solving the problem of Atari2600 games, in the game playing process, the observation obtained by an intelligent agent can be found to be not independent and distributed at the same time, the previous frame and the next frame have very strong continuity, and the obtained data is related time sequence data and cannot meet the independent and uniform distribution; the game has the advantages of simpler task scene, shorter decision time sequence, smaller decision space and low complexity of problems existing in the game, and if the DQN is applied to more complex time sequence tasks, the effect is obviously reduced.

Disclosure of Invention

In order to overcome the defects that the existing deep reinforcement learning algorithm cannot effectively solve the problem of continuous decision space combination optimization in a time sequence task and the efficiency of continuous decision of the time sequence task is low due to the technical limitation, the invention provides the efficient deep reinforcement learning algorithm for continuous decision space combination optimization, which is used for modeling the time sequence task by utilizing a Markov decision process to separate state information into an environment state and a decision state, and simultaneously designs an effective rewarding expectation calculation method by probabilistic dynamic programming, and an intelligent agent can collect environmental feedback in a learning paradigm more effectively by avoiding a technology based on track sampling, thereby providing favorable technical support for the technical development of the deep reinforcement learning algorithm.

The technical scheme adopted for solving the technical problems is as follows:

a high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization is characterized in that an algorithm flow framework is as follows, A: the DRL agent takes the limited time sequence data and the combination weight state as input, and generates weight adjustment operation, so that the rewards of the combination optimization task are correspondingly changed; b: generating rewards corresponding to each strategy after the environment receives the corresponding actions, updating parameters of a strategy network through gradient descent optimization by utilizing rewards of a plurality of action tracks, and feeding back a combined state to the DRL agent by the environment so as to facilitate subsequent training processing; the DRL agent learns from the evaluation of multiple interactions with the environment so that the policy network can generate favorable behaviors to obtain the maximum rewards of the combined optimization task; the method specifically comprises the following steps of: modeling a problem in deep reinforcement learning aiming at a time sequence task to form a sequence decision problem, setting deep reinforcement learning framework element definition which needs to obtain continuous decision combination optimization problem in order to ensure analysis of a great deal of interaction between a neural network model and the environment, wherein specific element definition data comprise states, actions and rewards; step two: modeling a continuous decision space combination optimization problem in a time sequence task as a Markov decision process through definition of the deep reinforcement learning framework elements in the first step, and obtaining a training target of the deep reinforcement learning; step three: calculating an expected value of the training target G (t) by using a probabilistic dynamic programming algorithm; step four: the network parameters are optimized through gradient updating, so that the optimal expected return is obtained step by step, and finally, the optimal solution of the continuous decision combination optimization problem is obtained.

Further, in the step a, the policy network of the agent is composed of a recurrent neural network layer and a full connection layer, and meanwhile, the long-term and short-term memory network is used as a deep feature characterization learning module of the policy network.

Further, in the first step, the state space formula is:

the state space relates to a combined optimization weight state formula: />

The related combination weight formula at the same time is as follows:

further, in the step one, in the action space, the next state of the related combining weight is expressed as the following formula:

further, in the first step, the reward function is expressed as the following formula:

the reward function for the task combination of the set N involved is expressed as the following formula:

further, in the second step, the training target of the deep reinforcement learning is defined as follows:

the value function V (t) involved is expressed as:

further, the third step involves calculating the second step

The desired J (θ) formula for (a) is as follows:

further, in the fourth step, the expected gradient estimator is derived, and the summation formula at the time t is as follows:

the invention has the beneficial effects that: the invention directly deduces the strategy weight in the action space guided by the maximum accumulated income, models the time sequence task by using a Markov decision process, and simultaneously separates the state information into an environment state and a decision state so that an agent can reduce the environment interaction cost, and designs an effective rewarding expected calculation method by probabilistic dynamic programming so that the agent can more effectively collect environment feedback in the learning paradigm by avoiding the technology based on track sampling. The invention aims to solve the continuous decision space combination optimization problem in time sequence tasks through effective optimal action space search in deep reinforcement learning and probabilistic dynamic programming calculation rewards. Based on the above, the invention has good application prospect.

Drawings

The invention will be further described with reference to the drawings and examples.

Fig. 1 is a block diagram of the architecture of the present invention.

FIG. 2 is a block flow diagram of an efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization in accordance with the present invention.

Detailed Description

The efficient deep reinforcement learning algorithm for continuous decision space combination optimization is shown in fig. 1 and 2, and the algorithm flow framework is as follows, a: the DRL Agent takes limited time sequence data and a combination weight state as input, and generates weight adjustment operation, so that the rewarding return of a combination optimization task is correspondingly changed, and particularly, an Agent's strategy network consists of a circulating neural network layer and a full connection layer. B: generating rewards corresponding to each strategy after the environment receives the corresponding actions, and updating parameters of the strategy network through gradient descent optimization by utilizing rewards of a plurality of action tracks; meanwhile, the environment also feeds back the combined state to the DRL agent so as to facilitate subsequent training processing. C: the DRL agent learns from the evaluation of multiple interactions with the environment so that the policy network can produce favorable behavior to obtain the maximum rewards for combining optimization tasks.

An efficient deep reinforcement learning algorithm for continuous decision space combination optimization is shown in fig. 1 and 2, and specifically comprises four steps. Step 1: in practical application, by modeling in deep reinforcement learning aiming at a time sequence task, a problem is usually required to be modeled into a sequence decision problem, and meanwhile, a neural network model needs to perform a large amount of interaction analysis with the environment, so that a deep reinforcement learning framework element definition (element definition comprises states, actions and rewards) of a continuous decision combination optimization problem needs to be obtained first. The action of the step 1 is to lay a foundation for the step 2, because the invention uses deep reinforcement learning, the problem needs to be analyzed and summarized to abstract the definition of the elements corresponding to the problem, namely the state space, the action space and the rewarding function, and the rationality and the correctness of the elements need to be ensured (a strategy network in the deep reinforcement learning needs to randomly sample an action in the action space according to the state corresponding to the state space at the current moment, the action interacts with the environment through the action, the environment evaluates the action according to the defined rewarding function, the corresponding rewarding value reward is returned, the strategy network receives the evaluation reward of the environment for the action, which is the interaction process, the state is updated at the next moment, and then the operation is repeated, so that a continuous decision and continuous interaction process is realized. In this step, considering that there are N combinations of problems of subtasks in the T period, where the problems are all optimization problems of time-series tasks, and there are N such same problems in the T period, the core problem of the present invention is to combine the N subtasks together, and to optimize the N subtasks in a combination optimization manner, where a dynamic weight needs to be assigned to each subtask (the weight is dynamically changed, that is, the proportion of the subtask to the whole combination is 1, and the detailed concept is explained in the following state space), and each subtask generates a reward in the optimization process of deep reinforcement learning, and the task goal of the combination optimization is to maximize the sum of the reward.

In step 1, as shown in fig. 1 and 2, the state space (the state information of the deep reinforcement learning represents the environmental information perceived by the agent and the change caused by the action of the agent; the state information is the basis for the agent to make decisions and evaluate the long-term benefits of the agent, and the quality of the state design directly determines whether the deep reinforcement learning algorithm can converge, the convergence speed and the final performance, and the state space is the set containing all the designed state information) is expressed by the following formula:

in->

The agent's perception of the environment is described, including the features of the time-sequential task data. The weight state of the data combination optimization is recorded as follows: />

Wherein w (t) is taken as a combination weight in the time sequence task combination optimization problem, and the combination weights are initialized to be equal at the beginning of the time stamp t; at the same time, the sum of the combination weights is always 1, namely +.>

R ^N Meaning of->

Is a numerical vector of N dimensions; in the environmental modeling of the present invention, it is assumed that/>

Is not changed according to different operations, but +.>

The policy actions by the agent change accordingly. />

In step 1, the motion space is shown in fig. 1 and 2

(action space for continuous decision combination optimization problem of the present invention, comprising a set of all actions that can be sampled in a task), in this step action a (t) corresponds to the combination weight state determined at the beginning of time t and at the end of time t->

To->

Due to variation of->

And w (t) satisfy the constraints in the state space, so this step uses a softmax function to map as the last layer of the policy network, through which the sum of the weight ratios of all subtasks at each moment is guaranteed to be 1, while satisfying the constraints of LSTM in the policy network->

The mapped value E (h ^L The average of (t)) is taken as action a (t) at time t, so for task i, the next state of the combining weights can be expressed as follows:

as shown in figures 1 and 2 of the drawings,in step 1, the reward function is denoted as r (t) (each time, the agent selects an action in the current state (state) according to its policy), the environment (environment) responds to these actions accordingly, and transitions to the new state, and a reward signal (reward) is generated, which is typically a value, and is typically calculated by the reward function: the bonus function is marked with the following formula

The rewarding function is an incentive and guide of the agent to learn the favorable strategy, the environment evaluates the action a (t) obtained by the strategy network through the rewarding function, generates a rewarding value reward corresponding to the action, and can obtain the rewarding function of the task combination of the set N as follows:

where the sum of discounts for rewards, called return, is the goal that the agent wants to maximize in the course of action selection.

In step 2, as shown in fig. 1 and 2, through the definition of the deep reinforcement learning framework element actions and rewards in step 1, the continuous decision space combination optimization problem in the time sequence task is modeled as a markov decision process, and the training target of the deep reinforcement learning is obtained. Specifically, the following formula is defined:

the above formula is a variation of the recurring rewards function of the Markov Decision Process (MDP), where γ is the discount factor; for simplicity, this step is denoted by G (t)

By R (t) de-representation

The above equation is restated to be markov decision-processedThe reward function of a pass, while the cost function V (t) can be expressed as the following formula:

. In this step, the problem here needs to be modeled as a Markov decision process because the agent takes action after the status of the environment is obtained and returns the action taken to the environment after the interaction between the agent and the environment is reinforcement learning; after the action of the agent is obtained, the environment enters the next state, and a rewind is generated to transmit the next state back to the agent. This interaction process can be represented by a Markov decision process, so the Markov decision process is a basic framework within reinforcement learning. Without this step, therefore, deep reinforcement learning techniques would not be used to solve the optimization problem.

In step 3, shown in fig. 1, 2, the expected value of the training object G (t) is calculated by using a probabilistic dynamic programming algorithm. Specifically, the state starting at a given time t is

In this step, the action policy is defined as +.>

This is the output value of the policy neural network, where θ is the parameter of the policy neural network, and then the probabilistic dynamic programming algorithm is used to calculate +.>

The desired J (θ), formula is as follows:

the above formula is a variation of the Belman desired backup function, in which the state transition probability P(s) is one, by the formula

As will->

Transfer to->

Is based on (a) and (b). In the step 3, an effective rewarding expectation calculation method is designed through probabilistic dynamic programming, so that an intelligent agent can collect environmental feedback more effectively in a learning paradigm provided by the invention through avoiding a track sampling-based technology; the traditional method for calculating the rewards expects needs to continuously sample tracks so as to fit the rewards expects values, and the method is very time-consuming and can greatly reduce the algorithm efficiency; because the invention calculates the expected value of the rewards through probabilistic dynamic programming, the high efficiency of the technical scheme of the invention is embodied, and the efficiency of the deep reinforcement learning algorithm is greatly improved.

In step 3, the network parameters are optimized through gradient update, so that the optimal expected return is obtained step by step, and finally the optimal solution of the continuous decision combination optimization problem is obtained. The main objective of this step is to train a strategic network, and to choose the optimal parameters θ so as to achieve an optimal desired return E ((G (t) |pi, θ), i.e. the desired value of G (t) mentioned above, and then to update the parameters by means of a function, so that the desire of G (t) is larger and larger, and the overall benefit G (t) is further improved, where η in the following formula is the learning rate:

since the above-described motion space transformation is an index function, a desired gradient estimate can be derived while summing the formulas at time t: />

In the method, in the process of the invention,

is a reward observation, ->

Is the return value of the action trajectory, which may have a large variance, so this step subtracts the baseline ++from the strategy gradient>

To reduce variance. The function of the step 4 is that the technical scheme of the invention can deduce an expected gradient estimator, strategy the network parameters of the neural network through the gradient updated optimization algorithm, and finally obtain the optimal expected value, namely the optimal solution of the problem through continuous optimization.

As shown in fig. 1 and 2, the technical solution of the present invention uses a widely used LSTM Network as a Policy Network (Policy Network), in particular, provides a time-sequential state for LSTM, and all weights are trained in the LSTM layer using the same structure, and share weights in the parameter optimization process. In general, the technical scheme provided by the invention separates the environment into a static environment state and a dynamic weight state, and is different from the traditional method (weight distribution after prediction). Meanwhile, by designing a probability dynamic programming method without track sampling, the lowest effective process in deep reinforcement learning, namely the rewarding expectation of agent interaction, is overcome, so that the efficiency of a deep reinforcement learning algorithm is greatly improved, and a good optimization effect is realized.

It should be understood by those skilled in the art that although the present disclosure describes embodiments, the embodiments do not include only a single embodiment, and the description is for clarity only, and those skilled in the art should consider the disclosure as a whole, and the embodiments in the examples may be combined appropriately to form other embodiments that can be understood by those skilled in the art, so that the scope of the present disclosure is defined by the claims.

Claims

1. A high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization is characterized in that an algorithm flow framework is as follows, A: DRLagent takes limited time sequence data and a combination weight state as input, and generates weight adjustment operation, so that the rewards of the combination optimization task are correspondingly changed; b: generating rewards corresponding to each strategy after the environment receives the corresponding actions, updating parameters of a strategy network by gradient descent optimization by utilizing rewards of a plurality of action tracks, and simultaneously feeding back a combined state to the DRLagent by the environment so as to facilitate subsequent training processing; DRLagent learns from the evaluation of multiple interactions with the environment so that the policy network can produce favorable behavior, thereby obtaining the maximum rewards of the combined optimization task; the method specifically comprises the following steps of: modeling a problem in deep reinforcement learning aiming at a time sequence task to form a sequence decision problem, setting deep reinforcement learning framework element definition which needs to obtain continuous decision combination optimization problem in order to ensure analysis of a great deal of interaction between a neural network model and the environment, wherein specific element definition data comprise states, actions and rewards; step two: modeling a continuous decision space combination optimization problem in a time sequence task as a Markov decision process through definition of the deep reinforcement learning framework elements in the first step, and obtaining a training target of the deep reinforcement learning; step three: calculating an expected value of the training target G (t) by using a probabilistic dynamic programming algorithm; step four: the network parameters are optimized through gradient updating, so that the optimal expected return is obtained step by step, and finally, the optimal solution of the continuous decision combination optimization problem is obtained.

2. The efficient deep reinforcement learning algorithm for continuous decision space combination optimization according to claim 1, wherein in a, the agent's policy network consists of a recurrent neural network layer and a full connection layer, and a long-term memory network is used as a deep feature characterization learning module of the policy network.

3. According to the weightsThe efficient deep reinforcement learning algorithm for continuous decision space combination optimization of claim 1, wherein in step one, the state space formula is:

the state space relates to a combined optimization weight state formula: />

The related combination weight formula at the same time is as follows: />

4. The efficient deep reinforcement learning algorithm for continuous decision space combining optimization of claim 1, wherein in step one, the next state of the related combining weights in the action space is expressed as the following formula:

5. the efficient deep reinforcement learning algorithm for continuous decision space combination optimization of claim 1, wherein in step one, the reward function is expressed as the following formula:

6. the efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization of claim 1, wherein in step two, the training objective of the deep reinforcement learning is defined as follows:

/>

the value function V (t) involved is expressed as:

7. the efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization of claim 1, wherein step three involves the step of calculating

The desired J (θ) formula for (a) is as follows:

8. the efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization of claim 1, wherein in step four, the desired gradient estimator is derived, and the summation formula at time t is as follows:

/>