CN116128028A - Efficient deep reinforcement learning algorithm for continuous decision space combination optimization - Google Patents

Efficient deep reinforcement learning algorithm for continuous decision space combination optimization Download PDF

Info

Publication number
CN116128028A
CN116128028A CN202310191943.6A CN202310191943A CN116128028A CN 116128028 A CN116128028 A CN 116128028A CN 202310191943 A CN202310191943 A CN 202310191943A CN 116128028 A CN116128028 A CN 116128028A
Authority
CN
China
Prior art keywords
reinforcement learning
deep reinforcement
optimization
continuous decision
combination optimization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310191943.6A
Other languages
Chinese (zh)
Inventor
韩莉
丁南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202310191943.6A priority Critical patent/CN116128028A/en
Publication of CN116128028A publication Critical patent/CN116128028A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

A high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization comprises the following steps: modeling the problem into a sequence decision problem, and setting a deep reinforcement learning framework element definition which needs to obtain a continuous decision combination optimization problem; step two: modeling a continuous decision space combination optimization problem in a time sequence task as a Markov decision process through definition of the deep reinforcement learning framework elements in the first step, and obtaining a training target of the deep reinforcement learning; step three: calculating an expected value of the training target G (t) by using a probabilistic dynamic programming algorithm; step four: and obtaining an optimal solution of the continuous decision combination optimization problem. The invention can reduce the environmental interaction cost of the intelligent agent, and the continuous decision space combination optimization problem in the time sequence task is expected to be solved through the effective optimal action space search in the deep reinforcement learning and the probabilistic dynamic programming calculation rewarding. Based on the above, the invention has good application prospect.

Description

Efficient deep reinforcement learning algorithm for continuous decision space combination optimization
Technical Field
The invention relates to the technical field of deep reinforcement learning, in particular to a high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization.
Background
During deep reinforcement learning, an agent (e.g., an intelligent robot) is constantly interacting with the environment. After the agent acquires a certain state in the environment, it outputs an action, which is also called decision, by using the state, then the action is executed in the environment, and the environment outputs the next state and the rewards brought by the current action according to the action taken by the agent, so that the agent can acquire rewards from the environment as much as possible. The deep reinforcement learning is combined with the strong characterization capability of the neural network to fit the strategy model and the value model of the intelligent agent, the capability of solving the complex problem is greatly improved, great progress is made in various intelligent decision-making problems in recent years, and the method becomes a branch of rapid development in the field of artificial intelligence. The combination optimization problem involved in deep reinforcement learning is in most cases a decision sequence, i.e. a sequential decision problem, such as deciding in what order to visit each city for TSP problems (tourist problems) and deciding in what order to machine work the work pieces for Job shop scheduling problems. The recurrent neural network inside the deep neural network can just complete the mapping problem from one sequence to another sequence, so that the recurrent neural network is used for directly solving the combination optimization problem, and the method is a feasible scheme. In another technical scheme, reinforcement learning is adopted, the reinforcement learning is naturally used for making sequence decisions, so that the sequence decision problem inside the combined optimization problem can be directly solved by reinforcement learning, and the technical difficulty is how to define state and reward.
Although deep learning has strong perceptibility, certain decision-making capability is lacking; while reinforcement learning, while having decision making capabilities, is not desirable for perceived problems. Therefore, the two are combined, the advantages are complementary, and a solution idea can be provided for the perception decision problem of the complex system; therefore, very popular and popular deep reinforcement learning is currently presented. Deep reinforcement learning combines the perception capability of deep learning with the decision capability of reinforcement learning, and is an artificial intelligence method which is closer to the human thinking mode. Early reinforcement learning is developed from trial-and-error learning and simulates learning behaviors of humans and animals; such algorithms are only suitable for handling simple decision-making problems such as maze walking, cross chess, etc. The RL (reinforcement learning) of modern meaning is a solving algorithm of an optimal control problem, especially a sequence decision task of a random environment, and in principle, the requirement problem is that the RL (reinforcement learning) has Markov (chain) property, and the requirement is that the state is separable, which is a key of applying the Bellman (algorithm) principle, if a deep reinforcement learning method is used for solving the sequence decision combination optimization problem of the random environment, a neural network model is generally required to interact with the environment in a large amount, which is extremely time-consuming and is full of noise.
At present, the traditional deep reinforcement learning algorithm is difficult to quickly solve the problem of complex long time sequence tasks, for example, the DQN algorithm is used for solving the problem of Atari2600 games, in the game playing process, the observation obtained by an intelligent agent can be found to be not independent and distributed at the same time, the previous frame and the next frame have very strong continuity, and the obtained data is related time sequence data and cannot meet the independent and uniform distribution; the game has the advantages of simpler task scene, shorter decision time sequence, smaller decision space and low complexity of problems existing in the game, and if the DQN is applied to more complex time sequence tasks, the effect is obviously reduced.
Disclosure of Invention
In order to overcome the defects that the existing deep reinforcement learning algorithm cannot effectively solve the problem of continuous decision space combination optimization in a time sequence task and the efficiency of continuous decision of the time sequence task is low due to the technical limitation, the invention provides the efficient deep reinforcement learning algorithm for continuous decision space combination optimization, which is used for modeling the time sequence task by utilizing a Markov decision process to separate state information into an environment state and a decision state, and simultaneously designs an effective rewarding expectation calculation method by probabilistic dynamic programming, and an intelligent agent can collect environmental feedback in a learning paradigm more effectively by avoiding a technology based on track sampling, thereby providing favorable technical support for the technical development of the deep reinforcement learning algorithm.
The technical scheme adopted for solving the technical problems is as follows:
a high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization is characterized in that an algorithm flow framework is as follows, A: the DRL agent takes the limited time sequence data and the combination weight state as input, and generates weight adjustment operation, so that the rewards of the combination optimization task are correspondingly changed; b: generating rewards corresponding to each strategy after the environment receives the corresponding actions, updating parameters of a strategy network through gradient descent optimization by utilizing rewards of a plurality of action tracks, and feeding back a combined state to the DRL agent by the environment so as to facilitate subsequent training processing; the DRL agent learns from the evaluation of multiple interactions with the environment so that the policy network can generate favorable behaviors to obtain the maximum rewards of the combined optimization task; the method specifically comprises the following steps of: modeling a problem in deep reinforcement learning aiming at a time sequence task to form a sequence decision problem, setting deep reinforcement learning framework element definition which needs to obtain continuous decision combination optimization problem in order to ensure analysis of a great deal of interaction between a neural network model and the environment, wherein specific element definition data comprise states, actions and rewards; step two: modeling a continuous decision space combination optimization problem in a time sequence task as a Markov decision process through definition of the deep reinforcement learning framework elements in the first step, and obtaining a training target of the deep reinforcement learning; step three: calculating an expected value of the training target G (t) by using a probabilistic dynamic programming algorithm; step four: the network parameters are optimized through gradient updating, so that the optimal expected return is obtained step by step, and finally, the optimal solution of the continuous decision combination optimization problem is obtained.
Further, in the step a, the policy network of the agent is composed of a recurrent neural network layer and a full connection layer, and meanwhile, the long-term and short-term memory network is used as a deep feature characterization learning module of the policy network.
Further, in the first step, the state space formula is:
Figure BDA0004105805710000031
the state space relates to a combined optimization weight state formula: />
Figure BDA0004105805710000032
The related combination weight formula at the same time is as follows:
Figure BDA0004105805710000033
further, in the step one, in the action space, the next state of the related combining weight is expressed as the following formula:
Figure BDA0004105805710000034
further, in the first step, the reward function is expressed as the following formula:
Figure BDA0004105805710000035
the reward function for the task combination of the set N involved is expressed as the following formula:
Figure BDA0004105805710000036
further, in the second step, the training target of the deep reinforcement learning is defined as follows:
Figure BDA0004105805710000041
the value function V (t) involved is expressed as:
Figure BDA0004105805710000042
further, the third step involves calculating the second step
Figure BDA0004105805710000043
The desired J (θ) formula for (a) is as follows:
Figure BDA0004105805710000044
further, in the fourth step, the expected gradient estimator is derived, and the summation formula at the time t is as follows:
Figure BDA0004105805710000045
the invention has the beneficial effects that: the invention directly deduces the strategy weight in the action space guided by the maximum accumulated income, models the time sequence task by using a Markov decision process, and simultaneously separates the state information into an environment state and a decision state so that an agent can reduce the environment interaction cost, and designs an effective rewarding expected calculation method by probabilistic dynamic programming so that the agent can more effectively collect environment feedback in the learning paradigm by avoiding the technology based on track sampling. The invention aims to solve the continuous decision space combination optimization problem in time sequence tasks through effective optimal action space search in deep reinforcement learning and probabilistic dynamic programming calculation rewards. Based on the above, the invention has good application prospect.
Drawings
The invention will be further described with reference to the drawings and examples.
Fig. 1 is a block diagram of the architecture of the present invention.
FIG. 2 is a block flow diagram of an efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization in accordance with the present invention.
Detailed Description
The efficient deep reinforcement learning algorithm for continuous decision space combination optimization is shown in fig. 1 and 2, and the algorithm flow framework is as follows, a: the DRL Agent takes limited time sequence data and a combination weight state as input, and generates weight adjustment operation, so that the rewarding return of a combination optimization task is correspondingly changed, and particularly, an Agent's strategy network consists of a circulating neural network layer and a full connection layer. B: generating rewards corresponding to each strategy after the environment receives the corresponding actions, and updating parameters of the strategy network through gradient descent optimization by utilizing rewards of a plurality of action tracks; meanwhile, the environment also feeds back the combined state to the DRL agent so as to facilitate subsequent training processing. C: the DRL agent learns from the evaluation of multiple interactions with the environment so that the policy network can produce favorable behavior to obtain the maximum rewards for combining optimization tasks.
An efficient deep reinforcement learning algorithm for continuous decision space combination optimization is shown in fig. 1 and 2, and specifically comprises four steps. Step 1: in practical application, by modeling in deep reinforcement learning aiming at a time sequence task, a problem is usually required to be modeled into a sequence decision problem, and meanwhile, a neural network model needs to perform a large amount of interaction analysis with the environment, so that a deep reinforcement learning framework element definition (element definition comprises states, actions and rewards) of a continuous decision combination optimization problem needs to be obtained first. The action of the step 1 is to lay a foundation for the step 2, because the invention uses deep reinforcement learning, the problem needs to be analyzed and summarized to abstract the definition of the elements corresponding to the problem, namely the state space, the action space and the rewarding function, and the rationality and the correctness of the elements need to be ensured (a strategy network in the deep reinforcement learning needs to randomly sample an action in the action space according to the state corresponding to the state space at the current moment, the action interacts with the environment through the action, the environment evaluates the action according to the defined rewarding function, the corresponding rewarding value reward is returned, the strategy network receives the evaluation reward of the environment for the action, which is the interaction process, the state is updated at the next moment, and then the operation is repeated, so that a continuous decision and continuous interaction process is realized. In this step, considering that there are N combinations of problems of subtasks in the T period, where the problems are all optimization problems of time-series tasks, and there are N such same problems in the T period, the core problem of the present invention is to combine the N subtasks together, and to optimize the N subtasks in a combination optimization manner, where a dynamic weight needs to be assigned to each subtask (the weight is dynamically changed, that is, the proportion of the subtask to the whole combination is 1, and the detailed concept is explained in the following state space), and each subtask generates a reward in the optimization process of deep reinforcement learning, and the task goal of the combination optimization is to maximize the sum of the reward.
In step 1, as shown in fig. 1 and 2, the state space (the state information of the deep reinforcement learning represents the environmental information perceived by the agent and the change caused by the action of the agent; the state information is the basis for the agent to make decisions and evaluate the long-term benefits of the agent, and the quality of the state design directly determines whether the deep reinforcement learning algorithm can converge, the convergence speed and the final performance, and the state space is the set containing all the designed state information) is expressed by the following formula:
Figure BDA0004105805710000061
in->
Figure BDA0004105805710000062
The agent's perception of the environment is described, including the features of the time-sequential task data. The weight state of the data combination optimization is recorded as follows: />
Figure BDA0004105805710000063
Wherein w (t) is taken as a combination weight in the time sequence task combination optimization problem, and the combination weights are initialized to be equal at the beginning of the time stamp t; at the same time, the sum of the combination weights is always 1, namely +.>
Figure BDA0004105805710000064
R N Meaning of->
Figure BDA0004105805710000071
Is a numerical vector of N dimensions; in the environmental modeling of the present invention, it is assumed that/>
Figure BDA0004105805710000072
Is not changed according to different operations, but +.>
Figure BDA0004105805710000073
The policy actions by the agent change accordingly. />
In step 1, the motion space is shown in fig. 1 and 2
Figure BDA0004105805710000074
(action space for continuous decision combination optimization problem of the present invention, comprising a set of all actions that can be sampled in a task), in this step action a (t) corresponds to the combination weight state determined at the beginning of time t and at the end of time t->
Figure BDA0004105805710000075
To->
Figure BDA0004105805710000076
Due to variation of->
Figure BDA0004105805710000077
And w (t) satisfy the constraints in the state space, so this step uses a softmax function to map as the last layer of the policy network, through which the sum of the weight ratios of all subtasks at each moment is guaranteed to be 1, while satisfying the constraints of LSTM in the policy network->
Figure BDA0004105805710000078
The mapped value E (h L The average of (t)) is taken as action a (t) at time t, so for task i, the next state of the combining weights can be expressed as follows:
Figure BDA0004105805710000079
as shown in figures 1 and 2 of the drawings,in step 1, the reward function is denoted as r (t) (each time, the agent selects an action in the current state (state) according to its policy), the environment (environment) responds to these actions accordingly, and transitions to the new state, and a reward signal (reward) is generated, which is typically a value, and is typically calculated by the reward function: the bonus function is marked with the following formula
Figure BDA00041058057100000710
The rewarding function is an incentive and guide of the agent to learn the favorable strategy, the environment evaluates the action a (t) obtained by the strategy network through the rewarding function, generates a rewarding value reward corresponding to the action, and can obtain the rewarding function of the task combination of the set N as follows:
Figure BDA0004105805710000081
where the sum of discounts for rewards, called return, is the goal that the agent wants to maximize in the course of action selection.
In step 2, as shown in fig. 1 and 2, through the definition of the deep reinforcement learning framework element actions and rewards in step 1, the continuous decision space combination optimization problem in the time sequence task is modeled as a markov decision process, and the training target of the deep reinforcement learning is obtained. Specifically, the following formula is defined:
Figure BDA0004105805710000082
the above formula is a variation of the recurring rewards function of the Markov Decision Process (MDP), where γ is the discount factor; for simplicity, this step is denoted by G (t)
Figure BDA0004105805710000083
By R (t) de-representation
Figure BDA0004105805710000084
The above equation is restated to be markov decision-processedThe reward function of a pass, while the cost function V (t) can be expressed as the following formula:
Figure BDA0004105805710000085
. In this step, the problem here needs to be modeled as a Markov decision process because the agent takes action after the status of the environment is obtained and returns the action taken to the environment after the interaction between the agent and the environment is reinforcement learning; after the action of the agent is obtained, the environment enters the next state, and a rewind is generated to transmit the next state back to the agent. This interaction process can be represented by a Markov decision process, so the Markov decision process is a basic framework within reinforcement learning. Without this step, therefore, deep reinforcement learning techniques would not be used to solve the optimization problem.
In step 3, shown in fig. 1, 2, the expected value of the training object G (t) is calculated by using a probabilistic dynamic programming algorithm. Specifically, the state starting at a given time t is
Figure BDA0004105805710000091
In this step, the action policy is defined as +.>
Figure BDA0004105805710000092
This is the output value of the policy neural network, where θ is the parameter of the policy neural network, and then the probabilistic dynamic programming algorithm is used to calculate +.>
Figure BDA0004105805710000093
The desired J (θ), formula is as follows:
Figure BDA0004105805710000094
the above formula is a variation of the Belman desired backup function, in which the state transition probability P(s) is one, by the formula
Figure BDA0004105805710000095
As will->
Figure BDA0004105805710000096
Transfer to->
Figure BDA0004105805710000097
Is based on (a) and (b). In the step 3, an effective rewarding expectation calculation method is designed through probabilistic dynamic programming, so that an intelligent agent can collect environmental feedback more effectively in a learning paradigm provided by the invention through avoiding a track sampling-based technology; the traditional method for calculating the rewards expects needs to continuously sample tracks so as to fit the rewards expects values, and the method is very time-consuming and can greatly reduce the algorithm efficiency; because the invention calculates the expected value of the rewards through probabilistic dynamic programming, the high efficiency of the technical scheme of the invention is embodied, and the efficiency of the deep reinforcement learning algorithm is greatly improved.
In step 3, the network parameters are optimized through gradient update, so that the optimal expected return is obtained step by step, and finally the optimal solution of the continuous decision combination optimization problem is obtained. The main objective of this step is to train a strategic network, and to choose the optimal parameters θ so as to achieve an optimal desired return E ((G (t) |pi, θ), i.e. the desired value of G (t) mentioned above, and then to update the parameters by means of a function, so that the desire of G (t) is larger and larger, and the overall benefit G (t) is further improved, where η in the following formula is the learning rate:
Figure BDA0004105805710000101
since the above-described motion space transformation is an index function, a desired gradient estimate can be derived while summing the formulas at time t: />
Figure BDA0004105805710000102
In the method, in the process of the invention,
Figure BDA0004105805710000103
is a reward observation, ->
Figure BDA0004105805710000104
Is the return value of the action trajectory, which may have a large variance, so this step subtracts the baseline ++from the strategy gradient>
Figure BDA0004105805710000105
To reduce variance. The function of the step 4 is that the technical scheme of the invention can deduce an expected gradient estimator, strategy the network parameters of the neural network through the gradient updated optimization algorithm, and finally obtain the optimal expected value, namely the optimal solution of the problem through continuous optimization.
As shown in fig. 1 and 2, the technical solution of the present invention uses a widely used LSTM Network as a Policy Network (Policy Network), in particular, provides a time-sequential state for LSTM, and all weights are trained in the LSTM layer using the same structure, and share weights in the parameter optimization process. In general, the technical scheme provided by the invention separates the environment into a static environment state and a dynamic weight state, and is different from the traditional method (weight distribution after prediction). Meanwhile, by designing a probability dynamic programming method without track sampling, the lowest effective process in deep reinforcement learning, namely the rewarding expectation of agent interaction, is overcome, so that the efficiency of a deep reinforcement learning algorithm is greatly improved, and a good optimization effect is realized.
It should be understood by those skilled in the art that although the present disclosure describes embodiments, the embodiments do not include only a single embodiment, and the description is for clarity only, and those skilled in the art should consider the disclosure as a whole, and the embodiments in the examples may be combined appropriately to form other embodiments that can be understood by those skilled in the art, so that the scope of the present disclosure is defined by the claims.

Claims (8)

1. A high-efficiency deep reinforcement learning algorithm for continuous decision space combination optimization is characterized in that an algorithm flow framework is as follows, A: DRLagent takes limited time sequence data and a combination weight state as input, and generates weight adjustment operation, so that the rewards of the combination optimization task are correspondingly changed; b: generating rewards corresponding to each strategy after the environment receives the corresponding actions, updating parameters of a strategy network by gradient descent optimization by utilizing rewards of a plurality of action tracks, and simultaneously feeding back a combined state to the DRLagent by the environment so as to facilitate subsequent training processing; DRLagent learns from the evaluation of multiple interactions with the environment so that the policy network can produce favorable behavior, thereby obtaining the maximum rewards of the combined optimization task; the method specifically comprises the following steps of: modeling a problem in deep reinforcement learning aiming at a time sequence task to form a sequence decision problem, setting deep reinforcement learning framework element definition which needs to obtain continuous decision combination optimization problem in order to ensure analysis of a great deal of interaction between a neural network model and the environment, wherein specific element definition data comprise states, actions and rewards; step two: modeling a continuous decision space combination optimization problem in a time sequence task as a Markov decision process through definition of the deep reinforcement learning framework elements in the first step, and obtaining a training target of the deep reinforcement learning; step three: calculating an expected value of the training target G (t) by using a probabilistic dynamic programming algorithm; step four: the network parameters are optimized through gradient updating, so that the optimal expected return is obtained step by step, and finally, the optimal solution of the continuous decision combination optimization problem is obtained.
2. The efficient deep reinforcement learning algorithm for continuous decision space combination optimization according to claim 1, wherein in a, the agent's policy network consists of a recurrent neural network layer and a full connection layer, and a long-term memory network is used as a deep feature characterization learning module of the policy network.
3. According to the weightsThe efficient deep reinforcement learning algorithm for continuous decision space combination optimization of claim 1, wherein in step one, the state space formula is:
Figure FDA0004105805700000011
the state space relates to a combined optimization weight state formula: />
Figure FDA0004105805700000012
The related combination weight formula at the same time is as follows: />
Figure FDA0004105805700000013
4. The efficient deep reinforcement learning algorithm for continuous decision space combining optimization of claim 1, wherein in step one, the next state of the related combining weights in the action space is expressed as the following formula:
Figure FDA0004105805700000021
5. the efficient deep reinforcement learning algorithm for continuous decision space combination optimization of claim 1, wherein in step one, the reward function is expressed as the following formula:
Figure FDA0004105805700000022
the reward function for the task combination of the set N involved is expressed as the following formula:
Figure FDA0004105805700000023
6. the efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization of claim 1, wherein in step two, the training objective of the deep reinforcement learning is defined as follows:
Figure FDA0004105805700000024
/>
the value function V (t) involved is expressed as:
Figure FDA0004105805700000025
7. the efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization of claim 1, wherein step three involves the step of calculating
Figure FDA0004105805700000026
The desired J (θ) formula for (a) is as follows:
Figure FDA0004105805700000027
8. the efficient deep reinforcement learning algorithm for continuous decision space combinatorial optimization of claim 1, wherein in step four, the desired gradient estimator is derived, and the summation formula at time t is as follows:
Figure FDA0004105805700000031
/>
CN202310191943.6A 2023-03-02 2023-03-02 Efficient deep reinforcement learning algorithm for continuous decision space combination optimization Pending CN116128028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310191943.6A CN116128028A (en) 2023-03-02 2023-03-02 Efficient deep reinforcement learning algorithm for continuous decision space combination optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310191943.6A CN116128028A (en) 2023-03-02 2023-03-02 Efficient deep reinforcement learning algorithm for continuous decision space combination optimization

Publications (1)

Publication Number Publication Date
CN116128028A true CN116128028A (en) 2023-05-16

Family

ID=86308181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310191943.6A Pending CN116128028A (en) 2023-03-02 2023-03-02 Efficient deep reinforcement learning algorithm for continuous decision space combination optimization

Country Status (1)

Country Link
CN (1) CN116128028A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116485039B (en) * 2023-06-08 2023-10-13 中国人民解放军96901部队 Impact sequence intelligent planning method based on reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116485039B (en) * 2023-06-08 2023-10-13 中国人民解放军96901部队 Impact sequence intelligent planning method based on reinforcement learning

Similar Documents

Publication Publication Date Title
Du et al. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
Gronauer et al. Multi-agent deep reinforcement learning: a survey
CN110119844B (en) Robot motion decision method, system and device introducing emotion regulation and control mechanism
CN112364984A (en) Cooperative multi-agent reinforcement learning method
Tang et al. A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation
CN114896899B (en) Multi-agent distributed decision method and system based on information interaction
CN116128028A (en) Efficient deep reinforcement learning algorithm for continuous decision space combination optimization
CN114510012A (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
CN112990485A (en) Knowledge strategy selection method and device based on reinforcement learning
CN112613608A (en) Reinforced learning method and related device
CN116643499A (en) Model reinforcement learning-based agent path planning method and system
CN113947022B (en) Near-end strategy optimization method based on model
CN115409158A (en) Robot behavior decision method and device based on layered deep reinforcement learning model
Tong et al. Enhancing rolling horizon evolution with policy and value networks
Atashbar et al. AI and macroeconomic modeling: Deep reinforcement learning in an RBC model
CN116757249A (en) Unmanned aerial vehicle cluster strategy intention recognition method based on distributed reinforcement learning
CN116663637A (en) Multi-level agent synchronous nesting training method
CN115273502B (en) Traffic signal cooperative control method
CN114861368B (en) Construction method of railway longitudinal section design learning model based on near-end strategy
CN116456480A (en) Multi-agent collaborative decision-making method based on deep reinforcement learning under communication resource limitation
Yuan Intrinsically-motivated reinforcement learning: A brief introduction
Li et al. SparseMAAC: Sparse attention for multi-agent reinforcement learning
Zheng et al. Green Simulation Based Policy Optimization with Partial Historical Trajectory Reuse
Jain RAMario: Experimental Approach to Reptile Algorithm--Reinforcement Learning for Mario

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination