CN116596059A - Multi-agent reinforcement learning method based on priority experience sharing - Google Patents

Multi-agent reinforcement learning method based on priority experience sharing Download PDF

Info

Publication number
CN116596059A
CN116596059A CN202310456116.5A CN202310456116A CN116596059A CN 116596059 A CN116596059 A CN 116596059A CN 202310456116 A CN202310456116 A CN 202310456116A CN 116596059 A CN116596059 A CN 116596059A
Authority
CN
China
Prior art keywords
experience
priority
sample
agent
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310456116.5A
Other languages
Chinese (zh)
Inventor
郭鹏骏
赵岭忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202310456116.5A priority Critical patent/CN116596059A/en
Publication of CN116596059A publication Critical patent/CN116596059A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-agent reinforcement learning method based on priority experience sharing, which comprises the following steps: step 1, experience collection; step 2, calculating priority; and step 3, strategy updating. According to the method, the priority mechanism is adopted to evaluate the experience data of different agents, the experience data sampling probability with large contribution to the policy learning of the agents is improved, and the problem of poor learning effect caused by low quality of training samples of the agents is solved.

Description

Multi-agent reinforcement learning method based on priority experience sharing
Technical Field
The invention relates to the technical field of multi-agent reinforcement learning, in particular to a multi-agent reinforcement learning method based on priority experience sharing.
Background
Human society has evolved to share experience. In modern society, people can acquire knowledge through various ways, such as reading books, watching videos, participating in training, and the like. However, experience sharing remains a central component of learning, as it not only allows us to learn knowledge more deeply, but also helps us apply knowledge better. Experience sharing is typically achieved by communicating knowledge with the peer and teacher. In this process, we can learn the experiences and ideas of other people and draw experience and intelligence from them. This communication not only speeds up the learning process, but also helps us understand knowledge in depth, thus better applying it.
Experience sharing is used as an information concentration mode, and learning efficiency and performance of multi-agent reinforcement learning can be effectively improved. Classical reinforcement learning methods such as MADDPG, COMA and other algorithms, the strategy optimization of the agents in the training process is independent, namely each agent has a respective experience pool, and the overall learning efficiency is limited by the agents with the lowest learning efficiency. The learning speed may be different between agents, resulting in a limitation of learning efficiency of the entire system. Because each agent performs policy update according to own experience data, the experience data of different agents may have differences, which results in slower learning speed of some agents, thereby affecting the learning efficiency of the whole system. The experience sharing method such as SEAC algorithm can update own strategy and cost function by each agent by using experience data of other agents, thereby realizing knowledge sharing and communication. This approach can help the agents better explore the environment and optimize decisions, thereby improving the revenue and performance of the overall system. However, there are also some adverse effects in the experience sharing process. First, sample data distribution is critical to experience sharing, and a sparse rewarding environment refers to a situation in which an agent can obtain corresponding feedback rewards when few agents are in the environment, and most of the time, no rewarding information exists. Experience sharing may result in reduced learning efficiency for the overall system if there is a large amount of low rewards information or redundant information in the sample data. Because low rewards or redundant information may interfere with the learning process of the agent, the agent may not obtain effective feedback and rewards signals and thus may not update the strategy and cost function properly.
Disclosure of Invention
The invention aims to solve the problem that the learning efficiency of an intelligent agent system is affected due to low probability of acquiring useful information when a multi-intelligent agent system in a sparse rewarding environment shares experience strategy update after experience storage in the prior art, and provides a multi-intelligent agent reinforcement learning method based on priority experience sharing. According to the method, the priority mechanism is adopted to evaluate the experience data of different agents, the experience data sampling probability with large contribution to the policy learning of the agents is improved, and the problem of poor learning effect caused by low quality of training samples of the agents is solved.
The technical scheme for realizing the aim of the invention is as follows:
a multi-agent reinforcement learning method based on priority experience sharing comprises the following steps:
step 1, experience collection: each intelligent agent observes the environment state in the environment, selects corresponding actions according to the self strategy network, outputs the actions to the environment to transfer to the next state, and collects experience according to the state, the actions and the rewarding information;
step 2, priority calculation: calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agent, transmitting the priority to a centralized experience pool module, and ensuring that more effective sample experiences can be selected during strategy updating by giving larger weight to the experience samples with large contribution;
step 3, strategy updating: the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority which is updated continuously, and the learning speed of an intelligent body is accelerated and the training effect is improved by utilizing the information of the experience sample.
The specific process of experience collection in the step 1 is as follows:
firstly, each intelligent agent observes the environmental state in the environment to make action for exploration; secondly, carrying out priority calculation on experience data of each intelligent agent, judging the experience value of each intelligent agent and storing the experience value into a shared experience pool; finally, experience sampling is carried out in a shared experience pool according to probability sampling, and parameters of the deep neural network are updated for actual decision making in the next step; each agent selects actions according to the strategy network, outputs the actions to the environment for execution, iterates until the preset training step number or learning convergence is reached, and then collects experience according to the state, the actions and the rewarding information.
The specific process of the priority calculation in the step 2 is as follows:
calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agents, firstly, each intelligent agent calculates the priority of each experience sample data in the own experience pool, and then the priority is transmitted to a centralized experience pool module; when the strategy is updated, the centralized experience sharing module samples the experience sample data of all the agents so as to ensure that the important experience sample data of each agent has enough opportunity to be trained;
priority experience sharing ensures that more effective sample experiences can be selected during policy updating by giving large weight to experience samples with large contribution; the contribution of the experience sample is considered in two aspects, firstly, the rewarding size of the experience sample can be evaluated from the rewarding size of the experience sample, and high-efficiency samples with high value can be screened; second, from the importance of the empirical sample to the agent, if the empirical sample itself is highly rewarded but the agent is already familiar with the empirical sample, the effect of the empirical sample on its strategic network correction is still small; therefore, considering the difference between the Q values corresponding to the experience sample rewards and the action cost functions as a measurement index, and improving the utilization of sample data with larger variability by the intelligent agent; in the multi-agent priority experience sharing method, a priority calculation formula is expressed as follows:
p i =|r i |+|r i +γmax a′ Q′(s i+1, a′)-Q(s i ,a i )|+∈ (1)
wherein p is i Indicating the priority of the ith experience sample, r i Representing an instant prize of the sample, s i And a i Respectively represent the corresponding state and action of the sample, s i+1, Representing the next state of the sample, a' representing all actions of the next state, γ being the discount coefficient measuring the importance of future rewards, Q(s) i ,a i ) And Q'(s) i+1, a') respectively represents Q values of the selection actions of the current policy network and the target policy network, E is a positive number smaller than 0.01, and the experience sample priority is ensured to be not 0;
the structure of the priority experience sharing algorithm is as follows: firstly, an agent observes an environment state and outputs corresponding actions to the environment by using a strategy network, the environment is transferred to the next state according to the current state and the actions of the agent and gives feedback rewards, and then the agent calculates experience sample data priority and stores the experience sample data priority in a shared experience pool; in the training process, probability sampling is carried out from a shared experience pool at intervals, each intelligent agent calculates TD errors of a value network and carries out corresponding updating, then a corresponding Q value is selected to guide strategy network updating, and finally the intelligent agent recalculates the priority of experience sample data according to the corresponding network errors and updates the priority weight to the corresponding experience sample in the experience pool.
The specific process of policy updating in the step 3 is as follows:
the following defects exist if greedy strategy is adopted to select the experience sample with the greatest contribution in the strategy updating process: the priorities of all experience samples are not updated after each intelligent agent updates Critic, but only the priorities of the currently selected experience samples are updated; thus, the transferred weights correspond to the previously calculated agent, not the current agent; the fact that a certain experience sample has low contribution can only indicate that the contribution of the experience sample to a previously calculated intelligent agent is not large, but cannot indicate that the experience sample does not work on a current intelligent agent, so that sample data with large contribution to the current intelligent agent can be missed based on a priority greedy strategy, in addition, the experience sample data with the largest priority can be selected always by a greedy strategy-based method to cause overfitting, random probability sampling is adopted, the sampling probability of the experience sample data with high contribution can be improved, certain randomness is guaranteed, and a sampling probability formula is defined as follows:
wherein P (i) is the probability that the empirical sample i is sampled, P i For the priority of the experience sample i, α is used to control the priority, and α=0 is a uniform sample;
in priority experience playback, the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority of the experience sample due to the fact that the priority is continuously updated, and in this way, the information of the experience sample can be effectively utilized, the learning speed of an intelligent body is increased, and the training effect is improved; the policy update algorithm is as follows:
compared with the prior art, the technical scheme has the following characteristics:
1. the priority calculating method provided by the technical scheme considers two aspects, namely the rewarding size of the experience sample and the importance of the experience sample to the intelligent agent.
2. According to the technical scheme, through random probability empirical sampling, the empirical sample data sampling probability with high contribution can be improved, and certain randomness is ensured to avoid over fitting.
3. The priority is continuously updated during policy learning, the priority is changed in the shared experience library, and the accuracy of the experience sample data weight is improved.
Drawings
FIG. 1 is a diagram of a multi-agent priority experience sharing framework;
FIG. 2 is a flow chart of a multi-agent priority experience sharing task;
fig. 3 is a diagram of the internal network architecture of the algorithm.
Detailed Description
The invention will now be described in further detail with reference to the drawings and specific examples, which are not intended to limit the invention thereto.
Examples:
a multi-agent reinforcement learning method based on priority experience sharing comprises the following steps:
step 1, experience collection: each intelligent agent observes the environment state in the environment, selects corresponding actions according to the self strategy network, outputs the actions to the environment to transfer to the next state, and collects experience according to the state, the actions and the rewarding information;
step 2, priority calculation: calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agent, transmitting the priority to a centralized experience pool module, and ensuring that more effective sample experiences can be selected during strategy updating by giving larger weight to the experience samples with large contribution;
step 3, strategy updating: the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority which is updated continuously, and the learning speed of an intelligent body is accelerated and the training effect is improved by utilizing the information of the experience sample.
The specific process of experience collection in the step 1 is as follows:
in the multi-agent reinforcement learning training process, each agent performs corresponding actions according to own strategies and generates corresponding track data, then the agents perform strategy updating according to the data generated by the agents, however, in the multi-agent system, each agent performs exploration training learning according to own strategies, and the learning efficiency of different agents is possibly different, so that the overall learning efficiency in the multi-agent system is limited by the agent with slowest learning, experience sharing can improve the overall learning efficiency of the multi-agent system, the agents can realize common targets or personal targets by cooperation, and in the process, each agent can benefit from the experience of other agents to accelerate learning and improve the efficiency;
the basic idea of experience sharing is to store experiences collected by different agents in the environment in a unified experience pool, as shown in fig. 1, and then randomly extract a batch of samples from the experience pool for training a neural network, so that the diversity of the samples can be improved to avoid negative influence of continuous and highly-relevant state sequences on the training of the network;
the specific flow of experience collection is shown in figure 2: firstly, each intelligent agent observes the environmental state in the environment to make action for exploration; secondly, carrying out priority calculation on experience data of each intelligent agent, judging the experience value of each intelligent agent and storing the experience value into a shared experience pool; finally, experience sampling is carried out in a shared experience pool according to probability sampling, and parameters of the deep neural network are updated for actual decision making in the next step; each agent selects actions according to the strategy network, outputs the actions to the environment for execution, iterates until the preset training step number or learning convergence is reached, and then collects experience according to the state, the actions and the rewarding information. The specific process of the priority calculation in the step 2 is as follows:
calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agents, firstly, each intelligent agent calculates the priority of each experience sample data in the own experience pool, and then the priority is transmitted to a centralized experience pool module; when the strategy is updated, the centralized experience sharing module samples the experience sample data of all the agents so as to ensure that the important experience sample data of each agent has enough opportunity to be trained;
priority experience sharing ensures that more effective sample experiences can be selected during policy updating by giving large weight to experience samples with large contribution; the contribution of the experience sample is considered in two aspects, firstly, the rewarding size of the experience sample can be evaluated from the rewarding size of the experience sample, and high-efficiency samples with high value can be screened; second, from the importance of the empirical sample to the agent, if the empirical sample itself is highly rewarded but the agent is already familiar with the empirical sample, the effect of the empirical sample on its strategic network correction is still small; therefore, considering the difference between the Q values corresponding to the experience sample rewards and the action cost functions as a measurement index, and improving the utilization of sample data with larger variability by the intelligent agent; in the multi-agent priority experience sharing method, a priority calculation formula is expressed as follows:
p i =|r i |+|r i +γmax a′ Q′(s i+1, a′)-Q(s i ,a i )|+∈ (1)
wherein p is i Indicating the priority of the ith experience sample, r i Representing an instant prize of the sample, s i And a i Respectively represent the corresponding state and action of the sample, s i+1, Representing the next state of the sample, a' representing all actions of the next state, γ being the discount coefficient measuring the importance of future rewards, Q(s) i ,a i ) And Q'(s) i+1, a') respectively represents Q values of the selection actions of the current policy network and the target policy network, E is a positive number smaller than 0.01, and the experience sample priority is ensured to be not 0;
the structure of the priority experience sharing algorithm is shown in fig. 3, firstly, an agent observes the environment state and outputs corresponding actions to the environment by using a strategy network, the environment is transferred to the next state according to the current state and the actions of the agent and gives feedback rewards, and then the agent calculates the experience sample data priority and stores the experience sample data priority in a shared experience pool; in the training process, probability sampling is carried out from a shared experience pool at intervals, each intelligent agent calculates TD errors of a value network and carries out corresponding updating, then a corresponding Q value is selected to guide strategy network updating, and finally the intelligent agent recalculates the priority of experience sample data according to the corresponding network errors and updates the priority weight to the corresponding experience sample in the experience pool.
The specific process of policy updating in the step 3 is as follows:
the following defects exist if greedy strategy is adopted to select the experience sample with the greatest contribution in the strategy updating process: the priorities of all experience samples are not updated after each intelligent agent updates Critic, but only the priorities of the currently selected experience samples are updated; thus, the transferred weights correspond to the previously calculated agent, not the current agent; the fact that a certain experience sample has low contribution can only indicate that the contribution of the experience sample to a previously calculated intelligent agent is not large, but cannot indicate that the experience sample does not work on a current intelligent agent, so that sample data with large contribution to the current intelligent agent can be missed based on a priority greedy strategy, in addition, the experience sample data with the largest priority can be selected always by a greedy strategy-based method to cause overfitting, random probability sampling is adopted, the sampling probability of the experience sample data with high contribution can be improved, certain randomness is guaranteed, and a sampling probability formula is defined as follows:
wherein P (i) is the probability that the empirical sample i is sampled, P i For the priority of the experience sample i, α is used to control the priority, and α=0 is a uniform sample;
in priority experience playback, the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority of the experience sample due to the fact that the priority is continuously updated, and in this way, the information of the experience sample can be effectively utilized, the learning speed of an intelligent body is increased, and the training effect is improved; the policy update algorithm is as follows:

Claims (4)

1. the multi-agent reinforcement learning method based on priority experience sharing is characterized by comprising the following steps of:
step 1, experience collection: each intelligent agent observes the environment state in the environment, selects corresponding actions according to the self strategy network, outputs the actions to the environment to transfer to the next state, and collects experience according to the state, the actions and the rewarding information;
step 2, priority calculation: calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agent, transmitting the priority to a centralized experience pool module, and ensuring that more effective sample experiences can be selected during strategy updating by giving larger weight to the experience samples with large contribution;
step 3, strategy updating: the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority which is updated continuously, and the learning speed of an intelligent body is accelerated and the training effect is improved by utilizing the information of the experience sample.
2. The multi-agent reinforcement learning method based on priority experience sharing according to claim 1, wherein the specific process of experience collection in step 1 is as follows:
firstly, each intelligent agent observes the environmental state in the environment to make action for exploration; secondly, carrying out priority calculation on experience data of each intelligent agent, judging the experience value of each intelligent agent and storing the experience value into a shared experience pool; finally, experience sampling is carried out in a shared experience pool according to probability sampling, and parameters of the deep neural network are updated for actual decision making in the next step; each agent selects actions according to the strategy network, outputs the actions to the environment for execution, iterates until the preset training step number or learning convergence is reached, and then collects experience according to the state, the actions and the rewarding information.
3. The multi-agent reinforcement learning method based on priority experience sharing according to claim 1, wherein the specific process of priority calculation in step 2 is as follows:
calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agents, firstly, each intelligent agent calculates the priority of each experience sample data in the own experience pool, and then the priority is transmitted to a centralized experience pool module; when the strategy is updated, the centralized experience sharing module samples the experience sample data of all the agents so as to ensure that the important experience sample data of each agent has enough opportunity to be trained;
priority experience sharing ensures that more effective sample experiences can be selected during policy updating by giving large weight to experience samples with large contribution; the contribution of the experience sample is considered in two aspects, firstly, the rewarding size of the experience sample can be evaluated from the rewarding size of the experience sample, and high-efficiency samples with high value can be screened; second, from the importance of the empirical sample to the agent, if the empirical sample itself is highly rewarded but the agent is already familiar with the empirical sample, the effect of the empirical sample on its strategic network correction is still small; therefore, considering the difference between the Q values corresponding to the experience sample rewards and the action cost functions as a measurement index, and improving the utilization of sample data with larger variability by the intelligent agent; in the multi-agent priority experience sharing method, a priority calculation formula is expressed as follows:
p i =|r i |+|r i +γmax a′ Q′(s i+1 ,a′)-Q(s i ,a i )|+∈ (1)
wherein p is i Indicating the priority of the ith experience sample, r i Representing an instant prize of the sample, s i And a i Respectively represent the corresponding state and action of the sample, s i+1 Representing the next state of the sample, a' representing all actions of the next state, γ being the discount coefficient measuring the importance of future rewards, Q(s) i ,a i ) And Q'(s) i+1 A') respectively represents Q values of the selection actions of the current policy network and the target policy network, E is a positive number smaller than 0.01, and the experience sample priority is ensured to be not 0;
the structure of the priority experience sharing algorithm is as follows: firstly, an agent observes an environment state and outputs corresponding actions to the environment by using a strategy network, the environment is transferred to the next state according to the current state and the actions of the agent and gives feedback rewards, and then the agent calculates experience sample data priority and stores the experience sample data priority in a shared experience pool; in the training process, probability sampling is carried out from a shared experience pool at intervals, each intelligent agent calculates TD errors of a value network and carries out corresponding updating, then a corresponding Q value is selected to guide strategy network updating, and finally the intelligent agent recalculates the priority of experience sample data according to the corresponding network errors and updates the priority weight to the corresponding experience sample in the experience pool.
4. The multi-agent reinforcement learning method based on priority experience sharing according to claim 1, wherein the specific process of policy updating in the step 3 is as follows:
the following defects exist if greedy strategy is adopted to select the experience sample with the greatest contribution in the strategy updating process: the priorities of all experience samples are not updated after each intelligent agent updates Critic, but only the priorities of the currently selected experience samples are updated; thus, the transferred weights correspond to the previously calculated agent, not the current agent; the fact that a certain experience sample has low contribution can only indicate that the contribution of the experience sample to a previously calculated intelligent agent is not large, but cannot indicate that the experience sample does not work on a current intelligent agent, so that sample data with large contribution to the current intelligent agent can be missed based on a priority greedy strategy, in addition, the experience sample data with the largest priority can be selected always by a greedy strategy-based method to cause overfitting, random probability sampling is adopted, the sampling probability of the experience sample data with high contribution can be improved, certain randomness is guaranteed, and a sampling probability formula is defined as follows:
wherein P (i) is the probability that the empirical sample i is sampled, P i For the priority of the experience sample i, α is used to control the priority, and α=0 is a uniform sample;
in priority experience playback, the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority of the experience sample due to the fact that the priority is continuously updated, and in this way, the information of the experience sample can be effectively utilized, the learning speed of an intelligent body is increased, and the training effect is improved; the policy update algorithm is as follows:
CN202310456116.5A 2023-04-25 2023-04-25 Multi-agent reinforcement learning method based on priority experience sharing Pending CN116596059A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310456116.5A CN116596059A (en) 2023-04-25 2023-04-25 Multi-agent reinforcement learning method based on priority experience sharing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310456116.5A CN116596059A (en) 2023-04-25 2023-04-25 Multi-agent reinforcement learning method based on priority experience sharing

Publications (1)

Publication Number Publication Date
CN116596059A true CN116596059A (en) 2023-08-15

Family

ID=87599908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310456116.5A Pending CN116596059A (en) 2023-04-25 2023-04-25 Multi-agent reinforcement learning method based on priority experience sharing

Country Status (1)

Country Link
CN (1) CN116596059A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407514A (en) * 2023-11-28 2024-01-16 星环信息科技(上海)股份有限公司 Solution plan generation method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407514A (en) * 2023-11-28 2024-01-16 星环信息科技(上海)股份有限公司 Solution plan generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111858009B (en) Task scheduling method of mobile edge computing system based on migration and reinforcement learning
CN109886343B (en) Image classification method and device, equipment and storage medium
CN113570039B (en) Block chain system based on reinforcement learning optimization consensus
CN113191484A (en) Federal learning client intelligent selection method and system based on deep reinforcement learning
CN110691422A (en) Multi-channel intelligent access method based on deep reinforcement learning
CN113568727B (en) Mobile edge computing task allocation method based on deep reinforcement learning
CN111695690A (en) Multi-agent confrontation decision-making method based on cooperative reinforcement learning and transfer learning
CN111832627A (en) Image classification model training method, classification method and system for suppressing label noise
CN113434212A (en) Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning
CN106815782A (en) A kind of real estate estimation method and system based on neutral net statistical models
CN112717415B (en) Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game
CN116596059A (en) Multi-agent reinforcement learning method based on priority experience sharing
CN113487039B (en) Deep reinforcement learning-based intelligent self-adaptive decision generation method and system
CN116448117A (en) Path planning method integrating deep neural network and reinforcement learning method
CN116992779B (en) Simulation method and system of photovoltaic energy storage system based on digital twin model
CN114169543A (en) Federal learning algorithm based on model obsolescence and user participation perception
CN116341605A (en) Grey wolf algorithm hybrid optimization method based on reverse learning strategy
CN112215412A (en) Dissolved oxygen prediction method and device
Jiang et al. Action candidate based clipped double q-learning for discrete and continuous action tasks
CN117010482A (en) Strategy method based on double experience pool priority sampling and DuelingDQN implementation
CN114866272B (en) Multi-round data delivery system of true value discovery algorithm in crowd-sourced sensing environment
CN116361639A (en) Self-adaptive federal learning method suitable for artificial intelligence internet of things heterogeneous system
CN113890653B (en) Multi-agent reinforcement learning power distribution method for multi-user benefits
CN115618241A (en) Task self-adaption and federal learning method and system for edge side vision analysis
Yu et al. Historical best Q-networks for deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination