CN110945542A - Multi-agent deep reinforcement learning agent method based on smart power grid - Google Patents

Multi-agent deep reinforcement learning agent method based on smart power grid Download PDF

Info

Publication number
CN110945542A
CN110945542A CN201880000858.4A CN201880000858A CN110945542A CN 110945542 A CN110945542 A CN 110945542A CN 201880000858 A CN201880000858 A CN 201880000858A CN 110945542 A CN110945542 A CN 110945542A
Authority
CN
China
Prior art keywords
agent
reinforcement learning
power
action
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201880000858.4A
Other languages
Chinese (zh)
Other versions
CN110945542B (en
Inventor
侯韩旭
郝建业
杨耀东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan University of Technology
Original Assignee
Dongguan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan University of Technology filed Critical Dongguan University of Technology
Publication of CN110945542A publication Critical patent/CN110945542A/en
Application granted granted Critical
Publication of CN110945542B publication Critical patent/CN110945542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention is suitable for the technical field of power automation control, and provides a multi-agent deep reinforcement learning agent method based on a smart grid, which comprises the following steps: s1, calculating the corresponding action standard value under the current state according to the reward obtained by the selected action, and updating the parameters of the neural network; s2, establishing a multi-agent of 'external competition and internal cooperation' according to the types of the consumers and the producers; s3, setting the reward function for each internal agent based on profit maximization of the agent' S actions and the benefits of other internal agents. The input layer of the neural network may accept direct input of values of features that characterize the state, while the Q-table needs to discretize the feature values to reduce the state space.

Description

Multi-agent deep reinforcement learning agent method based on smart power grid Technical Field
The invention belongs to the technical field of power automation control, and particularly relates to a multi-agent deep reinforcement learning agent method based on a smart grid.
Background
The intelligent power grid is characterized in that a series of digital communication technologies are used for realizing power grid modernization[1][2]. The economy, the national defense safety and even the safety of residents of a country all depend on the reliability of the power grid, and in actual operation, the intelligent power grid can not only facilitate users to select corresponding power packages in real time, but also actively allocate power resources and realize balanced power supply. The power grid can adjust and feed back market fluctuation in real time, bidirectional information communication service and comprehensive power grid condition perception are achieved, and the power grid is an important component of 21 st century modernization.
Previously, power grid technology was primarily designed to supply power unidirectionally from large centralized power plants to distributed consumers, such as homes and industrial facilities. Recently, a more popular research topic of the smart grid is to predict the power demand of users, so as to adjust the power price and the bidding strategy in advance to realize the maximization of the agent profit[3]. Meanwhile, an agent mechanism is also another core of the design of the smart grid, and through the agent mechanism, the smart grid is arranged among local producers, local consumers, large-scale power plants and other agents in a comprehensive mode, and a market adjusting mechanism is applied to achieve the win-win situation. One of the key issues is to achieve bidirectional communication between the grid and the local wind and solar producers, Reddy et al[4]The use of a reinforcement learning framework to design agents for the local grid was first proposed as a solution to this problem. One key element of the reinforcement learning framework is the state space, which learns strategies from manually constructed features[4]This, however, limits the number of economic signals that an agent can accommodate, and also limits the ability of the agent to absorb new signals when the environment changes. Reinforcement learning has been applied to the field of e-commerce to solve many practical problems, mainly by learning optimal strategies through interaction of agents with the environment, such as paldol et al[5]A data-driven approach is proposed to design electronic auctions based on reinforcement learning. In the electricityForce field, reinforcement learning is used to study wholesale market trading strategies[6]Or to help establish a physical control system. Examples of electric wholesale applications include[7]Mainly studied the bidding strategy of electric wholesale auction, and Ramavajjala et al[8]Study Next State Policy evaluation (NSPI) as a rule on Least Square Policy evaluation (LSPI)[9]And shows the benefit of expanding the pre-delivery commitment problem of the wind power generation. Physical control applications for reinforcement learning include load and frequency control of the power grid and autonomous monitoring applications, for example[10]. However, most of the previous work related to the power grid agent is ideal for setting the power grid environment, on one hand, a large number of simple settings are used for simulating a complex power grid operation mechanism, on the other hand, information provided by the environment is highly abstract when an algorithm is designed, so that many important details are lost, and inaccurate decision making is caused.
On the other hand, customers in the smart grid exhibit various power consumption or production patterns. This indicates that we need to develop different pricing strategies for different types of customers. Following this idea, retail agents may be considered multi-agent systems, as each agent is responsible for pricing a specific class of power consumers or producers. For example, Wang et al assign an independent pricing agent to each customer in their agent framework[23]. However, authors use independent reinforcement learning processes for different customers and consider the profits of the entire agent as an immediate return for each agent. It does not distinguish the individual contribution of each agent to the agent's profit and therefore does not motivate the agent to learn the best strategy.
Reinforcement learning, unlike traditional machine learning, is a process of gradually learning through constant interaction with the environment to a strategy that maximizes the jackpot[14]. Reinforcement learning simulates the human cognitive process, is widespread, and is studied in many disciplines, such as gambling theory and cybernetics. Reinforcement learning allows an agent to learn strategies from the environment, which is typically set to a Markov Decision Process (MDP)[15]While many algorithms are in this settingUsing dynamic programming techniques[16][17][18]
The basic reinforcement learning model includes:
a range of ambient and agent states S ═ S1;s2;…;sn};
A series of agent actions a ═ a1;a2;…;an};
Describes the transfer function between states δ (s, a) → s';
the reward function r (s, a).
In many jobs, if the agent is assumed to be able to observe the environmental conditions at the present time, it is said to be fully observable, otherwise it is said to be partially observable. An agent based reinforcement learning communicates with the environment at discrete time steps. As shown in FIG. 2-1, at each time t, the agent obtains a reward r that generally includes the timetThen selects an action a from the selectable actions, and then this action acts on the environment, which under action reaches a new state st+1The agent obtains the reward t for a new momentt+1And the process is repeated in cycles. Reinforcement learning based wisdom gradually learns a strategy pi to maximize cumulative rewards in interacting with the environment: s → A. In order to learn near-optimal, the agent must learn the adjustment strategy over a long period of time. The basic setting and learning process of reinforcement learning are very suitable for the field of power grids.
With respect to how the optimal strategy is found, we introduce the value function method here. The value function approach attempts to find a policy maximizing return by maintaining an estimate of a range of expected returns for some policies. To formally define the optimum, we define a policy with the values:
Vπ(s)=E[R|s,π] (2-1)
r represents the random reward obtained following the policy pi starting from the initial state s. Definition V*(s) as VπMaximum possible value of(s):
V*(s)=maxπVπ(s) (2-2)
the strategy that can achieve these optimal values at each state is called an optimal strategy. While the state values are sufficient to define the optimum, it is also useful to define the action values. Given a state s, an action a and a policy π, the action value of the (s, a) pair under policy π is defined as:
Qπ(s,a)=E[R|s,a,π] (2-3)
r represents the jackpot obtained by taking action a first and then following policy pi in state s. From the theory of MDP, we can always determine the optimal action by simply selecting the action with the highest value in each state, given the Q value of the optimal strategy. The action value function of such an optimal strategy is denoted Q*. Knowing the optimal action value is sufficient to know how to achieve the optimization.
When the transfer function and reward function of the environment are both unknown, we can use Q-learning to upgrade the action value function:
Qt(s,a)←(1-αt)Qt-1(s,a)+αt[rt+γmaxa′Qt-1(s′,a′) (2-4)
wherein, αtIs the learning rate, rtIs the reward at the current time, and gamma is the discount factor. The current action value Q is upgraded once each time of interaction with the environmentt(s, a) retaining a portion of the previous time's value of Q for that state and action, recalculating Q (s, a) from the reward obtained for the current time and the new state reached, and combining with the previous portion of experience as the new action value for that time.
An artificial neural network is a computational model used in machine learning, computer science and other research fields[19][20]. Artificial neural networks are based on a large number of interconnected elementary units, artificial neurons.
Generally, the artificial neurons of each layer are connected to each other, and signals are input from the first layer input layer and output from the last layer output layer. Current deep learning programs typically have thousands to millions of neural nodes and millions of connections. The goal of artificial neural networks is to solve problems in a human-like manner, although some kinds of neural networks are more abstract. The networks in a neural network represent the connections of artificial neurons between different layers in each system. A typical artificial neural network is defined by three types of parameters:
the connection mode of different layer neurons;
weights in these connections, which can be upgraded in a later learning process;
the weighted input of a neuron is converted to an activation function of its output activation.
Mathematically, a function f (x) represented by a neural network is defined as the other function gi(x) Combinations of (a) and (b). Conveniently represented as a network structure with arrows depicting the dependencies between variables. One widely used form is nonlinear weight summation:
f(x)=K(∑iwigi(x)) (2-5)
where K represents an activation function. The most important property of the activation function is that it provides a smooth transition when the input value changes, e.g. a small change in the input causes a small change in the output. Thus, the inputs are continually adjusted until the final output is formed, according to the weights in the connection. But such an output is usually not the result we want, so we also need neural networks for learning. The most attractive of neural networks is the possibility of learning. Given a specific task to be learned, and a set of learned objective functions F, learning means finding a function F in F through a series of observations*As a solution to the task. Thus, we define a loss function C:
Figure PCTCN2018093753-APPB-000001
for the optimal function f*Without other solutions there is a ratio f*Still small loss function value:
Figure PCTCN2018093753-APPB-000002
the loss function is an important concept of learning, which is a measure of the distance of a particular solution from the optimal solution. The learning process searches the solution space of the problem to find the function with the smallest loss function value. For application problems where the solution needs to be found in the data, the loss must be a function of these actually observed samples. The loss function is usually defined as a statistic, since generally only statistically observed samples can be evaluated. Therefore, for the problem of finding the model function f, it is the minimization of the loss function C ═ E [ (f (x) -y)2]Where the data pairs (x, y) come from certain distributions D. In practical applications, we usually have only N limited samples, so we can only minimize
Figure PCTCN2018093753-APPB-000003
Figure PCTCN2018093753-APPB-000004
Thus, the loss function is minimized over some samples of the data, rather than over the theoretical distribution of the entire data set. When the sample-based loss function values are minimized, we find the optimal parameters of the neural network over these samples.
Q-network, since neural networks can be used as a function fit, the Q function in reinforcement learning can also be fitted with a neural network[21][22]The Q-table is used for storing the values of the state action pairs, and the discretization of the state space is not needed to be considered, only characteristic values representing the states are directly input into the neural network, and parameters in the network are used for fitting a Q value function, so that the problem of infinite state space is naturally solved. However, unlike traditional neural network applications, reinforcement learning is notThe method has the advantages that the samples are available at the beginning, new rewards and observation are obtained through continuous interaction with the environment, and meanwhile, the reinforcement learning does not have the marks of the samples as a judgment basis for judging whether the model outputs the accurate samples. However, if we abandon the traditional application of the neural network, and only from the function fitting function of the neural network itself, the neural network is regarded as a tool for storing Q (s, a) like a Q-table, and each time the agent interacts with the environment, the agent can update the parameters in the neural network like updating the Q-table to make the Q (s, a) output by the agent close to the currently considered value.
We now consider how to design the input-output and loss functions of a Q-network to be functionally identical to the Q-table. First, the input is still state S, but instead of discretizing the state space from infinite to finite as in traditional reinforcement learning, each feature representing the state space can be directly used as an input to the neural network. Also, similar to the Q-table storing a row of values for each state representing the estimated cumulative reward for each action in that state, each node of the neural network output layer represents an action, and the output value of each node is the estimated cumulative value Q (S, a) of that action in the input state Si). By designing the input layer and the output layer of the neural network in this way, the neural network realizes the function of storing Q (s, a). Meanwhile, the parameters of the artificial neural network need to be updated, and no existing label y for the input state is provided according to the definition of the loss functioniHowever, according to the formula for upgrading Q-learning to action value, we can be based on Q (s, a) already stored in the artificial neural network and the reward r at the present momenttTo upgrade parameters in the neural network. For example, at time t, the agent is in state stWhen it selects action a according to the policytThen, the next state s is enteredt+1And earn a prize rt. Now, when we upgrade parameters in neural networks, we want Q(s)t,at) Should be upgraded to the update part r in Q-learningt+maxa′Q(s′,a′):
C=[Qt(st,at)-(rt+maxa′Qt-1(st+1,a′))]2 (2-8)
Even if the action value at the current time approaches the update section. Also, the learning rate is set at the time of updating. Thus, the process of using the Q-network to store updated action values is the same as using the Q-table directly, the only difference being that the input layer of the neural network can accept direct input of values characterizing the state, while the Q-table requires discretization of the feature values to reduce the state space.
Reference to the literature
[1]M.Amin and B.Wollenberg.Toward a smart grid:Power delivery for the 21 st century.IEEE Power and Energy Magazine,3(5):3441,2005.
[2]C.Gellings,M.Samotyj,and B.Howe.The future’s power delivery system.IEEE Power Energy Magazine,2(5):4048,2004.
[3]Wang X,Zhang M,Ren F.Load Forecasting in a Smart Grid through Customer Behaviour Learning Using L1-Regularized Continuous Conditional Random Fields[C].Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems.International Foundation for Autonomous Agents and Multiagent Systems,2016:817-826.
[4]Reddy P P,Veloso M M.Strategy learning for autonomous agents in smart grid markets[J].2011.
[5]Pardoe D,Stone P,Saar-Tsechansky M,et al.Adaptive Auction Mechanism Design and the Incorporation of Prior Knowledge[J].INFORMS Journal on Computing,2010,22(3):353-370.
[6]Babic J,Podobnik V.An analysis of power trading agent competition 2014[M].Agent-Mediated Electronic Commerce.Designing Trading Strategies and Mechanisms for Electronic Markets.Springer International Publishing,2014:1-15.
[7]Petrik M,Taylor G,Parr R,et al.Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes[J].Computer Science,2010.
[8]Ramavajjala V,Elkan C.Policy iteration based on a learned transition model[C].European Conference on Machine Learning and Knowledge Discovery in Databases.Springer-Verlag,2012:211-226.
[9]Lagoudakis M G,Parr R.Least-squares policy iteration[M].JMLR.org,2003.
[10]Venayagamoorthy G K.Potentials and promises of computational intelligence for smart grids[C].Power & Energy Society General Meeting,2009.PES'09.IEEE.IEEE,2009:1-6.
[11]Wikipedia [ EB/OL].https:.en.wikipedia.org/wiki/Smart_grid
[12]EPRI[EB/OL].https:.www.epri.com/#/about/epri
[13]Kintner-Meyer M C,Chassin D P,Kannberg L D,et al.GridWise:The benefits of a transformed energy system[J].Pacific Northwest National Laboratory under contract with the United States Department of Energy,2008:25.
[14]Sutton R S,Barto A G.Reinforcement learning:An introduction[M].Cambridge:MIT press,1998.
[15]Littman M L.Markov games as a framework for multi-agent reinforcement learning[C].Proceedings of the eleventh international conference on machine learning.1994,157:157-163.
[16]Lewis F L,Vrabie D.Reinforcement learning and adaptive dynamic programming for feedback control[J].IEEE circuits and systems magazine,2009,9(3).
[17]Busoniu L,Babuska R,De Schutter B,et al.Reinforcement learning and dynamic programming using function approximators[M].CRC press,2010.
[18]Szepesvári C,Kioloa M.Reinforcement learning:dynamic programming[J].University of Alberta,MLSS,2008,8.
[19]Wikipedia [ EB/OL].https:en.wikipedia.org/wiki/Artificial_neural_network
[20]Wang S C.Artificial neural network[M].Interdisciplinary computing in java programming.Springer US,2003:81-100.
[21]Mnih V,Kavukcuoglu K,Silver D,et al.Playing Atari with Deep Reinforcement Learning[J].Computer Science,2013.
[22]Huang B Q,Cao G Y,Guo M.Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance[C].Machine Learning and Cybernetics,2005.Proceedings of 2005 International Conference on.IEEE,2005,1:85-89.
[23]DoE[EB/OL].http:.www.eia.doe.gov,2010.
[24]Olfati-Saber R,Fax J A,Murray R M.Consensus and cooperation in networked multi-agent systems[J].Proceedings of the IEEE,2007,95(1):215-233.
[25]Ferber J.Multi-agent systems:an introduction to distributed artificial intelligence[M].Reading:Addison-Wesley,1999.
[26]Littman M L.Markov games as a framework for multi-agent reinforcement learning[C].Proceedings of the eleventh international conference on machine learning.1994,157:157-163.
[27]Tan M.Multi-agent reinforcement learning:Independent vs.cooperative agents[C].Proceedings of the tenth international conference on machine learning.1993:330-337.
[28]Wiering M.Multi-agent reinforcement learning for traffic light control[C].ICML.2000:1151-1158.
[29]Hernández L,Baladron C,Aguiar J M,et al.A multi-agent system architecture for smart grid management and forecasting of energy demand in virtual power plants[J].IEEE Communications Magazine,2013,51(1):106-113.
[30]Niu D,Wang Y,Wu D D.Power load forecasting using support vector machine and ant colony optimization[J].Expert Systems with Applications,2010,37(3):2531-2539.
[31]Li H Z,Guo S,Li C J,et al.A hybrid annual power load forecasting model based on generalized regression neural network with fruit fly optimization algorithm[J].Knowledge-Based Systems,2013,37:378-387.
[32]Gong S,Li H.Dynamic spectrum allocation for power load prediction via wireless metering in smart grid[C].Information Sciences and Systems(CISS),2011 45th Annual Conference on.IEEE,2011:1-6
[33]Xishun Wang,Minjie Zhang,and Fenghui Ren.A hybrid-learning based broker model for strategic power trading in smart grid markets.Knowledge-Based Systems,119,2016.
[34]Electricity consumption in a sample of london households,2015.https://data.london.gov.uk/dataset/smartmeter-energyuse-data-in-london-households.
[35]S Hochreiter and J Schmidhuber.Long short-term memory.Neural Computation,9(8):1735–1780, 1997.
Disclosure of Invention
The invention aims to provide a multi-agent deep reinforcement learning agent method based on a smart grid, and aims to solve the problem that the agent state space is infinite.
The invention is realized in such a way that a multi-agent deep reinforcement learning agent method based on a smart grid comprises the following steps:
s1, calculating the corresponding action standard value under the current state according to the reward obtained by the selected action, and updating the parameters of the neural network;
s2, establishing a multi-agent of 'external competition and internal cooperation' according to the types of the consumers and the producers;
s3, setting the reward function of each internal agent according to profit maximization of the agent' S actions and the benefits of other internal agents, the function formula of which is:
Figure PCTCN2018093753-APPB-000005
Figure PCTCN2018093753-APPB-000006
wherein C represents the category in which the consumer is located, P represents the category in which the producer is located,
Figure PCTCN2018093753-APPB-000007
representing agent BkInternal agent, i ∈ { C ∈ }1,C2,P1,P2},κt,CRepresenting the amount of power consumed by a certain type of consumer at time t, κt,PIndicating the amount of power a certain type of producer produces at time t,
Figure PCTCN2018093753-APPB-000008
is the unbalanced partial charge in calculating the profit for the monomer.
The further technical scheme of the invention is as follows: the step S1 further includes the following steps:
s11, initializing parameters of the neural network;
s12, initializing the state value at the beginning of each period in the operation period;
s13, selecting the state value by using the probability or selecting the maximum action value in the current state;
s14, executing the selected action value and entering the next state after obtaining the reward;
s15, calculating the standard value corresponding to the current state to update the neural network parameters to enable the stored Q (S)t,at) Is close toyt
The further technical scheme of the invention is as follows: in step S15, the action values are stored in the parameters, and each time a new state is entered, the characteristic values are only required to be sequentially input into the neural network, and the action with the largest Q (S, a) value can be selected from the output layer of the neural network as the next action to be performed.
The further technical scheme of the invention is as follows: the step S2 includes the following steps:
s21, classifying the consumers according to the power consumption difference;
and S22, classifying the producers according to the actual power generation situation.
The further technical scheme of the invention is as follows: in step S3, each agent considers the benefit of the agent when selecting the action and the benefit of the agent when selecting the action.
The further technical scheme of the invention is as follows: the consumers are divided into daytime consumers and all-day consumers according to the situation of power consumption.
The further technical scheme of the invention is as follows: the producer is divided into a whole-day generator and a daytime generator according to the actual power generation condition.
The invention has the beneficial effects that: the input layer of the neural network may accept direct input of values of features that characterize the state, while the Q-table needs to discretize the feature values to reduce the state space.
Drawings
Fig. 1 is a schematic diagram of a classic scenario of reinforcement learning.
FIG. 2 is a neural network including a hidden layer, a first layer having neurons that communicate data to a second layer of neurons through synapses, the second layer of neurons communicating data to a third layer of neurons through synapses, as provided by embodiments of the present invention. Synapses store schematics of parameters called weights that manipulate data in computations.
Fig. 3 is a schematic diagram of a proxy framework.
Fig. 4 is a schematic diagram of cyclic DQN.
Fig. 5 is a graph showing the yield distribution of each of the 20 experiments.
FIG. 6 is a schematic diagram of the revenue distributions of each of 20 experiments in a variety of user environments.
[ amend 03.01.2019 according to rules 91 ] FIG. 7 is a graph of electricity usage for different types of users.
FIG. 8 is a schematic representation of agent revenue for an evaluation period.
[ amendment 03.01.2019 according to rules 91 ] FIG. 9 is a graph of electricity usage for different types of users.
[ amendment 03.01.2019 according to rules 91 ] FIG. 10 is a third graph of electricity usage by different types of customers.
[ amendment 03.01.2019 according to rules 91 ] FIG. 11 is a graph of electricity usage for different types of customers.
[ amend 03.01.2019 according to rules 91 ] FIG. 12 is a graph of electricity usage for different types of users.
[ amend 03.01.2019 according to rules 91 ] FIG. 13 is a six graph of electricity usage for different types of users.
Detailed Description
Two aspects of improvement are carried out on the negotiation algorithm of the agent in the working aspect, and firstly, the problem that the state space of the agent is infinite is solved; secondly, the local environment is slightly changed, so that the scene is more real, and meanwhile, a design mode of external competition and internal cooperation multi-intelligent agent is correspondingly provided, so that the multi-intelligent agent is more competitive. Finally, we introduce real electricity usage data while helping our agent framework learn effective pricing strategies in more complex environments with some advanced timing techniques.
Smart grid for set local market and[4]will only be specifically designed for the type of consumer and producer and the way in which electricity is produced/used in the second improvement. In the local market, there are consumers who consume power and small producers who produce power, and there are several agents that buy and sell power between consumers and small producers. The agent needs to be set because the small-sized producer and the consumer are inconvenient to directly coordinate, and through the intermediate link of the agent, the electricity can be conveniently bought and sold by the electricity consumer, resources can be better coordinated, and electricity is guaranteedAnd balancing the supply and demand of the force resources. The specific form of the agency is that each producer and consumer is issued a contract every hour, all users select the contracts from different agencies, and each agency can know the contract price of other agencies at the moment and how many producers and consumers select own contracts. In this way, the agent adjusts its contract price at the next time to maximize its profit according to the contract subscription and contract prices of other agents. Thus, each hour serves as the basic unit of time, the agent interacts with the environment once, and the user subscribes to the contract.
An imbalance in power supply and demand occurs when the producer and consumer subscribing to the broker contract require different power. At this time, we do not deal with the deficit portion of the power through the wholesale market, but set a penalty fee as a penalty for the occurrence of power supply imbalance by the agent. Next, we delineate this local market more clearly by definition. First, for electricity prices, we set the price range to 0.01 to 0.20[23]The minimum price variation is 0.01. Each agent Bk(K ═ 1,2, …, K) there are two bids at time t, one is a bid to the consumer
Figure PCTCN2018093753-APPB-000009
The other is the bid to the producer
Figure PCTCN2018093753-APPB-000010
In addition, agent B at each time tkThe number of producers and consumers subscribing to oneself is mastered:
Figure PCTCN2018093753-APPB-000011
and
Figure PCTCN2018093753-APPB-000012
for convenience, we assume that each consumer consumes k electrical power per time tt,CAnd the amount of electricity generated by each producer at each moment is κt,P. Finally, we set the imbalance cost per unit of power at time t to be φt. At this point, agent B is computedkThe reward at time t is clear:
Figure PCTCN2018093753-APPB-000013
thus, we have generally defined the basic operation of the local market. In the following, we first explain the definitions of the two status indicators. The first is the PriceRangeStatus (PRS) index to determine if the market is reasonable, and if the market is reasonable as seen by an agent, then it must satisfy:
Figure PCTCN2018093753-APPB-000014
wherein, muLIs a subjective value that indicates the agent's desire for marginal benefit in the market. At the same time, the user can select the desired position,
Figure PCTCN2018093753-APPB-000015
Figure PCTCN2018093753-APPB-000016
wherein, BLRepresenting the agent itself. The second indicator, PortfolioStatus (PS), indicates whether the agent itself has achieved supply-demand balance. Next, we set several actions operating on the price as a set of actions that all agents can choose to take.
A={Maintain,Lower,Raise,Revert,Inline,MinMax}
Each agent sets the price at the next time through these actions at time t
Figure PCTCN2018093753-APPB-000017
And
Figure PCTCN2018093753-APPB-000018
● Maintain represents the price for maintaining the last moment;
● Lower both the producer and consumer prices at time t by 0.01;
● Raise increases by 0.01 on both the producer and consumer prices at time t;
● Revert moves 0.01 towards the midpoint price,
Figure PCTCN2018093753-APPB-000019
● Inline sets the new producer and consumer prices to
Figure PCTCN2018093753-APPB-000020
Figure PCTCN2018093753-APPB-000021
MinMax sets the prices of the new producer and consumer to
Figure PCTCN2018093753-APPB-000022
Figure PCTCN2018093753-APPB-000023
Several fixed policies compete for the setting of agents, and for comparison and verification, we design several fixed policy agents. The balancing strategy attempts to reduce supply imbalance by adjusting the producer and consumer contract prices, increasing the producer and consumer contract prices when it sees an excess in demand, and decreasing the producer and consumer contract prices when it sees an excess in supply. Greedy strategies attempt to maximize profit by increasing profit margins, i.e., maximizing the difference between consumer and producer contract prices when the market is rational. Both strategies can be characterized as adaptive, in that they will react to market and portfolio conditions, but they will not learn from the past. Meanwhile, two non-adaptive agents are designed, wherein the agent with a fixed strategy always maintains a certain price; a random agent randomly selects one of six actions for each price adjustment.
TABLE 3-1 Balancing Algorithm
Figure PCTCN2018093753-APPB-000024
TABLE 3-2 greedy Algorithm
Figure PCTCN2018093753-APPB-000025
The first modification is to change the original storage structure of Q-learning from Q-table to Q-network, and the current method is completely the same as the Q-learning method, that is, the internal parameters of the storage structure are upgraded after each interaction with the environment is completed. Later work also considered the mechanism of empirical replay.
TABLE 3-3Q-learning Algorithm Using Q-network
Figure PCTCN2018093753-APPB-000026
The first row of the algorithm initializes the parameters of the neural network. The second row indicates that the experiment will run for M cycles, the third row indicates that the state values will be initialized at the beginning of each cycle, the fifth and sixth rows indicate that random selection is performed with a certain probability, otherwise the action with the largest action value in the current state is selected. The seventh row represents the execution of the selected action, thenTo reward and enter the next state. The eighth line calculates the standard value of the corresponding action in the current state, and the ninth line indicates that the parameters of the neural network are updated according to the value calculated in the eighth line, so that the stored Q(s)t,aa) Is close to yt. By storing the action values in the parameters, and entering a new state each time, only the characteristic values of the state are required to be input into the neural network in sequence, and the action with the maximum Q (s, a) value can be selected from the output layer of the neural network as the next execution action.
In addition to trying to store the action values of reinforcement learning using a neural network, the multi-agent based intelligent negotiation algorithm also considers the more realistic situation that consumers have multiple types, and meanwhile, small producers are divided into two cases of wind power generation and solar power generation. To investigate this general phenomenon, we make corresponding changes to the environment. Firstly, consumers are divided into two types, one type is a common user who does not need to consume electric power at night, and the other type is a user who needs to consume electric power all day long; then, according to the actual situation, producers can be divided into two types, one type is a wind power generator which can generate power all day long, and the other type is a solar power generator which can generate power only in daytime. Therefore, four types of users exist in the current power grid environment, the original way of uniformly adjusting the price by an agent of an agent is not applicable, so that a new agent of an agent of a plurality of agents of the agent of. Multiple intelligence bodies can coordinate and cooperate with each other[24][25][26][27][28]The multi-intelligence framework can be more suitable for the power grid with particularly complex external environment. The contract price of the user can be more pertinently adjusted on the premise of the original power grid rule so as to obtain the maximization of the profit.
However, although the multiple wisdom externally appears as an agent, there are four different agents inside, and how to make the agents guarantee that they cooperate with each other inside is a matter of thought. In order to make the agents' internal agents cooperate with each other as much as possible to form a real group to compete with other agents, we need to redesign the reward function of each internal agent, so that the actions of each agent not only take into account the profit maximization of itself, but also take into account the benefits of other internal agents. We redesign the reward function for each internal agent:
Figure PCTCN2018093753-APPB-000027
wherein C represents the category in which the consumer is located, P represents the category in which the producer is located,
Figure PCTCN2018093753-APPB-000028
representing agent BkInternal agent, i ∈ { C ∈ }1,C2,P1,P2}。κt,CRepresenting the amount of power consumed by a certain type of consumer at time t, κt,PIndicating the amount of power a certain type of producer produces at time t. While
Figure PCTCN2018093753-APPB-000029
Is the unbalanced partial charge in calculating the profit for the monomer.
Figure PCTCN2018093753-APPB-000030
Further:
Figure PCTCN2018093753-APPB-000031
since a single agent can only buy or sell power from or to the producer, it is not good to measure its own profit directly, but we can consider its contribution to the total profit in reverse, i.e. the loss to the total profit without the agent buying or selling power is the agent's own profit. By considering from the overall relationship, we obtain the individual agents' own profit. Therefore, each intelligent agent can consider the whole benefits while considering the benefits of the intelligent agent when selecting the action through the newly designed reward function.
A real data simulation multi-agent framework is used for verifying the effectiveness of the agent framework in a complex environment, and real power utilization data of a family user in 2013 of London city are introduced[34]We have selected about 1000 users from them. First, it is not sufficient to just publish one price to all consumers. That is, we only consider home users in the retail market, but their power usage patterns are different due to different living habits and consumption concepts. Thus, using multiple agents to distribute respective electricity prices for different groups of consumers may better facilitate supply and demand balance. Here we group consumers according to the usage profile. Considering that power consumption is time series data, our agents use K-Means based on Dynamic Time Warping (DTW) distance criteria for clustering. After clustering, a user group with similar electricity utilization behaviors can be obtained. The proxy architecture in a real data simulation environment is shown in fig. 3.
Secondly, since the electricity consumption behavior of the user changes from moment to moment, we use Long Short-Term Memory (LSTM) which has excellent performance in time series[35]The neural network element structure of (1) enhances our network architecture to help agents better extract timing information from past market information to make effective decisions. Finally, the neural network architecture used by our agents is shown in FIG. 4.
The setting of experiment parameters includes parameters in method definition and parameters in a plurality of experiment operation, and we show that the experiment has five agents in total, namely an agent of an intelligent agent, an agent of a balanced strategy, an agent of a greedy strategy, an agent of a fixed price and an agent of random action. Local power grid marketThe number of consumers (2) is set to 1000, the number of producers is set to 100, the power consumed by the consumers per hour is 10 basic power units, and the power produced by the producers per hour is 100 basic power units. The unbalanced cost per unit power is 0.1, and it is noted that the unbalanced cost cannot be set too small, preventing an agent from cheating the subscription of a consumer at as low a price as possible without purchasing power from a producer. Furthermore, considering that the real-world user has a certain inertia to the subscriptions, we set the user's preference for selection to {35,30,20,10,5} representing that 35% of the potential consumers select the subscription with the lowest contract price, 30% of the potential consumers select the subscription with the second lowest contract price, and so on. At the same time, the producer will choose from the higher prices according to the choice preferences. In the experiment, we also set the initial price per unit of electricity, the selling price of electricity is 0.13, and the purchasing price of electricity is 0.1. And the subjective marginal benefit mu of the marketLSet to 0.02. For the running period, we set 300, the first 200 periods are learning periods, the agent is allowed to learn in this period, the last one hundred periods are statistical periods, and the total profit of each agent in this period is used as the final criterion for judging whether the agent algorithm is competitive. Each period has 10 days and 24 hours per day, i.e. each period has 240 basic time units. For Q-learning, we use the epsilon-greedy strategy.
Experiment of Q-network, for the design of neural network, we set up a network comprising two hidden layers. The input layer receives the state, and in order to fully utilize the information given by the environment, the state characteristics are designed into the contract price of all agents at the last moment and the number of users subscribed by the agent of the intelligent agent; and the contract prices of all agents at the moment and the information of the number of the users subscribed by the agent of the intelligent agent are added, so that 24 input units are provided. The export layer has six export units representing six actions operating on the price, respectively, and the exported value represents the desired jackpot for selecting the action in the imported state and then proceeding according to the policy. In addition, we used the XAVIER initialization parameters, the RMSPROP algorithm and gradient descent to train the parameters of the neural network. Furthermore, we averaged the total rewards of 20 rounds of the entire experiment run repeatedly to determine the final performance of the agent using the Q-network based Q-learning algorithm in the experiment, while illustrating the advantages and disadvantages of using Q-network to store action values in comparison to previous agents using Q-table based Q-learning algorithm. It should be noted that the state of the agent using the Q-table follows the settings of the previous job, being designed as a combination of the PRS and PS metrics at the previous time and this time.
TABLE 4-1 average rewards for each of 20 trials of all agencies
Figure PCTCN2018093753-APPB-000032
TABLE 4-2 average rewards for each of 20 experiments
Figure PCTCN2018093753-APPB-000033
From the above two tables we can see that the competition of the agent using Q-learning is significantly stronger than that of other strategies, while the greedy strategy is the only agent with positive total profit except the reinforcement learning agent, the agent with the least total profit is the fixed strategy, and is easily defeated because its price is not changed all the time, and the total profit of the agent using the balancing strategy and the greedy strategy is two or three bits, which explains the superiority of the adaptive strategy, and makes the agent of the reinforcement learning algorithm greatly lead other agents, which explains the superiority of learning from the past. The effect of using Q-network to store the past experience is better than that of using Q-table to store the experience, which shows the importance of having more accurate state representation. As can also be seen from the following figure, the proxy revenue using Q-network appears more stable, substantially around 1500000, while the proxy revenue using Q-table fluctuates more and is relatively unstable, as shown in FIG. 5.
Experiments of multi-agent, we also performed experiments of multi-agent, but first we need to modify the configuration of some grid environments. Firstly, because we have set two groups of producers and two groups of consumers, in order to draw close to the original experimental parameters as much as possible, we set the number of consumers who only use electricity in the daytime to be 500, the number of consumers who use electricity all day to be 500, and the electricity consumption situation of the consumers who only use electricity in the daytime to be {0,0,0,0,0,0, 0,10,10,10,10,10,10,10,10,10,0,0,0,0,0,0}, that is, the first six hours of a day and the last six hours of a day are not used, and the electricity consumption of the consumers who use electricity all day is 10 basic units of electricity per hour. Further, the number of producers of wind power generation is set to 50, the number of producers of solar power generation is set to 50, wind power generation is all-day power generation, the amount of power generation per hour is 100 power basic units, and the power generation conditions of all days of solar power generation are {0,0,0,0,0,0,100, 0,0,0,0,0,0,0 }. At the same time, we slightly adjusted the selection preferences of different kinds of producers and consumers. In the experiment, for better comparison with previous work, we used a Q-table as a structure for storing action values, while the input state increased the characteristic index of whether the current time is day or night compared with previous work.
Table 4-3 selection preferences of different kinds of users
Figure PCTCN2018093753-APPB-000034
Figure PCTCN2018093753-APPB-000035
In the experiment, because the external conditions of the experiment are changed, in order to explain that the effect of the multi-agent of the 'external competition and internal cooperation' is better than that of the original single agent of the intelligent agent, the multi-agent and the single agent of the intelligent agent are put into the experiment together for competition. The running test was repeated to obtain the following results.
TABLE 4-4 average profits from 20 runs in various user environments
Figure PCTCN2018093753-APPB-000036
From the above table and the broken line in fig. 6, it can be seen that multi-agent and single-agent agents have absolute dominance in competition, and multi-agent can defeat single-agent in each turn, which illustrates that multi-agent agents are more marketable and have stronger competitiveness than single-agent agents. Meanwhile, the total income of each agent is greatly improved compared with the experiment of 4.2 sections, and the agent of a fixed price strategy in the original experiment is removed, so that the environment can be adapted to the environment, the price is adjusted to balance the multi-intelligent agent of supply and demand, the whole market is dominated by the agent capable of balancing supply and demand, compared with the experiment in the previous section, the electric quantity supply and demand balance of the whole smart grid market is guaranteed, the unbalanced cost of the agent is greatly reduced, and the overall total income level is improved.
To verify that the design concept of "inter-working" works, we set the reward function of each agent as the whole reward function of the agent
Figure PCTCN2018093753-APPB-000037
And the reward function we have designed
Figure PCTCN2018093753-APPB-000038
Experiments are carried out, and after 10 experiments are respectively carried out, the average total income obtained by using the designed single reward function is found to be 23.37% higher than the average total income obtained by directly using the integral reward function, which shows that the single reward function has better effect than the integral reward function, so that each intelligent agent simultaneously considers the self interest and the integral interest, maximizes the self interest based on the integral interest and compares the self interest with the party only considering the integral interestThe formula is more flexible and more targeted.
[ correcting 03.01.2019 according to rules 91 ] experiments of multi-intellectual agent under real data simulation, firstly, clustering users after data cleaning, classifying the users into 5 classes according to other empirical data, and distributing the number of people as { 215; 97, a stabilizer; 317; 274; 79}. The resulting power usage curves are shown in FIGS. 7 and 9-13.
It can be seen from the figure that the power usage curves for each class of users are very different, which poses a great challenge to our agency. In addition, for more real modeling users, modeling is carried out on a selection model of the users in the power grid, according to the general characteristic of the dependence of the users in the power grid, random psychological prices are distributed to each user within a certain range, and when the bid price of the agent signed by the user last time is better than the psychological expectation of the user, the user can select to continue; otherwise, the user can select the power contract according to a certain probability according to the price quality reordering.
TABLE 4-1 user Electricity price selection model
Figure PCTCN2018093753-APPB-000039
The data of 2013 year and month 2 in the power data of the london family users are selected as the power utilization data of the consumers, and in order to guarantee the balance of the whole power utilization, the two types of producers are assigned to respectively undertake half of power generation tasks. Although the overall power supply and demand of the system is balanced, it is very difficult to balance the supply and demand within the agent itself at any moment, because the power consumption behavior of each consumer is different and the selection behavior of each user is also different. The user psychology price random range is [0.10,0.15], the training period is 50, the evaluation period is 10, and the length of the time sequence state is 3. The yield of the final evaluation period is shown in fig. 8.
Pricing issues for retail agents in smart grid retail markets are discussed. We first applied DRL in retail agent design to solve the discrete state space problem and it is its better application and practical environment to use LSTM and DTW-based clustering mechanisms to strengthen our agents. By clustering clients, a cooperative multi-agent deep reinforcement learning agent framework with a unique excitation function is designed. Finally, the adaptability and the strong competitiveness of the agent framework under the complex environment are verified by introducing the domestic electricity consumption data in London city. As future work, we will explore the application of more advanced DRL techniques (e.g., the operator-critic algorithm) to our retail agent design to produce more efficient pricing strategies.
In addition, by considering actual small-scale power generation data and household power storage devices, the agent mechanism can be further generalized to obtain a more realistic smart grid. Load prediction of power grid is also a popular research topic[29][30][31][32]For future work, we will start from this aspect, perform accurate classification and modeling analysis on the user, let the agent automatically identify the user's category, and then let the internal corresponding agent manage the user.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

  1. A multi-agent deep reinforcement learning agent method based on a smart grid is characterized by comprising the following steps:
    s1, calculating the corresponding action standard value under the current state according to the reward obtained by the selected action, and updating the parameters of the neural network;
    s2, establishing a multi-agent of 'external competition and internal cooperation' according to the types of the consumers and the producers;
    s3, setting the reward function of each internal agent according to profit maximization of the agent' S actions and the benefits of other internal agents, the function formula of which is:
    Figure PCTCN2018093753-APPB-100001
    wherein C represents the category in which the consumer is located, P represents the category in which the producer is located,
    Figure PCTCN2018093753-APPB-100002
    representing agent BkInternal agent, i ∈ { C ∈ }1,C2,P1,P2},κt,CRepresenting the amount of power consumed by a certain type of consumer at time t, κt,PIndicating the amount of power a certain type of producer produces at time t,
    Figure PCTCN2018093753-APPB-100003
    is the unbalanced partial charge in calculating the profit for the monomer.
  2. The multi-agent deep reinforcement learning agent method as claimed in claim 1, wherein said step S1 further comprises the steps of:
    s11, initializing parameters of the neural network;
    s12, initializing the state value at the beginning of each period in the operation period;
    s13, selecting the state value by using the probability or selecting the maximum action value in the current state;
    s14, executing the selected action value and entering the next state after obtaining the reward;
    s15, calculating the standard value corresponding to the current state to update the neural network parameters to enable the stored Q (S)t,at) Is close to yt
  3. The multi-agent deep reinforcement learning agent method as claimed in claim 2, wherein the action values are stored in the parameters in step S15, and the characteristic values are only required to be sequentially input into the neural network each time a new state is entered, and the Q (S, a) value maximum action can be selected from the output layer of the neural network as the next execution action.
  4. The multi-agent deep reinforcement learning agent method as claimed in claim 3, wherein said step S2 comprises the steps of:
    s21, classifying the consumers according to the power consumption difference;
    and S22, classifying the producers according to the actual power generation situation.
  5. The multi-agent deep reinforcement learning agent method of claim 4, wherein each agent considers the benefit of itself and the whole benefit in selecting the action through the reward function in step S3.
  6. The multi-agent deep reinforcement learning agent method according to claim 5, wherein the consumers are divided into daytime consuming users and all-day consuming users according to the situation of consuming power.
  7. The multi-agent deep reinforcement learning agent method of claim 6, wherein the producer is divided into a whole-day generator and a daytime generator according to the real-world power generation situation.
CN201880000858.4A 2018-06-29 2018-06-29 Multi-agent deep reinforcement learning agent method based on smart grid Active CN110945542B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/093753 WO2020000399A1 (en) 2018-06-29 2018-06-29 Multi-agent deep reinforcement learning proxy method based on intelligent grid

Publications (2)

Publication Number Publication Date
CN110945542A true CN110945542A (en) 2020-03-31
CN110945542B CN110945542B (en) 2023-05-05

Family

ID=68984589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880000858.4A Active CN110945542B (en) 2018-06-29 2018-06-29 Multi-agent deep reinforcement learning agent method based on smart grid

Country Status (2)

Country Link
CN (1) CN110945542B (en)
WO (1) WO2020000399A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639756A (en) * 2020-06-12 2020-09-08 南京大学 Multi-agent reinforcement learning method based on game reduction
CN111967199A (en) * 2020-09-23 2020-11-20 浙江大学 Agent contribution distribution method under reinforcement learning multi-agent cooperation task
CN112215350A (en) * 2020-09-17 2021-01-12 天津(滨海)人工智能军民融合创新中心 Smart agent control method and device based on reinforcement learning
CN112286203A (en) * 2020-11-11 2021-01-29 大连理工大学 Multi-agent reinforcement learning path planning method based on ant colony algorithm
CN114123178A (en) * 2021-11-17 2022-03-01 哈尔滨工程大学 Intelligent power grid partition network reconstruction method based on multi-agent reinforcement learning
CN114619907A (en) * 2020-12-14 2022-06-14 中国科学技术大学 Coordinated charging method and coordinated charging system based on distributed deep reinforcement learning
WO2023103763A1 (en) * 2021-12-09 2023-06-15 Huawei Technologies Co., Ltd. Methods, systems and computer program products for protecting a deep reinforcement learning agent
CN116599061A (en) * 2023-07-18 2023-08-15 国网浙江省电力有限公司宁波供电公司 Power grid operation control method based on reinforcement learning

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111369108A (en) * 2020-02-20 2020-07-03 华中科技大学鄂州工业技术研究院 Power grid real-time pricing method and device
CN111709706B (en) * 2020-06-09 2023-08-04 国网安徽省电力有限公司安庆供电公司 Automatic generation method of new equipment starting scheme based on self-adaptive pattern recognition
CN111817349B (en) * 2020-07-31 2023-08-25 三峡大学 Multi-micro-grid passive off-grid switching control method based on deep Q learning
CN112446470B (en) * 2020-11-12 2024-05-28 北京工业大学 Reinforced learning method for coherent synthesis
CN112819144B (en) * 2021-02-20 2024-02-13 厦门吉比特网络技术股份有限公司 Method for improving convergence and training speed of neural network with multiple agents
US11817704B2 (en) * 2021-02-23 2023-11-14 Distro Energy B.V. Transparent customizable and transferrable intelligent trading agent
CN112884129B (en) * 2021-03-10 2023-07-18 中国人民解放军军事科学院国防科技创新研究院 Multi-step rule extraction method, device and storage medium based on teaching data
CN113469839A (en) * 2021-06-30 2021-10-01 国网上海市电力公司 Smart park optimization strategy based on deep reinforcement learning
CN113570039B (en) * 2021-07-22 2024-02-06 同济大学 Block chain system based on reinforcement learning optimization consensus
CN113555870B (en) * 2021-07-26 2023-10-13 国网江苏省电力有限公司南通供电分公司 Q-learning photovoltaic prediction-based power distribution network multi-time scale optimal scheduling method
CN113687960B (en) * 2021-08-12 2023-09-29 华东师范大学 Edge computing intelligent caching method based on deep reinforcement learning
CN113791634B (en) * 2021-08-22 2024-02-02 西北工业大学 Multi-agent reinforcement learning-based multi-machine air combat decision method
CN114415663A (en) * 2021-12-15 2022-04-29 北京工业大学 Path planning method and system based on deep reinforcement learning
CN114329936B (en) * 2021-12-22 2024-03-29 太原理工大学 Virtual fully-mechanized production system deduction method based on multi-agent deep reinforcement learning
CN114362221B (en) * 2022-01-17 2023-10-13 河海大学 Regional intelligent power grid partition evaluation method based on deep reinforcement learning
CN114881688B (en) * 2022-04-25 2023-09-22 四川大学 Intelligent pricing method for power distribution network considering distributed resource interaction response
CN115065728B (en) * 2022-06-13 2023-12-08 福州大学 Multi-strategy reinforcement learning-based multi-target content storage method
CN115310999B (en) * 2022-06-27 2024-02-02 国网江苏省电力有限公司苏州供电分公司 Enterprise electricity behavior analysis method and system based on multi-layer perceptron and sequencing network
US11956138B1 (en) 2023-04-26 2024-04-09 International Business Machines Corporation Automated detection of network anomalies and generation of optimized anomaly-alleviating incentives
CN116912356B (en) * 2023-09-13 2024-01-09 深圳大学 Hexagonal set visualization method and related device
CN117648123A (en) * 2024-01-30 2024-03-05 中国人民解放军国防科技大学 Micro-service rapid integration method, system, equipment and storage medium
CN117808174B (en) * 2024-03-01 2024-05-28 山东大学 Micro-grid operation optimization method and system based on reinforcement learning under network attack

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107067190A (en) * 2017-05-18 2017-08-18 厦门大学 The micro-capacitance sensor power trade method learnt based on deeply
CN107179077A (en) * 2017-05-15 2017-09-19 北京航空航天大学 A kind of self-adaptive visual air navigation aid based on ELM LRF
CN107623337A (en) * 2017-09-26 2018-01-23 武汉大学 A kind of energy management method for micro-grid

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332373A1 (en) * 2009-02-26 2010-12-30 Jason Crabtree System and method for participation in energy-related markets
CN102622269B (en) * 2012-03-15 2014-06-04 广西大学 Java agent development (JADE)-based intelligent power grid power generation dispatching multi-Agent system
CN105022021B (en) * 2015-07-08 2018-04-17 国家电网公司 A kind of state identification method of the Electric Energy Tariff Point Metering Device based on multiple agent
CN105550946A (en) * 2016-01-28 2016-05-04 东北电力大学 Multi-agent based electricity utilization strategy capable of enabling residential users to participate in automated demand response

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179077A (en) * 2017-05-15 2017-09-19 北京航空航天大学 A kind of self-adaptive visual air navigation aid based on ELM LRF
CN107067190A (en) * 2017-05-18 2017-08-18 厦门大学 The micro-capacitance sensor power trade method learnt based on deeply
CN107623337A (en) * 2017-09-26 2018-01-23 武汉大学 A kind of energy management method for micro-grid

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639756A (en) * 2020-06-12 2020-09-08 南京大学 Multi-agent reinforcement learning method based on game reduction
CN112215350A (en) * 2020-09-17 2021-01-12 天津(滨海)人工智能军民融合创新中心 Smart agent control method and device based on reinforcement learning
CN112215350B (en) * 2020-09-17 2023-11-03 天津(滨海)人工智能军民融合创新中心 Method and device for controlling agent based on reinforcement learning
CN111967199B (en) * 2020-09-23 2022-08-05 浙江大学 Agent contribution distribution method under reinforcement learning multi-agent cooperation task
CN111967199A (en) * 2020-09-23 2020-11-20 浙江大学 Agent contribution distribution method under reinforcement learning multi-agent cooperation task
CN112286203A (en) * 2020-11-11 2021-01-29 大连理工大学 Multi-agent reinforcement learning path planning method based on ant colony algorithm
CN112286203B (en) * 2020-11-11 2021-10-15 大连理工大学 Multi-agent reinforcement learning path planning method based on ant colony algorithm
CN114619907A (en) * 2020-12-14 2022-06-14 中国科学技术大学 Coordinated charging method and coordinated charging system based on distributed deep reinforcement learning
CN114619907B (en) * 2020-12-14 2023-10-20 中国科学技术大学 Coordinated charging method and coordinated charging system based on distributed deep reinforcement learning
CN114123178A (en) * 2021-11-17 2022-03-01 哈尔滨工程大学 Intelligent power grid partition network reconstruction method based on multi-agent reinforcement learning
CN114123178B (en) * 2021-11-17 2023-12-19 哈尔滨工程大学 Multi-agent reinforcement learning-based intelligent power grid partition network reconstruction method
WO2023103763A1 (en) * 2021-12-09 2023-06-15 Huawei Technologies Co., Ltd. Methods, systems and computer program products for protecting a deep reinforcement learning agent
CN116599061A (en) * 2023-07-18 2023-08-15 国网浙江省电力有限公司宁波供电公司 Power grid operation control method based on reinforcement learning
CN116599061B (en) * 2023-07-18 2023-10-24 国网浙江省电力有限公司宁波供电公司 Power grid operation control method based on reinforcement learning

Also Published As

Publication number Publication date
CN110945542B (en) 2023-05-05
WO2020000399A1 (en) 2020-01-02

Similar Documents

Publication Publication Date Title
CN110945542B (en) Multi-agent deep reinforcement learning agent method based on smart grid
Al Mamun et al. A comprehensive review of the load forecasting techniques using single and hybrid predictive models
Antonopoulos et al. Artificial intelligence and machine learning approaches to energy demand-side response: A systematic review
Chen et al. Trading strategy optimization for a prosumer in continuous double auction-based peer-to-peer market: A prediction-integration model
Peters et al. A reinforcement learning approach to autonomous decision-making in smart electricity markets
Yang et al. Recurrent deep multiagent q-learning for autonomous brokers in smart grid.
Gao et al. A multiagent competitive bidding strategy in a pool-based electricity market with price-maker participants of WPPs and EV aggregators
Chen et al. Customized rebate pricing mechanism for virtual power plants using a hierarchical game and reinforcement learning approach
Chuang et al. Deep reinforcement learning based pricing strategy of aggregators considering renewable energy
Ghosh et al. VidyutVanika: A reinforcement learning based broker agent for a power trading competition
Gao et al. Bounded rationality based multi-VPP trading in local energy markets: a dynamic game approach with different trading targets
Lincoln et al. Comparing policy gradient and value function based reinforcement learning methods in simulated electrical power trade
Wang et al. A hybrid-learning based broker model for strategic power trading in smart grid markets
Ruan et al. Graph deep learning-based retail dynamic pricing for demand response
Qian et al. Automatically improved VCG mechanism for local energy markets via deep learning
Wang et al. Multi-agent simulation for strategic bidding in electricity markets using reinforcement learning
Okwuibe et al. Intelligent bidding strategies for prosumers in local energy markets based on reinforcement learning
Wu et al. Peer-to-peer energy trading optimization for community prosumers considering carbon cap-and-trade
Peters et al. Autonomous data-driven decision-making in smart electricity markets
Kell et al. Machine learning applications for electricity market agent-based models: A systematic literature review
Kell et al. A systematic literature review on machine learning for electricity market agent-based models
Umarov et al. The cognitive model and its implementation of the enterprise Uzmobile
Xu et al. Deep reinforcement learning for competitive DER pricing problem of virtual power plants
Wang Conjectural variation-based bidding strategies with Q-learning in electricity markets
Sunny et al. A Comprehensive Review of the Load Forecasting Techniques Using Single and Hybrid Predictive Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant