CN110945542A

CN110945542A - Multi-agent deep reinforcement learning agent method based on smart power grid

Info

Publication number: CN110945542A
Application number: CN201880000858.4A
Authority: CN
Inventors: 侯韩旭; 郝建业; 杨耀东
Original assignee: Dongguan University of Technology
Current assignee: Dongguan University of Technology
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2020-03-31
Anticipated expiration: 2038-06-29
Also published as: CN110945542B; WO2020000399A1

Abstract

The invention is suitable for the technical field of power automation control, and provides a multi-agent deep reinforcement learning agent method based on a smart grid, which comprises the following steps: s1, calculating the corresponding action standard value under the current state according to the reward obtained by the selected action, and updating the parameters of the neural network; s2, establishing a multi-agent of 'external competition and internal cooperation' according to the types of the consumers and the producers; s3, setting the reward function for each internal agent based on profit maximization of the agent' S actions and the benefits of other internal agents. The input layer of the neural network may accept direct input of values of features that characterize the state, while the Q-table needs to discretize the feature values to reduce the state space.

Description

Multi-agent deep reinforcement learning agent method based on smart power grid

Technical Field

The invention belongs to the technical field of power automation control, and particularly relates to a multi-agent deep reinforcement learning agent method based on a smart grid.

Background

The intelligent power grid is characterized in that a series of digital communication technologies are used for realizing power grid modernization^[1][2]. The economy, the national defense safety and even the safety of residents of a country all depend on the reliability of the power grid, and in actual operation, the intelligent power grid can not only facilitate users to select corresponding power packages in real time, but also actively allocate power resources and realize balanced power supply. The power grid can adjust and feed back market fluctuation in real time, bidirectional information communication service and comprehensive power grid condition perception are achieved, and the power grid is an important component of 21 st century modernization.

Previously, power grid technology was primarily designed to supply power unidirectionally from large centralized power plants to distributed consumers, such as homes and industrial facilities. Recently, a more popular research topic of the smart grid is to predict the power demand of users, so as to adjust the power price and the bidding strategy in advance to realize the maximization of the agent profit^[3]. Meanwhile, an agent mechanism is also another core of the design of the smart grid, and through the agent mechanism, the smart grid is arranged among local producers, local consumers, large-scale power plants and other agents in a comprehensive mode, and a market adjusting mechanism is applied to achieve the win-win situation. One of the key issues is to achieve bidirectional communication between the grid and the local wind and solar producers, Reddy et al^[4]The use of a reinforcement learning framework to design agents for the local grid was first proposed as a solution to this problem. One key element of the reinforcement learning framework is the state space, which learns strategies from manually constructed features^[4]This, however, limits the number of economic signals that an agent can accommodate, and also limits the ability of the agent to absorb new signals when the environment changes. Reinforcement learning has been applied to the field of e-commerce to solve many practical problems, mainly by learning optimal strategies through interaction of agents with the environment, such as paldol et al^[5]A data-driven approach is proposed to design electronic auctions based on reinforcement learning. In the electricityForce field, reinforcement learning is used to study wholesale market trading strategies^[6]Or to help establish a physical control system. Examples of electric wholesale applications include^[7]Mainly studied the bidding strategy of electric wholesale auction, and Ramavajjala et al^[8]Study Next State Policy evaluation (NSPI) as a rule on Least Square Policy evaluation (LSPI)^[9]And shows the benefit of expanding the pre-delivery commitment problem of the wind power generation. Physical control applications for reinforcement learning include load and frequency control of the power grid and autonomous monitoring applications, for example^[10]. However, most of the previous work related to the power grid agent is ideal for setting the power grid environment, on one hand, a large number of simple settings are used for simulating a complex power grid operation mechanism, on the other hand, information provided by the environment is highly abstract when an algorithm is designed, so that many important details are lost, and inaccurate decision making is caused.

On the other hand, customers in the smart grid exhibit various power consumption or production patterns. This indicates that we need to develop different pricing strategies for different types of customers. Following this idea, retail agents may be considered multi-agent systems, as each agent is responsible for pricing a specific class of power consumers or producers. For example, Wang et al assign an independent pricing agent to each customer in their agent framework^[23]. However, authors use independent reinforcement learning processes for different customers and consider the profits of the entire agent as an immediate return for each agent. It does not distinguish the individual contribution of each agent to the agent's profit and therefore does not motivate the agent to learn the best strategy.

Reinforcement learning, unlike traditional machine learning, is a process of gradually learning through constant interaction with the environment to a strategy that maximizes the jackpot^[14]. Reinforcement learning simulates the human cognitive process, is widespread, and is studied in many disciplines, such as gambling theory and cybernetics. Reinforcement learning allows an agent to learn strategies from the environment, which is typically set to a Markov Decision Process (MDP)^[15]While many algorithms are in this settingUsing dynamic programming techniques^[16][17][18]。

The basic reinforcement learning model includes:

a range of ambient and agent states S ═ S₁；s₂；…；s_n}；

A series of agent actions a ═ a₁；a₂；…；a_n}；

Describes the transfer function between states δ (s, a) → s';

the reward function r (s, a).

In many jobs, if the agent is assumed to be able to observe the environmental conditions at the present time, it is said to be fully observable, otherwise it is said to be partially observable. An agent based reinforcement learning communicates with the environment at discrete time steps. As shown in FIG. 2-1, at each time t, the agent obtains a reward r that generally includes the time_tThen selects an action a from the selectable actions, and then this action acts on the environment, which under action reaches a new state s_t+1The agent obtains the reward t for a new moment_t+1And the process is repeated in cycles. Reinforcement learning based wisdom gradually learns a strategy pi to maximize cumulative rewards in interacting with the environment: s → A. In order to learn near-optimal, the agent must learn the adjustment strategy over a long period of time. The basic setting and learning process of reinforcement learning are very suitable for the field of power grids.

With respect to how the optimal strategy is found, we introduce the value function method here. The value function approach attempts to find a policy maximizing return by maintaining an estimate of a range of expected returns for some policies. To formally define the optimum, we define a policy with the values:

V^π(s)＝E[R|s，π] (2-1)

r represents the random reward obtained following the policy pi starting from the initial state s. Definition V^*(s) as V^πMaximum possible value of(s):

V^*(s)＝max_πV^π(s) (2-2)

the strategy that can achieve these optimal values at each state is called an optimal strategy. While the state values are sufficient to define the optimum, it is also useful to define the action values. Given a state s, an action a and a policy π, the action value of the (s, a) pair under policy π is defined as:

Q^π(s,a)＝E[R|s,a,π] (2-3)

r represents the jackpot obtained by taking action a first and then following policy pi in state s. From the theory of MDP, we can always determine the optimal action by simply selecting the action with the highest value in each state, given the Q value of the optimal strategy. The action value function of such an optimal strategy is denoted Q^*. Knowing the optimal action value is sufficient to know how to achieve the optimization.

When the transfer function and reward function of the environment are both unknown, we can use Q-learning to upgrade the action value function:

Q_t(s,a)←(1-α_t)Q_t-1(s,a)+α_t[r_t+γmax_a′Q_t-1(s′,a′) (2-4)

wherein, α_tIs the learning rate, r_tIs the reward at the current time, and gamma is the discount factor. The current action value Q is upgraded once each time of interaction with the environment_t(s, a) retaining a portion of the previous time's value of Q for that state and action, recalculating Q (s, a) from the reward obtained for the current time and the new state reached, and combining with the previous portion of experience as the new action value for that time.

An artificial neural network is a computational model used in machine learning, computer science and other research fields^[19][20]. Artificial neural networks are based on a large number of interconnected elementary units, artificial neurons.

Generally, the artificial neurons of each layer are connected to each other, and signals are input from the first layer input layer and output from the last layer output layer. Current deep learning programs typically have thousands to millions of neural nodes and millions of connections. The goal of artificial neural networks is to solve problems in a human-like manner, although some kinds of neural networks are more abstract. The networks in a neural network represent the connections of artificial neurons between different layers in each system. A typical artificial neural network is defined by three types of parameters:

the connection mode of different layer neurons;

weights in these connections, which can be upgraded in a later learning process;

the weighted input of a neuron is converted to an activation function of its output activation.

Mathematically, a function f (x) represented by a neural network is defined as the other function g_i(x) Combinations of (a) and (b). Conveniently represented as a network structure with arrows depicting the dependencies between variables. One widely used form is nonlinear weight summation:

f(x)＝K(∑_iw_ig_i(x)) (2-5)

where K represents an activation function. The most important property of the activation function is that it provides a smooth transition when the input value changes, e.g. a small change in the input causes a small change in the output. Thus, the inputs are continually adjusted until the final output is formed, according to the weights in the connection. But such an output is usually not the result we want, so we also need neural networks for learning. The most attractive of neural networks is the possibility of learning. Given a specific task to be learned, and a set of learned objective functions F, learning means finding a function F in F through a series of observations^*As a solution to the task. Thus, we define a loss function C:

for the optimal function f^*Without other solutions there is a ratio f^*Still small loss function value:

the loss function is an important concept of learning, which is a measure of the distance of a particular solution from the optimal solution. The learning process searches the solution space of the problem to find the function with the smallest loss function value. For application problems where the solution needs to be found in the data, the loss must be a function of these actually observed samples. The loss function is usually defined as a statistic, since generally only statistically observed samples can be evaluated. Therefore, for the problem of finding the model function f, it is the minimization of the loss function C ═ E [ (f (x) -y)²]Where the data pairs (x, y) come from certain distributions D. In practical applications, we usually have only N limited samples, so we can only minimize

Thus, the loss function is minimized over some samples of the data, rather than over the theoretical distribution of the entire data set. When the sample-based loss function values are minimized, we find the optimal parameters of the neural network over these samples.

Q-network, since neural networks can be used as a function fit, the Q function in reinforcement learning can also be fitted with a neural network^[21][22]The Q-table is used for storing the values of the state action pairs, and the discretization of the state space is not needed to be considered, only characteristic values representing the states are directly input into the neural network, and parameters in the network are used for fitting a Q value function, so that the problem of infinite state space is naturally solved. However, unlike traditional neural network applications, reinforcement learning is notThe method has the advantages that the samples are available at the beginning, new rewards and observation are obtained through continuous interaction with the environment, and meanwhile, the reinforcement learning does not have the marks of the samples as a judgment basis for judging whether the model outputs the accurate samples. However, if we abandon the traditional application of the neural network, and only from the function fitting function of the neural network itself, the neural network is regarded as a tool for storing Q (s, a) like a Q-table, and each time the agent interacts with the environment, the agent can update the parameters in the neural network like updating the Q-table to make the Q (s, a) output by the agent close to the currently considered value.

We now consider how to design the input-output and loss functions of a Q-network to be functionally identical to the Q-table. First, the input is still state S, but instead of discretizing the state space from infinite to finite as in traditional reinforcement learning, each feature representing the state space can be directly used as an input to the neural network. Also, similar to the Q-table storing a row of values for each state representing the estimated cumulative reward for each action in that state, each node of the neural network output layer represents an action, and the output value of each node is the estimated cumulative value Q (S, a) of that action in the input state S_i). By designing the input layer and the output layer of the neural network in this way, the neural network realizes the function of storing Q (s, a). Meanwhile, the parameters of the artificial neural network need to be updated, and no existing label y for the input state is provided according to the definition of the loss function_iHowever, according to the formula for upgrading Q-learning to action value, we can be based on Q (s, a) already stored in the artificial neural network and the reward r at the present moment_tTo upgrade parameters in the neural network. For example, at time t, the agent is in state s_tWhen it selects action a according to the policy_tThen, the next state s is entered_t+1And earn a prize r_t. Now, when we upgrade parameters in neural networks, we want Q(s)_t,a_t) Should be upgraded to the update part r in Q-learning_t+max_a′Q(s′,a′)：

C＝[Q_t(s_t,a_t)-(r_t+max_a′Q_t-1(s_t+1,a′))]² (2-8)

Even if the action value at the current time approaches the update section. Also, the learning rate is set at the time of updating. Thus, the process of using the Q-network to store updated action values is the same as using the Q-table directly, the only difference being that the input layer of the neural network can accept direct input of values characterizing the state, while the Q-table requires discretization of the feature values to reduce the state space.

Reference to the literature

[1]M.Amin and B.Wollenberg.Toward a smart grid:Power delivery for the 21 st century.IEEE Power and Energy Magazine,3(5):3441,2005.

[2]C.Gellings,M.Samotyj,and B.Howe.The future’s power delivery system.IEEE Power Energy Magazine,2(5):4048,2004.

[3]Wang X,Zhang M,Ren F.Load Forecasting in a Smart Grid through Customer Behaviour Learning Using L1-Regularized Continuous Conditional Random Fields[C].Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems.International Foundation for Autonomous Agents and Multiagent Systems,2016:817-826.

[4]Reddy P P,Veloso M M.Strategy learning for autonomous agents in smart grid markets[J].2011.

[5]Pardoe D,Stone P,Saar-Tsechansky M,et al.Adaptive Auction Mechanism Design and the Incorporation of Prior Knowledge[J].INFORMS Journal on Computing,2010,22(3):353-370.

[6]Babic J,Podobnik V.An analysis of power trading agent competition 2014[M].Agent-Mediated Electronic Commerce.Designing Trading Strategies and Mechanisms for Electronic Markets.Springer International Publishing,2014:1-15.

[7]Petrik M,Taylor G,Parr R,et al.Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes[J].Computer Science,2010.

[8]Ramavajjala V,Elkan C.Policy iteration based on a learned transition model[C].European Conference on Machine Learning and Knowledge Discovery in Databases.Springer-Verlag,2012:211-226.

[9]Lagoudakis M G,Parr R.Least-squares policy iteration[M].JMLR.org,2003.

[10]Venayagamoorthy G K.Potentials and promises of computational intelligence for smart grids[C].Power & Energy Society General Meeting,2009.PES'09.IEEE.IEEE,2009:1-6.

[11]Wikipedia [ EB/OL].https:.en.wikipedia.org/wiki/Smart_grid

[12]EPRI[EB/OL].https:.www.epri.com/#/about/epri

[13]Kintner-Meyer M C,Chassin D P,Kannberg L D,et al.GridWise:The benefits of a transformed energy system[J].Pacific Northwest National Laboratory under contract with the United States Department of Energy,2008:25.

[14]Sutton R S,Barto A G.Reinforcement learning:An introduction[M].Cambridge:MIT press,1998.

[15]Littman M L.Markov games as a framework for multi-agent reinforcement learning[C].Proceedings of the eleventh international conference on machine learning.1994,157:157-163.

[16]Lewis F L,Vrabie D.Reinforcement learning and adaptive dynamic programming for feedback control[J].IEEE circuits and systems magazine,2009,9(3).

[17]Busoniu L,Babuska R,De Schutter B,et al.Reinforcement learning and dynamic programming using function approximators[M].CRC press,2010.

[18]Szepesvári C,Kioloa M.Reinforcement learning:dynamic programming[J].University of Alberta,MLSS,2008,8.

[19]Wikipedia [ EB/OL].https:en.wikipedia.org/wiki/Artificial_neural_network

[20]Wang S C.Artificial neural network[M].Interdisciplinary computing in java programming.Springer US,2003:81-100.

[21]Mnih V,Kavukcuoglu K,Silver D,et al.Playing Atari with Deep Reinforcement Learning[J].Computer Science,2013.

[22]Huang B Q,Cao G Y,Guo M.Reinforcement learning neural network to the problem of autonomous mobile robot obstacle avoidance[C].Machine Learning and Cybernetics,2005.Proceedings of 2005 International Conference on.IEEE,2005,1:85-89.

[23]DoE[EB/OL].http:.www.eia.doe.gov,2010.

[24]Olfati-Saber R,Fax J A,Murray R M.Consensus and cooperation in networked multi-agent systems[J].Proceedings of the IEEE,2007,95(1):215-233.

[25]Ferber J.Multi-agent systems:an introduction to distributed artificial intelligence[M].Reading:Addison-Wesley,1999.

[26]Littman M L.Markov games as a framework for multi-agent reinforcement learning[C].Proceedings of the eleventh international conference on machine learning.1994,157:157-163.

[27]Tan M.Multi-agent reinforcement learning:Independent vs.cooperative agents[C].Proceedings of the tenth international conference on machine learning.1993:330-337.

[28]Wiering M.Multi-agent reinforcement learning for traffic light control[C].ICML.2000:1151-1158.

[29]Hernández L,Baladron C,Aguiar J M,et al.A multi-agent system architecture for smart grid management and forecasting of energy demand in virtual power plants[J].IEEE Communications Magazine,2013,51(1):106-113.

[30]Niu D,Wang Y,Wu D D.Power load forecasting using support vector machine and ant colony optimization[J].Expert Systems with Applications,2010,37(3):2531-2539.

[31]Li H Z,Guo S,Li C J,et al.A hybrid annual power load forecasting model based on generalized regression neural network with fruit fly optimization algorithm[J].Knowledge-Based Systems,2013,37:378-387.

[32]Gong S,Li H.Dynamic spectrum allocation for power load prediction via wireless metering in smart grid[C].Information Sciences and Systems(CISS),2011 45th Annual Conference on.IEEE,2011:1-6

[33]Xishun Wang,Minjie Zhang,and Fenghui Ren.A hybrid-learning based broker model for strategic power trading in smart grid markets.Knowledge-Based Systems,119,2016.

[34]Electricity consumption in a sample of london households,2015.https://data.london.gov.uk/dataset/smartmeter-energyuse-data-in-london-households.

[35]S Hochreiter and J Schmidhuber.Long short-term memory.Neural Computation,9(8):1735–1780, 1997.

Disclosure of Invention

The invention aims to provide a multi-agent deep reinforcement learning agent method based on a smart grid, and aims to solve the problem that the agent state space is infinite.

The invention is realized in such a way that a multi-agent deep reinforcement learning agent method based on a smart grid comprises the following steps:

s1, calculating the corresponding action standard value under the current state according to the reward obtained by the selected action, and updating the parameters of the neural network;

s2, establishing a multi-agent of 'external competition and internal cooperation' according to the types of the consumers and the producers;

s3, setting the reward function of each internal agent according to profit maximization of the agent' S actions and the benefits of other internal agents, the function formula of which is:

wherein C represents the category in which the consumer is located, P represents the category in which the producer is located,

representing agent B_kInternal agent, i ∈ { C ∈ }₁,C_2,P₁,P₂}，κ_t,CRepresenting the amount of power consumed by a certain type of consumer at time t, κ_t,PIndicating the amount of power a certain type of producer produces at time t,

is the unbalanced partial charge in calculating the profit for the monomer.

The further technical scheme of the invention is as follows: the step S1 further includes the following steps:

s11, initializing parameters of the neural network;

s12, initializing the state value at the beginning of each period in the operation period;

s13, selecting the state value by using the probability or selecting the maximum action value in the current state;

s14, executing the selected action value and entering the next state after obtaining the reward;

s15, calculating the standard value corresponding to the current state to update the neural network parameters to enable the stored Q (S)_t,a_t) Is close toy_t。

The further technical scheme of the invention is as follows: in step S15, the action values are stored in the parameters, and each time a new state is entered, the characteristic values are only required to be sequentially input into the neural network, and the action with the largest Q (S, a) value can be selected from the output layer of the neural network as the next action to be performed.

The further technical scheme of the invention is as follows: the step S2 includes the following steps:

s21, classifying the consumers according to the power consumption difference;

and S22, classifying the producers according to the actual power generation situation.

The further technical scheme of the invention is as follows: in step S3, each agent considers the benefit of the agent when selecting the action and the benefit of the agent when selecting the action.

The further technical scheme of the invention is as follows: the consumers are divided into daytime consumers and all-day consumers according to the situation of power consumption.

The further technical scheme of the invention is as follows: the producer is divided into a whole-day generator and a daytime generator according to the actual power generation condition.

The invention has the beneficial effects that: the input layer of the neural network may accept direct input of values of features that characterize the state, while the Q-table needs to discretize the feature values to reduce the state space.

Drawings

Fig. 1 is a schematic diagram of a classic scenario of reinforcement learning.

FIG. 2 is a neural network including a hidden layer, a first layer having neurons that communicate data to a second layer of neurons through synapses, the second layer of neurons communicating data to a third layer of neurons through synapses, as provided by embodiments of the present invention. Synapses store schematics of parameters called weights that manipulate data in computations.

Fig. 3 is a schematic diagram of a proxy framework.

Fig. 4 is a schematic diagram of cyclic DQN.

Fig. 5 is a graph showing the yield distribution of each of the 20 experiments.

FIG. 6 is a schematic diagram of the revenue distributions of each of 20 experiments in a variety of user environments.

[ amend 03.01.2019 according to rules 91 ] FIG. 7 is a graph of electricity usage for different types of users.

FIG. 8 is a schematic representation of agent revenue for an evaluation period.

[ amendment 03.01.2019 according to rules 91 ] FIG. 9 is a graph of electricity usage for different types of users.

[ amendment 03.01.2019 according to rules 91 ] FIG. 10 is a third graph of electricity usage by different types of customers.

[ amendment 03.01.2019 according to rules 91 ] FIG. 11 is a graph of electricity usage for different types of customers.

[ amend 03.01.2019 according to rules 91 ] FIG. 12 is a graph of electricity usage for different types of users.

[ amend 03.01.2019 according to rules 91 ] FIG. 13 is a six graph of electricity usage for different types of users.

Detailed Description

Two aspects of improvement are carried out on the negotiation algorithm of the agent in the working aspect, and firstly, the problem that the state space of the agent is infinite is solved; secondly, the local environment is slightly changed, so that the scene is more real, and meanwhile, a design mode of external competition and internal cooperation multi-intelligent agent is correspondingly provided, so that the multi-intelligent agent is more competitive. Finally, we introduce real electricity usage data while helping our agent framework learn effective pricing strategies in more complex environments with some advanced timing techniques.

Smart grid for set local market and^[4]will only be specifically designed for the type of consumer and producer and the way in which electricity is produced/used in the second improvement. In the local market, there are consumers who consume power and small producers who produce power, and there are several agents that buy and sell power between consumers and small producers. The agent needs to be set because the small-sized producer and the consumer are inconvenient to directly coordinate, and through the intermediate link of the agent, the electricity can be conveniently bought and sold by the electricity consumer, resources can be better coordinated, and electricity is guaranteedAnd balancing the supply and demand of the force resources. The specific form of the agency is that each producer and consumer is issued a contract every hour, all users select the contracts from different agencies, and each agency can know the contract price of other agencies at the moment and how many producers and consumers select own contracts. In this way, the agent adjusts its contract price at the next time to maximize its profit according to the contract subscription and contract prices of other agents. Thus, each hour serves as the basic unit of time, the agent interacts with the environment once, and the user subscribes to the contract.

An imbalance in power supply and demand occurs when the producer and consumer subscribing to the broker contract require different power. At this time, we do not deal with the deficit portion of the power through the wholesale market, but set a penalty fee as a penalty for the occurrence of power supply imbalance by the agent. Next, we delineate this local market more clearly by definition. First, for electricity prices, we set the price range to 0.01 to 0.20^[23]The minimum price variation is 0.01. Each agent B_k(K ═ 1,2, …, K) there are two bids at time t, one is a bid to the consumer

The other is the bid to the producer

In addition, agent B at each time t_kThe number of producers and consumers subscribing to oneself is mastered:

and

for convenience, we assume that each consumer consumes k electrical power per time t_t,CAnd the amount of electricity generated by each producer at each moment is κ_t,P. Finally, we set the imbalance cost per unit of power at time t to be φ_t. At this point, agent B is computed_kThe reward at time t is clear:

thus, we have generally defined the basic operation of the local market. In the following, we first explain the definitions of the two status indicators. The first is the PriceRangeStatus (PRS) index to determine if the market is reasonable, and if the market is reasonable as seen by an agent, then it must satisfy:

wherein, mu_LIs a subjective value that indicates the agent's desire for marginal benefit in the market. At the same time, the user can select the desired position,

wherein, B_LRepresenting the agent itself. The second indicator, PortfolioStatus (PS), indicates whether the agent itself has achieved supply-demand balance. Next, we set several actions operating on the price as a set of actions that all agents can choose to take.

A＝{Maintain,Lower,Raise,Revert,Inline,MinMax}

Each agent sets the price at the next time through these actions at time t

And

● Maintain represents the price for maintaining the last moment;

● Lower both the producer and consumer prices at time t by 0.01;

● Raise increases by 0.01 on both the producer and consumer prices at time t;

● Revert moves 0.01 towards the midpoint price,

● Inline sets the new producer and consumer prices to

MinMax sets the prices of the new producer and consumer to

Several fixed policies compete for the setting of agents, and for comparison and verification, we design several fixed policy agents. The balancing strategy attempts to reduce supply imbalance by adjusting the producer and consumer contract prices, increasing the producer and consumer contract prices when it sees an excess in demand, and decreasing the producer and consumer contract prices when it sees an excess in supply. Greedy strategies attempt to maximize profit by increasing profit margins, i.e., maximizing the difference between consumer and producer contract prices when the market is rational. Both strategies can be characterized as adaptive, in that they will react to market and portfolio conditions, but they will not learn from the past. Meanwhile, two non-adaptive agents are designed, wherein the agent with a fixed strategy always maintains a certain price; a random agent randomly selects one of six actions for each price adjustment.

TABLE 3-1 Balancing Algorithm

TABLE 3-2 greedy Algorithm

The first modification is to change the original storage structure of Q-learning from Q-table to Q-network, and the current method is completely the same as the Q-learning method, that is, the internal parameters of the storage structure are upgraded after each interaction with the environment is completed. Later work also considered the mechanism of empirical replay.

TABLE 3-3Q-learning Algorithm Using Q-network

The first row of the algorithm initializes the parameters of the neural network. The second row indicates that the experiment will run for M cycles, the third row indicates that the state values will be initialized at the beginning of each cycle, the fifth and sixth rows indicate that random selection is performed with a certain probability, otherwise the action with the largest action value in the current state is selected. The seventh row represents the execution of the selected action, thenTo reward and enter the next state. The eighth line calculates the standard value of the corresponding action in the current state, and the ninth line indicates that the parameters of the neural network are updated according to the value calculated in the eighth line, so that the stored Q(s)_t,a_a) Is close to y_t. By storing the action values in the parameters, and entering a new state each time, only the characteristic values of the state are required to be input into the neural network in sequence, and the action with the maximum Q (s, a) value can be selected from the output layer of the neural network as the next execution action.

In addition to trying to store the action values of reinforcement learning using a neural network, the multi-agent based intelligent negotiation algorithm also considers the more realistic situation that consumers have multiple types, and meanwhile, small producers are divided into two cases of wind power generation and solar power generation. To investigate this general phenomenon, we make corresponding changes to the environment. Firstly, consumers are divided into two types, one type is a common user who does not need to consume electric power at night, and the other type is a user who needs to consume electric power all day long; then, according to the actual situation, producers can be divided into two types, one type is a wind power generator which can generate power all day long, and the other type is a solar power generator which can generate power only in daytime. Therefore, four types of users exist in the current power grid environment, the original way of uniformly adjusting the price by an agent of an agent is not applicable, so that a new agent of an agent of a plurality of agents of the agent of. Multiple intelligence bodies can coordinate and cooperate with each other^{[24][25][26][27][28]}The multi-intelligence framework can be more suitable for the power grid with particularly complex external environment. The contract price of the user can be more pertinently adjusted on the premise of the original power grid rule so as to obtain the maximization of the profit.

However, although the multiple wisdom externally appears as an agent, there are four different agents inside, and how to make the agents guarantee that they cooperate with each other inside is a matter of thought. In order to make the agents' internal agents cooperate with each other as much as possible to form a real group to compete with other agents, we need to redesign the reward function of each internal agent, so that the actions of each agent not only take into account the profit maximization of itself, but also take into account the benefits of other internal agents. We redesign the reward function for each internal agent:

representing agent B_kInternal agent, i ∈ { C ∈ }₁,C_2,P₁,P₂}。κ_t,CRepresenting the amount of power consumed by a certain type of consumer at time t, κ_t,PIndicating the amount of power a certain type of producer produces at time t. While

Is the unbalanced partial charge in calculating the profit for the monomer.

Further:

since a single agent can only buy or sell power from or to the producer, it is not good to measure its own profit directly, but we can consider its contribution to the total profit in reverse, i.e. the loss to the total profit without the agent buying or selling power is the agent's own profit. By considering from the overall relationship, we obtain the individual agents' own profit. Therefore, each intelligent agent can consider the whole benefits while considering the benefits of the intelligent agent when selecting the action through the newly designed reward function.

A real data simulation multi-agent framework is used for verifying the effectiveness of the agent framework in a complex environment, and real power utilization data of a family user in 2013 of London city are introduced^[34]We have selected about 1000 users from them. First, it is not sufficient to just publish one price to all consumers. That is, we only consider home users in the retail market, but their power usage patterns are different due to different living habits and consumption concepts. Thus, using multiple agents to distribute respective electricity prices for different groups of consumers may better facilitate supply and demand balance. Here we group consumers according to the usage profile. Considering that power consumption is time series data, our agents use K-Means based on Dynamic Time Warping (DTW) distance criteria for clustering. After clustering, a user group with similar electricity utilization behaviors can be obtained. The proxy architecture in a real data simulation environment is shown in fig. 3.

Secondly, since the electricity consumption behavior of the user changes from moment to moment, we use Long Short-Term Memory (LSTM) which has excellent performance in time series^[35]The neural network element structure of (1) enhances our network architecture to help agents better extract timing information from past market information to make effective decisions. Finally, the neural network architecture used by our agents is shown in FIG. 4.

The setting of experiment parameters includes parameters in method definition and parameters in a plurality of experiment operation, and we show that the experiment has five agents in total, namely an agent of an intelligent agent, an agent of a balanced strategy, an agent of a greedy strategy, an agent of a fixed price and an agent of random action. Local power grid marketThe number of consumers (2) is set to 1000, the number of producers is set to 100, the power consumed by the consumers per hour is 10 basic power units, and the power produced by the producers per hour is 100 basic power units. The unbalanced cost per unit power is 0.1, and it is noted that the unbalanced cost cannot be set too small, preventing an agent from cheating the subscription of a consumer at as low a price as possible without purchasing power from a producer. Furthermore, considering that the real-world user has a certain inertia to the subscriptions, we set the user's preference for selection to {35,30,20,10,5} representing that 35% of the potential consumers select the subscription with the lowest contract price, 30% of the potential consumers select the subscription with the second lowest contract price, and so on. At the same time, the producer will choose from the higher prices according to the choice preferences. In the experiment, we also set the initial price per unit of electricity, the selling price of electricity is 0.13, and the purchasing price of electricity is 0.1. And the subjective marginal benefit mu of the market_LSet to 0.02. For the running period, we set 300, the first 200 periods are learning periods, the agent is allowed to learn in this period, the last one hundred periods are statistical periods, and the total profit of each agent in this period is used as the final criterion for judging whether the agent algorithm is competitive. Each period has 10 days and 24 hours per day, i.e. each period has 240 basic time units. For Q-learning, we use the epsilon-greedy strategy.

Experiment of Q-network, for the design of neural network, we set up a network comprising two hidden layers. The input layer receives the state, and in order to fully utilize the information given by the environment, the state characteristics are designed into the contract price of all agents at the last moment and the number of users subscribed by the agent of the intelligent agent; and the contract prices of all agents at the moment and the information of the number of the users subscribed by the agent of the intelligent agent are added, so that 24 input units are provided. The export layer has six export units representing six actions operating on the price, respectively, and the exported value represents the desired jackpot for selecting the action in the imported state and then proceeding according to the policy. In addition, we used the XAVIER initialization parameters, the RMSPROP algorithm and gradient descent to train the parameters of the neural network. Furthermore, we averaged the total rewards of 20 rounds of the entire experiment run repeatedly to determine the final performance of the agent using the Q-network based Q-learning algorithm in the experiment, while illustrating the advantages and disadvantages of using Q-network to store action values in comparison to previous agents using Q-table based Q-learning algorithm. It should be noted that the state of the agent using the Q-table follows the settings of the previous job, being designed as a combination of the PRS and PS metrics at the previous time and this time.

TABLE 4-1 average rewards for each of 20 trials of all agencies

TABLE 4-2 average rewards for each of 20 experiments

From the above two tables we can see that the competition of the agent using Q-learning is significantly stronger than that of other strategies, while the greedy strategy is the only agent with positive total profit except the reinforcement learning agent, the agent with the least total profit is the fixed strategy, and is easily defeated because its price is not changed all the time, and the total profit of the agent using the balancing strategy and the greedy strategy is two or three bits, which explains the superiority of the adaptive strategy, and makes the agent of the reinforcement learning algorithm greatly lead other agents, which explains the superiority of learning from the past. The effect of using Q-network to store the past experience is better than that of using Q-table to store the experience, which shows the importance of having more accurate state representation. As can also be seen from the following figure, the proxy revenue using Q-network appears more stable, substantially around 1500000, while the proxy revenue using Q-table fluctuates more and is relatively unstable, as shown in FIG. 5.

Experiments of multi-agent, we also performed experiments of multi-agent, but first we need to modify the configuration of some grid environments. Firstly, because we have set two groups of producers and two groups of consumers, in order to draw close to the original experimental parameters as much as possible, we set the number of consumers who only use electricity in the daytime to be 500, the number of consumers who use electricity all day to be 500, and the electricity consumption situation of the consumers who only use electricity in the daytime to be {0,0,0,0,0,0, 0,10,10,10,10,10,10,10,10,10,0,0,0,0,0,0}, that is, the first six hours of a day and the last six hours of a day are not used, and the electricity consumption of the consumers who use electricity all day is 10 basic units of electricity per hour. Further, the number of producers of wind power generation is set to 50, the number of producers of solar power generation is set to 50, wind power generation is all-day power generation, the amount of power generation per hour is 100 power basic units, and the power generation conditions of all days of solar power generation are {0,0,0,0,0,0,100, 0,0,0,0,0,0,0 }. At the same time, we slightly adjusted the selection preferences of different kinds of producers and consumers. In the experiment, for better comparison with previous work, we used a Q-table as a structure for storing action values, while the input state increased the characteristic index of whether the current time is day or night compared with previous work.

Table 4-3 selection preferences of different kinds of users

In the experiment, because the external conditions of the experiment are changed, in order to explain that the effect of the multi-agent of the 'external competition and internal cooperation' is better than that of the original single agent of the intelligent agent, the multi-agent and the single agent of the intelligent agent are put into the experiment together for competition. The running test was repeated to obtain the following results.

TABLE 4-4 average profits from 20 runs in various user environments

From the above table and the broken line in fig. 6, it can be seen that multi-agent and single-agent agents have absolute dominance in competition, and multi-agent can defeat single-agent in each turn, which illustrates that multi-agent agents are more marketable and have stronger competitiveness than single-agent agents. Meanwhile, the total income of each agent is greatly improved compared with the experiment of 4.2 sections, and the agent of a fixed price strategy in the original experiment is removed, so that the environment can be adapted to the environment, the price is adjusted to balance the multi-intelligent agent of supply and demand, the whole market is dominated by the agent capable of balancing supply and demand, compared with the experiment in the previous section, the electric quantity supply and demand balance of the whole smart grid market is guaranteed, the unbalanced cost of the agent is greatly reduced, and the overall total income level is improved.

To verify that the design concept of "inter-working" works, we set the reward function of each agent as the whole reward function of the agent

And the reward function we have designed

Experiments are carried out, and after 10 experiments are respectively carried out, the average total income obtained by using the designed single reward function is found to be 23.37% higher than the average total income obtained by directly using the integral reward function, which shows that the single reward function has better effect than the integral reward function, so that each intelligent agent simultaneously considers the self interest and the integral interest, maximizes the self interest based on the integral interest and compares the self interest with the party only considering the integral interestThe formula is more flexible and more targeted.

[ correcting 03.01.2019 according to rules 91 ] experiments of multi-intellectual agent under real data simulation, firstly, clustering users after data cleaning, classifying the users into 5 classes according to other empirical data, and distributing the number of people as { 215; 97, a stabilizer; 317; 274; 79}. The resulting power usage curves are shown in FIGS. 7 and 9-13.

It can be seen from the figure that the power usage curves for each class of users are very different, which poses a great challenge to our agency. In addition, for more real modeling users, modeling is carried out on a selection model of the users in the power grid, according to the general characteristic of the dependence of the users in the power grid, random psychological prices are distributed to each user within a certain range, and when the bid price of the agent signed by the user last time is better than the psychological expectation of the user, the user can select to continue; otherwise, the user can select the power contract according to a certain probability according to the price quality reordering.

TABLE 4-1 user Electricity price selection model

The data of 2013 year and month 2 in the power data of the london family users are selected as the power utilization data of the consumers, and in order to guarantee the balance of the whole power utilization, the two types of producers are assigned to respectively undertake half of power generation tasks. Although the overall power supply and demand of the system is balanced, it is very difficult to balance the supply and demand within the agent itself at any moment, because the power consumption behavior of each consumer is different and the selection behavior of each user is also different. The user psychology price random range is [0.10,0.15], the training period is 50, the evaluation period is 10, and the length of the time sequence state is 3. The yield of the final evaluation period is shown in fig. 8.

Pricing issues for retail agents in smart grid retail markets are discussed. We first applied DRL in retail agent design to solve the discrete state space problem and it is its better application and practical environment to use LSTM and DTW-based clustering mechanisms to strengthen our agents. By clustering clients, a cooperative multi-agent deep reinforcement learning agent framework with a unique excitation function is designed. Finally, the adaptability and the strong competitiveness of the agent framework under the complex environment are verified by introducing the domestic electricity consumption data in London city. As future work, we will explore the application of more advanced DRL techniques (e.g., the operator-critic algorithm) to our retail agent design to produce more efficient pricing strategies.

In addition, by considering actual small-scale power generation data and household power storage devices, the agent mechanism can be further generalized to obtain a more realistic smart grid. Load prediction of power grid is also a popular research topic^{[29][30][31][32]}For future work, we will start from this aspect, perform accurate classification and modeling analysis on the user, let the agent automatically identify the user's category, and then let the internal corresponding agent manage the user.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

A multi-agent deep reinforcement learning agent method based on a smart grid is characterized by comprising the following steps:

s1, calculating the corresponding action standard value under the current state according to the reward obtained by the selected action, and updating the parameters of the neural network;

s2, establishing a multi-agent of 'external competition and internal cooperation' according to the types of the consumers and the producers;

s3, setting the reward function of each internal agent according to profit maximization of the agent' S actions and the benefits of other internal agents, the function formula of which is:

wherein C represents the category in which the consumer is located, P represents the category in which the producer is located,
representing agent B_kInternal agent, i ∈ { C ∈ }₁,C₂,P₁,P₂}，κ_t,CRepresenting the amount of power consumed by a certain type of consumer at time t, κ_t,PIndicating the amount of power a certain type of producer produces at time t,
is the unbalanced partial charge in calculating the profit for the monomer.
The multi-agent deep reinforcement learning agent method as claimed in claim 1, wherein said step S1 further comprises the steps of:

s11, initializing parameters of the neural network;

s12, initializing the state value at the beginning of each period in the operation period;

s13, selecting the state value by using the probability or selecting the maximum action value in the current state;

s14, executing the selected action value and entering the next state after obtaining the reward;

s15, calculating the standard value corresponding to the current state to update the neural network parameters to enable the stored Q (S)_t,a_t) Is close to y_t。
The multi-agent deep reinforcement learning agent method as claimed in claim 2, wherein the action values are stored in the parameters in step S15, and the characteristic values are only required to be sequentially input into the neural network each time a new state is entered, and the Q (S, a) value maximum action can be selected from the output layer of the neural network as the next execution action.
The multi-agent deep reinforcement learning agent method as claimed in claim 3, wherein said step S2 comprises the steps of:

s21, classifying the consumers according to the power consumption difference;

and S22, classifying the producers according to the actual power generation situation.
The multi-agent deep reinforcement learning agent method of claim 4, wherein each agent considers the benefit of itself and the whole benefit in selecting the action through the reward function in step S3.
The multi-agent deep reinforcement learning agent method according to claim 5, wherein the consumers are divided into daytime consuming users and all-day consuming users according to the situation of consuming power.
The multi-agent deep reinforcement learning agent method of claim 6, wherein the producer is divided into a whole-day generator and a daytime generator according to the real-world power generation situation.