CN115438873A

CN115438873A - Power dispatching method based on block chain and deep reinforcement learning

Info

Publication number: CN115438873A
Application number: CN202211167510.9A
Authority: CN
Inventors: 李先贤; 周梁昊杰; 刘鹏; 李东城; 陈柠天; 霍浩; 王博仪
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-06

Abstract

The invention discloses a power dispatching method based on a block chain and deep reinforcement learning, which comprises the following steps: step one, registering a user; collecting data and encrypting a chain; setting a DRL state space and an action space; setting a DRL reward function R (t) and a user constraint and penalty mechanism; step five, obtaining a prediction result based on DRL training of the improved DQN, and reporting electric power to a power grid department; step six, updating the credit value; step seven, the power grid department uploads the scheduling information to the block chain for storage after encrypting the scheduling information by using the public key of the application user, and the user credit value is also stored in the block chain in an uplink manner; and step eight, finishing power dispatching based on the reputation value. The method can perform fusion management on big data acquired based on a block chain technology, realize data aggregation and sharing of different sources, balance the power problem and perfect an energy management system.

Description

Power dispatching method based on block chain and deep reinforcement learning

Technical Field

The invention relates to the technical field of deep reinforcement learning and block chains, in particular to a power dispatching method based on a block chain and the deep reinforcement learning.

Background

The smart grid is the intellectualization of the grid, also called as "grid 2.0", is established on the basis of an integrated, high-speed two-way communication network, and realizes the purposes of reliability, safety, economy, high efficiency, environmental friendliness and safe use of the grid through the application of advanced sensing and measuring technology, advanced equipment technology, advanced control method and advanced decision support system technology, and the main characteristics of the smart grid comprise self-healing, excitation and user protection, attack resistance, provision of electric energy quality meeting the requirements of users, allowance of access of various different power generation forms, starting of the power market and optimized and efficient operation of assets.

With the development of smart grid concept, green energy and renewable energy are integrated to become a new vision of the traditional power grid, and due to the transition of the traditional power grid from centralized type to distributed type, the excellent characteristics of decentralization, credibility, traceability and the like of a block chain make the traditional power grid and the construction of the smart grid very in accordance, so that the mutual trust between different user subjects in the distributed environment, and further the operations such as energy trading and the like are performed.

In a narrow sense, a blockchain is a Decentralized shared ledger (Decentralized shared ledger) which combines data blocks into a specific data structure in a chain manner according to a time sequence and is cryptographically guaranteed to be non-falsifiable and non-falsifiable, and can safely store simple, sequential and verifiable data in a system. The generalized blockchain technique is a new decentralized infrastructure and distributed computing paradigm that uses a cryptographic chain blockchain structure to verify and store data, uses a distributed node consensus algorithm to generate and update data, and uses automated script code (smart contracts) to program and manipulate data. The block chain technology has the core advantages of decentralization, and can realize point-to-point transaction, coordination and cooperation based on decentralization credit in a distributed system with nodes not needing to trust with each other by means of data encryption, timestamps, distributed consensus, economic incentive and the like, thereby providing a solution for solving the problems of high cost, low efficiency, unsafe data storage and the like commonly existing in a centralization mechanism.

However, p2p energy trading also faces some challenges and difficulties, and most distributed power generation equipment mostly depends on wind power, solar energy and the like, is easily influenced by weather, and has the problems of randomness, uncertainty and the like of generated power. In summary, the premise for completing the power transaction is that the user knows the power load of the user and whether the user has excess power to perform the transaction, and the conventional prediction method is to predict the transaction by a deep learning method. However, the existing deep learning network has the problems of poor feature extraction capability, low accuracy, easy loss of long-term dependence on information and the like. Therefore, the idea of reinforcement learning is introduced, the periodic characteristics of the load can be dynamically learned, and the long-term dependency relationship can be captured more effectively.

The technical background mainly stems from the following two parts:

1. block Chain (Block Chain):

the block chain is a chain formed by blocks. Each block stores

Certain information, which are connected in a chain according to the time sequence of their respective generation. These servers, referred to as nodes in the blockchain system, provide storage space and computational support for the entire blockchain system. If the information in the block chain is to be modified, more than half of the nodes must be authenticated and the information in all the nodes must be modified, and the nodes are usually held in different hands of different subjects, so that the information in the block chain is extremely difficult to tamper with. Compared with the traditional network, the block chain has two core characteristics: the first is that data is difficult to tamper with, and the second is decentralized. Based on the two characteristics, the information recorded by the block chain is more real and reliable, and the problem that people are not trusted each other can be solved.

Deep Reinforcement Learning (DRL): deep Learning (DL) has strong perception capability but lacks certain decision-making capability; while Reinforcement Learning (RL) has decision-making capabilities and is not motivated to perceive the problem. Therefore, the two are combined, the advantages are complementary, and a solution is provided for the perception decision problem of a complex system. Modeless RLs can be roughly classified into two categories depending on the optimization objective: value-Based (Value-Based RL) and Policy-Based (Policy-Based RL), representative algorithms include time-difference learning, Q-learning, sarsa, and others.

Disclosure of Invention

The invention aims to solve the problems of overlarge wave crests and wave troughs and local unbalance of electric power caused by instability of green energy power generation, and provides an electric power scheduling method based on a block chain and deep reinforcement learning. The method can perform fusion management on big data acquired based on a block chain technology, realize data aggregation and sharing of different sources, balance the power problem and perfect an energy management system.

The technical scheme for realizing the purpose of the invention is as follows:

a power dispatching method based on a block chain and deep reinforcement learning comprises the following steps:

step one, a user registers an account and adds equipment to be managed to obtain a user id and an equipment id, and a power grid department serves as a trusted and safe certificate authority to distribute public and private keys for the user;

step two, the intelligent electric meter submits the electric power parameters of the registered equipment and carries out cochain after encryption through the public key of the user, and the cochain data format is as follows:

＜Uid，ID，Pc，P，Se，C，V＞，

wherein Uid is a user ID, ID is a device ID, pc is a user public key, P is device power generation, se is a device state, C is device current, and V is device voltage;

performing data analysis on the power data, processing the data of the user at each stage in the power market, and taking the data of the uplink data after filtering, duplicate removal and error correction discretization preprocessing operations as a training sample of a subsequent network;

step three, setting DRL state space and action space:

load power P of state space S from real time t moment _load (t), time-of-use electricity price TOU (t) at the time t, and charge state SOC (t) of the energy storage device at the time t;

where SOC (t) is defined as:

SOC(t)＝SOC(t-1)+P _load (t)·Δt/E _b ，

E _b is the maximum capacity of the energy storage device,

the state space S is defined as:

S＝{P _load (t)，TOU(t)，SOC(t)}；

the action space Act is defined as follows:

wherein

And discretizing the continuous load for the next moment load grade, wherein the processing process is as follows:

…

in the formula: p _min Indicating the minimum permissible load prediction, P _max Indicating that the maximum load forecast is allowed,

the average value is represented by the average value,

is selected by the agent among a-Z;

b (t) >0 represents that the user has redundant electric energy, the energy storage device is charged, the charging electric quantity is B (t), B (t) < 0 represents that the user lacks load, the power dispatching needs to be applied to the power grid, and the electric energy needing to be dispatched is B (t);

step four, setting a DRL reward function R (t) and a user constraint and penalty mechanism:

in a learning link, a deep reinforcement learning algorithm needs to determine the updating direction and amplitude of the parameters of the controller according to the reward value returned by the external environment; in the market environment, the goal of optimizing control is to minimize the long-term electricity purchase cost of electricity purchasing users and to reduce the operating cost of existing equipment:

p is penalty cost, mainly reflecting the degree of violation of operation constraint by a user, and in order to prevent a malicious user from making an extreme load prediction value for pursuing benefits, the operation constraint is as follows:

(1) Load capacity limit constraints:

(2) And (3) deviation electric quantity constraint:

in the formula: p _allow Representing the maximum deviation value allowed;

in the optimization process, if the user violates the constraint condition, the user pays punishment according to the out-of-limit degree of the user, and the reward is reduced, wherein the specific punishment cost is calculated as follows:

P＝P ₁ +P ₂ +P ₃ ，

in the formula: rho and delta are corresponding penalty coefficients;

(II) operation cost:

f _g (t)＝TOU(t)·Δt(P _load -B _store )，

wherein B is _store Representing the energy stored in the energy storage means, if the user does not have energy storage means, B _store Is 0;

(iii) a reward function, where the user has N devices,

step five, DRL training based on the improved DQN:

(1) And (3) experience accumulation: evaluating network according to user state S in power market environment _t Outputting Q values of all actions in the action space Act, and selecting the action a according to a greedy strategy _t Market feedback award r _t And obtaining the next state S _t+1 From this, a complete Markov tuple (S) is obtained _t ，a _t ，r _t ，S _t+1 ) Putting the samples into an experience pool as a sample set, and repeatedly training until the number of the samples reaches the set size of the experience pool;

(2) Update parameters of the Q function: randomly sampling the TD-Error of different sample data in the experience pool according to a priority sampling algorithm, and selecting the sample with the large TD-Error with higher probability for training;

(3) Training a neural network: constructing a loss function L of the neural network for training, and copying the parameters of the evaluation network to a target network completely after N rounds of complete training of the evaluation network;

(4) Optimizing parameters: if the benefit obtained by the controller is not increasing and tends to be stable for a long time, the evaluation network parameter at the moment is converged, otherwise, the steps (1) to (4) are repeatedly executed;

(5) The DRL output power operation execution command format is as follows:

＜Uid，Pk，ID，Op，Qty＞，

wherein Uid is a user ID, pk is a user public key, ID is an equipment ID, op is operation on equipment, namely applying for dispatching and storing electric quantity, and Qty is predicted electric quantity;

step six, in order to prevent resource waste caused by the fact that part of malicious users apply for power scheduling requests with excessive difference with self-needed power when the time-of-use electricity price is low, scheduling credit values are introduced and stored in a block chain, and are automatically managed according to intelligent contracts, when the users apply for scheduling power, the users need to upload past actual power data to the block chain together with needed load requirements predicted by an intelligent electric meter and a DR1 controller for credit value calculation, and the following is specific definition of the credit values:

if the users need power scheduling, a transaction request is sent to a block chain, a power grid company acquires transaction information and scheduling credit values TR of the users applying power scheduling from the block chain, and a power grid department performs power scheduling according to the scheduling credit values of all the users in descending order;

the scheduling reputation value updating mode is as follows:

wherein Q _value (t) represents a predictive evaluation value at the ith scheduling,

for the calling power credit requested by the controller at time t,

alpha is a prediction deviation factor for the amount actually required by the user, because the prediction cannot be certain and the actual situation is the same, deviation protection within a certain range is given, alpha is related to the latest scheduling credit value, the higher the credit value is, the larger the deviation factor is, and N represents the accumulated scheduling application times of the user;

the updating of the credit value is automatically updated by using an intelligent contract, and the credit value is stored on the block chain and is supervised by all users participating in the power dispatching platform;

step seven, the user uploads the scheduling information and the power information to a block chain for storage after the scheduling information and the power information are encrypted by the public key of the user, and the user credit value is also stored in the block chain in an uplink manner;

in order to enable the user reputation value updating to be supervised by all users, storing the reputation value plaintext on a block chain, and encrypting and storing the power parameters and the scheduling request of user equipment by a public key of the user; only the power department and the user can decrypt the data through the private key, and the power department and the user can trace the changes of the power dispatching request and the power equipment parameters in the past through accessing the block chain;

step eight, through the step five, the intelligent agent after data training predicts the power load of the user in the next time period, if surplus electric quantity exists, the user does not need to perform power declaration, the surplus power is stored in the next time period, if the intelligent agent predicts that the power load of the user in the next time period is too large and cannot be self-sufficient, the intelligent agent performs power declaration to a power grid department according to the predicted power limit, and the scheduling of the power is completed after the intelligent agent waits for the auditing of the power department; for a plurality of received power dispatching applications, a power dispatching center needs to carry out sequential dispatching in order to ensure safe and stable operation of a power grid and reliable power supply to the outside, dispatching information is stored and linked up, the power dispatching center carries out descending sequencing on the dispatching applications according to user credit values stored in a block chain, dispatching auditing and power dispatching are preferentially carried out on users with high credit values, in order to ensure that the credit values are fairly and justly updated, the six steps of credit values are used for updating an intelligent contract for management, and the credit values are updated in the next time period according to actual power actual load data according to the credit values in the six steps.

The technical scheme focuses on load prediction and optimization of a user electricity purchasing scheme by using a block chain technology and a deep reinforcement learning technology in a distributed environment. The technical advantages of the technical scheme are as follows:

1. the block chain technology is used for solving data sharing in a distributed environment, as distributed power generation equipment is more and more popular, the data sharing efficiency in the distributed environment is very low in a previous centralized mode, and the generation of the block chain data of the power microgrid relates to data acquisition, fusion sharing management, contract affairs, access control and data consensus. Data from physical equipment, including structured data, unstructured data and semi-structured data, are acquired by data acquisition equipment such as sensors and the like, and are stored in distributed manner and different positions by using a block chain technology. And performing fusion management on the acquired big data based on the block chain technology to realize data aggregation and sharing of different sources.

2. The user load is predicted and the power dispatching is optimized by using deep reinforcement learning, the traditional deep learning and reinforcement learning method lacks the extensible capability, the decision in a high-dimensional space is low in efficiency, and continuous input variables are difficult to directly process, so that the method for using the deep reinforcement learning by combining the deep learning with strong perception capability and the reinforcement learning with strong decision capability is suitable for the complex environment of the power market, wherein the state and the action space are high-dimensional.

The method can perform fusion management on big data acquired based on a block chain technology, realize data aggregation and sharing of different sources, balance the power problem and perfect an energy management system.

Drawings

FIG. 1 is a schematic flow chart of an embodiment;

FIG. 2 is a flow diagram of a DRL in an embodiment;

FIG. 3 is a block diagram of an embodiment of a power block chain;

fig. 4 is a schematic diagram of a discretization process of a continuous load in the embodiment.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Example (b):

with the continuous improvement of the openness degree of the electric power market, a large number of electric power selling enterprises are gradually participating in electric power market transactions. The current research is based on methods such as random planning, robust optimization, artificial intelligence and the like to construct an electricity vendor load optimization decision model. In order to fully utilize the available uncertainty information, the effectiveness of the decision result is improved. The optimization method based on reinforcement learning is very suitable for user load prediction, and the data-driven artificial intelligence algorithm can fully excavate the uncertainty rule of various factors such as electricity price, real-time load and the like in the electric power spot market, take multi-decision variables into consideration to the greatest extent, and improve the decision effect. Meanwhile, in order to adapt to a complex power market environment, the action strategy of the power selling trader is optimized by adopting a deep reinforcement learning algorithm on the basis of reinforcement learning and combining the thought of a neural network.

In the embodiment, the DRL training agent is carried out by using historical data, a user can know the load use condition of the next time period in advance through the agent, and the advanced power dispatching declaration or the storage of redundant power is carried out, so that an energy management system is perfected.

Referring to fig. 1, a power scheduling method based on a block chain and deep reinforcement learning includes the following steps:

＜Uid，ID，Pc，P，Se，C，V＞，

setting a DRL state space and an action space:

load power P of state space S at real time t _load (t), time-of-use electricity price TOU (t) at time t, and charge state SOC (t) of the energy storage device at time t;

where SOC (t) is defined as:

SOC(t)＝SOC(t-1)+P _load (t)·Δt/E _b ，

E _b is the maximum capacity of the energy storage device,

the state space S is defined as:

S＝{P _load (t)，TOU(t)，SOC(t)}；

the action space Act is defined as follows:

wherein

For the next moment load level, discretizing the continuous load, as shown in fig. 4, the processing procedure is as follows:

…

in the formula: p is _min Indicating the minimum permissible load prediction, P _max Indicating that the maximum load forecast is allowed,

the average score value is represented by the average score value,

is selected by the agent among a-Z;

b (t) >0 represents that the user has redundant electric energy, the energy storage device is charged, the charging electric quantity is B (t), B (t) < 0 represents that the user lacks load, electric power scheduling needs to be applied to a power grid, and the electric energy needing scheduling is B (t);

(1) Load capacity limit constraints:

(2) And (3) deviation electric quantity constraint:

in the formula: p _allow Representing the maximum deviation value allowed;

P＝P ₁ +P ₂ +P ₃ ，

in the formula: rho and delta are corresponding penalty coefficients;

(II) operation cost:

f _g (t)＝TOU(t)·Δt(P _load -B _store )，

wherein B is _store Representing the energy stored in the energy storage device, if the user does not have an energy storage device, B _store Is 0;

(iii) a reward function, where the user has N devices,

step five, DRL training based on improved DQN, as shown in fig. 2:

(1) And (3) experience accumulation: evaluating the network according to the user state S in the power market environment _t Outputting Q values of all actions in the action space Act, and selecting the action a according to a greedy strategy _t Market feedback award r _t And obtaining the next state S _t+1 From this, a complete Markov tuple (S) is obtained _t ，a _t ，r _t ，S _t+1 ) Putting the samples into an experience pool as a sample set, and repeatedly training until the number of the samples reaches the set size of the experience pool;

(5) The DRL output power operation execution command format is as follows:

＜Uid，Pk，ID，Op，Qty＞，

step six, in order to prevent resource waste caused by the fact that part of malicious users apply for power scheduling requests with excessive difference from self-needs when the time-of-use electricity price is low, a scheduling credit value is introduced and stored in a block chain, and the scheduling credit value is automatically managed according to an intelligent contract, when the users apply for scheduling power, the users need to upload past actual power data to the block chain together with the needed load needs predicted by an intelligent electric meter and a DRl controller for credit value calculation, and the following is the specific definition of the credit value:

the scheduling reputation value updating mode is as follows:

for the calling power credit requested by the controller at time t,

is the amount that the user actually needs,alpha is a prediction deviation factor, because the prediction cannot be certain and the same as the actual situation, deviation protection within a certain range is given, alpha is related to the latest scheduling credit value, the higher the credit value is, the larger the deviation factor is, and N represents the number of times of accumulated application scheduling of the user;

step seven, the user encrypts the scheduling information and the electric power information through a public key of the user and uploads the encrypted scheduling information and electric power information to a block chain for storage, the user reputation value is also stored in the block chain in an uplink mode, and the block structure in the electric power block chain is shown in fig. 3;

step eight, through the step five, the intelligent agent after data training predicts the power load of the user in the next time period, if surplus electric quantity exists, the user does not need to perform power declaration, the surplus power is stored in the next time period, if the intelligent agent predicts that the power load of the user in the next time period is too large and can not be self-sufficient, the intelligent agent performs power declaration to a power grid department according to the predicted power limit, and the scheduling of the power is completed after the power department audits the power load; for a plurality of received power dispatching applications, a power dispatching center needs to carry out sequential dispatching for ensuring safe and stable operation of a power grid and reliable power supply to the outside, dispatching information is stored and linked up, the power dispatching center carries out descending sequencing on the dispatching applications according to user credit values stored in a block chain, dispatching auditing and power dispatching are preferentially carried out on users with high credit values, in order to ensure that the updating of the credit values is fair and fair, credit value updating and intelligent contract management in the step six are used, and the credit values are updated according to actual power actual load data in the step six in the next time period.

Claims

1. A power dispatching method based on a block chain and deep reinforcement learning is characterized by comprising the following steps:

<Uid，ID，Pc，P，Se，C，V>，

wherein Uid is a user ID, ID is a device ID, pc is a user public key, P is device generating power, se is a device state, C is device current, and V is device voltage;

performing data analysis on the power data, processing the data of the user at each stage in the power market, and taking the data of the uplink data after discretization preprocessing operations such as filtering, duplicate removal, error correction and the like as training samples of a subsequent network;

setting a DRL state space and an action space:

where SOC (t) is defined as:

SOC(t)＝SOC(t-1)+Pload(t)·Δt/E _b ，

E _b is the maximum capacity of the energy storage device,

the state space S is defined as:

S＝{P _load (t)，TOU(t)，SOC(t)}；

the action space Act is defined as follows:

wherein

Carrying out discretization processing on continuous loads for load predicted values at the next moment, wherein the processing process is as follows:

…

P _min indicating the minimum permissible load prediction, P _max Indicating that the maximum load forecast is allowed,

the average score value is represented by the average score value,

is selected by the agent among a-Z;

b (t) >0 represents that the user has redundant electric energy and charges the energy storage device, the charging electric quantity is B (t), B (t) < 0 represents that the user lacks load and needs to apply for power dispatching to the power grid, and the electric energy needing to be dispatched is B (t);

p is penalty cost, mainly reflecting the degree of violation of operation constraint by a user, and in order to prevent a malicious user from making extreme load prediction value operation constraint for benefit pursuit, the method comprises the following steps:

(1) Load capacity limit constraints:

(2) And (3) deviation electric quantity constraint:

in the formula: p is _allow Representing the maximum deviation value allowed;

P＝P ₁ +P ₂ +P ₃ ，

in the formula: rho and delta are corresponding penalty coefficients;

(II) operation cost:

f _g (t)＝TOU(t)·Δt(P _load -B _store )，

(iii) a reward function, where the user has N devices,

step five, DRL training based on the improved DQN:

(5) The DRL output power operation execution command format is as follows:

<Uid，Pk，ID，Op，Qty>，

the scheduling reputation value updating mode is as follows:

for the calling power credit requested by the controller at time t,

alpha is a prediction deviation factor and alpha is the latest scheduling creditThe values are related, the higher the credit value is, the larger the deviation factor is, and N represents the number of times of accumulated application scheduling of the user;

in order to enable the user reputation value updating to be supervised by all users, storing the reputation value plaintext on a block chain, and encrypting and storing the power parameters and the scheduling requests of user equipment by public keys of the users; only the power department and the user can decrypt the data through the private key, and the power department and the user can trace the changes of the power dispatching request and the power equipment parameters in the past through accessing the block chain;

step eight, through the step five, the intelligent agent after data training predicts the power load of the user in the next time period, if surplus electric quantity exists, the user does not need to perform power declaration, the surplus power is stored in the next time period, if the intelligent agent predicts that the power load of the user in the next time period is too large and can not be self-sufficient, the intelligent agent performs power declaration to a power grid department according to the predicted power limit, and the scheduling of the power is completed after the power department audits the power load; for a power grid department, a plurality of received power dispatching applications, a power grid dispatching center needs to carry out sequential dispatching in order to ensure safe and stable operation of a power grid and reliable external power supply, dispatching information is stored and uplink is stored, the power grid department carries out descending sequencing on the dispatching applications according to user credit values stored in a block chain, dispatching verification and power dispatching are preferentially carried out on users with high credit values, in order to ensure that the updating of the credit values is fair and fair, the credit values in the sixth step are used for updating intelligent contracts for management, and the credit values are updated according to actual power actual load data in the sixth step in the next time period.