CN112258039B

CN112258039B - Intelligent scheduling method for defective materials of power system based on reinforcement learning

Info

Publication number: CN112258039B
Application number: CN202011144804.0A
Authority: CN
Inventors: 俞虹; 唐诚旋; 蒋群群; 陈珏伊; 张秀; 程文美; 代洲; 徐一蝶
Original assignee: Guizhou Power Grid Co Ltd
Current assignee: Guizhou Power Grid Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2022-07-22
Anticipated expiration: 2040-10-23
Also published as: CN112258039A

Abstract

The invention discloses an intelligent scheduling method for defective goods and materials of an electric power system based on reinforcement learning, which comprises the steps of defining states, decisions, transfer equations, reward functions and requirements and targets in dynamic scheduling problems of goods and materials storage in the reinforcement learning; solving the material warehousing dynamic scheduling problem by utilizing a Markov decision process; listing Bellman equations aiming at power grid defect materials and selecting a solving strategy; and modifying the Bellman equation into a data-driven online updating form, and determining a scheduling action based on an epsilon greedy strategy. The invention provides a combined control and scheduling problem for solving emergency materials of an electric power system based on a Markov random process and reinforcement learning, and an end-to-end algorithm does not predict the demand and directly makes inventory control and scheduling decisions; meanwhile, the method is verified on a real data set, has good convergence and gain, and proves the usability and practical value of the method.

Description

Intelligent scheduling method for defective materials of power system based on reinforcement learning

Technical Field

The invention relates to the technical field of power grid and artificial intelligence scheduling, in particular to an intelligent scheduling method for defective materials of a power system based on reinforcement learning.

Background

Statistical optimization method: according to statistical rules, the distribution of various emergency demands is modeled, and statistically average and optimal warehousing distribution is calculated through centralized mathematical modeling.

A data prediction method: based on the idea of data analysis and mining in each region, a sequence-to-sequence model is constructed for different requirements of each region by using an artificial intelligence and machine learning method, so that the time sequence is predicted; then, on the basis of prediction, centralized layout and optimization are carried out on the warehousing system and scheduling.

For a statistical optimization method, the method needs complete statistics on all the demand distributions in the region, meanwhile, the optimal distribution needs to be recalculated every time state transition and emergency occur, the calculation resource consumption is high, the response is slow, and certain limitations are realized; for a data prediction method, the traditional feature selection is usually based on a feature sorting method, according to the calculated importance and relevance of each feature, the first k features are taken as the input of demand prediction, and the method has the greatest defect that the global information of the system cannot be represented well by selecting the features with the greatest importance and relevance, so that the most abundant information cannot be provided for the prediction system; meanwhile, because the predicted result is not the final result, secondary calculation is carried out according to the predicted result to obtain a scheduling and control scheme, and errors are accumulated by a multi-step framework to cause deviation of the final result.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned conventional problems.

Therefore, the invention provides an intelligent scheduling method for defective goods and materials of an electric power system based on reinforcement learning, which can solve the problem of joint control and scheduling of emergency goods and materials of the electric power system.

In order to solve the technical problems, the invention provides the following technical scheme: the method comprises the steps of defining states, decisions, transfer equations, reward functions and demands and targets in the dynamic scheduling problem of material storage in reinforcement learning; solving the material warehousing dynamic scheduling problem by utilizing a Markov decision process; listing Bellman equations aiming at power grid defective materials and selecting a solving strategy; and modifying the Bellman equation into a data-driven online updating form, and determining a scheduling action based on an epsilon greedy strategy.

The invention relates to a preferable scheme of an intelligent dispatching method for defective goods and materials of an electric power system based on reinforcement learning, wherein the preferable scheme comprises the following steps: including defining the current time state, and then storing the materials S in each warehouse_t＝Z∈R^n×m(ii) a Wherein, Z_i,jIndicating the quantity of the material j in the warehouse i at the current moment.

The invention relates to a preferable scheme of an intelligent dispatching method for defective goods and materials of an electric power system based on reinforcement learning, wherein the preferable scheme comprises the following steps: including, according to the current time state S_t∈R^n×mAnd the requirement Q ∈ R^n×mThe warehousing system determines a scheduling scheme X and a purchasing scheme B at the moment, wherein X_i,jAnd B_i,jRespectively representing the ex-warehouse quantity and the purchase quantity of the goods and materials j in the warehouse i at the current moment.

The invention relates to a preferable scheme of an intelligent dispatching method for defective goods and materials of an electric power system based on reinforcement learning, wherein the preferable scheme comprises the following steps: after the warehousing system decides a scheduling and purchasing scheme, the warehousing state randomly generates state transition at the next moment, and then a transition equation is expressed as follows:

S_t+1＝Z-X+B

wherein, as the storage materials can not be negative physically, and the storage space is always limited, the effective decision (X, B) must satisfy the following inequality:

as a preferred scheme of the reinforcement learning-based intelligent scheduling method for the defective goods and materials of the power system, the method comprises the following steps: the warehouse system mainly aims at meeting the problem of emergency material demand in regions and between regions, and the reward function is obtained by subtracting the cost of purchasing materials from the lost income at the current moment, and comprises the following steps:

wherein, the symbol (x)^-Comprises the following steps:

the invention relates to a preferable scheme of an intelligent dispatching method for defective goods and materials of an electric power system based on reinforcement learning, wherein the preferable scheme comprises the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

wherein γ ∈ [0,1) is an attenuation Factor (Discount Factor).

The invention relates to a preferable scheme of an intelligent dispatching method for defective goods and materials of an electric power system based on reinforcement learning, wherein the preferable scheme comprises the following steps: changing the Bellman equation into a data-driven online updating form, which comprises the following steps:

V(S_t)←(1-α_t)V(S_t)+α_t[r_t+γV(S_t+1)]

wherein alpha is_tThe learning rate at time t.

As a preferred scheme of the reinforcement learning-based intelligent scheduling method for the defective goods and materials of the power system, the method comprises the following steps: determining Action by adopting the epsilon greedy strategy, and taking the current best Action by the warehouse under the probability of 1-epsilon to convert V (S)_t) And (4) maximizing.

The invention relates to a preferable scheme of an intelligent dispatching method for defective goods and materials of an electric power system based on reinforcement learning, wherein the preferable scheme comprises the following steps: also included is randomly selecting an action with a probability of ε, as follows:

wherein the random actions can be explored themselves, and knowledge learned by exploring to produce a variety of good or bad data, thereby improving current strategies.

The invention has the beneficial effects that: the invention provides a combined control and scheduling problem for solving emergency materials of an electric power system based on a Markov random process and reinforcement learning, and an end-to-end algorithm does not predict the demand and directly makes inventory control and scheduling decisions; the proposed algorithm is an "online" algorithm, i.e. inventory control and scheduling decisions rely only on observations of past events; the proposed algorithm is also a "model-free" algorithm, independent of any assumed stochastic model of uncertain events; meanwhile, the method is verified on a real data set, has good convergence and gain, and proves the usability and practical value of the method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a schematic flowchart illustrating a method for intelligently scheduling defective materials of an electrical power system based on reinforcement learning according to a first embodiment of the present invention;

fig. 2 is a schematic diagram illustrating defective material scheduling of a reinforcement learning-based intelligent defective material scheduling method for an electric power system according to a first embodiment of the present invention;

fig. 3 is a schematic diagram illustrating reinforcement learning scheduling of defective materials of an electric power system according to a reinforcement learning-based intelligent scheduling method of defective materials of an electric power system according to a first embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a profit comparison of a warehousing system of the electric power system defect material intelligent scheduling method based on reinforcement learning according to a second embodiment of the present invention under different warehousing capacities;

fig. 5 is a schematic diagram illustrating a profit comparison of a warehousing system of the electric power system defect material intelligent scheduling method based on reinforcement learning according to the second embodiment of the present invention under different warehousing capacities (C) and attenuation coefficients (Y).

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and it will be appreciated by those skilled in the art that the present invention may be practiced without departing from the spirit and scope of the present invention and that the present invention is not limited by the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not necessarily enlarged to scale, and are merely exemplary, which should not limit the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.

Example 1

Referring to fig. 1, 2 and 3, for a first embodiment of the present invention, there is provided a method for intelligently scheduling defective materials of an electric power system based on reinforcement learning, including:

s1: and defining the state, the decision, the transfer equation, the reward function and the requirement and the target in the dynamic scheduling problem of the material storage in the reinforcement learning. In which it is to be noted that,

defining the state, decision, transfer equation, reward function and the demand and target of the material storage dynamic scheduling problem in the reinforcement learning algorithm aiming at the power system defect material scheduling;

the state is the storage state and the material defect state at the moment t, the decision is made to be the scheduling mode and the purchasing mode adopted at the moment, and the transfer equation is a front-back change equation;

defining the state of the current moment, then the materials S stored in each warehouse_t＝Z∈R^n×m；

Wherein Z is_i,jRepresenting the quantity of materials j in the warehouse i at the current moment;

according to the current time state S_t∈R^n×mAnd the demand Q ∈ R^n×mThe warehousing system determines a scheduling scheme X and a purchasing scheme B at the moment, wherein X_i,jAnd B_i,jRespectively representing the ex-warehouse quantity and the purchase quantity of the goods and materials j in the warehouse i at the current moment.

S2: and solving the problem of dynamic scheduling of material storage by using a Markov decision process. It should be noted that in this step,

after the warehousing system decides the scheduling and purchasing scheme, at the next moment, the warehousing state randomly generates state transition, and then the transition equation is expressed as:

S_t+1＝Z-X+B

s3: listing Bellman equation aiming at power grid defect materials and selecting a solving strategy. Among them, it is also to be noted that:

the main objective of the warehousing system is to meet the problem of emergency material demand in regions and between regions, and then the reward function is to subtract the cost of purchasing materials from the lost income at the current moment as follows:

wherein, the symbol (x)^-Comprises the following steps:

solving the MDP problem, then:

wherein γ ∈ [0,1) is an attenuation Factor (Discount Factor).

S4: and modifying the Bellman equation into a data-driven online updating form, and determining a scheduling action based on an epsilon greedy strategy. What should be further described in this step is:

the Bellman equation is modified to a form of data-driven online update as follows:

V(S_t)←(1-α_t)V(S_t)+α_t[r_t+γV(S_t+1)]

wherein alpha is_tLearning rate at time t;

determining Action by adopting epsilon greedy strategy, and taking current best Action by the warehouse under the probability of 1-epsilon to obtain V (S)_t) Maximization;

at a probability of ε, actions are randomly selected as follows:

where random actions can be explored themselves, learning knowledge by exploring to produce a variety of good or bad data, thereby improving current strategies.

Preferably, the embodiment further includes designing a Bellman equation for the defective materials of the power grid and selecting a solution strategy, where the Bellman equation is a mathematical form of a scheduling problem, and the selection strategy is to obtain an optimal scheduling result more quickly; the Bellman equation is modified into a data-driven online updating form, namely required data such as material demand data and online storage data can be accessed in real time, and then the Bellman equation can adapt to the updated state so as to better adapt to the dynamic storage and scheduling problems of power grid defect materials to be solved by the invention, and scheduling actions are determined based on an epsilon greedy strategy.

Example 2

Referring to fig. 4 and 5, a second embodiment of the present invention, which is different from the first embodiment, provides an authenticity verification method for an intelligent scheduling method of defective goods and materials of an electric power system based on reinforcement learning, including:

in order to better verify and explain the technical effect adopted in the method, the embodiment selects the traditional greedy algorithm-based intelligent scheduling method to perform a comparison test with the method, compares the test result by means of scientific demonstration, and verifies the real effect of the method.

The convergence and the gain of the traditional intelligent scheduling method based on the greedy algorithm are low, and in order to verify that the method has higher gain and convergence compared with the traditional method, the traditional intelligent scheduling method based on the greedy algorithm is adopted to carry out real-time measurement comparison with the method.

And (3) testing conditions: (1) 15 areas in the jurisdiction or periphery of Guiyang City of Guizhou province were collected: the emergency material requirements of the dolomitic cloud, north city, brook, hui shui, jinyang, kaiyang, longli, south Ming, Qing Zhen, Shuanglong, Wudang, honeycomb, Xiaohe, repair culture and cloud rock are the requirements of the defective materials of each month;

(2) the invention carries out the transformation of the state transition equation aiming at specific problems so as to adapt to the application scene described in the embodiment;

(3) starting the automatic test equipment, simulating by using MATLB and outputting a curve schematic diagram.

Referring to fig. 4, a solid line is a curve output by the method of the present invention, and a dotted line is a curve output by the conventional method, and as the warehouse storage capacity increases, the average profit curves of both curves increase, but it can be seen from fig. 4 that the trend of the solid line is more prominent than that of the dotted line, and the solid line is always kept above the dotted line, thereby illustrating that the method of the present invention has higher gain compared to the conventional method.

Referring to fig. 5, it can be seen that the method of the present invention is always in an increasing trend as the attenuation coefficient γ and the warehouse storage capacity increase, but the benefit is the lowest when the attenuation coefficient γ is 0.95 and the benefit is the highest when the attenuation coefficient γ is 0.8, and based on this, the superiority of the method of the present invention is verified.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A method for intelligently scheduling defective goods and materials of an electric power system based on reinforcement learning is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

defining the state, decision, transfer equation, reward function and the demand and target of the material storage dynamic scheduling problem in reinforcement learning;

solving the material warehousing dynamic scheduling problem by utilizing a Markov decision process;

listing Bellman equations aiming at power grid defective materials and selecting a solving strategy;

modifying the Bellman equation into a data-driven online updating form, and determining a scheduling action based on an epsilon greedy strategy;

comprises the steps of (a) preparing a substrate,

wherein gamma belongs to [0,1) as attenuation factor;

the objective of the warehousing system is to satisfy the problem of emergency material demand in and between regions, and the reward function is the lost income at the current time minus the cost of purchasing materials, as follows:

wherein, the symbol (x)^-Comprises the following steps:

wherein the state of the current time is defined, S_tRepresenting the material stored in each warehouse, Z_i,jIndicates the quantity, X, of the materials j in the warehouse i at the current moment_i,jAnd B_i,jRespectively representing the ex-warehouse quantity and the purchase quantity S of the goods and materials j in the warehouse i at the current moment_t+1Expressing transfer equations, X, B expressing the warehousing system determining the sameScheduling scheme, purchasing scheme of time, current time state S_t∈R^n×mAnd the demand Q ∈ R^n×m。

2. The reinforcement learning-based intelligent scheduling method for the defective materials of the power system as claimed in claim 1, wherein: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

S_t+1＝Z-X+B。

3. the reinforcement learning-based intelligent scheduling method for the defective materials of the power system as claimed in claim 2, wherein: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

the Bellman equation is changed into a form of data-driven online update as follows:

V(S_t)←(1-α_t)V(S_t)+α_t[r_t+γV(S_t+1)]

wherein alpha is_tThe learning rate at time t.

4. The reinforcement learning-based intelligent scheduling method for the defective materials of the power system as claimed in claim 3, wherein: determining Action by adopting the epsilon greedy strategy, and taking the current best Action by the warehouse under the probability of 1-epsilon to obtain V (S)_t) And (4) maximization.

5. The reinforcement learning-based intelligent scheduling method for the defective materials of the power system as claimed in claim 4, wherein: also comprises the following steps of (1) preparing,

at a probability of ε, actions are randomly selected as follows:

wherein the random selection action can be self-contained with exploration to generate a variety of good or bad data learned knowledge, thereby improving current strategies.