CN112381359B - Multi-critic reinforcement learning power economy scheduling method based on data mining - Google Patents

Multi-critic reinforcement learning power economy scheduling method based on data mining Download PDF

Info

Publication number
CN112381359B
CN112381359B CN202011165889.0A CN202011165889A CN112381359B CN 112381359 B CN112381359 B CN 112381359B CN 202011165889 A CN202011165889 A CN 202011165889A CN 112381359 B CN112381359 B CN 112381359B
Authority
CN
China
Prior art keywords
critic
network
sample
reinforcement learning
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011165889.0A
Other languages
Chinese (zh)
Other versions
CN112381359A (en
Inventor
郑旭彬
刘林鹏
刘少伟
朱建全
冯健
王斌
丁照洋
郭志龙
钟伟津
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huizhou Energy Storage Power Generating Co ltd
Original Assignee
Huizhou Energy Storage Power Generating Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huizhou Energy Storage Power Generating Co ltd filed Critical Huizhou Energy Storage Power Generating Co ltd
Priority to CN202011165889.0A priority Critical patent/CN112381359B/en
Publication of CN112381359A publication Critical patent/CN112381359A/en
Application granted granted Critical
Publication of CN112381359B publication Critical patent/CN112381359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Tourism & Hospitality (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Water Supply & Treatment (AREA)
  • Public Health (AREA)
  • Probability & Statistics with Applications (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Fuzzy Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which comprises the following steps of: s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process; s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process; s3: selecting a sample from the historical data by using a data mining method; s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system; s5: judging whether an iteration end condition is reached; if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system; if not, the process returns to step S3 to perform the next iteration. The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which solves the problem that the existing methods for solving the economic dispatching problem of an electric power system have large errors.

Description

Multi-critic reinforcement learning power economy scheduling method based on data mining
Technical Field
The invention relates to the technical field of power economy scheduling, in particular to a power economy scheduling method based on multi-critic reinforcement learning of data mining.
Background
Efficient economic dispatch management of the power system has important significance for economic and safe operation of the power system. The existing economic dispatching methods of the power system can be divided into two categories, namely a classical mathematical method, an artificial intelligence method and the like. The classical mathematical method is very dependent on a mathematical model of the economic dispatching of the power system, and because the mathematical model of the economic dispatching of the power system is a non-convex random optimization problem and is difficult to directly solve to obtain an optimal solution, certain assumptions need to be made on the model and an original randomness problem needs to be converted into a deterministic problem when the classical mathematical method is used, and the assumptions can cause large modeling errors. In addition, such methods rely on the prediction information of uncertain factors such as renewable energy, electricity price and load, which are generally difficult to predict, and this also brings certain errors to the calculation results. The artificial intelligence method generally comprises a heuristic algorithm and a reinforcement learning algorithm, wherein the heuristic algorithm is generally slow in calculation speed and cannot ensure convergence; however, the existing reinforcement learning algorithm for solving the economic dispatching problem of the power system is generally a value-based reinforcement learning algorithm, such as a Q learning algorithm, and such methods cannot solve the optimization problem including continuous decision variables, so that the decision variables in the original problem need to be discretized, and if the number of discrete segments is too small, the obtained result greatly deviates from the optimal solution; if the number of discrete segments is large, the solving time is greatly increased. Therefore, the existing methods for solving the economic dispatching problem of the power system have large errors.
In the prior art, for example, chinese patent published in 9/25/2020, a virtual power plant economic scheduling method based on scenario and deep reinforcement learning is disclosed as CN111709672A, and a deep deterministic strategy gradient algorithm is used to determine a virtual power plant economic scheduling strategy, so that a Virtual Power Plant (VPP) having an energy storage and power distribution network is stably operated under an uncertain condition, but it does not adopt multiple critics to improve algorithm performance, and an error is large.
Disclosure of Invention
The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, aiming at overcoming the technical defect that the existing methods for solving the electric power system economic dispatching problems have large errors.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a power economy scheduling method based on multi-critic reinforcement learning of data mining comprises the following steps:
s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process;
s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process;
s3: selecting a sample from the historical data by using a data mining method;
s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system;
s5: judging whether an iteration end condition is reached;
if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system;
if not, the process returns to step S3 to perform the next iteration.
In the scheme, the utilization efficiency of historical data is enhanced by using a data mining method, the deep reinforcement learning network is improved by using multi-critic, and overestimation deviation generated by an approximate function in the learning process is reduced, so that an optimal decision is made on the economic scheduling problem of the power system.
Preferably, in step S1,
the Markov decision process objective function is:
Figure GDA0003206144900000021
the constraint conditions to be met include: alternating current power flow constraint, unit landslide climbing constraint, safe voltage constraint and energy storage charging and discharging constraint;
wherein, Cg,tRepresenting the cost of the generator g during time period t, Cg,tAnd the generator power PgAnd the start-stop state O of the generatorg(ii) related; t is the total time period; g is the total number of the generators; cESS,tRepresenting the cost of charging and discharging the stored energy over time t, CESS,tAnd energy storage charging and discharging power PbatIt is related.
Preferably, in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an actor network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)μ) Is represented by thetaμIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)Q) Is represented by thetaQFor the weighting parameter of the critic network, Q is equal to the expected value of the future total return under the conditions of state S and decision a; let the state variable S of the time period tt=(Pg,t-1,Og,t-1,SOCtT), decision variable a for time period tt=(Ot,Pbat,t) The time period t has a reporting function of rt=-Ct(ii) a Wherein, Pg,t-1For the generator power at time t-1, Og,t-1Start-stop state of generator, SOC, at time period t-1tFor the energy storage residual capacity in the time period t, OtFor the on-off state of the generator at time t, Pbat,tFor energy-storing charging-discharging power during time t, CtThe cost of time period t.
Preferably, the actor network is approximated using a four-layer deep neural network.
Preferably, the critic network is approximated using a three-layer deep neural network.
Preferably, the multi-critic architecture deep reinforcement learning network further comprises an experience replay pool for storing historical data (S, a, r, S '), wherein S' represents a post-transition state obtained after the decision a is made in the state S, and r represents a return value obtained in the transition process.
Preferably, in step S3,
the value of the sample is measured using the timing difference error σ:
σ=r+Q(S′,A|θQ)-Q(S,A|θQ)
selecting a sample from the historical data by a data mining method according to the value of the sample, and then selecting the probability p of the sample iiComprises the following steps:
Figure GDA0003206144900000031
wherein σiIs the timing difference error of sample i.
Preferably, in step S4, the weighting parameter θ of the critic networkQThe update formula of (2) is:
Figure GDA0003206144900000032
where M represents the number of samples drawn from the empirical playback pool, yiThe target value required for updating the Q value using the timing difference error is represented by the return at the current decision and the Q value of the next state: y isi=ri+γQ[Si,μ(S′iμ)],riRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, QkDenotes the Q value, S, of the kth critical networkiRepresents the state variable of sample i, μ (S'iμ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'iRepresenting the shifted state variable, A, of sample iiThe decision variables of sample i are represented.
Preferably, in step S4, the weight parameter θ of the actor networkμThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:
Figure GDA0003206144900000033
where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)iμ) And representing the mapping relation of the state variable and the decision variable of the sample i.
Preferably, in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration end condition is reached; otherwise, the iteration end condition is not reached.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which is characterized in that the utilization efficiency of historical data is enhanced by using the data mining method, a deep reinforcement learning network is improved by using the multi-critic, and overestimation deviation generated by an approximate function in the learning process is reduced, so that an optimal decision is made on the electric power system economic dispatching problem.
Drawings
FIG. 1 is a flow chart of the implementation steps of the technical scheme of the invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a power economy scheduling method based on multi-critic reinforcement learning of data mining includes the following steps:
s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process;
s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process;
s3: selecting a sample from the historical data by using a data mining method;
s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system;
s5: judging whether an iteration end condition is reached;
if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system;
if not, the process returns to step S3 to perform the next iteration.
In the specific implementation process, historical data are obtained by continuously interacting with the power system environment, the utilization efficiency of the historical data is enhanced by using a data mining method, a deep reinforcement learning network is improved by adopting multi-critic, and overestimation deviation generated by an approximation function in the learning process is reduced, so that the optimal decision is made on the economic dispatching problem of the power system.
More specifically, in step S1,
the Markov decision process objective function is:
Figure GDA0003206144900000051
the constraint conditions to be met include: alternating current power flow constraint, unit landslide climbing constraint, safe voltage constraint and energy storage charging and discharging constraint;
wherein, Cg,tRepresenting the cost of the generator g during time period t, Cg,tAnd the generator power PgAnd the start-stop state O of the generatorg(ii) related; t is the total time period; g is the total number of the generators; cESS,tRepresenting the cost of charging and discharging the stored energy over time t, CESS,tAnd energy storage charging and discharging power PbatIt is related.
In a specific implementation, the objective function is to minimize the expectation of the total cost of all time periods (i.e. maximize the expectation of the total return) by selecting a suitable set of decision variables at each time period. Since the probability distribution of the variables such as renewable energy generation, electricity price, and load in the power system is unknown, the state transition probability of the Markov Decision Process (MDP) is also unknown, and thus the problem cannot be directly solved.
More specifically, in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an actor network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)μ) Is represented by thetaμIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)Q) Is represented by thetaQFor the weighting parameter of the critic network, Q is equal to the expected value of the future total return under the conditions of state S and decision a; let the state variable S of the time period tt=(Pg,t-1,Og,t-1,SOCtT), decision variable a for time period tt=(Ot,Pbat,t) The time period t has a reporting function of rt=-Ct(ii) a Wherein, Pg,t-1For the generator power at time t-1, Og,t-1Start-stop state of generator, SOC, at time period t-1tFor the energy storage residual capacity in the time period t, OtFor the on-off state of the generator at time t, Pbat,tFor energy-storing charging-discharging power during time t, CtThe cost of time period t.
In a specific implementation process, the larger the number of critic networks, the longer the training time. And carrying out optimization management by using a multi-critic architecture deep reinforcement learning network under the condition that an economic dispatching model of the power system is unknown or cannot be directly solved to obtain an optimal economic dispatching strategy of the power system.
More specifically, the actor network is approximated using a four-layer deep neural network.
More specifically, the critic network is approximated using a three-layer deep neural network.
More specifically, the multi-critic architecture deep reinforcement learning network further comprises an experience replay pool for storing historical data (S, a, r, S '), wherein S' represents a post-transition state obtained after the decision a is made in the state S, and r represents a return value obtained in the transition process.
More specifically, in step S3,
the value of the sample is measured using the timing difference error σ:
σ=r+Q(S′,A|θQ)-Q(S,A|θQ)
selecting a sample from the historical data by a data mining method according to the value of the sample, and then selecting the probability p of the sample iiComprises the following steps:
Figure GDA0003206144900000061
wherein σiIs the timing difference error of sample i.
In the specific implementation process, when selecting samples in the experience playback pool, selecting the samples by adopting a random sampling method; meanwhile, because different samples have different values, the samples with higher values are selected to update the weight parameters of the operator network and the critic network, so that the historical data can be fully utilized, and the algorithm speed can be improved. If the timing difference error of one sample is larger, it means that the difference between the current Q value and the target Q value is still large, so the sample with larger timing difference error should be fully used to update the weighting parameters of the critic network. By the formula
Figure GDA0003206144900000062
The probability that the sample with the larger timing difference error is selected can be made higher.
More specifically, in step S4, the weighting parameter θ of the critic networkQThe update formula of (2) is:
Figure GDA0003206144900000063
where M represents the number of samples drawn from the empirical playback pool, yiWhen it is indicated to be usedThe target value required when the sequence difference error updates the Q value is obtained through the return under the current decision and the Q value of the next state: y isi=ri+γQ[Si,μ(S′iμ)],riRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, QkDenotes the Q value, S, of the kth critical networkiRepresents the state variable of sample i, μ (S'iμ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'iRepresenting the shifted state variable, A, of sample iiThe decision variables of sample i are represented.
In the concrete implementation process, the weight parameter theta of the criticc networkQThe timing difference error is minimized and updated through a gradient descent method, but in practical application, the approximate Q value is always larger than the actual Q value due to the fact that a single critic network is used, therefore, the minimum Q value is taken from k critic networks for updating, overestimation errors in Q value approximation can be remarkably reduced, and errors generated by a Q value function when deep reinforcement learning is used for solving the power system economic dispatching problem are reduced. The number of critic networks is at least two.
More specifically, in step S4, the weight parameter θ of the operator networkμThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:
Figure GDA0003206144900000071
where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)iμ) And representing the mapping relation of the state variable and the decision variable of the sample i.
In the specific implementation process, the weight parameters of the operator network and the criticc network are continuously updated in the process that the agent and the environment continuously interact, and finally the optimal operator network, namely the optimal economic dispatching strategy of the power system is obtained.
More specifically, in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration ending condition is reached; otherwise, the iteration end condition is not reached.
In a specific implementation, the reward value is the negative of the total cost of 24 time periods.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (5)

1. A power economy scheduling method based on multi-critic reinforcement learning of data mining is characterized by comprising the following steps:
s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process;
in step S1, the objective function of the markov decision process is:
Figure FDA0003206144890000011
the constraint conditions to be met include: alternating current power flow constraint, unit landslide climbing constraint, safe voltage constraint and energy storage charging and discharging constraint;
wherein, Cg,tRepresenting the cost of the generator g during time period t, Cg,tAnd the generator power PgAnd the start-stop state O of the generatorg(ii) related; t is the total time period; g is the total number of the generators; cESS,tRepresenting the cost of charging and discharging the stored energy over time t, CESS,tAnd energy storage charging and discharging power Pbat(ii) related;
s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process;
in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an operator network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)μ) Is represented by thetaμIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)Q) Is represented by thetaQFor the weighting parameter of the critic network, Q is equal to the expected value of the future total return under the conditions of state S and decision a; let the state variable S of the time period tt=(Pg,t-1,Og,t-1,SOCtT), decision variable a for time period tt=(Ot,Pbat,t) The time period t has a reporting function of rt=-Ct(ii) a Wherein, Pg,t-1For the generator power at time t-1, Og,t-1Start-stop state of generator, SOC, at time period t-1tFor the energy storage residual capacity in the time period t, OtFor the on-off state of the generator at time t, Pbat,tFor energy-storing charging-discharging power during time t, CtCost for time period t;
the multi-critic architecture deep reinforcement learning network further comprises an experience playback pool for storing historical data (S, A, r, S '), wherein S' represents a state after transition obtained after decision A is made in state S, and r represents a return value obtained in the transition process;
s3: selecting a sample from the historical data by using a data mining method;
s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system;
in step S4, the weighting parameter θ of the critic networkQThe update formula of (2) is:
Figure FDA0003206144890000021
where M represents the number of samples drawn from the empirical playback pool, yiShow to makeThe target value required when updating the Q value by the time sequence difference error is obtained by the return under the current decision and the Q value of the next statei=ri+γQ[Si,μ(S′iμ)],riRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, QkDenotes the Q value, S, of the kth critical networkiRepresents the state variable of sample i, μ (S'iμ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'iRepresenting the shifted state variable, A, of sample iiA decision variable representing a sample i;
in step S4, the weight parameter θ of the actor networkμThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:
Figure FDA0003206144890000022
where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)iμ) Representing the mapping relation between the state variable and the decision variable of the sample i;
s5: judging whether an iteration end condition is reached;
if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system;
if not, the process returns to step S3 to perform the next iteration.
2. The method for dispatching power economy based on multi-critic reinforcement learning of data mining according to claim 1, characterized in that the operator network is approximated by a four-layer deep neural network.
3. The method for dispatching power economy based on multi-critic reinforcement learning of data mining according to claim 1, wherein the critic network is approximated by a three-layer deep neural network.
4. The power economy dispatching method based on multi-criticc reinforcement learning of data mining as claimed in claim 1, wherein in step S3,
the value of the sample is measured using the timing difference error σ:
σ=r+Q(S′,A|θQ)-Q(S,A|θQ)
selecting a sample from the historical data by a data mining method according to the value of the sample, and then selecting the probability p of the sample iiComprises the following steps:
Figure FDA0003206144890000023
wherein σiIs the timing difference error of sample i.
5. The method according to claim 1, wherein in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration ending condition is reached; otherwise, the iteration end condition is not reached.
CN202011165889.0A 2020-10-27 2020-10-27 Multi-critic reinforcement learning power economy scheduling method based on data mining Active CN112381359B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011165889.0A CN112381359B (en) 2020-10-27 2020-10-27 Multi-critic reinforcement learning power economy scheduling method based on data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011165889.0A CN112381359B (en) 2020-10-27 2020-10-27 Multi-critic reinforcement learning power economy scheduling method based on data mining

Publications (2)

Publication Number Publication Date
CN112381359A CN112381359A (en) 2021-02-19
CN112381359B true CN112381359B (en) 2021-10-26

Family

ID=74577371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011165889.0A Active CN112381359B (en) 2020-10-27 2020-10-27 Multi-critic reinforcement learning power economy scheduling method based on data mining

Country Status (1)

Country Link
CN (1) CN112381359B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118477B (en) * 2022-06-22 2024-05-24 四川数字经济产业发展研究院 Smart grid state recovery method and system based on deep reinforcement learning
CN115775081B (en) * 2022-12-16 2023-10-03 华南理工大学 Random economic scheduling method, device and medium for electric power system
CN117200184B (en) * 2023-08-10 2024-04-09 国网浙江省电力有限公司金华供电公司 Virtual power plant load side resource multi-period regulation potential evaluation prediction method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784440A (en) * 2017-10-23 2018-03-09 国网辽宁省电力有限公司 A kind of power information system resource allocation system and method
CN109934332A (en) * 2018-12-31 2019-06-25 中国科学院软件研究所 The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends
CN110929948B (en) * 2019-11-29 2022-12-16 上海电力大学 Fully distributed intelligent power grid economic dispatching method based on deep reinforcement learning
CN111242443B (en) * 2020-01-06 2023-04-18 国网黑龙江省电力有限公司 Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet
CN111725836B (en) * 2020-06-18 2024-05-17 上海电器科学研究所(集团)有限公司 Demand response control method based on deep reinforcement learning
CN111709672B (en) * 2020-07-20 2023-04-18 国网黑龙江省电力有限公司 Virtual power plant economic dispatching method based on scene and deep reinforcement learning

Also Published As

Publication number Publication date
CN112381359A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN112381359B (en) Multi-critic reinforcement learning power economy scheduling method based on data mining
CN111884213B (en) Power distribution network voltage adjusting method based on deep reinforcement learning algorithm
CN114725936B (en) Power distribution network optimization method based on multi-agent deep reinforcement learning
CN112186743B (en) Dynamic power system economic dispatching method based on deep reinforcement learning
Sun et al. A customized voltage control strategy for electric vehicles in distribution networks with reinforcement learning method
CN112117760A (en) Micro-grid energy scheduling method based on double-Q-value network deep reinforcement learning
Yu et al. Unit commitment using Lagrangian relaxation and particle swarm optimization
US8996185B2 (en) Method for scheduling power generators based on optimal configurations and approximate dynamic programming
CN113511082A (en) Hybrid electric vehicle energy management method based on rule and double-depth Q network
CN112491094B (en) Hybrid-driven micro-grid energy management method, system and device
CN112003269A (en) Intelligent on-line control method of grid-connected shared energy storage system
CN115423207A (en) Wind storage virtual power plant online scheduling method and device
CN116468159A (en) Reactive power optimization method based on dual-delay depth deterministic strategy gradient
CN116599151A (en) Source network storage safety management method based on multi-source data
CN115345380A (en) New energy consumption electric power scheduling method based on artificial intelligence
CN104915788B (en) A method of considering the Electrical Power System Dynamic economic load dispatching of windy field correlation
Zhang et al. A cooperative EV charging scheduling strategy based on double deep Q-network and Prioritized experience replay
CN116523327A (en) Method and equipment for intelligently generating operation strategy of power distribution network based on reinforcement learning
CN113872213B (en) Autonomous optimization control method and device for power distribution network voltage
CN113972645A (en) Power distribution network optimization method based on multi-agent depth determination strategy gradient algorithm
CN117565727A (en) Wireless charging automatic control method and system based on artificial intelligence
CN116914755B (en) Light-storage joint planning method and system considering battery cycle life
CN117060386A (en) Micro-grid energy storage scheduling optimization method based on value distribution depth Q network
CN114048576B (en) Intelligent control method for energy storage system for stabilizing power transmission section tide of power grid
Chen et al. A Deep Reinforcement Learning-Based Charging Scheduling Approach with Augmented Lagrangian for Electric Vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant