CN112381359A - Multi-critic reinforcement learning power economy scheduling method based on data mining - Google Patents
Multi-critic reinforcement learning power economy scheduling method based on data mining Download PDFInfo
- Publication number
- CN112381359A CN112381359A CN202011165889.0A CN202011165889A CN112381359A CN 112381359 A CN112381359 A CN 112381359A CN 202011165889 A CN202011165889 A CN 202011165889A CN 112381359 A CN112381359 A CN 112381359A
- Authority
- CN
- China
- Prior art keywords
- critic
- reinforcement learning
- sample
- network
- data mining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 230000002787 reinforcement Effects 0.000 title claims abstract description 41
- 238000007418 data mining Methods 0.000 title claims abstract description 28
- 230000008569 process Effects 0.000 claims abstract description 26
- 230000006870 function Effects 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000007599 discharging Methods 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 12
- 238000004146 energy storage Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- 230000009194 climbing Effects 0.000 claims description 3
- 238000012067 mathematical method Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 238000013178 mathematical model Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/219—Managing data history or versioning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Strategic Management (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Tourism & Hospitality (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Entrepreneurship & Innovation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Water Supply & Treatment (AREA)
- Primary Health Care (AREA)
- Fuzzy Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Public Health (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which comprises the following steps of: s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process; s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process; s3: selecting a sample from the historical data by using a data mining method; s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system; s5: judging whether an iteration end condition is reached; if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system; if not, the process returns to step S3 to perform the next iteration. The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which solves the problem that the existing methods for solving the economic dispatching problem of an electric power system have large errors.
Description
Technical Field
The invention relates to the technical field of power economy scheduling, in particular to a power economy scheduling method based on multi-critic reinforcement learning of data mining.
Background
Efficient economic dispatch management of the power system has important significance for economic and safe operation of the power system. The existing economic dispatching methods of the power system can be divided into two categories, namely a classical mathematical method, an artificial intelligence method and the like. The classical mathematical method is very dependent on a mathematical model of the economic dispatching of the power system, and because the mathematical model of the economic dispatching of the power system is a non-convex random optimization problem and is difficult to directly solve to obtain an optimal solution, certain assumptions need to be made on the model and an original randomness problem needs to be converted into a deterministic problem when the classical mathematical method is used, and the assumptions can cause large modeling errors. In addition, such methods rely on the prediction information of uncertain factors such as renewable energy, electricity price and load, which are generally difficult to predict, and this also brings certain errors to the calculation results. The artificial intelligence method generally comprises a heuristic algorithm and a reinforcement learning algorithm, wherein the heuristic algorithm is generally slow in calculation speed and cannot ensure convergence; however, the existing reinforcement learning algorithm for solving the economic dispatching problem of the power system is generally a value-based reinforcement learning algorithm, such as a Q learning algorithm, and such methods cannot solve the optimization problem including continuous decision variables, so that the decision variables in the original problem need to be discretized, and if the number of discrete segments is too small, the obtained result greatly deviates from the optimal solution; if the number of discrete segments is large, the solving time is greatly increased. Therefore, the existing methods for solving the economic dispatching problem of the power system have large errors.
In the prior art, for example, chinese patent published in 9/25/2020, a virtual power plant economic scheduling method based on scenario and deep reinforcement learning is disclosed as CN111709672A, and a deep deterministic strategy gradient algorithm is used to determine a virtual power plant economic scheduling strategy, so that a Virtual Power Plant (VPP) having an energy storage and power distribution network is stably operated under an uncertain condition, but it does not adopt multiple critics to improve algorithm performance, and an error is large.
Disclosure of Invention
The invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, aiming at overcoming the technical defect that the existing methods for solving the electric power system economic dispatching problems have large errors.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a power economy scheduling method based on multi-critic reinforcement learning of data mining comprises the following steps:
s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process;
s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process;
s3: selecting a sample from the historical data by using a data mining method;
s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system;
s5: judging whether an iteration end condition is reached;
if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system;
if not, the process returns to step S3 to perform the next iteration.
In the scheme, the utilization efficiency of historical data is enhanced by using a data mining method, the deep reinforcement learning network is improved by using multi-critic, and overestimation deviation generated by an approximate function in the learning process is reduced, so that an optimal decision is made on the economic scheduling problem of the power system.
Preferably, in step S1,
the Markov decision process objective function is:
the constraint conditions to be met include: alternating current power flow constraint, unit landslide climbing constraint, safe voltage constraint and energy storage charging and discharging constraint;
wherein, Cg,tRepresenting the cost of the generator g during time period t, Cg,tAnd the generator power PgAnd the start-stop state O of the generatorg(ii) related; t is the total time period; g is the total number of the generators; cESS,tRepresenting the cost of charging and discharging the stored energy over time t, CESS,tAnd energy storage charging and discharging power PbatIt is related.
Preferably, in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an actor network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)μ) Is represented by thetaμIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)Q) Is represented by thetaQQ is the weight parameter of the operator network, equal to the expected value of the future total return under the conditions of state S and decision A; state variable of order period tSt=(Pg,t-1,Og,t-1,SOCtT), decision variable a for time period tt=(Ot,Pbat,t) The time period t has a reporting function of rt=-Ct(ii) a Wherein, Pg,t-1For the generator power at time t-1, Og,t-1Start-stop state of generator, SOC, at time period t-1tFor the energy storage residual capacity in the time period t, OtFor the on-off state of the generator at time t, Pbat,tFor energy-storing charging-discharging power during time t, CtThe cost of time period t.
Preferably, the actor network is approximated using a four-layer deep neural network.
Preferably, the critic network is approximated using a three-layer deep neural network.
Preferably, the multi-critic architecture deep reinforcement learning network further comprises an experience replay pool for storing historical data (S, a, r, S '), wherein S' represents a post-transition state obtained after the decision a is made in the state S, and r represents a return value obtained in the transition process.
Preferably, in step S3,
the value of the sample is measured using the timing difference error σ:
σ=r+Q(S′,A|θQ)-Q(S,A|θQ)
selecting a sample from the historical data by a data mining method according to the value of the sample, and then selecting the probability p of the sample iiComprises the following steps:
wherein σiIs the timing difference error of sample i.
Preferably, in step S4, the weight parameter θ of Critic networkQThe update formula of (2) is:
where M represents the number of samples drawn from the empirical playback pool, yiY represents the target value required for updating the Q value using the timing differential error, and is obtained by the reward at the current decision and the Q value of the next statei=ri+γQ[Si,μ(S′i|θμ)],riRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, QkDenotes the Q value, S, of the kth Critic networkiRepresents the state variable of sample i, μ (S'i|θμ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'iRepresenting the shifted state variable, A, of sample iiThe decision variables of sample i are represented.
Preferably, in step S4, the weight parameter θ of the Actor networkμThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:
where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)i|θμ) And representing the mapping relation of the state variable and the decision variable of the sample i.
Preferably, in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration end condition is reached; otherwise, the iteration end condition is not reached.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a multi-critic reinforcement learning electric power economic dispatching method based on data mining, which is characterized in that the utilization efficiency of historical data is enhanced by using the data mining method, a deep reinforcement learning network is improved by using the multi-critic, and overestimation deviation generated by an approximate function in the learning process is reduced, so that an optimal decision is made on the electric power system economic dispatching problem.
Drawings
FIG. 1 is a flow chart of the implementation steps of the technical scheme of the invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a power economy scheduling method based on multi-critic reinforcement learning of data mining includes the following steps:
s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process;
s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process;
s3: selecting a sample from the historical data by using a data mining method;
s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system;
s5: judging whether an iteration end condition is reached;
if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system;
if not, the process returns to step S3 to perform the next iteration.
In the specific implementation process, historical data are obtained by continuously interacting with the power system environment, the utilization efficiency of the historical data is enhanced by using a data mining method, a deep reinforcement learning network is improved by adopting multi-critic, and overestimation deviation generated by an approximation function in the learning process is reduced, so that the optimal decision is made on the economic dispatching problem of the power system.
More specifically, in step S1,
the Markov decision process objective function is:
the constraint conditions to be met include: alternating current power flow constraint, unit landslide climbing constraint, safe voltage constraint and energy storage charging and discharging constraint;
wherein, Cg,tRepresenting the cost of the generator g during time period t, Cg,tAnd the generator power PgAnd the start-stop state O of the generatorg(ii) related; t is the total time period; g is the total number of the generators; cESS,tRepresenting the cost of charging and discharging the stored energy over time t, CESS,tAnd energy storage charging and discharging power PbatIt is related.
In a specific implementation, the objective function is to minimize the expectation of the total cost of all time periods (i.e. maximize the expectation of the total return) by selecting a suitable set of decision variables at each time period. Since the probability distribution of the variables such as renewable energy generation, electricity price, and load in the power system is unknown, the state transition probability of the Markov Decision Process (MDP) is also unknown, and thus the problem cannot be directly solved.
More specifically, in step S2, the agent of the multi-critic architecture deep reinforcement learning network consists of an actor network and a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)μ) Is represented by thetaμIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)Q) Is represented by thetaQQ is the weight parameter of the operator network, equal to the expected value of the future total return under the conditions of state S and decision A; let the state variable S of the time period tt=(Pg,t-1,Og,t-1,SOCtT), decision variable a for time period tt=(Ot,Pbat,t) The time period t has a reporting function of rt=-Ct(ii) a Wherein, Pg,t-1For the generator power at time t-1, Og,t-1Start-stop state of generator, SOC, at time period t-1tFor the energy storage residual capacity in the time period t, OtFor the on-off state of the generator at time t, Pbat,tFor energy-storing charging-discharging power during time t, CtThe cost of time period t.
In a specific implementation process, the larger the number of critic networks, the longer the training time. And carrying out optimization management by using a multi-critic architecture deep reinforcement learning network under the condition that an economic dispatching model of the power system is unknown or cannot be directly solved to obtain an optimal economic dispatching strategy of the power system.
More specifically, the actor network is approximated using a four-layer deep neural network.
More specifically, the critic network is approximated using a three-layer deep neural network.
More specifically, the multi-critic architecture deep reinforcement learning network further comprises an experience replay pool for storing historical data (S, a, r, S '), wherein S' represents a post-transition state obtained after the decision a is made in the state S, and r represents a return value obtained in the transition process.
More specifically, in step S3,
the value of the sample is measured using the timing difference error σ:
σ=r+Q(S′,A|θQ)-Q(S,A|θQ)
selecting a sample from the historical data by a data mining method according to the value of the sample, and then selecting the probability p of the sample iiComprises the following steps:
wherein σiIs the timing difference error of sample i.
In the specific implementation process, when selecting samples in the experience playback pool, selecting the samples by adopting a random sampling method; meanwhile, different samples have different values, so that the selection value is higherUpdating the weighting parameters of the operator network and the criticc network is beneficial to fully utilize historical data and can enable the algorithm speed to be improved. If the timing difference error of one sample is larger, it means that the difference between the current Q value and the target Q value is still large, so the sample with larger timing difference error should be fully used to update the weighting parameters of the critic network. By the formulaThe probability that the sample with the larger timing difference error is selected can be made higher.
More specifically, in step S4, the weight parameter θ of the Critic networkQThe update formula of (2) is:
where M represents the number of samples drawn from the empirical playback pool, yiY represents the target value required for updating the Q value using the timing differential error, and is obtained by the reward at the current decision and the Q value of the next statei=ri+γQ[Si,μ(S′i|θμ)],riRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, QkDenotes the Q value, S, of the kth Critic networkiRepresents the state variable of sample i, μ (S'i|θμ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'iRepresenting the shifted state variable, A, of sample iiThe decision variables of sample i are represented.
In the implementation, the weight parameter theta of the Critic networkQThe time sequence difference error is minimized by the gradient descent method to be updated, but in practical application, the approximate Q value is often larger than the actual Q value by using a single critic network, so that the minimum Q value is taken from the k critic networks to be updated, overestimation errors in Q value approximation can be remarkably reduced, and deep reinforcement learning for solving the power system is reducedAnd 4, solving the error generated by the Q value function when the scheduling problem is solved. The number of critic networks is at least two.
More specifically, in step S4, the weight parameter θ of the Actor networkμThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:
where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)i|θμ) And representing the mapping relation of the state variable and the decision variable of the sample i.
In the specific implementation process, the weight parameters of the operator network and the criticc network are continuously updated in the process that the agent and the environment continuously interact, and finally the optimal operator network, namely the optimal economic dispatching strategy of the power system is obtained.
More specifically, in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration ending condition is reached; otherwise, the iteration end condition is not reached.
In a specific implementation, the reward value is the negative of the total cost of 24 time periods.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A power economy scheduling method based on multi-critic reinforcement learning of data mining is characterized by comprising the following steps:
s1: converting the multi-period economic scheduling problem of the power system into a Markov decision process;
s2: acquiring historical data of the power system, and constructing a multi-critic framework deep reinforcement learning network according to a Markov decision process;
s3: selecting a sample from the historical data by using a data mining method;
s4: updating parameters of a multi-critic architecture deep reinforcement learning network by using the samples to obtain an optimized economic dispatching strategy of the power system;
s5: judging whether an iteration end condition is reached;
if so, ending the iteration to obtain an optimal economic dispatching strategy of the power system;
if not, the process returns to step S3 to perform the next iteration.
2. The power economy dispatching method based on multi-criticc reinforcement learning of data mining as claimed in claim 1, wherein in step S1,
the Markov decision process objective function is:
the constraint conditions to be met include: alternating current power flow constraint, unit landslide climbing constraint, safe voltage constraint and energy storage charging and discharging constraint;
wherein, Cg,tRepresenting the cost of the generator g during time period t, Cg,tAnd the generator power PgAnd the start-stop state O of the generatorg(ii) related; t is the total time period; g is the total number of the generators; cESS,tRepresenting the cost of charging and discharging the stored energy over time t, CESs,tAnd energy storage charging and discharging power PbatIt is related.
3. The method for dispatching power economy based on multi-critic reinforcement learning of data mining as claimed in claim 2, wherein in step S2, agent of multi-critic architecture deep reinforcement learning network is selected from an operator networkAnd a plurality of critic networks; wherein the operator network represents the mapping relation between the state variable S and the decision variable A and uses mu (S | theta)μ) Is represented by thetaμIs the weight parameter of the operator network; the critic network represents the mapping between states and decision variables (S, A) and a state decision value function Q, denoted by Q (S, A | θ)Q) Is represented by thetaQQ is the weight parameter of the operator network, equal to the expected value of the future total return under the conditions of state S and decision A; let the state variable S of the time period tt=(Pg,t-1,Og,t-1,SOCtT), decision variable a for time period tt=(Ot,Pbat,t) The time period t has a reporting function of rt=-Ct(ii) a Wherein, Pg,t-1For the generator power at time t-1, Og,t-1Start-stop state of generator, SOC, at time period t-1tFor the energy storage residual capacity in the time period t, OtFor the on-off state of the generator at time t, Pbat,tFor energy-storing charging-discharging power during time t, CtThe cost of time period t.
4. The method for dispatching power economy based on multi-critic reinforcement learning of data mining according to claim 3, characterized in that the operator network is approximated by a four-layer deep neural network.
5. The method for multi-critic reinforcement learning based on data mining according to claim 3, wherein the critic network is approximated by a three-layer deep neural network.
6. The power economy dispatching method based on multi-critic reinforcement learning of data mining as claimed in claim 3, wherein the multi-critic architecture deep reinforcement learning network further comprises an experience replay pool for storing historical data (S, A, r, S '), wherein S' represents a post-transition state obtained after making decision A in state S, and r represents a return value obtained during the transition.
7. The power economy dispatching method based on multi-criticc reinforcement learning of data mining as claimed in claim 6, wherein in step S3,
the value of the sample is measured using the timing difference error σ:
σ=r+Q(S′,A|θQ)-Q(S,A|θQ)
selecting a sample from the historical data by a data mining method according to the value of the sample, and then selecting the probability p of the sample iiComprises the following steps:
wherein σiIs the timing difference error of sample i.
8. The method for dispatching power economy based on multi-Critic reinforcement learning of data mining as claimed in claim 6, wherein in step S4, the weight parameter θ of Critic networkQThe update formula of (2) is:
where M represents the number of samples drawn from the empirical playback pool, yiY represents the target value required for updating the Q value using the timing differential error, and is obtained by the reward at the current decision and the Q value of the next statei=ri+γQ[Si,μ(S′i|θμ)],riRepresenting the return of sample i, gamma is a discount coefficient between 0 and 1, used to adjust the far-vision performance of the algorithm, QkDenotes the Q value, S, of the kth Critic networkiRepresents the state variable of sample i, μ (S'i|θμ) Representing the mapping relation, S ', of the state variable and the decision variable after sample i is transferred'iRepresenting the shifted state variable, A, of sample iiThe decision variables of sample i are represented.
9. The power economic dispatching method based on multi-criticc reinforcement learning of data mining as claimed in claim 6, wherein in step S4, the weight parameter θ of the Actor networkμThe expectation value J of the maximum total return is updated by a gradient ascending method, and the updating formula is as follows:
where the expected value of the total return J is approximately represented by the mean of the randomly sampled samples, μ (S)i|θμ) And representing the mapping relation of the state variable and the decision variable of the sample i.
10. The method according to claim 3, wherein in step S5, when the return value of the economic dispatching strategy of the power system exceeds the Q value, an iteration ending condition is reached; otherwise, the iteration end condition is not reached.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011165889.0A CN112381359B (en) | 2020-10-27 | 2020-10-27 | Multi-critic reinforcement learning power economy scheduling method based on data mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011165889.0A CN112381359B (en) | 2020-10-27 | 2020-10-27 | Multi-critic reinforcement learning power economy scheduling method based on data mining |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112381359A true CN112381359A (en) | 2021-02-19 |
CN112381359B CN112381359B (en) | 2021-10-26 |
Family
ID=74577371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011165889.0A Active CN112381359B (en) | 2020-10-27 | 2020-10-27 | Multi-critic reinforcement learning power economy scheduling method based on data mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112381359B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115118477A (en) * | 2022-06-22 | 2022-09-27 | 四川数字经济产业发展研究院 | Smart grid state recovery method and system based on deep reinforcement learning |
CN115775081A (en) * | 2022-12-16 | 2023-03-10 | 华南理工大学 | Random economic dispatching method, device and medium for power system |
CN117200184A (en) * | 2023-08-10 | 2023-12-08 | 国网浙江省电力有限公司金华供电公司 | Virtual power plant load side resource multi-period regulation potential evaluation prediction method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784440A (en) * | 2017-10-23 | 2018-03-09 | 国网辽宁省电力有限公司 | A kind of power information system resource allocation system and method |
CN109934332A (en) * | 2018-12-31 | 2019-06-25 | 中国科学院软件研究所 | The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends |
CN110929948A (en) * | 2019-11-29 | 2020-03-27 | 上海电力大学 | Fully distributed intelligent power grid economic dispatching method based on deep reinforcement learning |
CN111242443A (en) * | 2020-01-06 | 2020-06-05 | 国网黑龙江省电力有限公司 | Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet |
CN111709672A (en) * | 2020-07-20 | 2020-09-25 | 国网黑龙江省电力有限公司 | Virtual power plant economic dispatching method based on scene and deep reinforcement learning |
CN111725836A (en) * | 2020-06-18 | 2020-09-29 | 上海电器科学研究所(集团)有限公司 | Demand response control method based on deep reinforcement learning |
-
2020
- 2020-10-27 CN CN202011165889.0A patent/CN112381359B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784440A (en) * | 2017-10-23 | 2018-03-09 | 国网辽宁省电力有限公司 | A kind of power information system resource allocation system and method |
CN109934332A (en) * | 2018-12-31 | 2019-06-25 | 中国科学院软件研究所 | The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends |
CN110929948A (en) * | 2019-11-29 | 2020-03-27 | 上海电力大学 | Fully distributed intelligent power grid economic dispatching method based on deep reinforcement learning |
CN111242443A (en) * | 2020-01-06 | 2020-06-05 | 国网黑龙江省电力有限公司 | Deep reinforcement learning-based economic dispatching method for virtual power plant in energy internet |
CN111725836A (en) * | 2020-06-18 | 2020-09-29 | 上海电器科学研究所(集团)有限公司 | Demand response control method based on deep reinforcement learning |
CN111709672A (en) * | 2020-07-20 | 2020-09-25 | 国网黑龙江省电力有限公司 | Virtual power plant economic dispatching method based on scene and deep reinforcement learning |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115118477A (en) * | 2022-06-22 | 2022-09-27 | 四川数字经济产业发展研究院 | Smart grid state recovery method and system based on deep reinforcement learning |
CN115118477B (en) * | 2022-06-22 | 2024-05-24 | 四川数字经济产业发展研究院 | Smart grid state recovery method and system based on deep reinforcement learning |
CN115775081A (en) * | 2022-12-16 | 2023-03-10 | 华南理工大学 | Random economic dispatching method, device and medium for power system |
CN115775081B (en) * | 2022-12-16 | 2023-10-03 | 华南理工大学 | Random economic scheduling method, device and medium for electric power system |
CN117200184A (en) * | 2023-08-10 | 2023-12-08 | 国网浙江省电力有限公司金华供电公司 | Virtual power plant load side resource multi-period regulation potential evaluation prediction method |
CN117200184B (en) * | 2023-08-10 | 2024-04-09 | 国网浙江省电力有限公司金华供电公司 | Virtual power plant load side resource multi-period regulation potential evaluation prediction method |
Also Published As
Publication number | Publication date |
---|---|
CN112381359B (en) | 2021-10-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111884213B (en) | Power distribution network voltage adjusting method based on deep reinforcement learning algorithm | |
CN112381359B (en) | Multi-critic reinforcement learning power economy scheduling method based on data mining | |
CN112186743B (en) | Dynamic power system economic dispatching method based on deep reinforcement learning | |
Sun et al. | A customized voltage control strategy for electric vehicles in distribution networks with reinforcement learning method | |
CN112117760A (en) | Micro-grid energy scheduling method based on double-Q-value network deep reinforcement learning | |
Yu et al. | Unit commitment using Lagrangian relaxation and particle swarm optimization | |
CN112131733B (en) | Distributed power supply planning method considering influence of charging load of electric automobile | |
CN113511082A (en) | Hybrid electric vehicle energy management method based on rule and double-depth Q network | |
CN110070292B (en) | Micro-grid economic dispatching method based on cross variation whale optimization algorithm | |
CN112491094B (en) | Hybrid-driven micro-grid energy management method, system and device | |
CN113872213B (en) | Autonomous optimization control method and device for power distribution network voltage | |
CN116523327A (en) | Method and equipment for intelligently generating operation strategy of power distribution network based on reinforcement learning | |
CN114784823A (en) | Micro-grid frequency control method and system based on depth certainty strategy gradient | |
CN116468159A (en) | Reactive power optimization method based on dual-delay depth deterministic strategy gradient | |
CN104915788B (en) | A method of considering the Electrical Power System Dynamic economic load dispatching of windy field correlation | |
CN113972645A (en) | Power distribution network optimization method based on multi-agent depth determination strategy gradient algorithm | |
CN115345380A (en) | New energy consumption electric power scheduling method based on artificial intelligence | |
CN117060386A (en) | Micro-grid energy storage scheduling optimization method based on value distribution depth Q network | |
CN117833285A (en) | Micro-grid energy storage optimization scheduling method based on deep reinforcement learning | |
CN114204546B (en) | Unit combination optimization method considering new energy consumption | |
CN117117989A (en) | Deep reinforcement learning solving method for unit combination | |
Chen et al. | A Deep Reinforcement Learning-Based Charging Scheduling Approach with Augmented Lagrangian for Electric Vehicle | |
CN114048576B (en) | Intelligent control method for energy storage system for stabilizing power transmission section tide of power grid | |
CN116995645A (en) | Electric power system safety constraint economic dispatching method based on protection mechanism reinforcement learning | |
CN115829258A (en) | Electric power system economic dispatching method based on polynomial chaotic approximate dynamic programming |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |