CN117808174B - Micro-grid operation optimization method and system based on reinforcement learning under network attack - Google Patents

Micro-grid operation optimization method and system based on reinforcement learning under network attack Download PDF

Info

Publication number
CN117808174B
CN117808174B CN202410231339.6A CN202410231339A CN117808174B CN 117808174 B CN117808174 B CN 117808174B CN 202410231339 A CN202410231339 A CN 202410231339A CN 117808174 B CN117808174 B CN 117808174B
Authority
CN
China
Prior art keywords
constraint
representing
micro
intelligent agent
power
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410231339.6A
Other languages
Chinese (zh)
Other versions
CN117808174A (en
Inventor
刘帅
王昊晨
王小文
徐昊天
刘龙成
赵浩然
华友情
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202410231339.6A priority Critical patent/CN117808174B/en
Publication of CN117808174A publication Critical patent/CN117808174A/en
Application granted granted Critical
Publication of CN117808174B publication Critical patent/CN117808174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of power system automation, and provides a micro-grid operation optimization method and system based on reinforcement learning under network attack, wherein the technical scheme is as follows: and decomposing the rewarding function of the intelligent agent into a plurality of sub rewarding functions and performing independent optimization, so that each sub rewarding function reaches the pareto optimum, and the local optimum of a single rewarding function is avoided. And if the action meets the index, feeding back an additional rewarding value to the agent, so that the agent tends to select the action meeting the set index. Under the condition that external network attack exists, the honey server is constructed as a safety protection measure, so that the honey server plays a role in protection, minimizes attack loss and has a certain practical significance.

Description

Micro-grid operation optimization method and system based on reinforcement learning under network attack
Technical Field
The invention belongs to the technical field of power system automation, and particularly relates to a micro-grid operation optimization method and system based on reinforcement learning under network attack.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The micro-grid is a small power generation and distribution network and consists of a plurality of independent energy systems such as a generator, a load, an energy storage device, a control device and the like. With the popularization of clean energy and distributed generation, the scale and complexity of micro-grid systems are gradually increasing. In the operation process of the micro-grid system, a plurality of targets such as economic cost reduction, safe operation assurance, environmental pollution reduction and the like need to be considered at the same time, and how to realize multi-target optimization of the micro-grid becomes a problem to be solved urgently.
In recent years, with the rapid development of artificial intelligence technology, deep reinforcement learning is increasingly applied to the field of micro-grids. By constructing proper state space, action space and rewarding function, the intelligent agent can interact with the environment and learn to the optimal strategy through multiple experiments and feedback, and the method provides an intelligent solution for the operation and management of the micro-grid system, so that the multi-objective optimization of the micro-grid can be realized.
However, the conventional deep reinforcement learning method has a certain limitation in solving the multi-objective optimization problem. First, it treats multiple targets as a whole and feeds back a whole rewarding function to the agent, which may cause the agent to converge to a locally optimal solution without guaranteeing that each target is optimal; second, the agent may sacrifice the value of some sub-bonus functions in pursuing the maximization of the overall bonus function, which may have serious consequences in the real world. In addition, the micro-grid may be attacked by external network in actual operation, and there is a risk of information leakage.
Disclosure of Invention
In order to solve at least one technical problem in the background art, the invention provides a reinforcement learning-based micro-grid operation optimization method and system under network attack, which can realize multi-objective optimization of the micro-grid under the condition that external network attack exists.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the first aspect of the present invention provides a reinforcement learning-based micro-grid operation optimization method under network attack, including the steps of:
Describing an objective function and constraint conditions in a micro-grid environment;
Constructing a micro-grid environment based on reinforcement learning based on an objective function and constraint conditions in the micro-grid environment, and setting a state space, an action space and a reward function of an intelligent agent;
decomposing the whole rewarding function of the intelligent agent into a plurality of sub rewarding functions according to the target quantity, wherein each sub rewarding function is optimized by adopting an independent critic neural network; when each sub rewarding function is optimized by adopting an independent critic neural network, an external network attack existing in an actual environment is considered, a honey pot server is constructed on a real server, and the behavior and response of the real server are simulated so as to isolate an attacker from the real server;
Whether the action selected by the agent meets the relevant criteria is part of the sub-rewarding function, thereby making the agent prone to selecting actions that meet the set criteria.
A second aspect of the present invention provides a reinforcement learning-based microgrid operation optimization system under network attack, comprising:
The power grid environment describing module is used for describing an objective function and constraint conditions in the micro-grid environment;
The intelligent agent setting module is used for constructing a micro-grid environment based on reinforcement learning based on an objective function and constraint conditions in the micro-grid environment and setting a state space, an action space and a rewarding function of the intelligent agent;
The operation optimization module is used for decomposing the whole rewarding function of the intelligent agent into a plurality of sub rewarding functions according to the target quantity, and each sub rewarding function is optimized by adopting an independent critic neural network; when each sub rewarding function is optimized by adopting an independent critic neural network, an external network attack existing in an actual environment is considered, a honey pot server is constructed on a real server, and the behavior and response of the real server are simulated so as to isolate an attacker from the real server;
And an optimization output module for determining whether the selected action of the agent satisfies the associated index as part of the sub-rewarding function, thereby causing the agent to tend to select an action that satisfies the set index.
Compared with the prior art, the invention has the beneficial effects that:
1. According to the invention, the rewarding function of the intelligent agent is decomposed into a plurality of sub rewarding functions and is independently optimized, so that each sub rewarding function reaches the pareto optimum, and the local optimum of a single rewarding function is avoided.
2. The invention takes whether the action selected by the agent meets the related index as a part of the sub-rewarding function, so that the agent tends to select the action meeting the set index, and the sacrifice of the value of the sub-rewarding function in the process of pursuing the maximization of the whole rewarding function by the agent is avoided.
3. Under the condition that external network attack exists, the honey pot server is constructed to serve as a safety protection measure, so that the honey pot server has a protection function, minimizes attack loss and has a certain practical significance.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flowchart of a reinforcement learning-based micro-grid operation optimization method under network attack provided by an embodiment of the invention;
FIG. 2 is a flow chart of sub-bonus function optimization using reinforcement learning provided by an embodiment of the invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
As shown in fig. 1, the embodiment provides a reinforcement learning-based micro-grid operation optimization method under network attack, which includes the following steps:
S101: describing an objective function and constraint conditions in a micro-grid environment;
In S101, for a micro-grid system including a distributed power source, a power storage device, an electric energy conversion device and an electric load, with the minimum economic cost, the minimum safe cost and the minimum environmental protection cost within a set time as objective functions, constraints such as a micro-grid power supply and demand balance constraint, a distributed power source output power constraint, a power storage device output power constraint and a capacity constraint are comprehensively considered, and a micro-grid multi-objective optimization model based on reinforcement learning is constructed.
Wherein, the economic cost objective function is:
In the method, in the process of the invention, Representing the cost of the micro-grid to exchange power with the main grid,/>Representing network loss in the course of power transfer,/>Representing electricity purchase price of micro-grid to main grid,/>Representing the amount of electricity purchased; representing the power generation cost of a distributed power supply,/> Representing the power generation cost factor of the distributed power supply,Representing active power of the distributed power supply; /(I)Representing micro grid internal scheduling costs,/>Indicating the unit price of the dispatch,Representing the/>, in a microgridResponse power of individual demand response block,/>Represents the/>A binary variable to which the individual demand response block responds; /(I)Representing the electricity generation cost of the electricity storage equipment,/>Representing the coefficient of the electricity generation cost of the electricity storage device,Representing the generated power of the electricity storage device.
The safety cost objective function is:
In the method, in the process of the invention, Representing the actual voltage and the voltage reference value, respectively, during the operation of the microgrid,/>Representing power loss of a microgrid system,/>Representing the voltage dependence coefficient,/>Representing the nominal voltage at the point of common coupling.
The environmental protection cost objective function is:
In the method, in the process of the invention, Represents the set of atmospheric pollutants, parameter/>Represents the/>Punishment coefficient of atmospheric pollutants,/>Represents the/>Atmospheric pollutant discharge rate.
Combining the economic cost objective function, the safety cost objective function and the environment-friendly cost objective function to obtain an objective function;
the constraint conditions specifically include:
power supply and demand balance constraint:
Voltage constraint:
power constraint of distributed power supply:
power constraint of the power storage device:
Electric quantity constraint of electric storage equipment:
Updating the electric quantity of the energy storage device:
In the method, in the process of the invention, Representing the active power output by a photovoltaic generator,/>The power consumption of residents is represented;、/> Respectively representing the minimum value and the maximum value of the allowable voltage; /(I) 、/>Respectively representing the minimum value and the maximum value of the output power of the distributed power supply; /(I)、/>Respectively representing the minimum value and the maximum value of the output power of the power storage equipment; /(I)、/>Respectively representing the minimum value and the maximum value of the electricity quantity of the electricity storage equipment; /(I)、/>Respectively representing the charging efficiency and the discharging efficiency of the electricity storage equipment,/>、/>Respectively representing the charge power and the discharge power,、/>Respectively, the charge time and the discharge time.
S102: constructing a micro-grid environment based on reinforcement learning, and setting a state space, an action space and a reward function of an agent, wherein the method specifically comprises the following steps of:
S2011: building the state space observed by the intelligent agent: the state space observed by the intelligent agent in the micro-grid environment comprises 、/>、/>And/>Expressed as:
In the method, in the process of the invention, Representing a set of states.
S2012: constructing an action space of the intelligent body: the action space of the intelligent body comprises、/>、/>Expressed as:
In the method, in the process of the invention, Representing a set of actions that satisfy the constraint.
S2013: constructing a reward function of the intelligent agent in the training process by combining the observed state space of the intelligent agent and the action space of the intelligent agent: when the intelligent agent is in stateAction taken/>Rewards returned to the agent by the environment; Namely:
In the method, in the process of the invention, Representing the sum of economic, safe and environmental costs, i.e./>,/>Representing Lagrange penalty factor,/>、/>、/>、/>、/>And respectively representing cost functions corresponding to the power supply and demand balance constraint, the voltage constraint, the distributed power supply power constraint, the power constraint of the power storage equipment and the capacity constraint.
S103: and solving an optimization problem by utilizing DDPG algorithm, decomposing the rewarding function of the intelligent agent according to the target quantity, and independently optimizing each sub rewarding function by a critic neural network.
The method specifically comprises the following steps: training is performed based on the agent established in S102, and the idea of target decomposition is used in DDPG algorithm of reinforcement learning. The whole rewarding function is decomposed into M sub rewarding functions, and each sub rewarding function is optimized by adopting an independent critic neural network, so that the best pareto solution of each sub rewarding function can be guaranteed, and the local optimization of the rewarding function is avoided. Decomposing the bonus function according to the goal, namely:
the algorithm DDPG is only aimed at one total target based on the improvement of DDPG algorithm in the traditional reinforcement learning, after actor neural network selection action, a total reward function is returned to the intelligent agent at each moment, and only one critic neural network is used for evaluation.
Therefore, in this embodiment, the total target is decomposed into M sub-targets, and for each sub-target, the agent obtains a sub-reward function at each moment, and each sub-reward function is evaluated by a critic neural network, so that the sum of obtained reward functions is higher than that of optimizing only the whole reward function, thereby improving the strategy of the agent.
The specific optimization algorithm process comprises the following steps:
Step 1: initializing actor neural network parameters And/>Parameters of critic neural network/>Initializing target network parameters/>、/>Initializing an empirical playback buffer/>, for storing training dataAnd parameters T and T;
step 2: initializing environment and obtaining initial state of agent
Step 3: the intelligent body observes the environmental stateActor neural network generates corresponding actions according to the current strategyAdding noise to promote action exploration;
Step 4: performing index analysis on actions taken by the agent:
Performing the selected action Observation of rewards of environmental feedback/>Decomposition/>Sub-prize value/>And the state of the next moment/>
Step 7: the obtained current state vector setStore to experience playback buffer/>In (a) and (b);
step 8: randomly selecting experience pool Training a set of data comprising: random sampling/>, when an empirical playback buffer stores more than a certain amount of dataEmpirical data/>; Calculation of target/>, using sampled empirical dataValue/>Wherein/>Representing a discount factor; by minimizing/>The loss function of value updates critic the parameters of the network: /(I); Using the outputs of critic networks, the policy gradients of actor networks are calculated and the parameters of actor networks are updated by gradient ascent:
step 9: a soft update is made to the target network, ,/>Wherein/>Representing soft update parameters.
S104: whether the action selected by the agent meets the relevant criteria is part of the sub-rewarding function, thereby making the agent prone to selecting actions that meet the set criteria.
In this embodiment, the optimization is performed based on the DDPG algorithm after the improvement in S103, and whether the action selected by the agent meets the relevant index is used as a part of the sub-rewarding function, if the action meets the index, the additional rewarding value is fed back to the agent, so that the agent tends to select the action meeting the set index.
In particular, the micro-grid needs to be compatible with reducing economic cost, ensuring safe operation of the distributed power supply and reducing environmental pollution in the actual operation process. In order to prevent the intelligent agent from pursuing the maximum overall rewarding function and neglecting individual indexes in the training process, thereby causing huge loss in the actual running process, the indexes such as economy, safety, environmental protection and the like need to be subjected to condition setting, and the specific expression is as follows:
Economic index:
In the method, in the process of the invention, Representing the maximum acceptable economic cost for the microgrid. On the basis of meeting the constraint condition, when the economic cost/>Less than/>At this time, the economic cost of the micro-grid is low, and the environment is fed back to the intelligent agent as an extra reward/>; When economic cost/>Not less than/>When the economic cost required by the micro-grid to meet the constraint condition is too high, the environment does not provide additional rewards.
The economic cost objective function actually obtained by the intelligent agent is as follows:
In the method, in the process of the invention, Representing the economic cost actually obtained after the agent takes the action,/>Indicating a decision on the action taken by the agent and feeding back an additional prize value.
Safety index:
In the method, in the process of the invention, Indicating that the microgrid is acceptable/>And/>Maximum threshold of gap. When the voltage of the micro grid/>And reference voltage/>The difference between (2) is less than/>At the moment, the running of the micro-grid is safe and reliable, and the environment is fed back to the intelligent agent for giving extra rewards/>; When the voltage of the micro-grid is too large from the reference voltage, the operation of the micro-grid has potential safety hazards, so that no additional rewarding value is provided.
The actual safety cost objective function obtained by the intelligent agent is as follows:
In the method, in the process of the invention, Representing the actual cost of security obtained after an agent takes an action,/>Indicating a decision on the action taken by the agent and feeding back an additional prize value.
Environmental protection indexes:
In the method, in the process of the invention, Representation/>And/>A maximum threshold value of the sum. When/>And/>The sum is less than the thresholdWhen the micro-grid generates less air pollutants, the environment is fed back to the intelligent agent to give an extra reward/>Otherwise, the micro-grid is considered to emit more air pollutants during operation, so that a reward value is not provided.
The actual environmental protection cost objective function that obtains of intelligent agent is:
In the method, in the process of the invention, Representing the actual cost of security obtained after an agent takes an action,/>Indicating a decision on the action taken by the agent and feeding back an additional prize value.
S105: and comprehensively considering the external network attack existing in the actual environment, and constructing the honey server on the basis of the real server to reduce the loss caused by information leakage.
Based on the algorithm optimized in the step S104, the honey pot server is set on the basis of the original real server by considering the external network attack existing in the actual environment, and the behavior and response of the real server are simulated to isolate an attacker from the real server, so that the functions of protecting the privacy of the micro-grid and preventing data leakage are achieved. By recording the information of the attack and the attacker, protection of the system and analysis and tracking of the attack can be provided.
The method specifically comprises the following steps:
s501: modeling both network attack and defense parties, specifically comprising:
constructing a set of offensive and defensive game participants Wherein/>Representing an attacker among the network attack and defense participants; /(I)Is an defender, namely a micro-grid, among network attack and defense participants.
Constructing attack and defense game participant policy setsWherein/>A set of policies representing an attacker; /(I)Representing a set of policies for the microgrid.
Constructing a set of defender actual server typesWherein/>、/>Representing the real server and honeypot server types, respectively.
Constructing server type signal sets released by defendersWherein/>、/>The server type signals respectively representing the release of the defender are a real server and a honey server. In the attack and defense process, the signals released by the defender are not necessarily the same as the actual server types, so that the judgment of the attacker is interfered, and the effect of resisting external attack is improved.
Constructing an attack mode set of an attackerWherein/>Representing a direct attack,/>The representative detects the type of defender server first and then attacks, differing in whether the attacker detects the attacking server in advance. If the detected counterpart is a honey server, the attacker gives up the attack, otherwise, the attack is continued, but the risk of failure in detection exists.
S502, analyzing the profit quantization of different participants under the honey-free server and the profit quantization in the honey dynamic game based on the constructed network attack and defense model;
s5021, quantifying the benefits of different participants under the honey-free server comprises the following steps:
The defender policy expects benefits:
In the method, in the process of the invention, Representing that an attacker adopts an attack strategy/>Defending person adopts defending strategy/>The income obtained by the defender is large; /(I)Representing the loss of system after defending against an attack, wherein/>Representing the fatality of an attacker launching an attack strategy,/>Representing an integrity cost,/>The weight of the integrity is represented as such,Representing privacy costs,/>Representing privacy weighting,/>Representing availability cost,/>Representing availability weights; /(I)Representing the cost spent by an defender against an attacker's attack, where/>Representing the operational cost of a defensive strategy,/>Indicating the inherent harm the attacker has to the defender due to the aggressive action taken by the attacker,Representing that an attacker adopts an attack strategy/>Defending person adopts defending strategy/>The extent of damage to the defender.
Attacker policy expects benefits:
In the method, in the process of the invention, Representing that an attacker adopts an attack strategy/>Defending person adopts defending strategy/>Afterwards, the size of the income obtained by the attacker; /(I)Representing that an attacker adopts an attack strategy/>Is not limited by the cost of (a).
S5022, profit quantization in the honey dynamic game comprises profit quantization of direct attack by an attacker and profit quantization of attack after detection by the attacker; the method comprises the steps of quantifying the benefits of attacking the honeypot server and quantifying the benefits of attacking the real server in each case;
wherein when an attacker directly attacks, when the attacker attacks the honey server,
The defender benefits are:
The attacker benefits are:
In the method, in the process of the invention, Is a honey factor and represents the capacity of honey server deception; /(I)Representing benefits obtained by defenders through monitoring and detecting the attack behaviors of the other parties by the honey pot server; /(I)Indicating that the defender releases camouflage costs different from the actual server signal; /(I)The attack strategy initiated by the attacker is utilized by the honeypot server of the defender, and the attack cost caused by partial information of the attacker is revealed.
When an attacker directly attacks, and when the attacker attacks the real server,
The defender benefits are:
The attacker benefits are:
Wherein, when an attacker detects and then attacks the honey server,
The defender benefits are:
The attacker benefits are:
In the method, in the process of the invention, Representing the probability that an attacker correctly detects the defender server type,/>Representing the cost of detection of an attacker.
Wherein, when an attacker detects and attacks the real server,
The defender benefits are:
The attacker benefits are:
Before attacking the server of the defender, the attacker finds that the server is a real server through the detection technology of the attacker, and then starts the attack, so that a certain benefit is obtained, and meanwhile, the defender is lost.
By comparing the profit quantization of different participants under the honey-free server with the profit quantization in the honey dynamic game, it can be seen that if the honey server is not arranged, an attacker can certainly launch an attack on the real server of the defender, so that the defender loses such as data leakage. By setting the honey server, the defender releases false information and acquires the attack strategy of an attacker, so that the privacy of the defender is protected, and a certain defending effect is achieved in an actual environment.
S5023, setting a honey server on the basis of an original real server based on the analysis conclusion, and simulating the behavior and response of the real server, wherein the honey server specifically comprises the following steps:
S50231: the type of honeypot set is determined, e.g., network honeypot, system honeypot, application honeypot, etc.
S50232: according to the selected honey type, proper honey software is selected to simulate various services and operating systems, so that the honey server is more real and reliable.
S50233: and installing selected software and configuring the honey pot environment on an independent server or virtual machine to ensure that the honey pot environment can normally communicate with an external network.
S50234: configuring the honey server includes installing common services, opening ports, manufacturing false information, etc.
S50235: and setting a honey server, monitoring all accesses and interactions, recording the behavior of an attacker, analyzing intrusion attempts and taking corresponding measures.
S50236: the software and configuration of the honeypot server is updated periodically to ensure that it remains up-to-date to accommodate the changing threat environment.
Example two
The embodiment provides a micro-grid operation optimization system based on reinforcement learning under network attack, which comprises:
The power grid environment describing module is used for describing an objective function and constraint conditions in the micro-grid environment;
The intelligent agent setting module is used for constructing a micro-grid environment based on reinforcement learning based on an objective function and constraint conditions in the micro-grid environment and setting a state space, an action space and a rewarding function of the intelligent agent;
The operation optimization module is used for decomposing the whole rewarding function of the intelligent agent into a plurality of sub rewarding functions according to the target quantity, and each sub rewarding function is optimized by adopting an independent critic neural network; when each sub rewarding function is optimized by adopting an independent critic neural network, an external network attack existing in an actual environment is considered, a honey pot server is constructed on a real server, and the behavior and response of the real server are simulated so as to isolate an attacker from the real server;
And an optimization output module for determining whether the selected action of the agent satisfies the associated index as part of the sub-rewarding function, thereby causing the agent to tend to select an action that satisfies the set index.
In the optimization model construction module, objective functions in the micro-grid environment comprise economic cost, safety cost and environmental protection cost objective functions, and constraint conditions comprise power supply and demand balance constraint, voltage constraint, power constraint of a distributed power supply, power constraint of power storage equipment, power constraint of the power storage equipment and power update constraint of the power storage equipment.
In the operation optimization module, each sub-rewarding function is optimized by adopting an independent critic neural network, and the method specifically comprises the following steps:
Initializing actor neural network parameters Parameters of a critic neural network;
The intelligent agent observes the environment state and selects corresponding actions according to the current strategy;
performing index analysis on actions taken by the intelligent agent, wherein the intelligent agent obtains rewards of environmental feedback;
Based on the target quantity of rewards fed back Decomposing into a plurality of sub rewards; the intelligent agent observes the state at the next moment, and stores the obtained state vector set into an experience pool R;
And randomly selecting a group of data in the experience pool R for training, and updating parameters of M critic neural networks to obtain a final strategy of the intelligent agent. In the optimization output module, the set indexes comprise economic indexes, safety indexes and environment-friendly indexes.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. The micro-grid operation optimization method based on reinforcement learning under network attack is characterized by comprising the following steps:
Describing an objective function and constraint conditions in a micro-grid environment;
the objective functions in the micro-grid environment comprise economic cost, safety cost and environmental protection cost objective functions, and the constraint conditions comprise power supply and demand balance constraint, voltage constraint, power constraint of a distributed power supply, power constraint of a power storage device, electric quantity constraint of the power storage device and electric quantity update constraint of the energy storage device;
Constructing a micro-grid environment based on reinforcement learning based on an objective function and constraint conditions in the micro-grid environment, and setting a state space, an action space and a reward function of an intelligent agent;
The state space, the action space and the rewarding function of the intelligent agent are set, specifically:
The state space of the intelligent agent is as follows:
Wherein, Representing the active power output by a photovoltaic generator,/>Representing the power consumption of residents,/>Representing electricity purchase price of micro-grid to main grid,/>For electricity of electricity storage equipment,/>Representing a set of states;
the action space is as follows:
Wherein, Is the power of the distributed power supply,/>For the amount of electricity purchased,/>Representing the generated power of the electricity storage device,/>Representing a set of actions that satisfy the constraint condition;
the reward function is:
Wherein, Representing the state space observed by the agent,/>Representing the action space of an agent,/>Representing the sum of economic, safe and environment-friendly costs,/>Representing Lagrange penalty factor,/>、/>、/>、/>、/>Respectively representing cost functions corresponding to power supply and demand balance constraint, voltage constraint, distributed power supply power constraint, power constraint and capacity constraint of the power storage equipment;
decomposing the whole rewarding function of the intelligent agent into a plurality of sub rewarding functions according to the target quantity, wherein each sub rewarding function is optimized by adopting an independent critic neural network; when each sub rewarding function is optimized by adopting an independent critic neural network, an external network attack existing in an actual environment is considered, a honey pot server is constructed on a real server, and the behavior and response of the real server are simulated so as to isolate an attacker from the real server;
each sub-rewarding function is optimized by adopting an independent critic neural network, and specifically comprises the following steps:
Initializing actor neural network parameters Parameters of a critic neural network;
The intelligent agent observes the environment state and selects corresponding actions according to the current strategy;
performing index analysis on actions taken by the intelligent agent, wherein the intelligent agent obtains rewards of environmental feedback;
Based on the target quantity of rewards fed back Decomposing into a plurality of sub rewards; the intelligent agent observes the state at the next moment, and stores the obtained state vector set into an experience pool R;
randomly selecting a group of data in the experience pool R for training, and updating parameters of M critic neural networks to obtain a final strategy of the intelligent agent;
The method for constructing the honey server on the real server by considering the external network attack existing in the actual environment comprises the following steps:
Determining the type of the honeypot;
selecting honey software according to the selected honey type, and simulating various services and operating systems;
Installing honeypot software and configuring a honeypot environment;
configuring and setting a honey server based on honey environment;
The software and the configuration of the honeypot server are updated regularly, so that the honeypot server is ensured to keep the latest state;
Whether the action selected by the agent meets the relevant criteria is part of the sub-rewarding function, thereby making the agent prone to selecting actions that meet the set criteria.
2. The reinforcement learning-based micro-grid operation optimization method under network attack according to claim 1, wherein the set indexes include an economic index, a safety index and an environmental protection index.
3. The micro-grid operation optimization system based on reinforcement learning under network attack is characterized by comprising:
The power grid environment describing module is used for describing an objective function and constraint conditions in the micro-grid environment;
the objective functions in the micro-grid environment comprise economic cost, safety cost and environmental protection cost objective functions, and the constraint conditions comprise power supply and demand balance constraint, voltage constraint, power constraint of a distributed power supply, power constraint of a power storage device, electric quantity constraint of the power storage device and electric quantity update constraint of the energy storage device;
The intelligent agent setting module is used for constructing a micro-grid environment based on reinforcement learning based on an objective function and constraint conditions in the micro-grid environment and setting a state space, an action space and a rewarding function of the intelligent agent;
The state space, the action space and the rewarding function of the intelligent agent are set, specifically:
The state space of the intelligent agent is as follows:
Wherein, Representing the active power output by a photovoltaic generator,/>Representing the power consumption of residents,/>Representing electricity purchase price of micro-grid to main grid,/>For electricity of electricity storage equipment,/>Representing a set of states;
the action space is as follows:
Wherein, Is the power of the distributed power supply,/>For the amount of electricity purchased,/>Representing the generated power of the electricity storage device,/>Representing a set of actions that satisfy the constraint condition;
the reward function is:
Wherein, Representing the state space observed by the agent,/>Representing the action space of an agent,/>Representing the sum of economic, safe and environment-friendly costs,/>Representing Lagrange penalty factor,/>、/>、/>、/>、/>Respectively representing cost functions corresponding to power supply and demand balance constraint, voltage constraint, distributed power supply power constraint, power constraint and capacity constraint of the power storage equipment;
The operation optimization module is used for decomposing the whole rewarding function of the intelligent agent into a plurality of sub rewarding functions according to the target quantity, and each sub rewarding function is optimized by adopting an independent critic neural network; when each sub rewarding function is optimized by adopting an independent critic neural network, an external network attack existing in an actual environment is considered, a honey pot server is constructed on a real server, and the behavior and response of the real server are simulated so as to isolate an attacker from the real server;
each sub-rewarding function is optimized by adopting an independent critic neural network, and specifically comprises the following steps:
Initializing actor neural network parameters Parameters of a critic neural network;
The intelligent agent observes the environment state and selects corresponding actions according to the current strategy;
performing index analysis on actions taken by the intelligent agent, wherein the intelligent agent obtains rewards of environmental feedback;
Based on the target quantity of rewards fed back Decomposing into a plurality of sub rewards; the intelligent agent observes the state at the next moment, and stores the obtained state vector set into an experience pool R;
randomly selecting a group of data in the experience pool R for training, and updating parameters of M critic neural networks to obtain a final strategy of the intelligent agent;
The method for constructing the honey server on the real server by considering the external network attack existing in the actual environment comprises the following steps:
Determining the type of the honeypot;
selecting honey software according to the selected honey type, and simulating various services and operating systems;
Installing honeypot software and configuring a honeypot environment;
configuring and setting a honey server based on honey environment;
The software and the configuration of the honeypot server are updated regularly, so that the honeypot server is ensured to keep the latest state;
And an optimization output module for determining whether the selected action of the agent satisfies the associated index as part of the sub-rewarding function, thereby causing the agent to tend to select an action that satisfies the set index.
4. The reinforcement learning-based micro-grid operation optimization system under network attack according to claim 3, wherein in the optimization model construction module, the objective functions in the micro-grid environment comprise economic cost, safety cost and environmental protection cost objective functions, and the constraint conditions comprise power supply-demand balance constraint, voltage constraint, power constraint of the distributed power supply, power constraint of the power storage device and power update constraint of the power storage device.
5. The reinforcement learning-based micro-grid operation optimization system under network attack of claim 3, wherein in the operation optimization module, each sub-rewarding function is optimized by adopting an independent critic neural network, and specifically comprises:
Initializing actor neural network parameters Parameters of a critic neural network;
The intelligent agent observes the environment state and selects corresponding actions according to the current strategy;
performing index analysis on actions taken by the intelligent agent, wherein the intelligent agent obtains rewards of environmental feedback;
Based on the target quantity of rewards fed back Decomposing into a plurality of sub rewards; the intelligent agent observes the state at the next moment, and stores the obtained state vector set into an experience pool R;
And randomly selecting a group of data in the experience pool R for training, and updating parameters of M critic neural networks to obtain a final strategy of the intelligent agent.
6. The reinforcement learning-based micro-grid operation optimization system under network attack according to claim 3, wherein the set indexes comprise an economic index, a safety index and an environmental protection index in the optimization output module.
CN202410231339.6A 2024-03-01 2024-03-01 Micro-grid operation optimization method and system based on reinforcement learning under network attack Active CN117808174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410231339.6A CN117808174B (en) 2024-03-01 2024-03-01 Micro-grid operation optimization method and system based on reinforcement learning under network attack

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410231339.6A CN117808174B (en) 2024-03-01 2024-03-01 Micro-grid operation optimization method and system based on reinforcement learning under network attack

Publications (2)

Publication Number Publication Date
CN117808174A CN117808174A (en) 2024-04-02
CN117808174B true CN117808174B (en) 2024-05-28

Family

ID=90433799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410231339.6A Active CN117808174B (en) 2024-03-01 2024-03-01 Micro-grid operation optimization method and system based on reinforcement learning under network attack

Country Status (1)

Country Link
CN (1) CN117808174B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020000399A1 (en) * 2018-06-29 2020-01-02 东莞理工学院 Multi-agent deep reinforcement learning proxy method based on intelligent grid
CN113783881A (en) * 2021-09-15 2021-12-10 浙江工业大学 Network honeypot deployment method facing penetration attack
CN114363093A (en) * 2022-03-17 2022-04-15 浙江君同智能科技有限责任公司 Honeypot deployment active defense method based on deep reinforcement learning
CN114725936A (en) * 2022-04-21 2022-07-08 电子科技大学 Power distribution network optimization method based on multi-agent deep reinforcement learning
CN115333143A (en) * 2022-07-08 2022-11-11 国网黑龙江省电力有限公司大庆供电公司 Deep learning multi-agent micro-grid cooperative control method based on double neural networks
CN115473677A (en) * 2022-08-09 2022-12-13 浙江工业大学 Penetration attack defense method and device based on reinforcement learning and electronic equipment
CN115883252A (en) * 2023-01-09 2023-03-31 国网江西省电力有限公司信息通信分公司 Power system APT attack defense method based on moving target defense
US11641579B1 (en) * 2020-03-04 2023-05-02 Cable Television Laboratories, Inc. User agents, systems and methods for machine-learning aided autonomous mobile network access
CN116319060A (en) * 2023-04-17 2023-06-23 北京理工大学 Intelligent self-evolution generation method for network threat treatment strategy based on DRL model
CN116405258A (en) * 2023-03-14 2023-07-07 中国人民解放军战略支援部队信息工程大学 Smart grid honey pot design method and system based on reinforcement learning
CN116562423A (en) * 2023-03-28 2023-08-08 西安工程大学 Deep reinforcement learning-based electric-thermal coupling new energy system energy management method
CN116565876A (en) * 2023-04-20 2023-08-08 武汉大学 Robust reinforcement learning distribution network tide optimization method and computer readable medium
CN116683513A (en) * 2023-06-21 2023-09-01 上海交通大学 Method and system for optimizing energy supplement strategy of mobile micro-grid
CN116776929A (en) * 2023-04-23 2023-09-19 南京航空航天大学 Multi-agent task decision method based on PF-MADDPG
WO2023249744A1 (en) * 2022-06-24 2023-12-28 Caterpillar Inc. Systems and methods for managing assignments of tasks for work machines using machine learning
CN117374937A (en) * 2023-10-11 2024-01-09 中国电力科学研究院有限公司 Multi-micro-grid collaborative optimization operation method, device, equipment and medium
WO2024022194A1 (en) * 2022-07-26 2024-02-01 中国电力科学研究院有限公司 Power grid real-time scheduling optimization method and system, computer device and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200119556A1 (en) * 2018-10-11 2020-04-16 Di Shi Autonomous Voltage Control for Power System Using Deep Reinforcement Learning Considering N-1 Contingency
EP4035078A1 (en) * 2019-09-24 2022-08-03 HRL Laboratories, LLC A deep reinforcement learning based method for surreptitiously generating signals to fool a recurrent neural network
US20210243226A1 (en) * 2020-02-03 2021-08-05 Purdue Research Foundation Lifelong learning based intelligent, diverse, agile, and robust system for network attack detection

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020000399A1 (en) * 2018-06-29 2020-01-02 东莞理工学院 Multi-agent deep reinforcement learning proxy method based on intelligent grid
US11641579B1 (en) * 2020-03-04 2023-05-02 Cable Television Laboratories, Inc. User agents, systems and methods for machine-learning aided autonomous mobile network access
CN113783881A (en) * 2021-09-15 2021-12-10 浙江工业大学 Network honeypot deployment method facing penetration attack
CN114363093A (en) * 2022-03-17 2022-04-15 浙江君同智能科技有限责任公司 Honeypot deployment active defense method based on deep reinforcement learning
CN114725936A (en) * 2022-04-21 2022-07-08 电子科技大学 Power distribution network optimization method based on multi-agent deep reinforcement learning
WO2023249744A1 (en) * 2022-06-24 2023-12-28 Caterpillar Inc. Systems and methods for managing assignments of tasks for work machines using machine learning
CN115333143A (en) * 2022-07-08 2022-11-11 国网黑龙江省电力有限公司大庆供电公司 Deep learning multi-agent micro-grid cooperative control method based on double neural networks
WO2024022194A1 (en) * 2022-07-26 2024-02-01 中国电力科学研究院有限公司 Power grid real-time scheduling optimization method and system, computer device and storage medium
CN115473677A (en) * 2022-08-09 2022-12-13 浙江工业大学 Penetration attack defense method and device based on reinforcement learning and electronic equipment
CN115883252A (en) * 2023-01-09 2023-03-31 国网江西省电力有限公司信息通信分公司 Power system APT attack defense method based on moving target defense
CN116405258A (en) * 2023-03-14 2023-07-07 中国人民解放军战略支援部队信息工程大学 Smart grid honey pot design method and system based on reinforcement learning
CN116562423A (en) * 2023-03-28 2023-08-08 西安工程大学 Deep reinforcement learning-based electric-thermal coupling new energy system energy management method
CN116319060A (en) * 2023-04-17 2023-06-23 北京理工大学 Intelligent self-evolution generation method for network threat treatment strategy based on DRL model
CN116565876A (en) * 2023-04-20 2023-08-08 武汉大学 Robust reinforcement learning distribution network tide optimization method and computer readable medium
CN116776929A (en) * 2023-04-23 2023-09-19 南京航空航天大学 Multi-agent task decision method based on PF-MADDPG
CN116683513A (en) * 2023-06-21 2023-09-01 上海交通大学 Method and system for optimizing energy supplement strategy of mobile micro-grid
CN117374937A (en) * 2023-10-11 2024-01-09 中国电力科学研究院有限公司 Multi-micro-grid collaborative optimization operation method, device, equipment and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于多智能体共享信息的低压配电网拓扑与数据建模技术研究;李澄;陈颢;刘恢;陆玉军;葛永高;王宁;;电子测量技术;20200623(第12期);全文 *
基于深度学习的SDN虚拟蜜网路由优化;胡洋;;计算机系统应用;20201013(第10期);全文 *
稀疏奖励下基于MADDPG算法的多智能体协同;许诺;杨振伟;;现代计算机;20200525(第15期);全文 *

Also Published As

Publication number Publication date
CN117808174A (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN112819300B (en) Power distribution network risk assessment method based on random game network under network attack
Xiang et al. An improved defender–attacker–defender model for transmission line defense considering offensive resource uncertainties
Holmgren Using graph models to analyze the vulnerability of electric power networks
Xiang et al. A game-theoretic study of load redistribution attack and defense in power systems
Hemmati et al. Reliability constrained generation expansion planning with consideration of wind farms uncertainties in deregulated electricity market
Xu et al. Bayesian adversarial multi-node bandit for optimal smart grid protection against cyber attacks
Liu et al. Intelligent jamming defense using DNN Stackelberg game in sensor edge cloud
CN115550078B (en) Method and system for fusing scheduling and response of dynamic resource pool
Chaoqi et al. Attack-defense game for critical infrastructure considering the cascade effect
Guo et al. Reinforcement-learning-based dynamic defense strategy of multistage game against dynamic load altering attack
Wei et al. Defending mechanisms for protecting power systems against intelligent attacks
CN114282855A (en) Comprehensive protection method of electric-gas coupling system under heavy load distribution attack
CN113934587A (en) Method for predicting health state of distributed network through artificial neural network
Omnes et al. Adversarial training for a continuous robustness control problem in power systems
Shahinzadeh et al. Unit commitment in smart grids with wind farms using virus colony search algorithm and considering adopted bidding strategy
CN117808174B (en) Micro-grid operation optimization method and system based on reinforcement learning under network attack
Ge et al. A game theory based optimal allocation strategy for defense resources of smart grid under cyber-attack
Wang et al. Optimal DoS attack strategy for cyber-physical systems: A Stackelberg game-theoretical approach
Matavalam et al. Curriculum based reinforcement learning of grid topology controllers to prevent thermal cascading
CN112541679A (en) Protection method for power grid under heavy load distribution attack
CN115801460B (en) Power distribution information physical system security risk assessment method considering network attack vulnerability
CN115983389A (en) Attack and defense game decision method based on reinforcement learning
CN108377238B (en) Power information network security policy learning device and method based on attack and defense confrontation
Sridharan et al. Game-theoretic approach to malicious controller detection in software defined networks
Bier et al. Game theory in infrastructure security

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant