CN117808174B

CN117808174B - Micro-grid operation optimization method and system based on reinforcement learning under network attack

Info

Publication number: CN117808174B
Application number: CN202410231339.6A
Authority: CN
Inventors: 刘帅; 王昊晨; 王小文; 徐昊天; 刘龙成; 赵浩然; 华友情
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2024-03-01
Filing date: 2024-03-01
Publication date: 2024-05-28
Anticipated expiration: 2044-03-01
Also published as: CN117808174A

Abstract

The invention belongs to the technical field of power system automation, and provides a micro-grid operation optimization method and system based on reinforcement learning under network attack, wherein the technical scheme is as follows: and decomposing the rewarding function of the intelligent agent into a plurality of sub rewarding functions and performing independent optimization, so that each sub rewarding function reaches the pareto optimum, and the local optimum of a single rewarding function is avoided. And if the action meets the index, feeding back an additional rewarding value to the agent, so that the agent tends to select the action meeting the set index. Under the condition that external network attack exists, the honey server is constructed as a safety protection measure, so that the honey server plays a role in protection, minimizes attack loss and has a certain practical significance.

Description

Micro-grid operation optimization method and system based on reinforcement learning under network attack

Technical Field

The invention belongs to the technical field of power system automation, and particularly relates to a micro-grid operation optimization method and system based on reinforcement learning under network attack.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The micro-grid is a small power generation and distribution network and consists of a plurality of independent energy systems such as a generator, a load, an energy storage device, a control device and the like. With the popularization of clean energy and distributed generation, the scale and complexity of micro-grid systems are gradually increasing. In the operation process of the micro-grid system, a plurality of targets such as economic cost reduction, safe operation assurance, environmental pollution reduction and the like need to be considered at the same time, and how to realize multi-target optimization of the micro-grid becomes a problem to be solved urgently.

In recent years, with the rapid development of artificial intelligence technology, deep reinforcement learning is increasingly applied to the field of micro-grids. By constructing proper state space, action space and rewarding function, the intelligent agent can interact with the environment and learn to the optimal strategy through multiple experiments and feedback, and the method provides an intelligent solution for the operation and management of the micro-grid system, so that the multi-objective optimization of the micro-grid can be realized.

However, the conventional deep reinforcement learning method has a certain limitation in solving the multi-objective optimization problem. First, it treats multiple targets as a whole and feeds back a whole rewarding function to the agent, which may cause the agent to converge to a locally optimal solution without guaranteeing that each target is optimal; second, the agent may sacrifice the value of some sub-bonus functions in pursuing the maximization of the overall bonus function, which may have serious consequences in the real world. In addition, the micro-grid may be attacked by external network in actual operation, and there is a risk of information leakage.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a reinforcement learning-based micro-grid operation optimization method and system under network attack, which can realize multi-objective optimization of the micro-grid under the condition that external network attack exists.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first aspect of the present invention provides a reinforcement learning-based micro-grid operation optimization method under network attack, including the steps of:

Describing an objective function and constraint conditions in a micro-grid environment;

Constructing a micro-grid environment based on reinforcement learning based on an objective function and constraint conditions in the micro-grid environment, and setting a state space, an action space and a reward function of an intelligent agent;

decomposing the whole rewarding function of the intelligent agent into a plurality of sub rewarding functions according to the target quantity, wherein each sub rewarding function is optimized by adopting an independent critic neural network; when each sub rewarding function is optimized by adopting an independent critic neural network, an external network attack existing in an actual environment is considered, a honey pot server is constructed on a real server, and the behavior and response of the real server are simulated so as to isolate an attacker from the real server;

Whether the action selected by the agent meets the relevant criteria is part of the sub-rewarding function, thereby making the agent prone to selecting actions that meet the set criteria.

A second aspect of the present invention provides a reinforcement learning-based microgrid operation optimization system under network attack, comprising:

The power grid environment describing module is used for describing an objective function and constraint conditions in the micro-grid environment;

The intelligent agent setting module is used for constructing a micro-grid environment based on reinforcement learning based on an objective function and constraint conditions in the micro-grid environment and setting a state space, an action space and a rewarding function of the intelligent agent;

The operation optimization module is used for decomposing the whole rewarding function of the intelligent agent into a plurality of sub rewarding functions according to the target quantity, and each sub rewarding function is optimized by adopting an independent critic neural network; when each sub rewarding function is optimized by adopting an independent critic neural network, an external network attack existing in an actual environment is considered, a honey pot server is constructed on a real server, and the behavior and response of the real server are simulated so as to isolate an attacker from the real server;

And an optimization output module for determining whether the selected action of the agent satisfies the associated index as part of the sub-rewarding function, thereby causing the agent to tend to select an action that satisfies the set index.

Compared with the prior art, the invention has the beneficial effects that:

1. According to the invention, the rewarding function of the intelligent agent is decomposed into a plurality of sub rewarding functions and is independently optimized, so that each sub rewarding function reaches the pareto optimum, and the local optimum of a single rewarding function is avoided.

2. The invention takes whether the action selected by the agent meets the related index as a part of the sub-rewarding function, so that the agent tends to select the action meeting the set index, and the sacrifice of the value of the sub-rewarding function in the process of pursuing the maximization of the whole rewarding function by the agent is avoided.

3. Under the condition that external network attack exists, the honey pot server is constructed to serve as a safety protection measure, so that the honey pot server has a protection function, minimizes attack loss and has a certain practical significance.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flowchart of a reinforcement learning-based micro-grid operation optimization method under network attack provided by an embodiment of the invention;

FIG. 2 is a flow chart of sub-bonus function optimization using reinforcement learning provided by an embodiment of the invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

As shown in fig. 1, the embodiment provides a reinforcement learning-based micro-grid operation optimization method under network attack, which includes the following steps:

S101: describing an objective function and constraint conditions in a micro-grid environment;

In S101, for a micro-grid system including a distributed power source, a power storage device, an electric energy conversion device and an electric load, with the minimum economic cost, the minimum safe cost and the minimum environmental protection cost within a set time as objective functions, constraints such as a micro-grid power supply and demand balance constraint, a distributed power source output power constraint, a power storage device output power constraint and a capacity constraint are comprehensively considered, and a micro-grid multi-objective optimization model based on reinforcement learning is constructed.

Wherein, the economic cost objective function is:

，

In the method, in the process of the invention, Representing the cost of the micro-grid to exchange power with the main grid,/>Representing network loss in the course of power transfer,/>Representing electricity purchase price of micro-grid to main grid,/>Representing the amount of electricity purchased; representing the power generation cost of a distributed power supply,/> Representing the power generation cost factor of the distributed power supply,Representing active power of the distributed power supply; /(I)Representing micro grid internal scheduling costs,/>Indicating the unit price of the dispatch,Representing the/>, in a microgridResponse power of individual demand response block,/>Represents the/>A binary variable to which the individual demand response block responds; /(I)Representing the electricity generation cost of the electricity storage equipment,/>Representing the coefficient of the electricity generation cost of the electricity storage device,Representing the generated power of the electricity storage device.

The safety cost objective function is:

，

In the method, in the process of the invention, Representing the actual voltage and the voltage reference value, respectively, during the operation of the microgrid,/>Representing power loss of a microgrid system,/>Representing the voltage dependence coefficient,/>Representing the nominal voltage at the point of common coupling.

The environmental protection cost objective function is:

，

In the method, in the process of the invention, Represents the set of atmospheric pollutants, parameter/>Represents the/>Punishment coefficient of atmospheric pollutants,/>Represents the/>Atmospheric pollutant discharge rate.

Combining the economic cost objective function, the safety cost objective function and the environment-friendly cost objective function to obtain an objective function;

the constraint conditions specifically include:

power supply and demand balance constraint: ，

Voltage constraint: ，

power constraint of distributed power supply: ，

power constraint of the power storage device: ，

Electric quantity constraint of electric storage equipment: ，

Updating the electric quantity of the energy storage device: ，

In the method, in the process of the invention, Representing the active power output by a photovoltaic generator,/>The power consumption of residents is represented;、/> Respectively representing the minimum value and the maximum value of the allowable voltage; /(I) 、/>Respectively representing the minimum value and the maximum value of the output power of the distributed power supply; /(I)、/>Respectively representing the minimum value and the maximum value of the output power of the power storage equipment; /(I)、/>Respectively representing the minimum value and the maximum value of the electricity quantity of the electricity storage equipment; /(I)、/>Respectively representing the charging efficiency and the discharging efficiency of the electricity storage equipment,/>、/>Respectively representing the charge power and the discharge power,、/>Respectively, the charge time and the discharge time.

S102: constructing a micro-grid environment based on reinforcement learning, and setting a state space, an action space and a reward function of an agent, wherein the method specifically comprises the following steps of:

S2011: building the state space observed by the intelligent agent: the state space observed by the intelligent agent in the micro-grid environment comprises 、/>、/>And/>Expressed as:

，

In the method, in the process of the invention, Representing a set of states.

S2012: constructing an action space of the intelligent body: the action space of the intelligent body comprises、/>、/>Expressed as:

，

In the method, in the process of the invention, Representing a set of actions that satisfy the constraint.

S2013: constructing a reward function of the intelligent agent in the training process by combining the observed state space of the intelligent agent and the action space of the intelligent agent: when the intelligent agent is in stateAction taken/>Rewards returned to the agent by the environment; Namely:

，

In the method, in the process of the invention, Representing the sum of economic, safe and environmental costs, i.e./>，/>Representing Lagrange penalty factor,/>、/>、/>、/>、/>And respectively representing cost functions corresponding to the power supply and demand balance constraint, the voltage constraint, the distributed power supply power constraint, the power constraint of the power storage equipment and the capacity constraint.

S103: and solving an optimization problem by utilizing DDPG algorithm, decomposing the rewarding function of the intelligent agent according to the target quantity, and independently optimizing each sub rewarding function by a critic neural network.

The method specifically comprises the following steps: training is performed based on the agent established in S102, and the idea of target decomposition is used in DDPG algorithm of reinforcement learning. The whole rewarding function is decomposed into M sub rewarding functions, and each sub rewarding function is optimized by adopting an independent critic neural network, so that the best pareto solution of each sub rewarding function can be guaranteed, and the local optimization of the rewarding function is avoided. Decomposing the bonus function according to the goal, namely:

，

the algorithm DDPG is only aimed at one total target based on the improvement of DDPG algorithm in the traditional reinforcement learning, after actor neural network selection action, a total reward function is returned to the intelligent agent at each moment, and only one critic neural network is used for evaluation.

Therefore, in this embodiment, the total target is decomposed into M sub-targets, and for each sub-target, the agent obtains a sub-reward function at each moment, and each sub-reward function is evaluated by a critic neural network, so that the sum of obtained reward functions is higher than that of optimizing only the whole reward function, thereby improving the strategy of the agent.

The specific optimization algorithm process comprises the following steps:

Step 1: initializing actor neural network parameters And/>Parameters of critic neural network/>Initializing target network parameters/>、/>Initializing an empirical playback buffer/>, for storing training dataAnd parameters T and T;

step 2: initializing environment and obtaining initial state of agent ；

Step 3: the intelligent body observes the environmental stateActor neural network generates corresponding actions according to the current strategyAdding noise to promote action exploration;

Step 4: performing index analysis on actions taken by the agent:

Performing the selected action Observation of rewards of environmental feedback/>Decomposition/>Sub-prize value/>And the state of the next moment/>；

Step 7: the obtained current state vector setStore to experience playback buffer/>In (a) and (b);

step 8: randomly selecting experience pool Training a set of data comprising: random sampling/>, when an empirical playback buffer stores more than a certain amount of dataEmpirical data/>; Calculation of target/>, using sampled empirical dataValue/>Wherein/>Representing a discount factor; by minimizing/>The loss function of value updates critic the parameters of the network: /(I); Using the outputs of critic networks, the policy gradients of actor networks are calculated and the parameters of actor networks are updated by gradient ascent:；

step 9: a soft update is made to the target network, ，/>Wherein/>Representing soft update parameters.

S104: whether the action selected by the agent meets the relevant criteria is part of the sub-rewarding function, thereby making the agent prone to selecting actions that meet the set criteria.

In this embodiment, the optimization is performed based on the DDPG algorithm after the improvement in S103, and whether the action selected by the agent meets the relevant index is used as a part of the sub-rewarding function, if the action meets the index, the additional rewarding value is fed back to the agent, so that the agent tends to select the action meeting the set index.

In particular, the micro-grid needs to be compatible with reducing economic cost, ensuring safe operation of the distributed power supply and reducing environmental pollution in the actual operation process. In order to prevent the intelligent agent from pursuing the maximum overall rewarding function and neglecting individual indexes in the training process, thereby causing huge loss in the actual running process, the indexes such as economy, safety, environmental protection and the like need to be subjected to condition setting, and the specific expression is as follows:

Economic index:

，

In the method, in the process of the invention, Representing the maximum acceptable economic cost for the microgrid. On the basis of meeting the constraint condition, when the economic cost/>Less than/>At this time, the economic cost of the micro-grid is low, and the environment is fed back to the intelligent agent as an extra reward/>; When economic cost/>Not less than/>When the economic cost required by the micro-grid to meet the constraint condition is too high, the environment does not provide additional rewards.

The economic cost objective function actually obtained by the intelligent agent is as follows:

，

In the method, in the process of the invention, Representing the economic cost actually obtained after the agent takes the action,/>Indicating a decision on the action taken by the agent and feeding back an additional prize value.

Safety index:

，

In the method, in the process of the invention, Indicating that the microgrid is acceptable/>And/>Maximum threshold of gap. When the voltage of the micro grid/>And reference voltage/>The difference between (2) is less than/>At the moment, the running of the micro-grid is safe and reliable, and the environment is fed back to the intelligent agent for giving extra rewards/>; When the voltage of the micro-grid is too large from the reference voltage, the operation of the micro-grid has potential safety hazards, so that no additional rewarding value is provided.

The actual safety cost objective function obtained by the intelligent agent is as follows:

，

In the method, in the process of the invention, Representing the actual cost of security obtained after an agent takes an action,/>Indicating a decision on the action taken by the agent and feeding back an additional prize value.

Environmental protection indexes:

，

In the method, in the process of the invention, Representation/>And/>A maximum threshold value of the sum. When/>And/>The sum is less than the thresholdWhen the micro-grid generates less air pollutants, the environment is fed back to the intelligent agent to give an extra reward/>Otherwise, the micro-grid is considered to emit more air pollutants during operation, so that a reward value is not provided.

The actual environmental protection cost objective function that obtains of intelligent agent is:

，

S105: and comprehensively considering the external network attack existing in the actual environment, and constructing the honey server on the basis of the real server to reduce the loss caused by information leakage.

Based on the algorithm optimized in the step S104, the honey pot server is set on the basis of the original real server by considering the external network attack existing in the actual environment, and the behavior and response of the real server are simulated to isolate an attacker from the real server, so that the functions of protecting the privacy of the micro-grid and preventing data leakage are achieved. By recording the information of the attack and the attacker, protection of the system and analysis and tracking of the attack can be provided.

The method specifically comprises the following steps:

s501: modeling both network attack and defense parties, specifically comprising:

constructing a set of offensive and defensive game participants Wherein/>Representing an attacker among the network attack and defense participants; /(I)Is an defender, namely a micro-grid, among network attack and defense participants.

Constructing attack and defense game participant policy setsWherein/>A set of policies representing an attacker; /(I)Representing a set of policies for the microgrid.

Constructing a set of defender actual server typesWherein/>、/>Representing the real server and honeypot server types, respectively.

Constructing server type signal sets released by defendersWherein/>、/>The server type signals respectively representing the release of the defender are a real server and a honey server. In the attack and defense process, the signals released by the defender are not necessarily the same as the actual server types, so that the judgment of the attacker is interfered, and the effect of resisting external attack is improved.

Constructing an attack mode set of an attackerWherein/>Representing a direct attack,/>The representative detects the type of defender server first and then attacks, differing in whether the attacker detects the attacking server in advance. If the detected counterpart is a honey server, the attacker gives up the attack, otherwise, the attack is continued, but the risk of failure in detection exists.

S502, analyzing the profit quantization of different participants under the honey-free server and the profit quantization in the honey dynamic game based on the constructed network attack and defense model;

s5021, quantifying the benefits of different participants under the honey-free server comprises the following steps:

The defender policy expects benefits:

，

In the method, in the process of the invention, Representing that an attacker adopts an attack strategy/>Defending person adopts defending strategy/>The income obtained by the defender is large; /(I)Representing the loss of system after defending against an attack, wherein/>Representing the fatality of an attacker launching an attack strategy,/>Representing an integrity cost,/>The weight of the integrity is represented as such,Representing privacy costs,/>Representing privacy weighting,/>Representing availability cost,/>Representing availability weights; /(I)Representing the cost spent by an defender against an attacker's attack, where/>Representing the operational cost of a defensive strategy,/>Indicating the inherent harm the attacker has to the defender due to the aggressive action taken by the attacker,Representing that an attacker adopts an attack strategy/>Defending person adopts defending strategy/>The extent of damage to the defender.

Attacker policy expects benefits:

，

In the method, in the process of the invention, Representing that an attacker adopts an attack strategy/>Defending person adopts defending strategy/>Afterwards, the size of the income obtained by the attacker; /(I)Representing that an attacker adopts an attack strategy/>Is not limited by the cost of (a).

S5022, profit quantization in the honey dynamic game comprises profit quantization of direct attack by an attacker and profit quantization of attack after detection by the attacker; the method comprises the steps of quantifying the benefits of attacking the honeypot server and quantifying the benefits of attacking the real server in each case;

wherein when an attacker directly attacks, when the attacker attacks the honey server,

The defender benefits are:；

The attacker benefits are: ；

In the method, in the process of the invention, Is a honey factor and represents the capacity of honey server deception; /(I)Representing benefits obtained by defenders through monitoring and detecting the attack behaviors of the other parties by the honey pot server; /(I)Indicating that the defender releases camouflage costs different from the actual server signal; /(I)The attack strategy initiated by the attacker is utilized by the honeypot server of the defender, and the attack cost caused by partial information of the attacker is revealed.

When an attacker directly attacks, and when the attacker attacks the real server,

The defender benefits are:；

The attacker benefits are: ；

Wherein, when an attacker detects and then attacks the honey server,

The defender benefits are:；

The attacker benefits are: ；

In the method, in the process of the invention, Representing the probability that an attacker correctly detects the defender server type,/>Representing the cost of detection of an attacker.

Wherein, when an attacker detects and attacks the real server,

The defender benefits are:；

The attacker benefits are: ；

Before attacking the server of the defender, the attacker finds that the server is a real server through the detection technology of the attacker, and then starts the attack, so that a certain benefit is obtained, and meanwhile, the defender is lost.

By comparing the profit quantization of different participants under the honey-free server with the profit quantization in the honey dynamic game, it can be seen that if the honey server is not arranged, an attacker can certainly launch an attack on the real server of the defender, so that the defender loses such as data leakage. By setting the honey server, the defender releases false information and acquires the attack strategy of an attacker, so that the privacy of the defender is protected, and a certain defending effect is achieved in an actual environment.

S5023, setting a honey server on the basis of an original real server based on the analysis conclusion, and simulating the behavior and response of the real server, wherein the honey server specifically comprises the following steps:

S50231: the type of honeypot set is determined, e.g., network honeypot, system honeypot, application honeypot, etc.

S50232: according to the selected honey type, proper honey software is selected to simulate various services and operating systems, so that the honey server is more real and reliable.

S50233: and installing selected software and configuring the honey pot environment on an independent server or virtual machine to ensure that the honey pot environment can normally communicate with an external network.

S50234: configuring the honey server includes installing common services, opening ports, manufacturing false information, etc.

S50235: and setting a honey server, monitoring all accesses and interactions, recording the behavior of an attacker, analyzing intrusion attempts and taking corresponding measures.

S50236: the software and configuration of the honeypot server is updated periodically to ensure that it remains up-to-date to accommodate the changing threat environment.

Example two

The embodiment provides a micro-grid operation optimization system based on reinforcement learning under network attack, which comprises:

In the optimization model construction module, objective functions in the micro-grid environment comprise economic cost, safety cost and environmental protection cost objective functions, and constraint conditions comprise power supply and demand balance constraint, voltage constraint, power constraint of a distributed power supply, power constraint of power storage equipment, power constraint of the power storage equipment and power update constraint of the power storage equipment.

In the operation optimization module, each sub-rewarding function is optimized by adopting an independent critic neural network, and the method specifically comprises the following steps:

Initializing actor neural network parameters Parameters of a critic neural network;

The intelligent agent observes the environment state and selects corresponding actions according to the current strategy;

performing index analysis on actions taken by the intelligent agent, wherein the intelligent agent obtains rewards of environmental feedback;

Based on the target quantity of rewards fed back Decomposing into a plurality of sub rewards; the intelligent agent observes the state at the next moment, and stores the obtained state vector set into an experience pool R;

And randomly selecting a group of data in the experience pool R for training, and updating parameters of M critic neural networks to obtain a final strategy of the intelligent agent. In the optimization output module, the set indexes comprise economic indexes, safety indexes and environment-friendly indexes.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The micro-grid operation optimization method based on reinforcement learning under network attack is characterized by comprising the following steps:

the objective functions in the micro-grid environment comprise economic cost, safety cost and environmental protection cost objective functions, and the constraint conditions comprise power supply and demand balance constraint, voltage constraint, power constraint of a distributed power supply, power constraint of a power storage device, electric quantity constraint of the power storage device and electric quantity update constraint of the energy storage device;

The state space, the action space and the rewarding function of the intelligent agent are set, specifically:

The state space of the intelligent agent is as follows:

，

Wherein, Representing the active power output by a photovoltaic generator,/>Representing the power consumption of residents,/>Representing electricity purchase price of micro-grid to main grid,/>For electricity of electricity storage equipment,/>Representing a set of states;

the action space is as follows:

，

Wherein, Is the power of the distributed power supply,/>For the amount of electricity purchased,/>Representing the generated power of the electricity storage device,/>Representing a set of actions that satisfy the constraint condition;

the reward function is:

，

Wherein, Representing the state space observed by the agent,/>Representing the action space of an agent,/>Representing the sum of economic, safe and environment-friendly costs,/>Representing Lagrange penalty factor,/>、/>、/>、/>、/>Respectively representing cost functions corresponding to power supply and demand balance constraint, voltage constraint, distributed power supply power constraint, power constraint and capacity constraint of the power storage equipment;

each sub-rewarding function is optimized by adopting an independent critic neural network, and specifically comprises the following steps:

randomly selecting a group of data in the experience pool R for training, and updating parameters of M critic neural networks to obtain a final strategy of the intelligent agent;

The method for constructing the honey server on the real server by considering the external network attack existing in the actual environment comprises the following steps:

Determining the type of the honeypot;

selecting honey software according to the selected honey type, and simulating various services and operating systems;

Installing honeypot software and configuring a honeypot environment;

configuring and setting a honey server based on honey environment;

The software and the configuration of the honeypot server are updated regularly, so that the honeypot server is ensured to keep the latest state;

2. The reinforcement learning-based micro-grid operation optimization method under network attack according to claim 1, wherein the set indexes include an economic index, a safety index and an environmental protection index.

3. The micro-grid operation optimization system based on reinforcement learning under network attack is characterized by comprising:

The state space of the intelligent agent is as follows:

，

the action space is as follows:

，

the reward function is:

，

Determining the type of the honeypot;

Installing honeypot software and configuring a honeypot environment;

configuring and setting a honey server based on honey environment;

4. The reinforcement learning-based micro-grid operation optimization system under network attack according to claim 3, wherein in the optimization model construction module, the objective functions in the micro-grid environment comprise economic cost, safety cost and environmental protection cost objective functions, and the constraint conditions comprise power supply-demand balance constraint, voltage constraint, power constraint of the distributed power supply, power constraint of the power storage device and power update constraint of the power storage device.

5. The reinforcement learning-based micro-grid operation optimization system under network attack of claim 3, wherein in the operation optimization module, each sub-rewarding function is optimized by adopting an independent critic neural network, and specifically comprises:

And randomly selecting a group of data in the experience pool R for training, and updating parameters of M critic neural networks to obtain a final strategy of the intelligent agent.

6. The reinforcement learning-based micro-grid operation optimization system under network attack according to claim 3, wherein the set indexes comprise an economic index, a safety index and an environmental protection index in the optimization output module.