CN111144728A

CN111144728A - Deep reinforcement learning-based economic scheduling method for cogeneration system

Info

Publication number: CN111144728A
Application number: CN201911314830.0A
Authority: CN
Inventors: 周苏洋; 胡子健; 顾伟; 吴志
Original assignee: Southeast University
Current assignee: Yangzhou Power Supply Branch Of State Grid Jiangsu Electric Power Co ltd; Southeast University
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-05-12
Anticipated expiration: 2039-12-18
Also published as: CN111144728B

Abstract

The invention discloses a combined heat and power system economic dispatching method based on deep reinforcement learning, which comprises S1, aiming at a combined heat and power system operation model, describing the operation model by using a Markov chain model, strictly converting an objective function and a constraint target in an optimization method respectively, and giving a proof; s2, improving the DPPO algorithm in the deep reinforcement learning to train the intelligent agent under various operation states, firstly, before each training round begins, the operation environment will randomly generate operation data in a reasonable operation range; the intelligent agent in the turn generates a control strategy according to the current internal neural network parameters, and interacts with the operating environment; and after the round is finished, the accumulated reward in the maximized round is used as the target for back propagation, and the network parameters of the intelligent agent are optimized, so that the intelligent agent learns the economic dispatching strategy for different operation states of the cogeneration system. The invention greatly improves the convenience of use. And has better convergence performance.

Description

Deep reinforcement learning-based economic scheduling method for cogeneration system

The technical field is as follows:

the invention belongs to the technical field of energy system optimization control, and particularly relates to an economic dispatching method of a cogeneration system based on a DPPO deep reinforcement learning algorithm.

Background art:

the contradiction between the current social development and the energy consumption is increasingly obvious, and the annual book of the world energy statistics released by the British oil company in 2018 shows that the worldwide coal exploration reserves can only be maintained for about 134 years of human production activities, and the oil and the natural gas can only be maintained for about 53 years, so that the extremely challenging environmental protection target is realized, the economic and sustainable energy supply is provided for the current generation and the later generation of human beings, and the innovation and the change of the current energy use mode are urgently needed. Under the background, the concept of Integrated Energy System (IES) is developed, and it is essential to integrate various Energy sources (such as electricity, gas, heat, hydrogen, etc.) and fully play the synergistic and complementary roles between them, so as to improve the overall Energy utilization efficiency, promote the consumption of renewable Energy sources, and reduce the Energy consumption, cost and emission. IES is proved to be an effective energy solution, and has great potential in constructing safe, efficient, clean and flexible future energy systems.

As a typical form of integrated energy system, a thermoelectric system establishes a wide connection between two subsystems, electricity and heat, through coupling devices (such as cogeneration units, electric boilers and electric heat pumps). Compared with the traditional discrete energy supply system, the cogeneration system can fully utilize the waste heat generated in the power generation process to meet part of civil or industrial heat supply loads, thereby improving the overall energy utilization efficiency. Furthermore, the thermal inertia of the heating system can significantly increase the flexibility of the system to absorb renewable energy and optimize operation, and enhance the stability of the power system by reducing the volatility of renewable energy. Therefore, the electric heating type comprehensive energy system is more and more concerned by extensive research at home and abroad due to the advantages of the electric heating type comprehensive energy system in many aspects.

Different from the operation optimization of a single power supply system, due to the existence of equipment coupling and the access of various equipment and loads, the cogeneration system needs to face a more complicated and changeable operation environment, and great challenges are brought to the intelligent optimization scheduling of the system. In order to provide a control strategy capable of coping with multiple operation scenes and improve the intelligence of economic dispatching, the invention selects an optimization strategy based on a deep reinforcement learning algorithm, learns and memorizes different operation conditions with strong data storage efficiency, and trains and generates an intelligent agent capable of coping with multiple operation scenes.

The invention content is as follows:

the invention aims to provide an economic dispatching method of a cogeneration system based on deep reinforcement learning aiming at the existing problems, which can achieve the same economic performance as the traditional optimization method, and meanwhile, the trained intelligent agent can be repeatedly utilized to deal with various running states, thereby greatly improving the convenience of use. Meanwhile, compared with other reinforcement learning strategies, the improved DPPO algorithm (namely the distributed near-end strategy optimization algorithm) has better convergence performance.

The above object of the present invention can be achieved by the following technical solutions:

a combined heat and power generation system economic dispatching method based on deep reinforcement learning comprises the following steps:

s1, aiming at the running model of the cogeneration system, describing the running model by using a Markov chain model, strictly converting a target function and a constraint target in the optimization method respectively, and giving a proof;

s2, improving the DPPO algorithm in the deep reinforcement learning to train the intelligent agent under various operation states, firstly, before each training round begins, the operation environment will randomly generate operation data in a reasonable operation range; the intelligent agent in the turn generates a control strategy according to the current internal neural network parameters, and interacts with the operating environment; and after the round is finished, the accumulated reward in the maximized round is used as the target for back propagation, and the network parameters of the intelligent agent are optimized, so that the intelligent agent learns the economic dispatching strategy for different operation states of the cogeneration system.

As an improvement of the present invention, the constituent factors of the markov chain model described in step S1 include environment and action, and are directed to the operation environment of the cogeneration system

The intelligent agent will generate an action

The environment will operate according to the action indication and feed back the reward r, so the cogeneration system is defined by a six-element group:

wherein P:

is a matrix, p, that transitions from one state to another₀:

Is the probability distribution of the initial state, γ ∈ (0,1) is the exploration factor, and the specific relationship between the parameters is described by the following formula:

in the formula: i is an indication function, I-1 if the power mismatch is less than the limit epsilon, otherwise I-0, within one training round; c ═ p_gt,q_gt,q_gb,q_tst,p_grid,p_wind]Is a device operating state parameter, p_gt,q_gt,q_gb,q_tst,p_grid,p_windSequentially comprises the electric output of the gas turbine, the heat output of the gas boiler, the heat charging/discharging value of the heat storage tank, the interaction electric quantity with the power grid and the electric quantity generated by the fan; d ═ p [ (p)_l-p_s),(q_l-q_s),p_l,q_l)]Is the power mismatch value, p_lFor the value of the electrical load requirement, p_sSupplying a value, q, to the electrical load_lQ is a requirement of the thermal load_sSupplying a value for the thermal load; x ═ tst_i,rtp]For two random environment variables, tst_iThe initial state of the heat storage tank at the ith moment is rtp, and the time-of-use electricity price is rtp;

represents an action value, Δ p_gt,Δp_gb,Δq_tst,Δp_gridAnd respectively representing the output of the gas turbine, the output of the gas boiler, the heat charging/discharging of the heat storage tank and the change value of the trading volume with the power grid when actions are taken.

As a modification of the present invention, the strict transformation of the objective function part in the optimization method is performed in step S1, and it is proved that the specific method is: let pi be some random policy generated by the intelligent agent, pi ═ a₀,a₁,…a_nRepresents the set of actions from step 0 to the last step in a training round, as defined by the criteria for the Markov chain problem:

A_π(s,a)＝R_π(s,a)-V_π(s)

in the above formula: s_t,a_tRespectively, the state and the action at the t-th moment, the subscript t denoting the value of the moment in the training round, R_π(s_t,a_t) Refers to the cumulative reward function in the case of taking a strategy trajectory pi from the t-th moment in a training round, r(s)_t,a_t) At time t is indicated as s_tState, taking action a_tThe reward of the environmental feedback, the integral function subscript t represents the reward from the t-th moment, the superscript represents the reward ending at the t + l moment,

the symbol represents the action of sampling from the strategy track pi and acting all the way along the strategy track，V_π(s_t) Is a value function, representing the value of the pair at s_tEstimation of the possible jackpot under the State, r(s)_t) Is shown in state s_tAn estimate of the reward given to the environment, A_π(s, a) is a difference function representing the difference between the actual reward and the estimated reward for evaluating the goodness of the current action, assuming that another strategy trajectory is taken

Then the new strategy trajectory

May be expressed as:

where η (π) represents the cumulative reward value that an agent receives in a training round in the event that a strategy trajectory π is taken, so that the new strategy trajectory

The cumulative prize value of (d) may be represented by the prize of the original policy trajectory pi plus the value of the cumulative difference function; and then, as long as guarantee

The strategy after each update is better than the original strategy, and finally converges to the optimal solution, according to the definition A of the difference function_π(s,a)＝R_π(s,a)-V_π(s) the policy trajectory at the time of final convergence has the largest cumulative reward function value, and a policy larger than the cumulative reward value of the policy trajectory cannot be found, so that the policy trajectory at the time is the optimal solution, and according to the description, the optimized objective function can be converted into the cumulative reward value in the maximization round, namely

The specific prize values are set as follows:

d＝(P_s-P_l,Q_s-Q_l)

c_gasand c_gridGas costs and grid trade fees, i.e. profits, respectively, where ρ_gasAnd ρ_gridUnit price for gas and grid transactions, respectively, η is energy conversion efficiency, superscript t_dRepresents a time period t_dInner, subscripts gt and gb denote a gas turbine and a gas boiler, respectively; d represents the power mismatch value and the final reward is composed of three parts: 1) gas and grid trading costs, intelligent agents may be encouraged to learn how to minimize operating costs by maximizing the cumulative reward value; 2) a power mismatch value that encourages intelligent agents to learn how to minimize supply-demand imbalance by maximizing the jackpot; 3) s_tstIndicating the final state of the thermal storage tank,

under the normal operation condition, operators hope to ensure that the final heat storage of the heat storage tank does not change greatly in a period of time so as to be used in the next stage, and the minimization of the value can ensure that the heat storage tank can be finally stabilized near the ideal state.

As an improvement of the present invention, the constraint target part in the optimization method is strictly transformed in step S1, and it is proved that the specific method is as follows:

1) supply and demand balance constraint: the supply value of the electric heat should match the demand value:

p_gt+p_wind+p_grid＝p_l,

q_gt+q_gb+q_tst＝q_l,

q_gt＝αp_gt,

α is the electric heat conversion efficiency of the gas turbine, according to the reward function in the Markov chain model, the optimization goal is converted into the maximum accumulated reward, the supply and demand balance constraint is used as one item in the reward function, so as to ensure that the finally generated control strategy can reach the requirement of supply and demand balance;

2) plant operating constraints

In the formula, the superscripts min and max respectively represent the minimum value and the maximum value of operation, which represent that the equipment should operate in the range of the maximum and minimum values, and according to the state transition probability in the Markov chain model, if the current action can cause the state to transition to the state exceeding the operation limit, the probability is 0, namely the transition to the state exceeding the operation limit is impossible;

3) energy storage device restraint

s.t.

Wherein Q represents heat storageThe heat storage value of the tank is stored, the upper scale tst represents a heat storage tank, the lower scale t represents the time, and the lower scales min and max represent the maximum value and the minimum value of heat storage respectively;

and

respectively representing the heat charging efficiency and the heat discharging efficiency of the heat storage tank, wherein the subscript char represents the heat charging, dis represents the heat discharging, and the superscript min and max respectively represent the maximum value and the minimum value of the heat charging/heat discharging efficiency; the limitation of the heat storage amount is converted into the state transition probability;

as a modification of the present invention, step S2 further includes:

and S21, before the turn is started, randomly generating the operation condition of the cogeneration system in the feasible domain according to the feasible domain for operating the real data. The method comprises the following steps: heat load, electric load, wind power generation capacity, initial value of the heat storage tank and energy price. A neural network, called an action network, is built in which the learning experience is stored in the intelligent agent, and the action network is used for generating a control strategy.

And S22, setting the step size of 300 steps in each training round, namely requiring the intelligent agent to complete the control target in 300 steps. The intelligent agent will continuously interact with the environment in 300 steps and obtain the corresponding reward value and store for training.

S23, according to the transformation of the optimization objective function, as long as the action of the selection is ensured to maximize the cumulative reward value in the round, that is to say

And then, the optimality of the finally obtained strategy can be ensured, theta is set as a parameter vector of the action network, the cumulative reward function value in 300 steps is calculated according to data obtained in one round, and the value is calculated along the gradient direction of the cumulative reward function:

updating the neural network parameter theta to enable the cumulative difference function value obtained in the next round to be larger than 0, but directly updating the parameter may cause the updating amplitude to be too large, and further cause the problem of difficult convergence, so that the cutting technology is applied to updating the action network parameter, and the following formula is a strategy updating ratio:

in the above formula: pi_θ(a_t|s_t) Representing the policy trajectory generated in the case of the action network parameter θ, (a)_t|s_t) Is shown in state s_tLower selection action a_t；

z_t(θ)∈(1-∈,1+∈)

The equation is epsilon as a cutting coefficient, and the equation shows that the parameter updating of the action network is limited within a certain range every time so as to achieve better convergence performance.

As an improvement of the present invention, the intra-round intelligent agent described in step S2 generates a control policy according to the current internal neural network parameters, and a distributed acquisition architecture is used in the process of interacting with the operating environment, that is, a plurality of intelligent agents are set to search in the same environment at the same time, so that each intelligent agent can acquire different data and finally update uniformly.

Has the advantages that:

the method can achieve the same economic performance as the traditional optimization method, and meanwhile, the trained intelligent agent can be recycled to deal with various running states, so that the use convenience is greatly improved. Meanwhile, compared with other reinforcement learning strategies, the improved DPPO algorithm has better convergence performance.

Description of the drawings:

FIG. 1 is a flow chart of the steps of the method of the present invention;

FIG. 2 is a schematic diagram of an example cogeneration system;

FIG. 3 is a schematic diagram of the application results of the present invention 1;

fig. 4 is a schematic diagram of the application result of the present invention 2.

The specific implementation mode is as follows:

the invention is described in further detail below with reference to the figures and the specific embodiments.

For the cogeneration system shown in fig. 2, the conventional optimization method expresses it as a nonlinear system of equations composed of optimization objectives and constraints. The invention describes the running model by using a Markov chain model, strictly converts an objective function and a constraint target in the optimization method respectively, and provides a proof, which comprises the following steps:

A markov chain based operational model is first established for the cogeneration system shown in fig. 2. The constituent factors of the markov chain model in step S1 include environment and action, and are specific to the cogeneration system operating environment

The intelligent agent will generate an action

wherein P:

is a matrix, p, that transitions from one state to another₀:

in the formula: i is an indication function, I-1 if the power mismatch is less than the limit epsilon, otherwise I-0, within one training round; c ═ p_gt,q_gt,q_gb,q_tst,p_grid,p_wind]Is a device operating state parameter, p_gt，q_gt,q_gb,q_tst,p_grid,p_windSequentially comprises the electric output of the gas turbine, the heat output of the gas boiler, the heat charging/discharging value of the heat storage tank, the interaction electric quantity with the power grid and the electric quantity generated by the fan; d ═ p [ (p)_l-p_s),(q_l-q_s),p_l,q_l)]Is the power mismatch value, p_lFor the value of the electrical load requirement, p_sSupplying a value, q, to the electrical load_lQ is a requirement of the thermal load_sSupplying a value for the thermal load; x ═ tst_i,rtp]For two random environment variables, tst_iThe initial state of the heat storage tank at the ith moment is rtp, and the time-of-use electricity price is rtp;

represents an action value, Δ p_gt,Δp_gb,Δq_tst,Δp_gridRespectively indicate to adoptAnd when the action is taken, the gas turbine outputs power, the gas boiler outputs power, and the heat storage tank charges/releases heat and changes the trading volume with the power grid.

And then strictly converting an objective function part in the optimization method, and giving a proof, wherein the specific method comprises the following steps: let pi be some random policy generated by the intelligent agent, pi ═ a₀,a₁,…a_nRepresents the set of actions from step 0 to the last step in a training round, as defined by the criteria for the Markov chain problem:

A_π(s,a)＝R_π(s,a)-V_π(s)

the symbol represents the action taken in sampling from the policy track pi and acting all the way along this policy track, V_π(s_t) Is a value function, representing the value of the pair at s_tEstimation of the possible jackpot under the State, r(s)_t) Is shown in state s_tAn estimate of the reward given to the environment, A_π(s, a) is a difference function representing the difference between the actual reward and the estimated reward for evaluating the goodness of the current action, assuming that another strategy trajectory is taken

Then the new strategy trajectory

May be expressed as:

The specific prize values are set as follows:

d＝(P_s-P_l,Q_s-Q_l)

Then, strict transformation is carried out on a constraint target part in the optimization method, and a demonstration is given, wherein the specific method comprises the following steps:

p_gt+p_wind+p_grid＝p_l,

q_gt+q_gb+q_tst＝q_l,

q_gt＝αp_gt,

2) plant operating constraints

3) energy storage device restraint

s.t.

In the formula, Q represents the heat storage value of the heat storage tank, the superscript tst represents the heat storage tank, the subscript t represents the time, and the subscripts min and max represent the maximum value and the minimum value of heat storage respectively;

and

respectively indicating the heat charging efficiency of the heat storage tankAnd heat release efficiency, the subscript char represents heat charge, dis represents heat release, and the superscript min and max represent the maximum and minimum values of the heat charge/release efficiency, respectively, according to the definition of the action in the Markov chain model, the heat charge/release efficiency limit is converted into the action value for the heat storage tank, and the action range is in the range of the heat charge/release efficiency; the limitation of the heat storage amount is converted into the state transition probability;

and finally training the intelligent agent based on the DPPO reinforcement learning algorithm, which specifically comprises the following steps:

z_t(θ)∈(1-∈,1+∈)

Meanwhile, in order to improve the efficiency and comprehensiveness of data acquisition, a distributed acquisition architecture is used. Namely, a plurality of intelligent agents are set to explore in the same environment at the same time, so that each intelligent agent can acquire different data and update uniformly.

And repeating the training rounds continuously until the accumulated reward value is stable, storing the parameter values of the intelligent agent network after training, and directly calling when the intelligent agent network is required to be used. And inputting the data of the real operation scene into the intelligent agent aiming at different operation scenes so as to generate an optimal control strategy.

Fig. 3 shows that when the heat load input at a certain time is 9000kW, the electric load is 6000kW, the wind power generation amount is 700kW, and the real-time electricity rate is 0.627$/kWh, the operation parameters of the internal equipment of the cogeneration system change. In the figure, the example gt refers to the operation state of a gas boiler, gb refers to the operation state of a gas turbine, tst refers to the operation state of a heat storage tank, and grid refers to the amount of electricity exchanged with a power grid. At the moment, the average level of the electric load is lower than the average level of the electric load all day, the level of the heat load is higher, the energy price is lower, so that the heat load born by the gas boiler is more, the output of the gas turbine is less, the heat storage tank is heated when the energy price is lower to ensure the maximum profit, and the redundant electric energy is sold to the power grid. According to the results of fig. 3, the results obtained by the present invention meet the actual requirements.

FIG. 4 illustrates the control strategy generated by the intelligent agent in the case of a day-ahead economic dispatch, entering 24 different time period operational scenarios throughout the day. Fig. 4(a) shows a relationship between a heat load and a heat supply, fig. 4(b) shows a relationship between an electric load and a power supply, and in the graph, TST refers to a heat charging/discharging amount of a heat storage tank, GT refers to an output heat/electricity amount of a gas turbine, Grid refers to an electricity amount traded with a power Grid, and GB refers to an output heat amount of a gas boiler. The dotted line in the graph indicates the actual load demand and the solid line indicates the energy supply value. It can be seen from fig. 4 that there is a slight difference between the heating and the actual load, but in actual operation, the heating load and the heat load demand are not perfectly matched, so the control strategy can meet the heat load demand. The electrical load is almost completely fitted with the power supply quantity, which indicates that the power supply meets the load requirement. Meanwhile, the control strategy under the condition is solved by adopting a traditional optimization method, the operation cost of the whole day is 16924.029$, and the operation cost of the whole day is 16874.28 $isobtained by adopting the control strategy obtained by the invention. It is shown that the process of the invention can achieve the same economic performance as the conventional optimization process.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A combined heat and power generation system economic dispatching method based on deep reinforcement learning is characterized in that: the method comprises the following steps:

2. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: the constituent factors of the markov chain model in step S1 include environment and action, and are specific to the cogeneration system operating environment

The intelligent agent will generate an action

wherein

Is a matrix that transitions from one state to another,

in the formula: i is an indication function, I-1 if the power mismatch is less than the limit epsilon, otherwise I-0, within one training round;c＝[p_gt，q_gt，q_gb，q_tst，p_grid，p_wind]is a device operating state parameter, p_gt，q_gt，q_gb，q_tst，p_grid，p_windSequentially comprises the electric output of the gas turbine, the heat output of the gas boiler, the heat charging/discharging value of the heat storage tank, the interaction electric quantity with the power grid and the electric quantity generated by the fan; d ═ p [ (p)_l-p_s)，(q_l-q_s)，p_l，q_l)]Is the power mismatch value, p_lFor the value of the electrical load requirement, p_sSupplying a value, q, to the electrical load_lQ is a requirement of the thermal load_sSupplying a value for the thermal load; x ═ tst_i，rtp]For two random environment variables, tst_iThe initial state of the heat storage tank at the ith moment is rtp, and the time-of-use electricity price is rtp;

represents an action value, Δ p_gt，Δp_gb，Δq_tst，Δp_gridAnd respectively representing the output of the gas turbine, the output of the gas boiler, the heat charging/discharging of the heat storage tank and the change value of the trading volume with the power grid when actions are taken.

3. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: in step S1, the objective function part in the optimization method is strictly transformed, and a proof is given, where the specific method is: let pi be some random policy generated by the intelligent agent, pi ═ a₀，a₁，...a_nRepresents the set of actions from step 0 to the last step in a training round, as defined by the criteria for the Markov chain problem:

A_π(s，a)＝R_π(s，a)-V_π(s)

in the above formula: s_t，a_tRespectively, the state and the action at the t-th moment, the subscript t denoting the value of the moment in the training round, R_π(s_t，a_t) Refers to the cumulative reward function in the case of taking a strategy trajectory pi from the t-th moment in a training round, r(s)_t，a_t) At time t is indicated as s_tState, taking action a_tThe reward of the environmental feedback, the integral function subscript t represents the reward from the t-th moment, the superscript represents the reward ending at the t + l moment,

Then the new strategy trajectory

May be expressed as:

The strategy after each update is better than the original strategy, and finally converges to the optimal solution, according to the definition A of the difference function_π(s，a)＝R_π(s，a)-V_π(s) the policy trajectory at the time of final convergence has the largest cumulative reward function value, and a policy larger than the cumulative reward value of the policy trajectory cannot be found, so that the policy trajectory at the time is the optimal solution, and according to the description, the optimized objective function can be converted into the cumulative reward value in the maximization round, namely

The specific prize values are set as follows:

d＝(P_s-P_l，Q_s-Q_l)

c_gasand c_gridGas costs and grid trade fees, i.e. profits, respectively, where ρ_gasAnd ρ_gridUnit price for gas and grid transactions, respectively, η is energy conversion efficiency, superscript t_dRepresents a time period t_dInner, subscripts gt and gb denote a gas turbine and a gas boiler, respectively; d represents the power mismatch value and the final reward is composed of three parts: 1) gas andgrid trading costs, intelligent agents may be encouraged to learn how to minimize operating costs by maximizing the cumulative reward value; 2) a power mismatch value that encourages intelligent agents to learn how to minimize supply-demand imbalance by maximizing the jackpot; 3) s_tstIndicating the final state of the thermal storage tank,

4. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: in step S1, the constraint target part in the optimization method is strictly transformed, and it is proved that the specific method is as follows:

p_at+p_wind+p_grid＝p_l，

q_gt+q_gb+q_tst＝q_l，

q_gt＝αp_gt，

2) plant operating constraints

3) energy storage device restraint

s.t.

and

respectively representing the heat charging efficiency and the heat discharging efficiency of the heat storage tank, wherein the subscript char represents the heat charging, dis represents the heat discharging, and the superscript min and max respectively represent the maximum value and the minimum value of the heat charging/heat discharging efficiency; the limitation of the amount of stored heat translates into a state transition probability.

5. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: step S2 further includes:

in the above formula: pi_θ(a_t|s_t) Is represented in a motion networkStrategy trajectory generated with parameter θ, (a)_t|s_t) Is shown in state s_tLower selection action a_t；

z_t(θ)∈(1-∈，1+∈)

6. The deep reinforcement learning-based combined heat and power generation system economic dispatching method according to claim 1, characterized in that: the intra-round intelligent agents described in step S2 generate control strategies according to the current internal neural network parameters, and use a distributed acquisition architecture in the process of interacting with the operating environment, that is, set up multiple intelligent agents to explore in the same environment at the same time, so that each intelligent agent can acquire different data and update them uniformly.