CN111310384B

CN111310384B - Wind field cooperative control method, terminal and computer readable storage medium

Info

Publication number: CN111310384B
Application number: CN202010056867.4A
Authority: CN
Inventors: 赵俊华; 赵焕; 梁高琪
Original assignee: Chinese University of Hong Kong Shenzhen
Current assignee: Chinese University of Hong Kong Shenzhen
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2024-05-21
Anticipated expiration: 2040-01-16
Also published as: CN111310384A

Abstract

The invention provides a wind field cooperative control method, a terminal and a computer readable storage medium, wherein the wind field cooperative control method comprises the following steps: constructing a wind field model; training in the wind field model by an integrated depth deterministic strategy gradient descent method to obtain K pre-trained strategy networks and K pre-trained Q networks, wherein K is an integer greater than 1; taking a pre-trained strategy network and a Q network as initial networks, and learning an actual optimal strategy in an actual environment through an integrated depth deterministic strategy gradient descent method; and adopting an optimal strategy to cooperatively control the wind field. Compared with the traditional method for obtaining the optimal control strategy through trial and error, the method effectively reduces the average learning cost in the actual learning process in the pre-training process, and reduces the randomness of the strategy in the learning process in the strategy integrating process, thereby reducing the randomness of the learning cost.

Description

Wind field cooperative control method, terminal and computer readable storage medium

Technical Field

The application relates to the technical field of wind power generation, in particular to a wind field cooperative control method, a terminal and a computer readable storage medium.

Background

With the increase of attention to environmental problems, the promotion of clean energy development is becoming a main goal of energy development at present. One major problem facing today, wind energy as an important component of clean energy, is how to reduce wake effects, maximizing the overall wind farm output through coordinated control of all fans in the wind farm.

The existing wind field cooperative control method without constructing a wake model by the reinforcement learning method obtains an optimal control strategy by trial and error, however, the learning process of the method has randomness and high learning cost.

Thus, the prior art has yet to be developed.

Disclosure of Invention

The application aims to solve the problems of randomness and high learning cost in the learning process of the traditional wind field cooperative control method.

In order to solve the technical problems, the invention discloses a wind field cooperative control method, which comprises the following steps:

Constructing a wind field model;

Training in the wind field model by an integrated depth deterministic strategy gradient descent method to obtain K pre-trained strategy networks and K pre-trained Q networks, wherein K is an integer greater than 1;

taking a pre-trained strategy network and a Q network as initial networks, and learning an actual optimal strategy in an actual environment through an integrated depth deterministic strategy gradient descent method;

and adopting an optimal strategy to cooperatively control the wind field.

Further, the wind field model comprises a wind driven generator output model and a wake model;

The output model of the wind driven generator is as follows: ；；

Wherein the method comprises the steps of Is the output of the wind driven generator,/>Is the deflection angle,/>Is the free wind speed,/>Is the area of the blade surface of the fan,/>Is an axial induction factor, which is a control variable expressed as: /(I)Wherein/>Wind speed after free wind passes through the wind driven generator;

The wake model is:

Wherein the method comprises the steps of Representing the position of the fan,/>Representing the ratio of wind speed reduction,/>Is the diameter of the blade surface of the fan,/>Is the roughness coefficient.

Further, the training method of the single strategy network and the single Q network comprises the following steps:

initializing a policy network And Q network/>Wherein/>Representing Q network parameters,/>Representing policy network parameters, s representing a state;

initializing target Q networks with identical weights And target policy network/>；

Initializing a replay buffer;

Training strategy network and Q by depth deterministic strategy gradient descent method in the wind field model

A network.

Further, the step of training the strategy network and the Q network in the wind field model by a depth deterministic strategy gradient descent method includes:

Accepting random model states ；

Obtaining behavior from a policy network：/>; Wherein/>Is Gaussian noise;

performing behavior in the wind farm model And get rewards/>；

Will be converted intoRandomly sampling n data from the transition of the replay buffer to form a batch, n being an integer greater than 1;

Updating the policy network, the Q network, the target policy network and the target Q network;

and iterating the steps until convergence, and finishing the pre-training of the strategy network and the Q network.

Further, the step of updating the policy network, the Q network, the target policy network, and the target Q network includes:

By minimizing losses Updating the Q network, and the loss equation is:

Wherein, Represent the/>, in a small lotSample number,/>Is the/>, in the batch dataRewarding of personal data,/>For/>Target value of individual data,/>Represents the/>Model state of individual data,/>Representing the/>, in the modelExecution behavior of the individual data;

updating a strategy network by using a strategy gradient, wherein a gradient equation is as follows:

Wherein the method comprises the steps of Representing a cumulative discount prize;

updating the target policy network and the target Q network: Wherein/> To update the parameters.

Further, the step of learning the actual optimal strategy in the actual environment by integrating the deep deterministic strategy gradient descent method by taking the pre-trained strategy network and the Q network as initial networks comprises the following steps:

K strategy networks obtained by pre-training Initializing K actual strategy networks;

Selecting the last pre-trained Q network to initialize to an actual Q network, initializing K target strategy networks and a target Q network with the same weight;

initializing a replay buffer;

and learning an actual optimal strategy in a real environment through an integrated depth deterministic strategy gradient descent method.

Further, the step of learning the actual optimal strategy by the integrated depth deterministic strategy gradient descent method in the real environment comprises the following steps:

Accepting random real environment states ；

Network integration of K policies to obtain behavior：/>Wherein, the method comprises the steps of, wherein,Is Gaussian noise;

Proxy execution behavior And get rewards/>；

and iterating the steps until convergence, and completing the actual training of the strategy network and the Q network.

updating the Q network by minimizing the loss, the loss equation is:

Wherein, Represent the/>, in a small lotSample number,/>Is the/>, in the batch dataRewarding of data,/>For/>Target value of individual data,/>Represents the/>Status in real Environment of personal data,/>Representing the/>, in a real environmentExecution behavior of the individual data;

then, respectively updating the strategy network by using the strategy gradients, wherein the gradient equation is as follows:

Finally updating the target policy network and the target Q network: Wherein/> To update the parameters.

A terminal comprising a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to perform a wind farm cooperative control method as described above.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a wind farm cooperative control method as described above.

Compared with the prior art, the application has the beneficial effects that: training is carried out in a wind field model through an integrated depth deterministic strategy gradient descent method to obtain a pre-trained strategy network and a pre-trained Q network, then the pre-trained strategy network and the pre-trained Q network are used as initial networks, and an actual optimal strategy is learned in an actual environment through the integrated depth deterministic strategy gradient descent method. Compared with the traditional method for obtaining the optimal control strategy through trial and error, the average learning cost in the actual learning process is effectively reduced in the pre-training process, and the randomness of the strategy in the learning process is reduced in the strategy integrating process, so that the randomness of the learning cost is reduced.

Drawings

FIG. 1 is a schematic diagram of the training method of the present invention.

Fig. 2 (a), (b), (c) and (d) are graphs comparing training results of the present invention with those of the conventional method in four scenarios shown in table 2.

Fig. 3 is a schematic diagram of a terminal structure provided by the present invention.

Detailed Description

In order to make the objects, features and advantages of the present application more comprehensible, the technical solutions in the embodiments of the present application will be clearly described in conjunction with the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application provides a wind field cooperative control method, the principle is as shown in figure 1, firstly, pre-training is carried out through a model, then network copying is carried out, training is carried out in a real environment, and an optimal strategy is learned. Mainly comprises the following steps S1-S3:

s1, constructing a wind field model, such as a low-fidelity wind field model.

Specifically, the wind field model may include a wind turbine output model and a wake model, where the wind turbine output model is:

Wherein the method comprises the steps of Is the output of the wind driven generator,/>Is the deflection angle,/>Is the free wind speed,/>Is the area of the blade surface of the fan,/>Is an axial induction factor and is a control variable.

The control variable may be expressed as:

，

Wherein the method comprises the steps of Wind speed after free wind passes through the wind driven generator;

The wake model is:

The analytical model uses a brake disc model and a PARK model. Parameters of the wind farm simulator and PARK model are shown in table 1.

Four wind farm topologies were tested. The turbine diameter is set to be D, wind tunnel structures are used, 5*D and 7*D are selected as fan distance parameters, and 5 and 10 are selected as fan quantity parameters. Scenes are numbered 1 through 4 and scene parameters are shown in table 2.

Wind speeds of 8m/s to 16m/s are randomly generated according to the Weber distribution. Assume that the wind angle is 0.

S2, training in the wind field model through an integrated depth deterministic strategy gradient descent method to obtain K pre-trained strategy networks and K pre-trained Q networks, wherein K is an integer greater than 1.

Specifically, the method comprises the following steps S21-S24:

S21, initializing a policy network And Q network/>Wherein/>Representing Q network parameters,/>Representing policy network parameters, s representing the state, i.e. wind farm fan speed, while initializing a reward function r. Wherein, regarding the setting of the deep reinforcement learning: the policy network is a six-layer fully connected neural network, and the Q network is a seven-layer fully connected neural network. Both networks use a linear activation function in the last hidden layer and a rectifying linear unit in the remaining layers. Other parameters are shown in Table 1.

The bonus function is defined as the total output powerSubtract unsafe behavior loss/>. By multiplying the distance outside the safety range by a coefficient/>To calculate unsafe behavior loss/>, of the turbine。

S22, initializing target Q networks with the same weightAnd target policy network/>。

S23, initializing a playback buffer (RB).

S24, training a strategy network and a Q network in the wind field model through a depth deterministic strategy gradient descent method.

The training process includes steps S241-S246:

s241, accept model State For example, the wind speed is 11m/s.

S242, obtaining behaviors according to the policy network：/>; Wherein/>Is Gaussian noise;

s243, executing behaviors in the wind field model And get rewards/>。

S244, will transitionAnd randomly sampling n data from the transition of the replay buffer to form a batch, n being an integer greater than 1.

S245, updating the strategy network, the Q network, the target strategy network and the target Q network. The updating method comprises the following steps S2451-S2453:

s2451 by minimizing losses Updating the Q network, and the loss equation is:

Wherein, Represent the/>, in a small lotSample number,/>Is the/>, in the batch dataRewarding of data,/>For/>Target value of individual data,/>Represents the/>Model state of individual data,/>Representing the/>, in the modelExecution behavior of the individual data.

S2452, updating a strategy network by using a strategy gradient, wherein a gradient equation is as follows:

Wherein the method comprises the steps of Representing a cumulative discount prize.

S2453, updating the target policy network and the target Q network: Wherein/> To update the parameters.

S246, iterating the steps S241-S245 until convergence, and completing the pre-training of the strategy network and the Q network. By the above method, 10 policy networks and 10Q networks are trained.

S3, taking the pre-trained strategy network and the Q network as initial networks, and learning an actual optimal strategy in an actual environment through an integrated depth deterministic strategy gradient descent method.

The training process includes steps S31-S34:

S31, initializing the 10 strategy networks obtained through pre-training into 10 actual strategy networks.

S32, selecting the last pre-trained Q network to initialize to an actual Q network, and initializing 10 target strategy networks and 1 target Q network with the same weight.

S33, initializing a playback buffer (RB).

S34, learning an actual optimal strategy in a real environment through an integrated depth deterministic strategy gradient descent method.

The training process includes steps S341-S346: s341 accepting random real environmental status。

S342, obtaining behaviors according to K strategy network integration：/>Wherein/>Is gaussian noise.

S343, agent execution behaviorAnd get rewards/>。

S344, will transitionN data are stored in the replay buffer and randomly sampled from the transitions in the replay buffer to form a batch, N being an integer greater than 1.

S345, updating the policy network, the Q network, the target policy network and the target Q network; the update process includes steps S3451-S3453:

S3451, updating the Q network by minimizing loss, wherein a loss equation is as follows:

S3452, respectively updating the strategy network by using the strategy gradients, wherein the gradient equation is as follows:

S3453, finally updating the target policy network and the target Q network: Wherein/> To update the parameters.

S346, iterating the steps until convergence, completing actual training of the strategy network and the Q network, learning an actual optimal strategy in an actual environment through an integrated depth deterministic strategy gradient descent method, and finally training results are shown in FIG. 2. The results show that in most cases the force of the blower in the farm in this method (ABE) is greater than in conventional methods, including the optimal greedy algorithm and Park method.

Training is carried out in a wind field model through an integrated depth deterministic strategy gradient descent method to obtain a pre-trained strategy network and a pre-trained Q network, then the pre-trained strategy network and the pre-trained Q network are used as initial networks, and an actual optimal strategy is learned in an actual environment through the integrated depth deterministic strategy gradient descent method. Compared with the traditional method for obtaining the optimal control strategy through trial and error, the pre-training process effectively reduces the average learning cost in the actual learning process, the method for integrating the strategy reduces the randomness of the strategy in the learning process,

Thereby reducing the randomness of the learning cost.

TABLE 3 comparison of learning costs for different scenarios

The invention also provides a terminal, as shown in fig. 3, comprising: a processor (processor) 10, a memory (memory) 20, a communication interface (CommunicationsInterface) 30, and a bus 40; wherein,

The processor 10, memory 20, and communication interface 30 communicate with each other via the bus 40.

The communication interface 30 is used for information transfer between communication devices of the mobile terminal.

The processor 10 is configured to invoke the computer program in the memory 20 to perform the methods provided in the above method embodiments, for example, including: when the system is started, constructing a wind field model; training in the wind field model by an integrated depth deterministic strategy gradient descent method to obtain K pre-trained strategy networks and K pre-trained Q networks, wherein K is an integer greater than 1; taking a pre-trained strategy network and a Q network as initial networks, and learning an actual optimal strategy in an actual environment through an integrated depth deterministic strategy gradient descent method; and adopting an optimal strategy c to cooperatively control the wind field.

The invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements one or a combination of the steps of the wind farm cooperative control method. The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although already shown above

While embodiments of the present application have been shown and described, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by those skilled in the art within the scope of the application, which are intended to be included within the scope of the application.

Claims

1. The wind field cooperative control method is characterized by comprising the following steps:

Constructing a wind field model;

Training in the wind field model by an integrated depth deterministic strategy gradient descent method to obtain K pre-trained strategy networks and K pre-trained Q networks, wherein K is an integer greater than 1; the training method of the single strategy network and the single Q network comprises the following steps:

Initializing a replay buffer;

Training a strategy network and a Q network in the wind field model by a depth deterministic strategy gradient descent method; wherein,

Step A, receiving random model state；

Step B, obtaining behaviors according to the strategy network：/>; Wherein/>Is Gaussian noise;

step C, executing behaviors in the wind field model And get rewards/>；

Step D, transitionRandomly sampling n data from the transition of the replay buffer to form a batch, n being an integer greater than 1;

e, updating the strategy network, the Q network, the target strategy network and the target Q network;

step F, iterating the steps A-E until convergence, and finishing pre-training of the strategy network and the Q network;

and adopting an optimal strategy to cooperatively control the wind field.

2. The wind farm cooperative control method of claim 1, wherein the wind farm model comprises a wind generator output model and a wake model;

The output model of the wind driven generator is as follows: ；

The wake model is:

3. The wind farm cooperative control method according to claim 1, wherein the step of updating the policy network, the Q network, the target policy network, and the target Q network includes:

By minimizing losses Updating the Q network, and the loss equation is:

4. The wind farm cooperative control method according to claim 1, wherein the step of learning the actual optimal strategy by the integrated depth deterministic strategy gradient descent method in the actual environment using the pre-trained strategy network and the Q network as initial networks comprises:

initializing a replay buffer;

5. The wind farm cooperative control method according to claim 4, wherein the step of learning the actual optimal strategy by the integrated depth deterministic strategy gradient descent method in the real environment comprises:

Accepting random real environment states ；

Network integration of K policies to obtain behavior：

Wherein/>Is Gaussian noise;

Proxy execution behavior And get rewards/>；

6. The wind farm cooperative control method according to claim 4, wherein the step of updating the policy network, the Q network, the target policy network, and the target Q network comprises:

updating the Q network by minimizing the loss, the loss equation is:

Wherein, Represent the/>, in a small lotSample number,/>Is the/>, in the batch dataRewarding of personal data,/>For/>Target value of individual data,/>Represents the/>Status in real Environment of personal data,/>Representing the/>, in a real environmentExecution behavior of the individual data;

7. A terminal comprising a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to perform the wind farm cooperative control method according to any of claims 1 to 6.

8. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the wind farm cooperative control method according to any of claims 1 to 6.