CN117997152A

CN117997152A - Bottom layer control method of modularized multi-level converter based on reinforcement learning

Info

Publication number: CN117997152A
Application number: CN202410397056.9A
Authority: CN
Inventors: 马辉; 秦赓; 郝传统; 郭志华; 仓文涛
Original assignee: Shenzhen Delian Minghai New Energy Co ltd
Current assignee: Shenzhen Delian Minghai New Energy Co ltd
Priority date: 2024-04-03
Filing date: 2024-04-03
Publication date: 2024-05-07
Anticipated expiration: 2044-04-03

Abstract

The application relates to a bottom layer control method of a modularized multi-level converter based on reinforcement learning, which is characterized in that state observation and rewarding of an environment are obtained and input into an intelligent body, the action of the output of the intelligent body is obtained to perform bottom layer control on the modularized multi-level converter, the state observation and rewarding of the environment are changed based on the action, the iterative loop of the bottom layer control on the multi-level converter is performed based on the changed state observation and rewarding until a first preset condition is met to obtain an optimal strategy of the intelligent body, and the bottom layer control is performed on the modularized multi-level converter according to the optimal strategy, wherein the first condition is that the rewarding reaches the maximum value or the number of iterative loops reaches a preset value.

Description

Bottom layer control method of modularized multi-level converter based on reinforcement learning

Technical Field

The application relates to the technical field of converter control, in particular to a bottom layer control method of a modularized multi-level converter based on reinforcement learning.

Background

The modularized multi-level converter (Modular Multilevel Converter, MMC) is a core device of a high-voltage direct-current transmission (High Voltage Direct Current, HVDC) system, and is widely applied to the fields of high-voltage direct-current transmission, large-scale energy storage systems and the like due to the advantages of high efficiency, low total harmonic distortion, easy modularization expansion and the like. However, due to the huge number of sub-modules (SM) in the MMC, the number of controlled objects is large, so that not only is the control algorithm complex, but also the computational power of the controller is extremely high. In addition, if the submodule is put into or the bypass state is switched too frequently, the switching loss is greatly increased, and devices with higher loss have higher thermal stress, so that the failure rate of the devices is increased. It is therefore very interesting to ensure that the multilevel converter remains in a high performance operating state.

Reinforcement learning (Reinforcement Learning, RL) is a machine learning method, the basic idea being that an Agent obtains incentive to rewards in the environment through self-exploration and continuous interaction, thereby learning the strategy of obtaining the maximum cumulative rewards value. Reinforcement learning has the advantage that it can handle situations of incomplete information and can explore and learn itself, independent of predefined rules or models. Reinforcement learning, however, also faces challenges such as the inability to determine appropriate rewards/penalty functions for different environments, different mission objectives, etc., making the modular multilevel converter less operational. There is no underlying control method for modular multilevel converters based on reinforcement learning.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method for controlling the bottom layer of a modular multilevel converter based on reinforcement learning, which can improve the operation performance of the modular multilevel converter.

In a first aspect, the present application provides a method for controlling an underlayer of a modular multilevel converter based on reinforcement learning, where the modular multilevel converter is disposed in an environment, the method comprising:

Step S1, acquiring state observation and rewards of the environment;

s2, inputting the state observation and rewards into an intelligent agent to obtain an action of the intelligent agent output, wherein the action is used for performing bottom control on the modularized multi-level converter;

Step S3, the state observation and rewarding of the environment are changed based on the action, and the iterative loop of the steps S1-S2 is carried out based on the changed state observation and rewarding until a first preset condition is met to obtain an optimal strategy of the agent, wherein the first preset condition is that: the rewards reach the maximum value or the times of the iterative loop reach a preset value; and utilizing the optimal strategy to perform bottom layer control on the modularized multi-level converter.

In one embodiment, the state observation of the environment includes: bridge arm current and bridge arm voltage reference values of the modularized multi-level converter, capacitance voltage of each sub-module in the modularized multi-level converter and operation state of each sub-module.

In one embodiment, the rewards are obtained by operating performance parameters and rewards functions of the modular multilevel converter; each submodule in the modularized multi-level converter comprises a plurality of power device switches; the operation performance parameters comprise tracking performance parameters of bridge arm reference voltages, balance performance parameters of capacitance voltages of all sub-modules and optimization performance parameters of switching losses of the power device; the reward function includes:

r(t)=ω₁r_V(t)+ω₂r_C(t)+ω₃r_S(t)；

wherein r (t) is the reward, ω ₁ is a preset first constant, r _V (t) is a tracking performance parameter of a bridge arm reference voltage, ω ₂ is a preset second constant, r _C (t) is a balance performance parameter of capacitance voltages of each sub-module, ω ₃ is a preset third constant, r _S (t) is an optimization performance parameter of switching loss of the power device, and t is the number of loop iterations; the first constant is the same in each loop iteration, the second constant is the same in each loop iteration, and the third constant is the same in each loop iteration.

In one embodiment, the tracking performance parameters of the bridge arm reference voltage include: and obtaining a first parameter value based on the bridge arm reference voltage, the bridge arm voltage and the rated direct current voltage.

In one embodiment, the balance performance parameters of the capacitance voltage of each sub-module include: and obtaining a second parameter value based on the maximum difference value among the capacitance voltages of the sub-modules, the rated voltage of the sub-modules and the target difference value.

In one embodiment, the optimized performance parameters of the switching loss of the power device include: the number of the closed states of the power devices in t iteration cycles is compared with the number of the power devices in t-1 iteration cycles; t is a positive integer.

In one embodiment, the acts for underlying control of the modular multilevel converter include: the actions are used for controlling the running states of all sub-modules in the modularized multi-level converter.

In one embodiment, the acts for controlling the operation state of each sub-module in the modular multilevel converter include:

Acquiring binary numbers representing the running states of all sub-modules in the modularized multi-level converter based on the state observation and the rewards;

converting the binary number into a target binary number, the target binary number being characterized as the action.

In a second aspect, the present application provides a method for controlling the bottom layer of a modular multilevel converter based on multi-round reinforcement learning, the method comprising multi-round training of agents, each round of training comprising a method for controlling the bottom layer of a modular multilevel converter based on reinforcement learning as described above; the same round of intelligent body training is the same rewarding function, and different rounds of intelligent body training are different rewarding functions; and training the multi-level intelligent agent through different reward functions, obtaining a plurality of optimal strategies, and selecting the optimal strategy meeting a second preset condition from the plurality of optimal strategies as a final strategy to perform bottom-layer control on the modularized multi-level converter.

In one embodiment, the second preset condition is:

The difference between the bridge arm voltage reference and the bridge arm voltage is less than or equal to 5 percent of rated voltage, or the total harmonic THD of the MMC is less than or equal to 2 percent;

The difference between the capacitance voltage of the maximum submodule and the capacitance voltage of the minimum submodule in one bridge arm is less than or equal to 20 percent;

The minimum of all power device switching losses under each optimal strategy.

In one of the embodiments of the present invention,

The reward function in each round of agent training includes r (t) =ω ₁r_V(t)+ω₂r_C(t)+ω₃r_S (t); r (t) is the reward, ω ₁ is a preset first constant, r _V (t) is a tracking performance parameter of the bridge arm reference voltage, ω ₂ is a preset second constant, r _C (t) is a balance performance parameter of the capacitor voltage of each sub-module, ω ₃ is a preset third constant, r _S (t) is an optimization performance parameter of the switching loss of the power device, and t is the number of loop iterations; in the same round of intelligent body training process, omega _1、ω₂ and omega ₃ are unchanged once set; omega _1、ω₂ and omega ₃ are not exactly the same during different rounds of agent training.

In a third aspect, the present application provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described above.

In a fifth aspect, the application provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method described above.

Compared with the prior MMC bottom layer control technology, the bottom layer control method of the modularized multi-level converter based on reinforcement learning has the following advantages:

1) The principle is simple, and the complex MMC bottom control system does not need to be explicitly modeled;

2) The control method can be used for offline training, and the calculation force requirement on the online controller is not high;

3) The control method can realize the optimal control of the multi-level converter and improve the working performance of the multi-level converter.

4) In a further scheme, the reward function gives consideration to bridge arm voltage tracking performance, submodule capacitance voltage balance performance and switching loss, optimizes the switching loss, ensures that the voltage output of the whole bridge arm of the MMC tracks the bridge arm reference voltage and simultaneously maintains the balance of submodule capacitance voltage in the bridge arm.

According to the bottom layer control method of the modularized multi-level converter based on multi-round reinforcement learning, multi-round intelligent body training is carried out through different reward functions, a plurality of optimal strategies are obtained, and then the optimal strategy meeting the conditions is selected from the plurality of optimal strategies to serve as a final strategy, so that the MMC with excellent working performance is obtained.

Drawings

FIG. 1 is a schematic diagram of the topology of an MMC according to a first embodiment of the present invention;

FIG. 2 is a three-layer control diagram of an MMC control system according to a first embodiment of the present invention;

FIG. 3 is a block diagram of the underlying control layer of the MMC of the prior art;

FIG. 4 is a schematic diagram of an embodiment of an MMC underlying control system based on reinforcement learning;

FIG. 5 is a flowchart of a method for controlling an MMC under reinforcement learning in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of a half-bridge sub-module according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a full-bridge sub-module according to an embodiment of the present invention;

FIG. 8 is a flowchart of a method for controlling the bottom layer of an MMC based on multi-round reinforcement learning in a second embodiment of the present invention;

figure 9 is a block diagram of a converter control device in one embodiment;

Fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Example 1

The topology of the MMC is shown in fig. 1, and the bridge arm is formed by connecting a plurality of SMs in series, and typically, the SMs are half-bridge sub-modules shown on the upper left side or full-bridge sub-modules shown on the lower left side of fig. 1. The half-bridge submodule or the full-bridge submodule comprises a plurality of power devices and a capacitor with constant voltage, a specific submodule is put into or bypassed through switching of the power devices, and bridge arm voltages are synthesized by the capacitor voltages of the submodules, so that extremely low harmonic distortion is realized. The MMC in fig. 1 is a three-phase structure, and it is understood that the MMC may also be a single-phase structure.

Because of the huge quantity of SM in MMC, in order to ensure the working performance of MMC, a control system is required to have a large number of sampling input and control output channels, and the balance of SM capacitor voltage in the bridge arm is maintained while the bridge arm voltage output by MMC is ensured to track the bridge arm reference voltage. The MMC control system may include three control layers, as shown in fig. 2, where the uppermost control layer is configured to control energy and power of the MMC, so as to maintain energy balance stored by SM capacitors in different bridge arms, and enable the MMC to output desired power or voltage; the intermediate layer control layer is used for controlling the current of the MMC and tracking reference signals of the output current of the MMC and loop current between bridge arms; the control target of the bottom control layer is to enable the bridge arm voltage actually output by the whole bridge arm to track the bridge arm reference voltage, and meanwhile, balance of SM capacitor voltage in the bridge arm is maintained.

Fig. 3 is a block diagram of a bottom control layer of an existing MMC, and assuming that the number of SMs connected in series in each bridge arm is N, the capacitor voltages of the N SMs may form an array 1*N, and the array V _order is obtained by performing online sequencing according to the voltage, and then a new array V _sort is obtained by updating the priority of SM access according to the current direction. Dividing the bridge arm voltage reference value by the average value of N SM capacitor voltages in the bridge arm to obtain the number N _ref.N_ref of SM capacitors to be connected and V _sort, and finally obtaining the driving signal of each SM in the bridge arm through a modulation method such as Nearest Level Modulation (NLM), pulse Width Modulation (PWM), carrier wave shift modulation (PSC) and the like. However, these bottom control strategies all need to perform real-time online sequencing on a huge number of SM voltages, so that not only is the control algorithm complex, but also the calculation power of the controller is extremely high. In addition, if the submodule is put into or the bypass state is switched too frequently, the switching loss is greatly increased, and devices with higher loss have higher thermal stress, so that the failure rate of the devices is increased.

In order to solve the above problems, the present invention proposes a bottom layer control system, method, computer device, computer readable storage medium and computer program product of MMC based on reinforcement learning, so as to optimize switching loss, and ensure that voltage output of the whole bridge arm of MMC tracks bridge arm reference voltage while maintaining balance of SM capacitor voltage in the bridge arm.

In one embodiment, the structural schematic diagram of the reinforcement learning-based MMC underlying control system shown in FIG. 4, the control system comprising an environment 10, an agent 20. Agent 20 is a learning system for interacting with environment 10; environment 10 is a task environment in which agent 20 is located. In each iteration of the loop, agent 20 obtains state observations s (t) and rewards r (t) for the current loop from environment 10, and selects an action a (t) from all available actions based on state observations s (t) and rewards r (t) to affect a state change of environment 10, and obtains rewards r (t+1) and state observations s (t+1) for the environment in the next loop. And through continuous iterative training, the optimal strategy of the maximum rewards is obtained. The policy is a mode or rule for defining the actions of the agent in different states in reinforcement learning, and is a mapping relation from the states to the actions, and is used for guiding the agent to take what actions in a specific state.

Specifically, the environment 10 includes: MMC110, sampling module 120, and upper and middle layer controller 130. The topology of MMC110 is shown in fig. 1. The sampling module 120 Is configured to collect a bridge arm current Is (t) of the MMC110, a capacitance voltage V _Ci (t) of the N sub-modules (i=1, 2,..n), and a bridge arm voltage V _s (t). The upper and middle layer controller 130 Is configured to output a bridge arm voltage reference value V _sref (t) according to the bridge arm current Is (t) collected by the collection module 120 and the capacitance voltages V _C1(t), V_C2(t), ... ,V_CN (t) of the N sub-modules. The control method adopted by the upper layer and middle layer controllers can be any one of the upper layer control method and the middle layer control method in the prior art.

The state observations s (t) obtained from the environment include bridge arm current I _s (t), capacitance voltage V _Ci (t) of N sub-modules (where i=1, 2,..n), operating state M (t) of each sub-module, and bridge arm voltage reference value V _sref (t).

The prize r (t) is a scalar feedback signal given by the environment that indicates how the agent 20 is performing after taking the action. In order to optimize the switching loss and ensure that the voltage output of the whole bridge arm of the MMC tracks the bridge arm reference voltage and maintains the balance of SM capacitor voltage in the bridge arm, the reward function is designed as a function comprising the optimization performance for evaluating the switching loss, the tracking performance of the bridge arm reference voltage and the balance performance of the capacitor voltage of each submodule; the method comprises the following steps: and calculating according to the bridge arm voltage reference value V _sref (t-1) of the previous cycle, the bridge arm voltage V _s (t) of the current cycle, the capacitance voltages V _Ci (t) of the N sub-modules, the running states M (t) of the sub-modules and the running states M (t-1) of the sub-modules of the previous cycle to obtain the rewards r (t).

The agent 20 may be a DQN (deep Q network), a dual deep Q network, a competing deep Q network, or other agents, which are not limited herein. The agent 20 selects an action a (t) based on the environmental state observations s (t) and rewards r (t); the bottom layer control method of the MMC based on reinforcement learning of the present embodiment focuses on improvement of bottom layer control of the MMC110 first, and realizes overall control based on the improved bottom layer control; the bottom layer control is mainly the running state control (input or bypass) of each sub-module, and then the switching state control of the power devices in each sub-module. Therefore, the operation a (t) includes the control of the running state M (t) of each sub-module in the MMC110, and further turns into the control of the switching state G (t) of the power device in each sub-module. After the switch state G (t) of the power device in each sub-module is controlled, the environment state changes. After the environmental state changes, new state observations s (t+1) and rewards r (t+1) are generated, and then the next cycle is entered for iteration.

In one embodiment, as shown in fig. 5, the present application provides a method for controlling the bottom layer of a modular multilevel converter based on reinforcement learning, which is explained by taking an application to the bottom layer control system shown in fig. 4 as an example, and the method includes the following steps.

In step S101, the agent 20 acquires a state observation S (t) and a reward r (t) of the current cycle through the environment 10.

Step S102, based on the state observations S (t) and rewards r (t), the agent 20 outputs an action a (t) for the underlying control of the modular multilevel converter 110.

In step S103, the environment 10 receives the action a (t) and then changes the state, and accordingly, the state observation and the rewards change based on the action a (t). And (3) carrying out an iterative loop of the step S101-the step S102 based on the changed state observation and rewarding until a first preset condition is met to obtain an optimal strategy of the intelligent agent, wherein the first preset condition is as follows: the reward reaches a maximum value or the number of iterative loops reaches a preset value.

Specifically, during the first cycle, the agent 20 first gives a random initial action a (0), and the environment receives the initial action a (0) and then changes its state, so as to obtain a state observation s (1) and a prize r (1) and enter the next cycle.

The action a (t) is mainly used for bottom layer control of the modularized multi-level converter, so the action a (t) comprises the control of the running state M (t) of each sub-module in the MMC. The operating state M (t) of each sub-module includes a drop-in or bypass, and during normal operation of the MMC, the bridge arm voltage is always non-negative, the bypass state of the sub-module can be represented by 0, and the forward cut-in state can be represented by 1 (i.e., the sub-module capacitance is connected in series in the main loop in the forward direction). The operation states of the N sub-modules can be represented by an N-bit binary number, and the binary number is converted into decimal number, so that all possible operation state combinations of the N sub-modules can be from 0 to 0A decimal number within the range is uniquely mapped. For example: in a bridge arm with 8 sub-modules, M (t) =000000011' b, b represents binary, which indicates that 1-6 of the sub-modules are bypass states and that sub-modules 7 and 8 are forward cut-in states at t cycles; converting the binary to decimal (in other embodiments, octal, hexadecimal, etc.) then M (t) =000000011 'b=3'd, d representing decimal; thus, all possible operating state combinations of the 8 sub-modules can be uniquely mapped by one decimal variable in the range of 0 to 255. The running state of each sub-module is represented by a decimal number, so that not only is the running state of each sub-module uniquely determined to be 1 (forward input) or 0 (bypass), but also the dimension of a state observation variable can be reduced, and further the calculation resources required in the neural network training process are reduced.

The operational state control state of the sub-modules may be converted into the switching state G (t) control of each power device within the sub-module according to different sub-module topologies. The half-bridge sub-module topology shown in fig. 6 and the full-bridge sub-module topology shown in fig. 7, whose corresponding switch states G (t) are shown in tables 1 and 2 below.

TABLE 1 half-bridge submodule Power device switch State Table

Wherein '1' represents that the preset running state of the full-bridge sub-module is a forward cut-in state, '0' represents that the preset running state of the full-bridge sub-module is a bypass state, and S3-S6 are power devices in the full-bridge sub-module.

Table 2 full bridge submodule power device switch state table

Wherein, the '1' indicates that the operation state of the half-bridge sub-module is a forward cut-in state, the '0' indicates that the operation state of the half-bridge sub-module is a bypass state, and S1 and S2 are power devices in the half-bridge sub-module.

After each time the environment 10 receives a given action a (t-1) of the agent 20, the state of the environment changes, and the main change is a change in the parameter value related to the operation performance of the MMC. The state change of the environment may be collected by the sampling module 120, and assuming that the MMC110 includes N sub-modules, the state change of the environment includes an operation state M (t) of each sub-module, a bridge arm current Is (t), a bridge arm voltage Vs (t), and a capacitance voltage V _Ci (t) of the N sub-modules (where i=1, 2,..n); the upper and middle controllers may calculate a bridge arm voltage reference value V _sref (t) according to the bridge arm current I _s (t) and the capacitance voltages V _Ci (t) of the N sub-modules (where i=1, 2..n).

The bridge arm current I _s (t) acquired by the acquisition module, the capacitance voltage V _Ci (t) of the N sub-modules (i=1, 2..n), the running state M (t) of each sub-module, and the bridge arm voltage reference value V _sref (t) are sent to the agent as state observation.

The intelligent agent can furthest improve rewards in the training process, so that the MMC has higher working performance under the control of the control strategy corresponding to the maximum rewards by fusing the parameter values related to the running performance of the MMC with the rewarding function and then obtaining the control strategy corresponding to the maximum rewards. In order to optimize the switching loss and ensure that the voltage output of the whole bridge arm of the MMC tracks the bridge arm reference voltage and maintains the balance of SM capacitor voltage in the bridge arm, the reward function is designed as a function comprising the optimization performance for evaluating the switching loss of the power device, the tracking performance of the bridge arm reference voltage and the balance performance of the capacitor voltage of each submodule, and is expressed as follows:

wherein r (t) is a reward, ω ₁ is a preset first constant, r _V (t) is a tracking performance parameter of a bridge arm reference voltage, ω ₂ is a preset second constant, r _C (t) is a balance performance parameter of capacitance voltages of each sub-module, ω ₃ is a preset third constant, r _S (t) is an optimization performance parameter of switching losses of each power device, and t is the number of loop iterations.

Illustratively, the tracking performance parameters of the bridge arm reference voltage include: and obtaining a first parameter value based on the reference voltage of the bridge arm of the converter, the voltage of the bridge arm of the converter and the rated direct current voltage.

Further, the tracking performance parameter of the bridge arm reference voltage can be obtained through the following formula:

V _S (t) is the bridge arm voltage of the current cycle, V _sref (t-1) is the bridge arm reference voltage of the previous cycle, V _dc is the rated direct current voltage, and the rated direct current voltage is equal to the maximum bridge arm voltage. That is, the tracking performance of the bridge arm reference voltage can be evaluated by the difference between the bridge arm reference voltage and the bridge arm voltage, and the first parameter obtained by dividing the difference by the rated direct current voltage to achieve normalization is the tracking performance parameter of the bridge arm reference voltage.

Illustratively, the balance performance parameters of the sub-module capacitance voltages include: and obtaining a second parameter value based on the maximum difference value among the capacitance voltages of the sub-modules, the rated voltage of the sub-modules and the target difference value.

Further, the balance performance parameters of the capacitance voltage of each sub-module can be obtained through the following formula:

Wherein V _Crated is the rated voltage of the submodule (the rated voltage of each submodule is the same); the exponential form in the above formula characterizes the capacitance-voltage balance performance of the submodule, and in general conditions, in order to ensure normal operation of the MMC, the difference between the maximum submodule capacitance voltage and the minimum submodule capacitance voltage in one bridge arm should be less than 20% of the rated voltage of the submodule (i.e. the target difference), and the maximum voltage difference exceeding 20% can lead to rapid reduction of r _c (t). Therefore, the difference between the capacitance voltage of the maximum submodule and the capacitance voltage of the minimum submodule in one bridge arm is limited to be less than 20% of the rated voltage of the submodule, normal operation of the MMC can be ensured, and it can be understood that the difference between the capacitance voltage of the maximum submodule and the capacitance voltage of the minimum submodule can be set to meet other conditions according to actual needs.

Illustratively, the operating frequency optimizing performance parameters of each power device include: the closed state of each power device in t iteration cycles is compared with the number of changes in t-1 iteration cycles, and t is a positive integer.

Further, the operating frequency optimization performance parameters of each power device of the converter can be obtained through the following formula:

Where M '(t) and M' (t-1) are decimal numbers converted from binary numbers, and r _s (t) describes the number of sub-modules that change from the last cycle t-1 to the current cycle t.

Therefore, in the embodiment, the tracking performance of the MMC bridge arm reference voltage, the balance performance of the capacitance voltage of each sub-module and the optimization performance of the working frequency of each power device can be ensured at the same time by maximizing the rewarding r (t) of the rewarding function. By iteratively looping continuously as above until a first preset condition is met, the first preset condition may be: the prize r (t) reaches a maximum or the number of loop iterations reaches a preset value.

Example two

In the first embodiment, once the first constant ω ₁, the second constant ω ₂ and the third constant ω ₃ in the reward function are set, they will not change until the end condition (the first preset condition) of the loop iteration is not reached, and the setting of the first constant ω ₁, the second constant ω ₂ and the third constant ω ₃ may affect the optimal strategy for obtaining the maximum reward finally, and under the optimal strategy, some performance of the tracking performance of the bridge arm voltage reference signal, the balance performance of the capacitance voltage of the submodule and the optimal performance of the switching loss of the power device may be particularly good, while other performance is poor, that is, a perfect balance state cannot be achieved between the three, and even some performance may not reach the relevant requirement. Multiple optimal strategies are obtained through multiple rounds of training, and the optimal strategy meeting the second preset condition is selected from the multiple optimal strategies to serve as a final strategy (to-optimal strategy).

The flow chart of the MMC bottom layer control method based on multi-round reinforcement learning in the present embodiment is shown in fig. 8, and includes the following steps S101 to S105.

In step S103, the environment 10 receives the action a (t) and then changes the state, and accordingly, the state observation and the rewards change based on the action a (t). Performing an iterative loop as in steps S101-S102 based on the changed state observations and rewards until a first preset condition is satisfied to obtain an optimal strategy of the agent, where the first preset condition is: the reward reaches a maximum value or the number of iterative loops reaches a preset value.

Step S104, replacing the rewarding function and performing step S101-step S103; training a plurality of intelligent agents through a plurality of different reward functions and obtaining a plurality of intelligent agents and a plurality of optimal strategies corresponding to the intelligent agents;

and step S105, selecting an optimal strategy meeting a second preset condition from the plurality of optimal strategies as a final strategy to perform bottom-layer control on the modularized multi-level converter.

In one embodiment, the second preset condition includes that a tracking performance parameter of a bridge arm voltage reference signal, a balance performance parameter of a capacitance voltage of the submodule and an optimization performance parameter of a switching loss of the power device respectively meet the following conditions:

1) Tracking performance parameters of bridge arm voltage reference signals: the difference between the bridge arm voltage reference and the bridge arm voltage is less than or equal to 5 percent of rated voltage; or the total harmonic THD of MMC is less than or equal to 2%;

2) Balance performance parameters of capacitance and voltage of the submodule: the difference between the capacitance voltage of the maximum submodule and the capacitance voltage of the minimum submodule in one bridge arm is less than or equal to 20 percent;

3) Optimized performance parameters for switching losses of power devices: the minimum of all power device switching losses under each optimal strategy.

In the above-mentioned MMC bottom layer control method based on multi-round reinforcement learning, the same round of agent training is the same rewarding function, and different rounds of agent training are different rewarding functions. In one embodiment, while different reward functions are between different rounds of agent training, the basic considerations for each reward function may be the same, such as: the tracking performance parameters including the bridge arm reference voltage, the balance performance parameters of the capacitance voltages of all the submodules and the optimization performance parameters of the switching loss of the power device can be expressed by the formula (1) in the first embodiment; the difference may be achieved by replacing the different weight coefficients, i.e. replacing some or more of the first constant ω ₁, the second constant ω ₂ and the third constant ω ₃. Obtaining a plurality of reward functions by continuously changing preset values of a first constant omega ₁, a second constant omega ₂ and a third constant omega ₃, each reward function being used for training of an agent, obtaining a plurality of optimal strategies through the plurality of reward functions, and selecting the optimal strategy meeting the second preset condition as a final strategy (to-optimal strategy), and obtaining the MMC with to-optimal working performance based on the final strategy (to-optimal strategy).

The method comprises the following steps: assuming that there are M rounds of agent training, the first constant in the reward function for each round of agent training is ω _{1_j}, the second constant is ω _{2_j,} and the third constant is ω _{3_j}, where j represents the number of rounds of training, j=1, 2. Omega _{1_j、}ω_{2_j} and omega _{3_j} in the reward function of one round of agent training are not exactly the same as omega _{1_j、}ω_{2_j} and omega _{3_j} in the reward functions of the other rounds of agent training.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Example III

Based on the same inventive concept, the embodiment of the application also provides a converter control device for realizing the converter control method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the converter control device or devices provided below may be referred to the limitation of the converter control method hereinabove, and will not be described herein.

In one embodiment, as shown in a block diagram of a current transformer control device 900 in fig. 9, in this embodiment, there is provided a current transformer control device 900, including: a data acquisition module 910, a data analysis module 920, and a loop module 930. Wherein the data acquisition module 910 is configured to acquire status observations and rewards of an environment. The data analysis module 920 is configured to input the state observation and rewards to the agent, and obtain an action of the agent output, where the action is used to perform bottom control on the modular multilevel converter. The state observations and rewards of the environment vary based on the action. The loop module 930 is configured to perform an iterative loop of steps S101-S102 in the bottom layer control method of the modular multilevel converter based on reinforcement learning in the above embodiment based on the changed state observation and rewards until a first preset condition is satisfied to obtain an optimal policy of the agent, where the first preset condition is: the number of times that the rewards reach the maximum value or the iterative loop reaches a preset value; and utilizing an optimal strategy to perform bottom layer control on the modular multilevel converter.

r(t)=ω₁r_V(t)+ω₂r_C(t)+ω₃r_S(t)；

Wherein r (t) is a reward, ω ₁ is a preset first constant, r _V (t) is a tracking performance parameter of a bridge arm reference voltage, ω ₂ is a preset second constant, r _C (t) is a balance performance parameter of capacitance voltages of each sub-module, ω ₃ is a preset third constant, r _S (t) is an optimization performance parameter of switching loss of the power device, and t is the number of loop iterations; the first constant is the same in each loop iteration, the second constant is the same in each loop iteration, and the third constant is the same in each loop iteration.

In one embodiment, the optimized performance parameters for the switching losses of the power device include: the number of the closed states of each power device in t iteration cycles is compared with the number of the closed states of each power device in t-1 iteration cycles; t is a positive integer.

In one embodiment, the data analysis module 920 is further configured to obtain actions to control the operation states of the sub-modules in the modular multilevel converter.

In one embodiment, the data analysis module 920 is further configured to obtain a binary number characterizing an operational state of each sub-module in the modular multilevel converter based on the state observations and rewards; the binary number is converted into a target binary number, which is characterized as an action.

In one embodiment, the circulation module 903 is further configured to perform multiple rounds of agent training through different reward functions and obtain multiple optimal strategies, and select an optimal strategy that meets a second preset condition from the multiple optimal strategies as a final strategy to perform bottom-layer control on the modular multilevel converter; the second preset condition is: the difference between the bridge arm voltage reference and the bridge arm voltage is less than or equal to 5 percent of rated voltage, or the total harmonic THD of the MMC is less than or equal to 2 percent; the difference between the capacitance voltage of the maximum submodule and the capacitance voltage of the minimum submodule in one bridge arm is less than or equal to 20 percent; the minimum of all power device switching losses.

Wherein the reward function in each round of agent training comprises r (t) =ω ₁r_V(t)+ω₂r_C(t)+ω₃r_S (t); r (t) is a reward, ω ₁ is a preset first constant, r _V (t) is a tracking performance parameter of a bridge arm reference voltage, ω ₂ is a preset second constant, r _C (t) is a balance performance parameter of capacitance voltages of each sub-module, ω ₃ is a preset third constant, r _S (t) is an optimization performance parameter of switching loss of the power device, and t is the number of loop iterations; omega _1、ω₂ and omega ₃ are not exactly the same during different rounds of agent training.

The respective modules in the above-described converter control device may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Example IV

In the present embodiment, a computer device, which may be a server, is provided, and an internal structure thereof may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing various data involved in the converter control method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a converter control method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. The bottom layer control method of the modularized multi-level converter based on reinforcement learning is characterized in that the modularized multi-level converter is arranged in the environment, and the method comprises the following steps:

Step S1, acquiring state observation and rewards of the environment;

S2, inputting the state observation and rewards into an intelligent agent to obtain an action output by the intelligent agent, wherein the action is used for performing bottom control on the modularized multi-level converter;

Step S3, the state observation and the rewards of the environment are changed based on the action, and the iterative loop of the steps S1-S2 is carried out based on the changed state observation and rewards until a first preset condition is met to obtain an optimal strategy of the agent, wherein the first preset condition is that: the rewards reach the maximum value or the times of the iterative loop reach a preset value; and utilizing the optimal strategy to perform bottom layer control on the modularized multi-level converter.

2. The method of claim 1, wherein the state observation of the environment comprises: bridge arm current and bridge arm voltage reference values of the modularized multi-level converter, capacitance voltage of each sub-module in the modularized multi-level converter and operation state of each sub-module.

3. The method of claim 1, wherein the rewards are obtained by operating performance parameters and rewards functions of the modular multilevel converter; each submodule in the modularized multi-level converter comprises a plurality of power device switches; the operation performance parameters comprise tracking performance parameters of bridge arm reference voltages, balance performance parameters of capacitance voltages of all sub-modules and optimization performance parameters of switching losses of the power device; the reward function includes:

r(t)=ω₁r_V(t)+ω₂r_C(t)+ω₃r_S(t)；

4. The method of claim 3, wherein the tracking performance parameters of the leg reference voltage comprise: and obtaining a first parameter value based on the bridge arm reference voltage, the bridge arm voltage and the rated direct current voltage.

5. A method according to claim 3, wherein the balance performance parameters of the capacitor voltages of the sub-modules comprise: and obtaining a second parameter value based on the maximum difference value among the capacitance voltages of the sub-modules, the rated voltage of the sub-modules and the target difference value.

6. The method of claim 3, wherein the optimized performance parameters for power device switching losses include: the number of the closed states of the power devices in t iteration cycles is compared with the number of the power devices in t-1 iteration cycles; t is a positive integer.

7. The method of claim 1, wherein the acts for underlying control of the modular multilevel converter comprise: the actions are used for controlling the running states of all sub-modules in the modularized multi-level converter.

8. The method of claim 2, wherein the acts for controlling the operational state of each sub-module in the modular multilevel converter comprise:

9. A method for controlling the bottom layer of a modular multilevel converter based on multi-round reinforcement learning, which is characterized in that the method comprises a plurality of rounds of intelligent body training, and each round of intelligent body training comprises the bottom layer control method of the modular multilevel converter based on reinforcement learning as claimed in any one of claims 1 to 8; different reward functions are adopted among different rounds of intelligent body training, multiple rounds of intelligent body training are carried out through the different reward functions, a plurality of optimal strategies are obtained, and the optimal strategy meeting a second preset condition is selected from the multiple optimal strategies to serve as a final strategy so as to carry out bottom control on the modularized multi-level converter.

10. The method of claim 9, wherein the second preset condition is:

The minimum of all power device switching losses under each optimal strategy.

11. The method of claim 9, wherein the step of determining the position of the substrate comprises,

12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.

14. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.