CN117057228A

CN117057228A - Inverter multi-objective optimization method based on deep reinforcement learning

Info

Publication number: CN117057228A
Application number: CN202311003536.4A
Authority: CN
Inventors: 王佳宁; 吴轶康; 杨仁海
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2023-11-14

Abstract

The invention provides an inverter multi-objective optimization method based on deep reinforcement learning, which comprises the steps of utilizing an analytic formula to establish an efficiency optimization model and a power density optimization model, and utilizing a neural network to establish an EMI common mode noise optimization model; determining a state set, an action set and a normalized multi-objective rewarding function; and performing offline learning by using a DDPG algorithm to obtain an optimal strategy, and applying the DDPG algorithm, wherein the system can realize optimization of efficiency and power density on the premise of meeting the EMI standard under any state and any weight coefficient according to the optimal strategy. The modeling method of the neural network on the EMI common mode noise is adopted, so that a large number of circuit simulations are avoided, and the optimizing efficiency is improved; the DDPG algorithm can solve the problem of complex high-dimensional design variables, can also avoid the problems of serious coupling and failure of each parameter in the inverter design, and can quickly find an optimal scheme.

Description

Inverter multi-objective optimization method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of power electronics, relates to multi-objective optimization of an inverter, and provides a multi-objective optimization method of an inverter for deep reinforcement learning.

Background

With the advance of the important strategy of double carbon, the utilization of renewable energy is increasingly important, and solar energy and wind energy power generation are important components of the utilization of clean energy in the future. The inverter is an interface between the wind power generation system and the photovoltaic power generation system and the power grid, and plays an important role in core electric energy conversion and control. Therefore, the inverter is one of key links which are indispensable for guaranteeing the efficient, economical and stable operation of the photovoltaic power generation system and the wind power generation system, and the inverter can reach optimal efficiency and power density under the premise of meeting the EMI standard under any operation working condition, so that the inverter is very significant for the efficient, economical and stable operation of the photovoltaic power generation system and the wind power generation system.

The wide band gap power device is widely applied to the power electronic converter due to the excellent characteristics of high frequency, high voltage, high temperature and the like. The increase in switching frequency can reduce the volume of the inverter, thereby greatly increasing the power density of the inverter, but can increase electromagnetic interference (Electromagnetic Interference, EMI) noise of the inverter. In addition, the reduction in inverter volume results in a more compact layout of components, which enhances parasitic parameter and temperature coupling between inverter design goals, resulting in reduced inverter efficiency and life. Therefore, in order for the inverter system to have comprehensive and excellent performance, a multi-objective optimization design is of paramount importance, for which many expert students propose different solutions:

The invention patent CN112968474A discloses a multi-objective optimizing method of a photovoltaic off-grid inverter system, which adopts a genetic algorithm NSGA-III to carry out multi-objective optimization on the photovoltaic off-grid inverter system. However, this solution has the following drawbacks: because NSGA-III algorithm is adopted, when the system state is changed, complex and time-consuming optimizing solving process is needed to be carried out again, computing resources are consumed, action values after the state change cannot be rapidly given out, the optimizing process has limitation, and the application range is limited.

The invention discloses a multi-objective optimization method for a photovoltaic inverter based on a DDPG algorithm, which adopts a deep reinforcement learning algorithm DDPG to carry out multi-objective optimization on a photovoltaic off-grid inverter system and solves the limitation of gene algorithm optimization. However, this solution still has the following disadvantages: in the invention, the optimization target model of the photovoltaic inverter system models the optimization target through an analytic formula, and a large number of circuit simulations are carried out in tens of thousands of multi-target optimization iterative processes, so that the overall optimizing efficiency is greatly reduced.

Disclosure of Invention

Aiming at the problems that in the multi-objective optimization of the existing inverter, the NSGA-III algorithm is adopted, the training or solving process is complex and time-consuming, the action value after the state change cannot be rapidly given, the actual application requirement is difficult to meet, and the efficiency of optimizing and modeling by an analytical formula is low, the invention provides a multi-objective optimization method of the inverter based on deep reinforcement learning, and aims to solve the problems in the prior art.

The invention aims at realizing the aim, and provides an inverter multi-objective optimization method based on deep reinforcement learning, wherein the inverter comprises a direct-current voltage source 10, a three-phase three-level ANPC inverter circuit 20, a filter circuit 30 and a load 40;

the three-phase three-level ANPC inverter circuit 20 comprises two identical supporting capacitors and an inverter main circuit, wherein the two supporting capacitors are respectively marked as supporting capacitors Cap ₁ And a support capacitor Cap ₂ Supporting capacitor Cap ₁ And a support capacitor Cap ₂ The series connection is connected between a direct current positive bus P and a direct current negative bus E of the direct current voltage source 10, and supports a capacitor Cap ₁ And a support capacitor Cap ₂ The connection point of the (C) is marked as a midpoint O of the direct current bus;

the inverter main circuit comprises an A-phase bridge arm, a B-phase bridge arm and a C-phase bridge arm, wherein each phase bridge arm comprises 6 switching tubes with anti-parallel diodes, namely the inverter main circuit comprises 18 switching tubes with anti-parallel diodes, and the 18 switching tubes with anti-parallel diodes are recorded as a switching tube S _ij The 18 antiparallel diodes are denoted as diode D _ij Where i represents three phases, i=a, b, c, j represents the serial numbers of the switching tubes and diodes, j=1, 2,3,4,5,6; the A-phase bridge arm, the B-phase bridge arm and the C-phase bridge arm are mutually connected in parallel between a direct current positive bus P and a direct current negative bus E; in each phase leg of the three-phase legs, a switching tube S _i1 Switch tube S _i2 Switch tube S _i3 Switch tube S _i4 Sequentially connected in series with a switch tube S _i1 The input end of (1) is connected with a direct current positive bus P and a switch tube S _i1 Is connected with the switch tube S _i2 Input terminal of (a) switch tube S _i2 Is connected with the switch tube S _i3 Input terminal of (a) switch tube S _i3 Is connected with the switch tube S _i4 Input terminal of (a) switch tube S _i4 The output end of (1) is connected with a direct current negative bus E and a switch tube S _i5 Is connected with the switch tube S _i1 Output terminal of (2), switch tube S _i5 The output end of (1) is connected with the midpoint O of the direct current bus and the switch tube S _i6 The input end of (1) is connected with the midpoint O of the DC bus and the switch tube S _i6 Is connected with the switch tube S _i3 Output terminal of (2), switch tube S _i2 And a switch tube S _i3 Is denoted as inverter output point phi _i The method comprises the steps of carrying out a first treatment on the surface of the Switch tube S _i1 Switch tube S _i4 Switch tube S _i5 And a switch tube S _i6 Is a power frequency switch tube with the same switching frequency, and the switch tube S _i2 And a switch tube S _i3 The switching frequency is the same as the switching frequency of the high-frequency switching tube;

the filter circuit 30 includes a three-phase filter inductance L _f And three-phase filter capacitor C ₀ Three-phase filter inductance L _f One end of (a) is connected with the output point phi of the inverter _i The other end is connected with a load 40, and a three-phase filter capacitor C ₀ Parallel connected three-phase filtering inductance L _f And a load 40;

the method comprises the following specific steps:

step 1, establishing an optimization target model;

The inverter is recorded as a system, 18 switching tubes with anti-parallel diodes are disassembled into 18 switching tubes and 18 anti-parallel diodes, and a supporting capacitor Cap is set ₁ Supporting capacitor Cap ₂ And three-phase filter capacitor C ₀ Is negligible in terms of loss and volume;

step 1.1, establishing an efficiency optimization model;

the efficiency eta of the system is taken as a target, an efficiency optimization model is established, and the expression is as follows:

wherein P is _loss P is the total loss of the system _loss ＝P _T +P _L ，P _T Total loss of 18 switching tubes and 18 antiparallel diodes, P _L Is three-phase filtering inductance L _f Loss of P _w The rated input power of the system;

step 1.2, establishing a power density optimization model;

and (3) taking the power density sigma of the system as a target, establishing a power density optimization model, wherein the expression is as follows:

wherein P is _w For rated input power of the system, V is the system volume, v=v _T +3V _L ，V _T V for the total volume of 18 switching tubes and 18 antiparallel diodes _L Is three-phase filtering inductance L _f The magnetic core volume of the middle single-phase filter inductor;

step 1.3, an EMI optimization model is established;

the EMI optimization model predicts the envelope curve of the EMI common mode noise spectrum by using an artificial neural network to represent the actual common mode noise level, and compares the predicted spectrum envelope curve with a noise amplitude curve in a standard to judge whether the standard is met;

Step 1.3.1, determining input variables and output variables of an artificial neural network;

the artificial neural network comprises a neural network 1 and a neural network 2, wherein:

the input variables of the neural network 1 are 3, and are respectively: high frequency switching frequency f _sw Filter inductance L _f Common mode inductance L _CM The output variable of the neural network 1 is the frequency of the 4 turning points of the inverter common mode conducted EMI spectrum, denoted as frequency f ₁ Frequency f ₂ Frequency f ₃ Frequency f ₄ ；

The neural network 2 has 4 input variables, which are: voltage value U of DC voltage source _dc High frequency switching frequency f _sw Filter inductance L _f Common mode inductance L _CM The method comprises the steps of carrying out a first treatment on the surface of the The output variable of the neural network 2 is the inverter common mode conducted EMI spectrum at frequency f ₁ Frequency f ₂ Frequency f ₃ Frequency f ₄ The spectral amplitude at this point is denoted as spectral amplitude M _f1 Spectral amplitude M _f2 Spectral amplitude M _f3 Spectral amplitude M _f4 ；

Step 1.3.2, obtaining sample data required for constructing a neural network 1 model and a neural network 2 model by using computer simulation software, wherein:

the sample data required for constructing the neural network 1 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data f of the neural network 1 _swN ，L _fN ，L _CMN And neural network 1 simulation output value Wherein N is the serial number of each group, n=1, 2,3 … K;

the sample data required for constructing the neural network 2 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data U of the neural network 2 _dcN ，f _swN ，L _fN ，L _CMN And neural network 2 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;

step 1.3.3, determining a neural network 1 model and a network structure of a neural network 2;

in the neural network 1 structure, the input layer contains 3 neurons, the hidden layer contains 8 neurons, and the output layer contains 4 neurons;

in the neural network 2 structure, the input layer contains 4 neurons, the hidden layer contains 11 neurons, and the output layer contains 4 neurons;

step 1.3.4, grouping sample data;

dividing the sample data obtained in the step 1.3.2 into a training subset, a verification subset and a test set, wherein the training subset contains K ₁ Group sample data, verifying that the subset contains K ₂ Group sample data, test set contains K ₃ Group sample data, and K ₁ +K ₂ +K ₃ ＝K；

Step 1.3.5, constructing a neural network 1 model and a neural network 2 model;

randomly extracting a group of input data from the training subset obtained in the step 1.3.4, and inputting the input data into the neural network 1 and the neural network 2 to obtain output corresponding to the input data; the parameters of the neural network 1 and the neural network 2 are updated by adopting an error back propagation gradient descent algorithm, and the updated neural network 1 and the updated neural network 2 are obtained;

And then K of the verification subset obtained in the step 1.3.4 ₂ The group input data are respectively input into the neural network 1 and the neural network 2 after updating to obtain the K ₂ K corresponding to group input data ₂ Group output, including output of neural network 1And the output of the neural network 2

The root mean square error δ1 and the root mean square error δ2 are defined, and their expressions are respectively:

given a first target error e ₁ And a second target error e ₂ And makes the following judgment:

if delta 1 < e ₁ And delta 2 < e ₂ Neural network 1 modelThe neural network 2 model is built, the step 1.3.6 is entered, otherwise, the step 1.3.5 is returned;

step 1.3.6, K of the test set obtained in step 1.3.4 ₃ Inputting the group input data into the neural network 1 model and the neural network 2 model which are constructed in the step 1.3.5 to obtain the model K ₃ K corresponding to group input data ₃ Simulation output value sum K of group neural network 1 ₃ The simulated output values of the group neural network 2 are respectively recorded as predicted values of the neural network 1And neural network 2 predictors

Step 1.3.7, randomly extracting a group of actual values of the neural network 1And the actual value of the neural network 2->

Establishing a plane coordinate system by taking frequency as an abscissa and taking frequency spectrum amplitude as an ordinate, and according to the predicted value Drawing inverter common mode conducted EMI spectrum prediction envelope on the coordinate system according to actual value +. >Drawing an actual envelope of the common mode conducted EMI spectrum of the inverter on the coordinate system;

judging whether the inverter common mode conducted EMI spectrum prediction envelope and the inverter common mode conducted EMI spectrum actual envelope are matched or not: if yes, the prediction of the common-mode conducted EMI spectrum envelope curve of the inverter is realized, and the prediction is ended; if not, returning to the step 1.3.4;

the matching means that four turning points on the predicted envelope of the common-mode conducted EMI spectrum of the inverter are in close agreement with four turning points on the actual envelope of the common-mode conducted EMI spectrum of the inverter;

step 2, determining a state set S and an action set A according to the efficiency optimization model, the power density optimization model and the EMI optimization model obtained in the step 1 ₀ And a reward function R;

step 2.1, determining a State set S and an action set A ₀ ；

Recording the current time T of the system, wherein t=1, 2,3 … T, T is the time of the system termination state, and recording the state of the system at the current time T as a state s _t ，s _t ＝(U _dc ，I) _t In the formula, U _dc The voltage value of the DC voltage source 10 is denoted as DC voltage U _dc I is the effective value of the output current of the system and is recorded as the output current I;

the state set S is T states S _t S= { S ₁ ，s ₂ ，...s _t ，..s _T And S.epsilon { (U) _dc ，I)}；

The action taken by the system at time t is denoted as action a _t ，a _t ＝(f _sw ) _t Wherein f is _sw The switching frequency of the high-frequency switching tube is denoted as the high-frequency switching frequency f _sw ；

The action set A ₀ For T actions a _t Is set of (A), A ₀ ＝{a ₁ ，a ₂ ，...a _t ，..a _T And (3) Wherein f _{sw_min} For a high frequency switching frequency f _sw Lower limit value f _{sw_max} For a high frequency switching frequency f _sw Upper limit value of (2);

step 2.2, determining a reward function R;

step 2.2.1, normalizing the efficiency optimization model and the power density optimization model;

the values of the efficiency optimization model and the power density optimization model of the system are not in the same magnitude, and normalization processing is carried out to ensure that the values of the two optimization models are between 0 and 1;

system total loss P in efficiency optimization model _loss To optimize the target f ₁ The system volume V in the power density optimization model is an optimization target f ₂ ；

Introducing an optimization objective f _α α=1, 2, for the optimization objective f _α Normalizing to obtain normalized optimization targetAnd is also provided with The expression is as follows:

wherein f _α，min To optimize the minimum value of the target, f _α，max Maximum value for optimization objective;

step 2.2.2, giving weight to efficiency, power density and EMI, and setting a reward function R;

the bonus function R represents a weighted sum of the bonus values generated by all actions of the system from the current state to the end state, expressed as follows:

Wherein r is _t For the state s of the system at time t _t Take action a _t The obtained single-step rewarding value is gamma which is a discount factor, wherein the discount factor gamma represents the influence degree of the length of time on the rewarding value;

when the inverter common mode conducted EMI spectrum prediction envelope is well below the noise amplitude curve of the EMI standard:

when there is an inverter common mode conducted EMI spectrum prediction envelope above the noise amplitude curve of the EMI standard:

wherein,as penalty coefficient, w _α For the weight coefficient, α= 1,2,0 < w _α < 1, and w ₁ +w ₂ =1, c is the prize that EMI meets the criterion;

step 3, offline learning of a DDPG algorithm;

arbitrarily extracting D states S from the state set S _t Composing a training data set for offline learning, d=4t/5; according to the state set S and the action set A obtained in the step 2 ₀ And a reward function R, offline learning is performed by using a DDPG algorithm of deep reinforcement learning, and an optimal strategy pi(s) is obtained _y )；

The DDPG algorithm comprises 4 neural networks, and the neural network parameters of the online strategy network are recorded as first neural network parameters theta _μ The neural network parameters of the target policy network are noted as second neural network parameters θ _μ′ The neural network parameter of the online evaluation network marks the third neural network parameter as theta _Q The neural network parameters of the target evaluation network are recorded as fourth neural network parameters theta _Q′ ；

Given a training step and a maximum step _max Given the training round number m and the mostLarge training round number M, step=1, 2,3 … step _max M=1, 2,3 … M, i.e. step is included in each training round _max Training for the second time, and performing M training rounds altogether;

define the average value of the bonus function R in each training round and record as the average bonusDuring each training round number m, the 4 neural networks included in the DDPG algorithm all face the average prize +.>The maximized direction update results in the optimal strategy pi (s _y )；

The optimal strategy pi (s _y ) The expression of (2) is as follows:

π(s _y )＝a _y

wherein s is _y A state value input for an online policy network corresponding to the optimal policy, and s _y ＝(U _dc ，I) _y ，(U _dc ，I) _y For the DC voltage U corresponding to the optimal strategy in the state set S _dc And output current I, a _y The action value output by the online strategy network corresponding to the optimal strategy is recorded as an optimal action a _y And a _y ＝(f _sw ) _y ，(f _sw ) _y For action set A ₀ Medium and optimum strategy pi(s) _y ) Corresponding high frequency switching frequency f _sw ；

Output optimal action a _y ；

Step 4, according to the optimal action a _y Performing application;

step 4.1, first, the states S selected from the state set S except the training data set _t Reformulating an application data set and then randomly extracting j from the application data set _max Individual states s _t And redefined as application state s _β ，β＝1，2，3...j _max Application state s _β ＝(U _dc ，I) _β I.e. application-likeState s _β Is a direct current voltage U _dc And a set of states at an output current I;

step 4.2, the optimal action a output in the step 3 is processed _y Substitution j _max Individual application states s _β In (3) different application states s are obtained _β Down-output optimal application actionsβ＝1，2，3...j _max ；

Step 4.3, applying state s _β ＝(U _dc ，I) _β Optimal application actionsRespectively substituting the power density optimization model, the power density optimization model and the EMI optimization model established in the step 1 to achieve the optimal efficiency of the system on the premise of meeting the EMI standard>Optimal power density of the system->β＝1，2，3...j _max Any state { (U) in the system state set S is caused to be _dc Maximizing efficiency, power density, and at the same time EMI meets the criteria.

Further, in step 3, offline learning is performed by using the deep reinforcement learning DDPG algorithm to obtain an optimal strategy pi(s) _y ) The specific steps of (a) are as follows:

step 3.1, initializing a first neural network parameter θ _μ Second neural network parameter θ _μ′ Third neural network parameter θ _Q And a fourth neural network parameter θ _Q′ And let theta _μ′ ＝θ _μ 、θ _Q′ ＝θ _Q The method comprises the steps of carrying out a first treatment on the surface of the Initializing the capacity of an experience playback pool P as D; initializing learning rate alpha of online evaluation network _Q Learning rate alpha of online policy network _μ Update parameter τ with moving average, and 0 < α _Q ＜1，0＜α _μ More than 1, more than 0 and less than 1; the output of the online policy network is noted as a, a=μ (s|θ _μ ) Wherein a is an action value output by the online policy network, and a corresponds to the action set A ₀ And a=f _sw The method comprises the steps of carrying out a first treatment on the surface of the S is a state value input by the online policy network, S corresponds to an individual in the state set S, and s= (U) _dc I); mu is the first neural network parameter θ through the online policy network _μ And a policy derived from the entered state value s;

step 3.2, state s of the system at time t _t Inputting the online policy network to obtain the output of the online policy networkAnd adding noise delta _t Action a of obtaining final output _t The specific expression is as follows:

step 3.3, the system is based on the state s _t Executing action a _t Transition to a new state s _t+1 At the same time get the execution action a _t The single step prize value r _t Will(s) _t ，a _t ，r _t ，s _t+1 ) Called a state transition sequence, and (s _t ，a _t ，r _t ，s _t+1 ) Stored in the experience playback pool P, the system enters a state s of t+1 at the next moment _t+1 ；

Circularly executing the steps 3.2 to 3.3, recording the number of state transition sequences in the experience playback pool P as N, entering the step 3.4 if N=D, otherwise returning to the step 3.2;

Step 3.4, randomly extracting n state transition sequences from the experience playback pool P, wherein n is less than D, taking the n state transition sequences as small batch data for training an online strategy network and an online evaluation network, and recording the kth state transition sequence in the small batch data as(s) _k ，a _k ，r _k ，s _k+1 )，k＝1，2，3…n；

Step 3.5, based on the small batch data(s) obtained in step 3.4 _k ，a _k ，r _k ，s _k+1 ) K=1, 2,3 … n, calculated as the jackpot y _k And error function L (θ) _Q ) The specific expression is as follows:

y _k ＝r _k +Q′(s _k+1 ，μ′(s _k+1 |θ _μ′ )|θ _Q′ )

wherein Q'(s) _k+1 ，μ′(s _k+1 |θ _μ′ )|θ _Q′ ) Scoring value output for target evaluation network, wherein μ'(s) _k+1 |θ _μ′ )|θ _Q′ Action value s output for target strategy network _k+1 The state values input for the target evaluation network and the target strategy network; q(s) _k ，a _k |θ _Q ) For on-line evaluation of the scoring value output by the network s _k And a _k The method comprises the steps of evaluating a state value and an action value input by a network on line;

step 3.6, on-line evaluation network is performed by minimizing the error function L (θ _Q ) To update theta _Q On-line policy network through deterministic policy gradientsUpdating theta _μ The target evaluation network and the target policy network update theta by a moving average method _Q′ And theta _μ′ The specific expression is as follows:

in the method, in the process of the invention,is a partial guide symbol, wherein->Representing policy J vs. θ _μ Deviation-inducing and->Input representing online evaluation network is s=s _k ，a＝μ(s _k ) When in use, the scoring value output by the network is evaluated onlineDeviation of the action value a is determined, +.>Input representing online policy network is s=s _k When the online policy network outputs action value +.>For theta _μ Deviation-inducing and->Representing an error function L (θ) _Q ) For theta _Q Deviation-inducing and->After being updatedThird neural network parameter of->For the first neural network parameter after updating, +.>For the fourth neural network parameter after updating, +.>For the updated second neural network parameter;

step 3.7, when the steps 3.4 to 3.6 are finished once, the training process of one step length is finished, and when step < step _max When step=step, repeating steps 3.4 to 3.6 _max When M is less than M, repeatedly executing the steps 3.2-3.6, and when m=M, finishing the training process of M rounds, wherein the training process of the next round is started from the step 3.2 to the step 3.6;

step 3.8, the training algorithm is ended, and the optimal strategy pi (s _y )＝a _y Record the average rewards of a training round as

In M training rounds, the first neural network parameter θ _μ Second neural network parameter θ _μ′ Third neural network parameter θ _Q And a fourth neural network parameter θ _Q′ Towards average rewardsThe maximized direction update results in the optimal strategy pi (s _y )。

Compared with the prior art, the invention has the beneficial effects that:

(1) In the multi-objective optimization of the inverter, the invention adopts the neural network to optimally model the EMI common mode noise of the inverter, builds the neural network model of the EMI common mode noise based on a small amount of simulation data, greatly improves the optimizing efficiency, and can rapidly obtain the envelope curve of the EMI common mode noise spectrum even if the working condition changes.

(2) The invention adopts the deep reinforcement learning algorithm DDPG to carry out multi-objective optimization on the inverter, can solve the problem of complex high-dimensional design variables, can avoid the problem of failure in the design of the inverter, finds an optimal scheme meeting the optimization objective, and fully improves the performance of the inverter.

Drawings

FIG. 1 is a topology of an inverter according to the present invention;

FIG. 2 is a block diagram of an inverter multi-objective optimization method in accordance with the present invention;

FIG. 3 is a flow chart of an inverter multi-objective optimization method according to the present invention;

FIG. 4 is a flow chart of the inverter of the present invention using a neural network to optimally model EMI common mode noise;

FIG. 5 is a block diagram of two neural networks of an EMI optimization model in the inverter multi-objective optimization method of the present invention;

FIG. 6 is a comparison diagram of an optimized modeling of EMI common mode noise in an embodiment of the present invention;

FIG. 7 is a graph showing the convergence effect of average rewards in an embodiment of the invention;

FIG. 8 is a training effect diagram of motion variables in an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is a topology diagram of a photovoltaic inverter in an embodiment of the present invention. As can be seen from fig. 1, the inverter comprises a dc voltage source 10, a three-phase three-level ANPC inverter circuit 20, a filter circuit 30 and a load 40;

the three-phase three-level ANPC inverter circuit 20 comprises two identical supporting capacitors and an inverter main circuit, wherein the two supporting capacitors are respectively marked as supporting capacitors Cap ₁ And a support capacitor Cap ₂ Supporting capacitor Cap ₁ And a support capacitor Cap ₂ Direct current positive bus P and direct current connected in series with direct current voltage source 10Between the current negative buses E, a supporting capacitor Cap ₁ And a support capacitor Cap ₂ The connection point of the (C) is marked as a midpoint O of the direct current bus;

The filter circuit 30 includes a three-phase filter inductance L _f And three-phase filter capacitor C ₀ Three-phase filter inductance L _f One end of (a) is connected with the output point phi of the inverter _i The other end is connected with a load 40, and a three-phase filter capacitor C ₀ Parallel connectionIn the three-phase filtering inductance L _f And a load 40;

in the present embodiment, a switching tube S _i1 Switch tube S _i4 Switch tube S _i5 And a switch tube S _i6 The three-phase filter capacitor C is a power frequency switch tube with the switching frequency of 50HZ ₀ ＝3μF，Cap ₁ ＝Cap ₂ ＝20μF。

Fig. 2 is a block diagram of a multi-objective optimization method of a photovoltaic inverter according to the present invention, fig. 3 is a flowchart of the multi-objective optimization method of a photovoltaic inverter according to the present invention, and as can be seen from fig. 2 and 3, the multi-objective optimization method of an inverter performs multi-objective optimization on the inverter based on deep reinforcement learning, specifically includes the following steps:

step 1, establishing an optimization target model;

step 1.1, establishing an efficiency optimization model

step 1.2, establishing a power density optimization model;

in the present embodiment, the rated input power P of the system is taken _w ＝140×10 ³ Tile, V _T ＝3.98×10 ^-4 Cubic meters.

Fig. 4 is a flowchart of the inverter of the present invention using a neural network to optimally model EMI common mode noise. As can be seen from fig. 4, the EMI optimization model is established by using two neural networks, and the specific steps are as follows:

step 1.3, an EMI optimization model is established;

the EMI optimization model predicts an envelope of the EMI common mode noise spectrum with an artificial neural network to represent an actual common mode noise level, and compares the predicted spectrum envelope with a noise magnitude curve in the standard to determine whether the standard is satisfied.

The neural network 2 has 4 input variables, which are: voltage value U of DC voltage source _dc High frequency switching frequency f _sw Filter inductance L _f Common mode inductance L _CM The method comprises the steps of carrying out a first treatment on the surface of the The output variable of the neural network 2 is the inverter common mode conducted EMI spectrum at frequency f ₁ Frequency f ₂ Frequency f ₃ Frequency f ₄ Spectral amplitude atRecorded as the spectral amplitude M _f1 Spectral amplitude M _f2 Spectral amplitude M _f3 Spectral amplitude M _f4 ；

the sample data required for constructing the neural network 1 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data f of the neural network 1 _swN ，L _fN ，L _CMN And neural network 1 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;

fig. 5 shows a network configuration diagram of the neural network 1 model and the neural network 2 of the present embodiment.

Step 1.3.4, grouping sample data;

dividing the sample data obtained in the step 1.3.2 into a training subset, a verification subset and a test set, wherein the training subset contains K ₁ Group sample data, verifying that the subset contains K ₂ Group sample data, test set contains K ₃ Number of group samplesAccording to, and K ₁ +K ₂ +K ₃ ＝K；

Step 1.3.5, constructing a neural network 1 model and a neural network 2 model;

And then K of the verification subset obtained in the step 1.3.4 ₂ The group input data are respectively input into the neural network 1 and the neural network 2 after updating to obtain the K ₂ K corresponding to group input data ₂ Group output, including output of neural network 1And the output of the neural network 2->N ₂ ＝K ₁ +1，K ₁ +2，…K ₁ +K ₂ ；

if delta 1 < e ₁ And delta 2 < e ₂ The neural network 1 model and the neural network 2 model are built, the step 1.3.6 is entered, otherwise, the step 1.3.5 is returned;

step 1.3.6, K of the test set obtained in step 1.3.4 ₃ Inputting the group input data into the neural network 1 model and the neural network 2 model which are constructed in the step 1.3.5 to obtain the model K ₃ K corresponding to group input data ₃ Simulation output value sum K of group neural network 1 ₃ The simulated output values of the group neural network 2 are respectively recorded as predicted values of the neural network 1And neural network 2 predictive value->N ₃ ＝K ₁ +K ₂ +1，K ₁ +K ₂ +2，…K；

in the present embodiment, U _dc =200v, high frequency switching frequency f _sw =50khz, filter inductance L _f =90 μh, common-mode inductance L _CM ＝120μH。

FIG. 6 is a comparison diagram of an optimized modeling of EMI common mode noise in an embodiment of the present invention, as shown in the following: the four turning points on the predicted envelope of the inverter common mode conducted EMI spectrum output by the neural network are basically consistent with the four turning points on the actual envelope of the inverter common mode conducted EMI spectrum.

step 2.1, determining a State set S and an action set A ₀ ；

Recording the current time T of the system, wherein t=1, 2,3 … T, T is the time of the system termination state, and recording the state of the system at the current time T as a state s _t ，s _t ＝(U _dc， I) _t In the formula, U _dc The voltage value of the DC voltage source 10 is denoted as DC voltage U _dc I is the effective value of the output current of the system and is recorded as the output current I;

step 2.2, determining a reward function R;

wherein r is _t For the state s of the system at time t _t Take action a _t And the obtained single-step rewarding value gamma is a discount factor, and the discount factor gamma represents the influence degree of the time length on the rewarding value.

wherein the method comprises the steps ofAs penalty coefficient, w _α For the weight coefficient, α= 1,2,0 < w _α < 1, and w ₁ +w ₂ =1, c is the prize that EMI meets the criterion;

in the present embodiment, U _dc The value range of (1) is 600-1200V, the value range of (I) is 100-120A, f is taken _{sw_min} =10000 Hz, take f _{sw_max} Take t=20, w=80000 Hz ₁ ＝w ₂ ＝0.5，C＝500，γ＝0.9，

Step 3, offline learning of a DDPG algorithm;

Given a training step and a maximum step _max Given a training round number M and a maximum training round number M, step=1, 2,3 … step _max M=1, 2,3 … M, i.e. step is included in each training round _max Training for the second time, and performing M training rounds altogether;

in the present embodiment, step is taken _max =100, taking m=2500.

Define the average value of the bonus function R in each training round and record as the average bonus During each training round number m, the 4 neural networks included in the DDPG algorithm all face the average prize +. >The maximized direction update results in the optimal strategy pi (s _y )；

The optimal strategy pi (s _y ) The expression of (2) is as follows:

π(s _y )＝a _y

wherein s is _y For online corresponding to optimal policyState value of policy network input, and s _y ＝(U _dc ，I) _y ，(U _dc ，I) _y For the DC voltage U corresponding to the optimal strategy in the state set S _dc And output current I, a _y The action value output by the online strategy network corresponding to the optimal strategy is recorded as an optimal action a _y And a _y ＝(f _sw ) _y ，(f _sw ) _y For action set A ₀ Medium and optimum strategy pi(s) _y ) Corresponding high frequency switching frequency f _sw ；

Output optimal action a _y ；

Step 4, according to the optimal action a _y Performing application;

step 4.1, first, the states S selected from the state set S except the training data set _t Reformulating an application data set and then randomly extracting j from the application data set _max Individual states s _t And redefined as application state s _β ，β＝1，2，3...j _max Application state s _β ＝(U _dc ，I) _β I.e. application state s _β Is a direct current voltage U _dc And a set of states at an output current I;

Step 4.3, applying state s _β ＝(U _dc ，I) _β Optimal application actionsRespectively substituting the power density optimization model, the power density optimization model and the EMI optimization model established in the step 1 to achieve the optimal efficiency of the system on the premise of meeting the EMI standard >Optimal power density of the system->β＝1，2，3...j _max Any state { (U) in the system state set S is caused to be _dc Maximizing efficiency, power density, and at the same time EMI meets the criteria.

In this embodiment, step 3 performs offline learning by using the DDPG algorithm of deep reinforcement learning to obtain an optimal policy pi(s) _y ) The specific steps of (a) are as follows:

in this embodiment, α is taken _Q =0.002, take α _μ Let τ=0.01, noise δ=0.001 _t ＝0.9995 ^m ×1000。

y _k ＝r _k +Q′(s _k+1 ，μ′(s _k+1 |θ _μ′ )|θ _Q′ )

wherein Q'(s) _k+1 ，μ′(s _k+1 |θ _μ′ )|θ _Q′ ) A scoring value output for the target evaluation network, wherein μ ′(s _k+1 |θ _μ′ )|θ _Q′ Action value s output for target strategy network _k+1 The state values input for the target evaluation network and the target strategy network; q(s) _k ，a _k |θ _Q ) For on-line evaluation of the scoring value output by the network s _k And a _k The method comprises the steps of evaluating a state value and an action value input by a network on line;

in the method, in the process of the invention,is a partial guide symbol, wherein->Representing policy J vs. θ _μ Deviation-inducing and->Input representing online evaluation network is s=s _k ，a＝μ(s _k ) When in use, the scoring value output by the network is evaluated onlineDeviation of the action value a is determined, +.>Input representing online policy network is s=s _k When the online policy network outputs action value +.>For theta _μ Deviation-inducing and->Representing an error function L (θ) _Q ) For theta _Q Deviation-inducing and->For the third neural network parameter after updating, +.>For the first neural network parameter after updating, +.>For the fourth neural network parameter after updating, +.>For the updated second neural network parameter;

In order to prove the beneficial effects of the invention, the invention is simulated.

FIG. 7 is a chart showing the convergence effect of the average prize R according to the embodiment of the present invention, wherein the abscissa in FIG. 7 represents the training round number m and the ordinate represents the average prize Rm=1, 2,3 … 2500. As can be seen from fig. 7, when the training rounds are between 0 and 500, the average jackpot is small and the hunting phenomenon is serious because the agent randomly interacts with the environment and collects the experience data in the early exploration phase, and the parameters of the strategy network and the evaluation network are not updated temporarily, so that the rewards gain at this time is small and the fluctuation is large. When the data in the experience playback pool reaches maximum capacity, i.e. from the training round number of 500, the network parameters start to be updated, the agent gradually improves the action strategy, the average jackpot gradually increases, but the training process is still less stable. When the training round number is m=1100, the agent learns to minimize power loss and volume And the EMI meets the action strategy of the standard, the average cumulative prize continues to increase and tends to be stable, the training effect reaches the optimum, and four neural network parameters theta _μ 、θ _μ′ 、θ _Q 、θ _Q′ Has been updated to obtain the optimal strategy pi (s _y )。

In the present embodiment, when U _dc When i=120 amperes, for action set a, =1200 volts ₀ A of (a) _t ＝(f _sw ) _t Training is performed, FIG. 8 shows the high frequency switching frequency f as the motion variable in the embodiment of the present invention _sw In FIG. 8, the abscissa indicates the training round number m and the ordinate indicates the high-frequency switching frequency f _sw M=1, 2,3 … 2500. As can be seen from FIG. 8, as the training round number m increases, the high frequency switching frequency f _sw Oscillating up and down, then gradually increasing and finally keeping between 28000Hz and 32000Hz, and f when m=2500 and step=100 _sw =33100 Hz is the optimal action variable value, calculated to give a maximum value 0.9868 for the efficiency η of the system and 31.495 kw/li for the power density σ.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An inverter multi-objective optimization method based on deep reinforcement learning, wherein the inverter comprises a direct-current voltage source (10), a three-phase three-level ANPC inverter circuit (20), a filter circuit (30) and a load (40);

the three-phase three-level ANPC inverter circuit (20) comprises two identical supporting capacitors and an inverter main circuit, wherein the two supporting capacitors are respectively marked as supporting capacitors Cap ₁ And a support capacitor Cap ₂ Supporting capacitor Cap ₁ And a support capacitor Cap ₂ The series connection is connected between a direct current positive bus P and a direct current negative bus E of a direct current voltage source (10), and supports a capacitor Cap ₁ And a support capacitor Cap ₂ The connection point of (1) is recorded as the midpoint of the DC busO；

the filter circuit (30) comprises a three-phase filter inductance L _f And three-phase filter capacitor C ₀ Three-phase filter inductance L _f One end of (a) is connected with the output point phi of the inverter _i The other end is connected with a load (40), and a three-phase filter capacitor C ₀ Parallel connected three-phase filtering inductance L _f And a load (40);

the method is characterized by comprising the following specific steps:

step 1, establishing an optimization target model;

step 1.1, establishing an efficiency optimization model;

step 1.2, establishing a power density optimization model;

step 1.3, an EMI optimization model is established;

the sample data required for constructing the neural network 1 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data f of the neural network 1 _swN ,L _fN ,L _CMN And neural network 1 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;

The sample data required for constructing the neural network 2 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data U of the neural network 2 _dcN ,f _swN ,L _fN ,L _CMN And neural network 2 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;

step 1.3.4, grouping sample data;

Step 1.3.5, constructing a neural network 1 model and a neural network 2 model;

And then K of the verification subset obtained in the step 1.3.4 ₂ The group input data are respectively input into the neural network 1 and the neural network 2 after updating to obtain the K ₂ K corresponding to group input data ₂ Group output, including output of neural network 1And the output of the neural network 2->N ₂ ＝K ₁ +1,K ₁ +2,…K ₁ +K ₂ ；

step 1.3.6, K of the test set obtained in step 1.3.4 ₃ Inputting the group input data into the neural network 1 model and the neural network 2 model which are constructed in the step 1.3.5 to obtain the model K ₃ K corresponding to group input data ₃ Simulation output value sum K of group neural network 1 ₃ The simulated output values of the group neural network 2 are respectively recorded as predicted values of the neural network 1And neural network 2 predictive value->N ₃ ＝K ₁ +K ₂ +1,K ₁ +K ₂ +2,…K；

Step 1.3.7, randomly extracting a group of actual values of the neural network 1And the actual value of the neural network 2

step 2.1, determining a State set S and an action set A ₀ ；

Recording the current time T of the system, wherein t=1, 2,3 … T, T is the time of the system termination state, and recording the state of the system at the current time T as a state s _t ，s _t ＝(U _dc ,I) _t In the formula, U _dc The voltage value of the DC voltage source (10) is recorded as the DC voltage U _dc I is the effective value of the output current of the system and is recorded as the output current I;

the state set S is T states S _t S= { S ₁ ,s ₂ ,…s _t ,..s _T And S.epsilon { (U) _dc ,I)}；

The action set a ₀ For T actions a _t Is set of (A), A ₀ ＝{a ₁ ,a ₂ ,…a _t ,..a _T And (3) Wherein f _{sw_min} For a high frequency switching frequency f _sw Lower limit value f _{sw_max} For a high frequency switching frequency f _sw Upper limit value of (2);

step 2.2, determining a reward function R;

Introducing an optimization objective f _α α=1, 2, for the optimization objective f _α Normalizing to obtain normalized optimization targetAnd-> Its watchThe expression is:

wherein f _α,min To optimize the minimum value of the target, f _α,max Maximum value for optimization objective;

wherein,as penalty coefficient, w _α As a weight coefficient, α= 1,2,0<w _α <1, andw ₁ +w ₂ =1, c is the prize that EMI meets the criterion;

step 3, offline learning of a DDPG algorithm;

The DDPG algorithm comprises 4 neural networks, and the neural network parameters of the online strategy network are recorded as first neural network parameters theta _μ The neural network parameters of the target policy network are noted as second neural network parameters θ _μ' The neural network parameter of the online evaluation network marks the third neural network parameter as theta _Q The neural network parameters of the target evaluation network are recorded as fourth neural network parameters theta _Q' ；

define the average value of the bonus function R in each training round and record as the average bonus During each training round number m, the 4 neural networks included in the DDPG algorithm all face the average prize +.>The maximized direction update results in the optimal strategy pi (s _y )；

The optimal strategy pi (s _y ) The expression of (2) is as follows:

π(s _y )＝a _y

wherein s is _y A state value input for an online policy network corresponding to the optimal policy, and s _y ＝(U _dc ,I) _y ，(U _dc ,I) _y For the DC voltage U corresponding to the optimal strategy in the state set S _dc And output current I, a _y The action value output by the online strategy network corresponding to the optimal strategy is recorded as an optimal action a _y And a _y ＝(f _sw ) _y ，(f _sw ) _y For action set A ₀ Medium and optimum strategy pi(s) _y ) Corresponding high frequency switching frequency f _sw ；

Output optimal action a _y ；

Step 4, according to the optimal action a _y Performing application;

step 4.1, first, the states S selected from the state set S except the training data set _t Reformulating an application data set and then randomly extracting j from the application data set _max Individual states s _t And redefined as application state s _β ，β＝1,2,3…j _max Application state s _β ＝(U _dc ,I) _β I.e. application state s _β Is a direct current voltage U _dc And a set of states at an output current I;

step 4.2, the optimal action a output in the step 3 is processed _y Substitution j _max Individual application states s _β In (3) different application states s are obtained _β Down-output optimal application actions

Step 4.3, applying state s _β ＝(U _dc ,I) _β Optimal application actionsRespectively substituting the power density optimization model, the power density optimization model and the EMI optimization model established in the step 1 to achieve the optimal efficiency of the system on the premise of meeting the EMI standard>Optimal power density of the system->Causing any state in the system state set S to be { (U) _dc Maximizing efficiency, power density, and at the same time EMI meets the criteria.

2. The method for optimizing inverter multi-objective based on deep reinforcement learning according to claim 1, wherein in step 3, the offline learning is performed by using a deep reinforcement learning DDPG algorithm to obtain an optimal strategy pi (s _y ) The specific steps of (a) are as follows:

step 3.1, initializing a first neural network parameter θ _μ Second neural network parameter θ _μ' Third neural network parameter θ _Q And a fourth neural network parameter θ _Q' And let theta _μ' ＝θ _μ 、θ _Q' ＝θ _Q The method comprises the steps of carrying out a first treatment on the surface of the Initializing the capacity of an experience playback pool P as D; initializing learning rate alpha of online evaluation network _Q Learning rate alpha of online policy network _μ The running average update parameter τ, and 0<α _Q <1,0<α _μ <1,0<τ<1, a step of; the output of the online policy network is noted as a, a=μ (s|θ _μ ) Wherein a is an action value output by the online policy network, and a corresponds to the action set A ₀ And a=f _sw The method comprises the steps of carrying out a first treatment on the surface of the S is a state value input by the online policy network, S corresponds to an individual in the state set S, and s= (U) _dc I); mu is the first neural network parameter θ through the online policy network _μ And a policy derived from the entered state value s;

step 3.3, the system is based on the state s _t Executing action a _t Transition to a new state s _t+1 At the same time get the execution action a _t The single step prize value r _t Will(s) _t ,a _t ,r _t ,s _t+1 ) Called a state transition sequence, and (s _t ,a _t ,r _t ,s _t+1 ) Stored in the experience playback pool P, the system enters a state s of t+1 at the next moment _t+1 ；

Step 3.4, randomly extracting n state transition sequences from the experience playback pool P, and n<D, taking n state transition sequences as small batch data for training an online strategy network and an online evaluation network, and recording the kth state transition sequence in the small batch data as(s) _k ,a _k ,r _k ,s _k+1 )，k＝1,2,3…n；

Step 3.5, based on the small batch data(s) obtained in step 3.4 _k ,a _k ,r _k ,s _k+1 ) K=1, 2,3 … n, calculated as the jackpot y _k And error function L (θ) _Q ) The specific expression is as follows:

y _k ＝r _k +Q′(s _k+1 ,μ′(s _k+1 |θ _μ' )|θ _Q' )

wherein Q'(s) _k+1 ,μ'(s _k+1 |θ _μ' )|θ _Q' ) Scoring value output for target evaluation network, wherein μ'(s) _k+1 |θ _μ' )|θ _Q' Action value s output for target strategy network _k+1 The state values input for the target evaluation network and the target strategy network; q(s) _k ,a _k |θ _Q ) For on-line evaluation of the scoring value output by the network s _k And a _k The method comprises the steps of evaluating a state value and an action value input by a network on line;

step 3.6, on-line evaluation network is performed by minimizing the error function L (θ _Q ) To update theta _Q On-line policy network through deterministic policy gradientsUpdating theta _μ The target evaluation network and the target policy network update theta by a moving average method _Q' And theta _μ' The specific expression is as follows:

in the method, in the process of the invention,is a partial guide symbol, wherein->Representing policy J vs. θ _μ Deviation-inducing and->Input representing online evaluation network is s=s _k ,a＝μ(s _k ) In the time of online evaluation of the scoring value outputted by the network +. >Deviation of the action value a is determined, +.>Input representing online policy network is s=s _k When the online policy network outputs action value +.>For theta _μ Deviation-inducing and->Representing an error function L (θ) _Q ) For theta _Q Deviation-inducing and->For the third neural network parameter after updating, +.>For the first neural network parameter after updating, +.>For the fourth neural network parameter after updating, +.>For the updated second neural network parameter;

step 3.7, when the steps 3.4 to 3.6 are completed once, the training process of one step length is completed, and when step<step _max When step=step, repeating steps 3.4 to 3.6 _max When the training process of one round is completed, the training process of the next round starts from the step 3.2 to the step 3.6, and when m is<And (3) repeatedly executing the steps 3.2 to 3.6 when m=m, and ending the learning process of the DDPG algorithm when the training process of the M rounds is completed;

In M training rounds, the first neural network parameter θ _μ Second neural network parameter θ _μ' Third neural network parameter θ _Q And a fourth neural network parameter θ _Q' Towards average rewardsThe maximized direction update results in the optimal strategy pi (s _y )。