CN117057228A - Inverter multi-objective optimization method based on deep reinforcement learning - Google Patents

Inverter multi-objective optimization method based on deep reinforcement learning Download PDF

Info

Publication number
CN117057228A
CN117057228A CN202311003536.4A CN202311003536A CN117057228A CN 117057228 A CN117057228 A CN 117057228A CN 202311003536 A CN202311003536 A CN 202311003536A CN 117057228 A CN117057228 A CN 117057228A
Authority
CN
China
Prior art keywords
neural network
output
network
inverter
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311003536.4A
Other languages
Chinese (zh)
Inventor
王佳宁
吴轶康
杨仁海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202311003536.4A priority Critical patent/CN117057228A/en
Publication of CN117057228A publication Critical patent/CN117057228A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/06Multi-objective optimisation, e.g. Pareto optimisation using simulated annealing [SA], ant colony algorithms or genetic algorithms [GA]

Abstract

The invention provides an inverter multi-objective optimization method based on deep reinforcement learning, which comprises the steps of utilizing an analytic formula to establish an efficiency optimization model and a power density optimization model, and utilizing a neural network to establish an EMI common mode noise optimization model; determining a state set, an action set and a normalized multi-objective rewarding function; and performing offline learning by using a DDPG algorithm to obtain an optimal strategy, and applying the DDPG algorithm, wherein the system can realize optimization of efficiency and power density on the premise of meeting the EMI standard under any state and any weight coefficient according to the optimal strategy. The modeling method of the neural network on the EMI common mode noise is adopted, so that a large number of circuit simulations are avoided, and the optimizing efficiency is improved; the DDPG algorithm can solve the problem of complex high-dimensional design variables, can also avoid the problems of serious coupling and failure of each parameter in the inverter design, and can quickly find an optimal scheme.

Description

Inverter multi-objective optimization method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of power electronics, relates to multi-objective optimization of an inverter, and provides a multi-objective optimization method of an inverter for deep reinforcement learning.
Background
With the advance of the important strategy of double carbon, the utilization of renewable energy is increasingly important, and solar energy and wind energy power generation are important components of the utilization of clean energy in the future. The inverter is an interface between the wind power generation system and the photovoltaic power generation system and the power grid, and plays an important role in core electric energy conversion and control. Therefore, the inverter is one of key links which are indispensable for guaranteeing the efficient, economical and stable operation of the photovoltaic power generation system and the wind power generation system, and the inverter can reach optimal efficiency and power density under the premise of meeting the EMI standard under any operation working condition, so that the inverter is very significant for the efficient, economical and stable operation of the photovoltaic power generation system and the wind power generation system.
The wide band gap power device is widely applied to the power electronic converter due to the excellent characteristics of high frequency, high voltage, high temperature and the like. The increase in switching frequency can reduce the volume of the inverter, thereby greatly increasing the power density of the inverter, but can increase electromagnetic interference (Electromagnetic Interference, EMI) noise of the inverter. In addition, the reduction in inverter volume results in a more compact layout of components, which enhances parasitic parameter and temperature coupling between inverter design goals, resulting in reduced inverter efficiency and life. Therefore, in order for the inverter system to have comprehensive and excellent performance, a multi-objective optimization design is of paramount importance, for which many expert students propose different solutions:
The invention patent CN112968474A discloses a multi-objective optimizing method of a photovoltaic off-grid inverter system, which adopts a genetic algorithm NSGA-III to carry out multi-objective optimization on the photovoltaic off-grid inverter system. However, this solution has the following drawbacks: because NSGA-III algorithm is adopted, when the system state is changed, complex and time-consuming optimizing solving process is needed to be carried out again, computing resources are consumed, action values after the state change cannot be rapidly given out, the optimizing process has limitation, and the application range is limited.
The invention discloses a multi-objective optimization method for a photovoltaic inverter based on a DDPG algorithm, which adopts a deep reinforcement learning algorithm DDPG to carry out multi-objective optimization on a photovoltaic off-grid inverter system and solves the limitation of gene algorithm optimization. However, this solution still has the following disadvantages: in the invention, the optimization target model of the photovoltaic inverter system models the optimization target through an analytic formula, and a large number of circuit simulations are carried out in tens of thousands of multi-target optimization iterative processes, so that the overall optimizing efficiency is greatly reduced.
Disclosure of Invention
Aiming at the problems that in the multi-objective optimization of the existing inverter, the NSGA-III algorithm is adopted, the training or solving process is complex and time-consuming, the action value after the state change cannot be rapidly given, the actual application requirement is difficult to meet, and the efficiency of optimizing and modeling by an analytical formula is low, the invention provides a multi-objective optimization method of the inverter based on deep reinforcement learning, and aims to solve the problems in the prior art.
The invention aims at realizing the aim, and provides an inverter multi-objective optimization method based on deep reinforcement learning, wherein the inverter comprises a direct-current voltage source 10, a three-phase three-level ANPC inverter circuit 20, a filter circuit 30 and a load 40;
the three-phase three-level ANPC inverter circuit 20 comprises two identical supporting capacitors and an inverter main circuit, wherein the two supporting capacitors are respectively marked as supporting capacitors Cap 1 And a support capacitor Cap 2 Supporting capacitor Cap 1 And a support capacitor Cap 2 The series connection is connected between a direct current positive bus P and a direct current negative bus E of the direct current voltage source 10, and supports a capacitor Cap 1 And a support capacitor Cap 2 The connection point of the (C) is marked as a midpoint O of the direct current bus;
the inverter main circuit comprises an A-phase bridge arm, a B-phase bridge arm and a C-phase bridge arm, wherein each phase bridge arm comprises 6 switching tubes with anti-parallel diodes, namely the inverter main circuit comprises 18 switching tubes with anti-parallel diodes, and the 18 switching tubes with anti-parallel diodes are recorded as a switching tube S ij The 18 antiparallel diodes are denoted as diode D ij Where i represents three phases, i=a, b, c, j represents the serial numbers of the switching tubes and diodes, j=1, 2,3,4,5,6; the A-phase bridge arm, the B-phase bridge arm and the C-phase bridge arm are mutually connected in parallel between a direct current positive bus P and a direct current negative bus E; in each phase leg of the three-phase legs, a switching tube S i1 Switch tube S i2 Switch tube S i3 Switch tube S i4 Sequentially connected in series with a switch tube S i1 The input end of (1) is connected with a direct current positive bus P and a switch tube S i1 Is connected with the switch tube S i2 Input terminal of (a) switch tube S i2 Is connected with the switch tube S i3 Input terminal of (a) switch tube S i3 Is connected with the switch tube S i4 Input terminal of (a) switch tube S i4 The output end of (1) is connected with a direct current negative bus E and a switch tube S i5 Is connected with the switch tube S i1 Output terminal of (2), switch tube S i5 The output end of (1) is connected with the midpoint O of the direct current bus and the switch tube S i6 The input end of (1) is connected with the midpoint O of the DC bus and the switch tube S i6 Is connected with the switch tube S i3 Output terminal of (2), switch tube S i2 And a switch tube S i3 Is denoted as inverter output point phi i The method comprises the steps of carrying out a first treatment on the surface of the Switch tube S i1 Switch tube S i4 Switch tube S i5 And a switch tube S i6 Is a power frequency switch tube with the same switching frequency, and the switch tube S i2 And a switch tube S i3 The switching frequency is the same as the switching frequency of the high-frequency switching tube;
the filter circuit 30 includes a three-phase filter inductance L f And three-phase filter capacitor C 0 Three-phase filter inductance L f One end of (a) is connected with the output point phi of the inverter i The other end is connected with a load 40, and a three-phase filter capacitor C 0 Parallel connected three-phase filtering inductance L f And a load 40;
the method comprises the following specific steps:
step 1, establishing an optimization target model;
The inverter is recorded as a system, 18 switching tubes with anti-parallel diodes are disassembled into 18 switching tubes and 18 anti-parallel diodes, and a supporting capacitor Cap is set 1 Supporting capacitor Cap 2 And three-phase filter capacitor C 0 Is negligible in terms of loss and volume;
step 1.1, establishing an efficiency optimization model;
the efficiency eta of the system is taken as a target, an efficiency optimization model is established, and the expression is as follows:
wherein P is loss P is the total loss of the system loss =P T +P L ,P T Total loss of 18 switching tubes and 18 antiparallel diodes, P L Is three-phase filtering inductance L f Loss of P w The rated input power of the system;
step 1.2, establishing a power density optimization model;
and (3) taking the power density sigma of the system as a target, establishing a power density optimization model, wherein the expression is as follows:
wherein P is w For rated input power of the system, V is the system volume, v=v T +3V L ,V T V for the total volume of 18 switching tubes and 18 antiparallel diodes L Is three-phase filtering inductance L f The magnetic core volume of the middle single-phase filter inductor;
step 1.3, an EMI optimization model is established;
the EMI optimization model predicts the envelope curve of the EMI common mode noise spectrum by using an artificial neural network to represent the actual common mode noise level, and compares the predicted spectrum envelope curve with a noise amplitude curve in a standard to judge whether the standard is met;
Step 1.3.1, determining input variables and output variables of an artificial neural network;
the artificial neural network comprises a neural network 1 and a neural network 2, wherein:
the input variables of the neural network 1 are 3, and are respectively: high frequency switching frequency f sw Filter inductance L f Common mode inductance L CM The output variable of the neural network 1 is the frequency of the 4 turning points of the inverter common mode conducted EMI spectrum, denoted as frequency f 1 Frequency f 2 Frequency f 3 Frequency f 4
The neural network 2 has 4 input variables, which are: voltage value U of DC voltage source dc High frequency switching frequency f sw Filter inductance L f Common mode inductance L CM The method comprises the steps of carrying out a first treatment on the surface of the The output variable of the neural network 2 is the inverter common mode conducted EMI spectrum at frequency f 1 Frequency f 2 Frequency f 3 Frequency f 4 The spectral amplitude at this point is denoted as spectral amplitude M f1 Spectral amplitude M f2 Spectral amplitude M f3 Spectral amplitude M f4
Step 1.3.2, obtaining sample data required for constructing a neural network 1 model and a neural network 2 model by using computer simulation software, wherein:
the sample data required for constructing the neural network 1 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data f of the neural network 1 swN ,L fN ,L CMN And neural network 1 simulation output value Wherein N is the serial number of each group, n=1, 2,3 … K;
the sample data required for constructing the neural network 2 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data U of the neural network 2 dcN ,f swN ,L fN ,L CMN And neural network 2 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;
step 1.3.3, determining a neural network 1 model and a network structure of a neural network 2;
in the neural network 1 structure, the input layer contains 3 neurons, the hidden layer contains 8 neurons, and the output layer contains 4 neurons;
in the neural network 2 structure, the input layer contains 4 neurons, the hidden layer contains 11 neurons, and the output layer contains 4 neurons;
step 1.3.4, grouping sample data;
dividing the sample data obtained in the step 1.3.2 into a training subset, a verification subset and a test set, wherein the training subset contains K 1 Group sample data, verifying that the subset contains K 2 Group sample data, test set contains K 3 Group sample data, and K 1 +K 2 +K 3 =K;
Step 1.3.5, constructing a neural network 1 model and a neural network 2 model;
randomly extracting a group of input data from the training subset obtained in the step 1.3.4, and inputting the input data into the neural network 1 and the neural network 2 to obtain output corresponding to the input data; the parameters of the neural network 1 and the neural network 2 are updated by adopting an error back propagation gradient descent algorithm, and the updated neural network 1 and the updated neural network 2 are obtained;
And then K of the verification subset obtained in the step 1.3.4 2 The group input data are respectively input into the neural network 1 and the neural network 2 after updating to obtain the K 2 K corresponding to group input data 2 Group output, including output of neural network 1And the output of the neural network 2
The root mean square error δ1 and the root mean square error δ2 are defined, and their expressions are respectively:
given a first target error e 1 And a second target error e 2 And makes the following judgment:
if delta 1 < e 1 And delta 2 < e 2 Neural network 1 modelThe neural network 2 model is built, the step 1.3.6 is entered, otherwise, the step 1.3.5 is returned;
step 1.3.6, K of the test set obtained in step 1.3.4 3 Inputting the group input data into the neural network 1 model and the neural network 2 model which are constructed in the step 1.3.5 to obtain the model K 3 K corresponding to group input data 3 Simulation output value sum K of group neural network 1 3 The simulated output values of the group neural network 2 are respectively recorded as predicted values of the neural network 1And neural network 2 predictors
Step 1.3.7, randomly extracting a group of actual values of the neural network 1And the actual value of the neural network 2->
Establishing a plane coordinate system by taking frequency as an abscissa and taking frequency spectrum amplitude as an ordinate, and according to the predicted value Drawing inverter common mode conducted EMI spectrum prediction envelope on the coordinate system according to actual value +. >Drawing an actual envelope of the common mode conducted EMI spectrum of the inverter on the coordinate system;
judging whether the inverter common mode conducted EMI spectrum prediction envelope and the inverter common mode conducted EMI spectrum actual envelope are matched or not: if yes, the prediction of the common-mode conducted EMI spectrum envelope curve of the inverter is realized, and the prediction is ended; if not, returning to the step 1.3.4;
the matching means that four turning points on the predicted envelope of the common-mode conducted EMI spectrum of the inverter are in close agreement with four turning points on the actual envelope of the common-mode conducted EMI spectrum of the inverter;
step 2, determining a state set S and an action set A according to the efficiency optimization model, the power density optimization model and the EMI optimization model obtained in the step 1 0 And a reward function R;
step 2.1, determining a State set S and an action set A 0
Recording the current time T of the system, wherein t=1, 2,3 … T, T is the time of the system termination state, and recording the state of the system at the current time T as a state s t ,s t =(U dc ,I) t In the formula, U dc The voltage value of the DC voltage source 10 is denoted as DC voltage U dc I is the effective value of the output current of the system and is recorded as the output current I;
the state set S is T states S t S= { S 1 ,s 2 ,...s t ,..s T And S.epsilon { (U) dc ,I)};
The action taken by the system at time t is denoted as action a t ,a t =(f sw ) t Wherein f is sw The switching frequency of the high-frequency switching tube is denoted as the high-frequency switching frequency f sw
The action set A 0 For T actions a t Is set of (A), A 0 ={a 1 ,a 2 ,...a t ,..a T And (3) Wherein f sw_min For a high frequency switching frequency f sw Lower limit value f sw_max For a high frequency switching frequency f sw Upper limit value of (2);
step 2.2, determining a reward function R;
step 2.2.1, normalizing the efficiency optimization model and the power density optimization model;
the values of the efficiency optimization model and the power density optimization model of the system are not in the same magnitude, and normalization processing is carried out to ensure that the values of the two optimization models are between 0 and 1;
system total loss P in efficiency optimization model loss To optimize the target f 1 The system volume V in the power density optimization model is an optimization target f 2
Introducing an optimization objective f α α=1, 2, for the optimization objective f α Normalizing to obtain normalized optimization targetAnd is also provided with The expression is as follows:
wherein f α,min To optimize the minimum value of the target, f α,max Maximum value for optimization objective;
step 2.2.2, giving weight to efficiency, power density and EMI, and setting a reward function R;
the bonus function R represents a weighted sum of the bonus values generated by all actions of the system from the current state to the end state, expressed as follows:
Wherein r is t For the state s of the system at time t t Take action a t The obtained single-step rewarding value is gamma which is a discount factor, wherein the discount factor gamma represents the influence degree of the length of time on the rewarding value;
when the inverter common mode conducted EMI spectrum prediction envelope is well below the noise amplitude curve of the EMI standard:
when there is an inverter common mode conducted EMI spectrum prediction envelope above the noise amplitude curve of the EMI standard:
wherein,as penalty coefficient, w α For the weight coefficient, α= 1,2,0 < w α < 1, and w 1 +w 2 =1, c is the prize that EMI meets the criterion;
step 3, offline learning of a DDPG algorithm;
arbitrarily extracting D states S from the state set S t Composing a training data set for offline learning, d=4t/5; according to the state set S and the action set A obtained in the step 2 0 And a reward function R, offline learning is performed by using a DDPG algorithm of deep reinforcement learning, and an optimal strategy pi(s) is obtained y );
The DDPG algorithm comprises 4 neural networks, and the neural network parameters of the online strategy network are recorded as first neural network parameters theta μ The neural network parameters of the target policy network are noted as second neural network parameters θ μ′ The neural network parameter of the online evaluation network marks the third neural network parameter as theta Q The neural network parameters of the target evaluation network are recorded as fourth neural network parameters theta Q′
Given a training step and a maximum step max Given the training round number m and the mostLarge training round number M, step=1, 2,3 … step max M=1, 2,3 … M, i.e. step is included in each training round max Training for the second time, and performing M training rounds altogether;
define the average value of the bonus function R in each training round and record as the average bonusDuring each training round number m, the 4 neural networks included in the DDPG algorithm all face the average prize +.>The maximized direction update results in the optimal strategy pi (s y );
The optimal strategy pi (s y ) The expression of (2) is as follows:
π(s y )=a y
wherein s is y A state value input for an online policy network corresponding to the optimal policy, and s y =(U dc ,I) y ,(U dc ,I) y For the DC voltage U corresponding to the optimal strategy in the state set S dc And output current I, a y The action value output by the online strategy network corresponding to the optimal strategy is recorded as an optimal action a y And a y =(f sw ) y ,(f sw ) y For action set A 0 Medium and optimum strategy pi(s) y ) Corresponding high frequency switching frequency f sw
Output optimal action a y
Step 4, according to the optimal action a y Performing application;
step 4.1, first, the states S selected from the state set S except the training data set t Reformulating an application data set and then randomly extracting j from the application data set max Individual states s t And redefined as application state s β ,β=1,2,3...j max Application state s β =(U dc ,I) β I.e. application-likeState s β Is a direct current voltage U dc And a set of states at an output current I;
step 4.2, the optimal action a output in the step 3 is processed y Substitution j max Individual application states s β In (3) different application states s are obtained β Down-output optimal application actionsβ=1,2,3...j max
Step 4.3, applying state s β =(U dc ,I) β Optimal application actionsRespectively substituting the power density optimization model, the power density optimization model and the EMI optimization model established in the step 1 to achieve the optimal efficiency of the system on the premise of meeting the EMI standard>Optimal power density of the system->β=1,2,3...j max Any state { (U) in the system state set S is caused to be dc Maximizing efficiency, power density, and at the same time EMI meets the criteria.
Further, in step 3, offline learning is performed by using the deep reinforcement learning DDPG algorithm to obtain an optimal strategy pi(s) y ) The specific steps of (a) are as follows:
step 3.1, initializing a first neural network parameter θ μ Second neural network parameter θ μ′ Third neural network parameter θ Q And a fourth neural network parameter θ Q′ And let theta μ′ =θ μ 、θ Q′ =θ Q The method comprises the steps of carrying out a first treatment on the surface of the Initializing the capacity of an experience playback pool P as D; initializing learning rate alpha of online evaluation network Q Learning rate alpha of online policy network μ Update parameter τ with moving average, and 0 < α Q <1,0<α μ More than 1, more than 0 and less than 1; the output of the online policy network is noted as a, a=μ (s|θ μ ) Wherein a is an action value output by the online policy network, and a corresponds to the action set A 0 And a=f sw The method comprises the steps of carrying out a first treatment on the surface of the S is a state value input by the online policy network, S corresponds to an individual in the state set S, and s= (U) dc I); mu is the first neural network parameter θ through the online policy network μ And a policy derived from the entered state value s;
step 3.2, state s of the system at time t t Inputting the online policy network to obtain the output of the online policy networkAnd adding noise delta t Action a of obtaining final output t The specific expression is as follows:
step 3.3, the system is based on the state s t Executing action a t Transition to a new state s t+1 At the same time get the execution action a t The single step prize value r t Will(s) t ,a t ,r t ,s t+1 ) Called a state transition sequence, and (s t ,a t ,r t ,s t+1 ) Stored in the experience playback pool P, the system enters a state s of t+1 at the next moment t+1
Circularly executing the steps 3.2 to 3.3, recording the number of state transition sequences in the experience playback pool P as N, entering the step 3.4 if N=D, otherwise returning to the step 3.2;
Step 3.4, randomly extracting n state transition sequences from the experience playback pool P, wherein n is less than D, taking the n state transition sequences as small batch data for training an online strategy network and an online evaluation network, and recording the kth state transition sequence in the small batch data as(s) k ,a k ,r k ,s k+1 ),k=1,2,3…n;
Step 3.5, based on the small batch data(s) obtained in step 3.4 k ,a k ,r k ,s k+1 ) K=1, 2,3 … n, calculated as the jackpot y k And error function L (θ) Q ) The specific expression is as follows:
y k =r k +Q′(s k+1 ,μ′(s k+1μ′ )|θ Q′ )
wherein Q'(s) k+1 ,μ′(s k+1μ′ )|θ Q′ ) Scoring value output for target evaluation network, wherein μ'(s) k+1μ′ )|θ Q′ Action value s output for target strategy network k+1 The state values input for the target evaluation network and the target strategy network; q(s) k ,a kQ ) For on-line evaluation of the scoring value output by the network s k And a k The method comprises the steps of evaluating a state value and an action value input by a network on line;
step 3.6, on-line evaluation network is performed by minimizing the error function L (θ Q ) To update theta Q On-line policy network through deterministic policy gradientsUpdating theta μ The target evaluation network and the target policy network update theta by a moving average method Q′ And theta μ′ The specific expression is as follows:
in the method, in the process of the invention,is a partial guide symbol, wherein->Representing policy J vs. θ μ Deviation-inducing and->Input representing online evaluation network is s=s k ,a=μ(s k ) When in use, the scoring value output by the network is evaluated onlineDeviation of the action value a is determined, +.>Input representing online policy network is s=s k When the online policy network outputs action value +.>For theta μ Deviation-inducing and->Representing an error function L (θ) Q ) For theta Q Deviation-inducing and->After being updatedThird neural network parameter of->For the first neural network parameter after updating, +.>For the fourth neural network parameter after updating, +.>For the updated second neural network parameter;
step 3.7, when the steps 3.4 to 3.6 are finished once, the training process of one step length is finished, and when step < step max When step=step, repeating steps 3.4 to 3.6 max When M is less than M, repeatedly executing the steps 3.2-3.6, and when m=M, finishing the training process of M rounds, wherein the training process of the next round is started from the step 3.2 to the step 3.6;
step 3.8, the training algorithm is ended, and the optimal strategy pi (s y )=a y Record the average rewards of a training round as
In M training rounds, the first neural network parameter θ μ Second neural network parameter θ μ′ Third neural network parameter θ Q And a fourth neural network parameter θ Q′ Towards average rewardsThe maximized direction update results in the optimal strategy pi (s y )。
Compared with the prior art, the invention has the beneficial effects that:
(1) In the multi-objective optimization of the inverter, the invention adopts the neural network to optimally model the EMI common mode noise of the inverter, builds the neural network model of the EMI common mode noise based on a small amount of simulation data, greatly improves the optimizing efficiency, and can rapidly obtain the envelope curve of the EMI common mode noise spectrum even if the working condition changes.
(2) The invention adopts the deep reinforcement learning algorithm DDPG to carry out multi-objective optimization on the inverter, can solve the problem of complex high-dimensional design variables, can avoid the problem of failure in the design of the inverter, finds an optimal scheme meeting the optimization objective, and fully improves the performance of the inverter.
Drawings
FIG. 1 is a topology of an inverter according to the present invention;
FIG. 2 is a block diagram of an inverter multi-objective optimization method in accordance with the present invention;
FIG. 3 is a flow chart of an inverter multi-objective optimization method according to the present invention;
FIG. 4 is a flow chart of the inverter of the present invention using a neural network to optimally model EMI common mode noise;
FIG. 5 is a block diagram of two neural networks of an EMI optimization model in the inverter multi-objective optimization method of the present invention;
FIG. 6 is a comparison diagram of an optimized modeling of EMI common mode noise in an embodiment of the present invention;
FIG. 7 is a graph showing the convergence effect of average rewards in an embodiment of the invention;
FIG. 8 is a training effect diagram of motion variables in an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 is a topology diagram of a photovoltaic inverter in an embodiment of the present invention. As can be seen from fig. 1, the inverter comprises a dc voltage source 10, a three-phase three-level ANPC inverter circuit 20, a filter circuit 30 and a load 40;
the three-phase three-level ANPC inverter circuit 20 comprises two identical supporting capacitors and an inverter main circuit, wherein the two supporting capacitors are respectively marked as supporting capacitors Cap 1 And a support capacitor Cap 2 Supporting capacitor Cap 1 And a support capacitor Cap 2 Direct current positive bus P and direct current connected in series with direct current voltage source 10Between the current negative buses E, a supporting capacitor Cap 1 And a support capacitor Cap 2 The connection point of the (C) is marked as a midpoint O of the direct current bus;
the inverter main circuit comprises an A-phase bridge arm, a B-phase bridge arm and a C-phase bridge arm, wherein each phase bridge arm comprises 6 switching tubes with anti-parallel diodes, namely the inverter main circuit comprises 18 switching tubes with anti-parallel diodes, and the 18 switching tubes with anti-parallel diodes are recorded as a switching tube S ij The 18 antiparallel diodes are denoted as diode D ij Where i represents three phases, i=a, b, c, j represents the serial numbers of the switching tubes and diodes, j=1, 2,3,4,5,6; the A-phase bridge arm, the B-phase bridge arm and the C-phase bridge arm are mutually connected in parallel between a direct current positive bus P and a direct current negative bus E; in each phase leg of the three-phase legs, a switching tube S i1 Switch tube S i2 Switch tube S i3 Switch tube S i4 Sequentially connected in series with a switch tube S i1 The input end of (1) is connected with a direct current positive bus P and a switch tube S i1 Is connected with the switch tube S i2 Input terminal of (a) switch tube S i2 Is connected with the switch tube S i3 Input terminal of (a) switch tube S i3 Is connected with the switch tube S i4 Input terminal of (a) switch tube S i4 The output end of (1) is connected with a direct current negative bus E and a switch tube S i5 Is connected with the switch tube S i1 Output terminal of (2), switch tube S i5 The output end of (1) is connected with the midpoint O of the direct current bus and the switch tube S i6 The input end of (1) is connected with the midpoint O of the DC bus and the switch tube S i6 Is connected with the switch tube S i3 Output terminal of (2), switch tube S i2 And a switch tube S i3 Is denoted as inverter output point phi i The method comprises the steps of carrying out a first treatment on the surface of the Switch tube S i1 Switch tube S i4 Switch tube S i5 And a switch tube S i6 Is a power frequency switch tube with the same switching frequency, and the switch tube S i2 And a switch tube S i3 The switching frequency is the same as the switching frequency of the high-frequency switching tube;
The filter circuit 30 includes a three-phase filter inductance L f And three-phase filter capacitor C 0 Three-phase filter inductance L f One end of (a) is connected with the output point phi of the inverter i The other end is connected with a load 40, and a three-phase filter capacitor C 0 Parallel connectionIn the three-phase filtering inductance L f And a load 40;
in the present embodiment, a switching tube S i1 Switch tube S i4 Switch tube S i5 And a switch tube S i6 The three-phase filter capacitor C is a power frequency switch tube with the switching frequency of 50HZ 0 =3μF,Cap 1 =Cap 2 =20μF。
Fig. 2 is a block diagram of a multi-objective optimization method of a photovoltaic inverter according to the present invention, fig. 3 is a flowchart of the multi-objective optimization method of a photovoltaic inverter according to the present invention, and as can be seen from fig. 2 and 3, the multi-objective optimization method of an inverter performs multi-objective optimization on the inverter based on deep reinforcement learning, specifically includes the following steps:
step 1, establishing an optimization target model;
the inverter is recorded as a system, 18 switching tubes with anti-parallel diodes are disassembled into 18 switching tubes and 18 anti-parallel diodes, and a supporting capacitor Cap is set 1 Supporting capacitor Cap 2 And three-phase filter capacitor C 0 Is negligible in terms of loss and volume;
step 1.1, establishing an efficiency optimization model
The efficiency eta of the system is taken as a target, an efficiency optimization model is established, and the expression is as follows:
wherein P is loss P is the total loss of the system loss =P T +P L ,P T Total loss of 18 switching tubes and 18 antiparallel diodes, P L Is three-phase filtering inductance L f Loss of P w The rated input power of the system;
step 1.2, establishing a power density optimization model;
and (3) taking the power density sigma of the system as a target, establishing a power density optimization model, wherein the expression is as follows:
wherein P is w For rated input power of the system, V is the system volume, v=v T +3V L ,V T V for the total volume of 18 switching tubes and 18 antiparallel diodes L Is three-phase filtering inductance L f The magnetic core volume of the middle single-phase filter inductor;
in the present embodiment, the rated input power P of the system is taken w =140×10 3 Tile, V T =3.98×10 -4 Cubic meters.
Fig. 4 is a flowchart of the inverter of the present invention using a neural network to optimally model EMI common mode noise. As can be seen from fig. 4, the EMI optimization model is established by using two neural networks, and the specific steps are as follows:
step 1.3, an EMI optimization model is established;
the EMI optimization model predicts an envelope of the EMI common mode noise spectrum with an artificial neural network to represent an actual common mode noise level, and compares the predicted spectrum envelope with a noise magnitude curve in the standard to determine whether the standard is satisfied.
Step 1.3.1, determining input variables and output variables of an artificial neural network;
The artificial neural network comprises a neural network 1 and a neural network 2, wherein:
the input variables of the neural network 1 are 3, and are respectively: high frequency switching frequency f sw Filter inductance L f Common mode inductance L CM The output variable of the neural network 1 is the frequency of the 4 turning points of the inverter common mode conducted EMI spectrum, denoted as frequency f 1 Frequency f 2 Frequency f 3 Frequency f 4
The neural network 2 has 4 input variables, which are: voltage value U of DC voltage source dc High frequency switching frequency f sw Filter inductance L f Common mode inductance L CM The method comprises the steps of carrying out a first treatment on the surface of the The output variable of the neural network 2 is the inverter common mode conducted EMI spectrum at frequency f 1 Frequency f 2 Frequency f 3 Frequency f 4 Spectral amplitude atRecorded as the spectral amplitude M f1 Spectral amplitude M f2 Spectral amplitude M f3 Spectral amplitude M f4
Step 1.3.2, obtaining sample data required for constructing a neural network 1 model and a neural network 2 model by using computer simulation software, wherein:
the sample data required for constructing the neural network 1 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data f of the neural network 1 swN ,L fN ,L CMN And neural network 1 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;
the sample data required for constructing the neural network 2 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data U of the neural network 2 dcN ,f swN ,L fN ,L CMN And neural network 2 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;
step 1.3.3, determining a neural network 1 model and a network structure of a neural network 2;
in the neural network 1 structure, the input layer contains 3 neurons, the hidden layer contains 8 neurons, and the output layer contains 4 neurons;
in the neural network 2 structure, the input layer contains 4 neurons, the hidden layer contains 11 neurons, and the output layer contains 4 neurons;
fig. 5 shows a network configuration diagram of the neural network 1 model and the neural network 2 of the present embodiment.
Step 1.3.4, grouping sample data;
dividing the sample data obtained in the step 1.3.2 into a training subset, a verification subset and a test set, wherein the training subset contains K 1 Group sample data, verifying that the subset contains K 2 Group sample data, test set contains K 3 Number of group samplesAccording to, and K 1 +K 2 +K 3 =K;
Step 1.3.5, constructing a neural network 1 model and a neural network 2 model;
randomly extracting a group of input data from the training subset obtained in the step 1.3.4, and inputting the input data into the neural network 1 and the neural network 2 to obtain output corresponding to the input data; the parameters of the neural network 1 and the neural network 2 are updated by adopting an error back propagation gradient descent algorithm, and the updated neural network 1 and the updated neural network 2 are obtained;
And then K of the verification subset obtained in the step 1.3.4 2 The group input data are respectively input into the neural network 1 and the neural network 2 after updating to obtain the K 2 K corresponding to group input data 2 Group output, including output of neural network 1And the output of the neural network 2->N 2 =K 1 +1,K 1 +2,…K 1 +K 2
The root mean square error δ1 and the root mean square error δ2 are defined, and their expressions are respectively:
given a first target error e 1 And a second target error e 2 And makes the following judgment:
if delta 1 < e 1 And delta 2 < e 2 The neural network 1 model and the neural network 2 model are built, the step 1.3.6 is entered, otherwise, the step 1.3.5 is returned;
step 1.3.6, K of the test set obtained in step 1.3.4 3 Inputting the group input data into the neural network 1 model and the neural network 2 model which are constructed in the step 1.3.5 to obtain the model K 3 K corresponding to group input data 3 Simulation output value sum K of group neural network 1 3 The simulated output values of the group neural network 2 are respectively recorded as predicted values of the neural network 1And neural network 2 predictive value->N 3 =K 1 +K 2 +1,K 1 +K 2 +2,…K;
Step 1.3.7, randomly extracting a group of actual values of the neural network 1And the actual value of the neural network 2->
Establishing a plane coordinate system by taking frequency as an abscissa and taking frequency spectrum amplitude as an ordinate, and according to the predicted value Drawing inverter common mode conducted EMI spectrum prediction envelope on the coordinate system according to actual value +. >Drawing an actual envelope of the common mode conducted EMI spectrum of the inverter on the coordinate system;
judging whether the inverter common mode conducted EMI spectrum prediction envelope and the inverter common mode conducted EMI spectrum actual envelope are matched or not: if yes, the prediction of the common-mode conducted EMI spectrum envelope curve of the inverter is realized, and the prediction is ended; if not, returning to the step 1.3.4;
the matching means that four turning points on the predicted envelope of the common-mode conducted EMI spectrum of the inverter are in close agreement with four turning points on the actual envelope of the common-mode conducted EMI spectrum of the inverter;
in the present embodiment, U dc =200v, high frequency switching frequency f sw =50khz, filter inductance L f =90 μh, common-mode inductance L CM =120μH。
FIG. 6 is a comparison diagram of an optimized modeling of EMI common mode noise in an embodiment of the present invention, as shown in the following: the four turning points on the predicted envelope of the inverter common mode conducted EMI spectrum output by the neural network are basically consistent with the four turning points on the actual envelope of the inverter common mode conducted EMI spectrum.
Step 2, determining a state set S and an action set A according to the efficiency optimization model, the power density optimization model and the EMI optimization model obtained in the step 1 0 And a reward function R;
step 2.1, determining a State set S and an action set A 0
Recording the current time T of the system, wherein t=1, 2,3 … T, T is the time of the system termination state, and recording the state of the system at the current time T as a state s t ,s t =(U dc, I) t In the formula, U dc The voltage value of the DC voltage source 10 is denoted as DC voltage U dc I is the effective value of the output current of the system and is recorded as the output current I;
the state set S is T states S t S= { S 1 ,s 2 ,...s t ,..s T And S.epsilon { (U) dc ,I)};
The action taken by the system at time t is denoted as action a t ,a t =(f sw ) t Wherein f is sw The switching frequency of the high-frequency switching tube is denoted as the high-frequency switching frequency f sw
The action set A 0 For T actions a t Is set of (A), A 0 ={a 1 ,a 2 ,...a t ,..a T And (3) Wherein f sw_min For a high frequency switching frequency f sw Lower limit value f sw_max For a high frequency switching frequency f sw Upper limit value of (2);
step 2.2, determining a reward function R;
step 2.2.1, normalizing the efficiency optimization model and the power density optimization model;
the values of the efficiency optimization model and the power density optimization model of the system are not in the same magnitude, and normalization processing is carried out to ensure that the values of the two optimization models are between 0 and 1;
system total loss P in efficiency optimization model loss To optimize the target f 1 The system volume V in the power density optimization model is an optimization target f 2
Introducing an optimization objective f α α=1, 2, for the optimization objective f α Normalizing to obtain normalized optimization targetAnd is also provided with The expression is as follows:
wherein f α,min To optimize the minimum value of the target, f α,max Maximum value for optimization objective;
step 2.2.2, giving weight to efficiency, power density and EMI, and setting a reward function R;
the bonus function R represents a weighted sum of the bonus values generated by all actions of the system from the current state to the end state, expressed as follows:
wherein r is t For the state s of the system at time t t Take action a t And the obtained single-step rewarding value gamma is a discount factor, and the discount factor gamma represents the influence degree of the time length on the rewarding value.
When the inverter common mode conducted EMI spectrum prediction envelope is well below the noise amplitude curve of the EMI standard:
when there is an inverter common mode conducted EMI spectrum prediction envelope above the noise amplitude curve of the EMI standard:
wherein the method comprises the steps ofAs penalty coefficient, w α For the weight coefficient, α= 1,2,0 < w α < 1, and w 1 +w 2 =1, c is the prize that EMI meets the criterion;
in the present embodiment, U dc The value range of (1) is 600-1200V, the value range of (I) is 100-120A, f is taken sw_min =10000 Hz, take f sw_max Take t=20, w=80000 Hz 1 =w 2 =0.5,C=500,γ=0.9,
Step 3, offline learning of a DDPG algorithm;
arbitrarily extracting D states S from the state set S t Composing a training data set for offline learning, d=4t/5; according to the state set S and the action set A obtained in the step 2 0 And a reward function R, offline learning is performed by using a DDPG algorithm of deep reinforcement learning, and an optimal strategy pi(s) is obtained y );
The DDPG algorithm comprises 4 neural networks, and the neural network parameters of the online strategy network are recorded as first neural network parameters theta μ The neural network parameters of the target policy network are noted as second neural network parameters θ μ′ The neural network parameter of the online evaluation network marks the third neural network parameter as theta Q The neural network parameters of the target evaluation network are recorded as fourth neural network parameters theta Q′
Given a training step and a maximum step max Given a training round number M and a maximum training round number M, step=1, 2,3 … step max M=1, 2,3 … M, i.e. step is included in each training round max Training for the second time, and performing M training rounds altogether;
in the present embodiment, step is taken max =100, taking m=2500.
Define the average value of the bonus function R in each training round and record as the average bonus During each training round number m, the 4 neural networks included in the DDPG algorithm all face the average prize +. >The maximized direction update results in the optimal strategy pi (s y );
The optimal strategy pi (s y ) The expression of (2) is as follows:
π(s y )=a y
wherein s is y For online corresponding to optimal policyState value of policy network input, and s y =(U dc ,I) y ,(U dc ,I) y For the DC voltage U corresponding to the optimal strategy in the state set S dc And output current I, a y The action value output by the online strategy network corresponding to the optimal strategy is recorded as an optimal action a y And a y =(f sw ) y ,(f sw ) y For action set A 0 Medium and optimum strategy pi(s) y ) Corresponding high frequency switching frequency f sw
Output optimal action a y
Step 4, according to the optimal action a y Performing application;
step 4.1, first, the states S selected from the state set S except the training data set t Reformulating an application data set and then randomly extracting j from the application data set max Individual states s t And redefined as application state s β ,β=1,2,3...j max Application state s β =(U dc ,I) β I.e. application state s β Is a direct current voltage U dc And a set of states at an output current I;
step 4.2, the optimal action a output in the step 3 is processed y Substitution j max Individual application states s β In (3) different application states s are obtained β Down-output optimal application actionsβ=1,2,3...j max
Step 4.3, applying state s β =(U dc ,I) β Optimal application actionsRespectively substituting the power density optimization model, the power density optimization model and the EMI optimization model established in the step 1 to achieve the optimal efficiency of the system on the premise of meeting the EMI standard >Optimal power density of the system->β=1,2,3...j max Any state { (U) in the system state set S is caused to be dc Maximizing efficiency, power density, and at the same time EMI meets the criteria.
In this embodiment, step 3 performs offline learning by using the DDPG algorithm of deep reinforcement learning to obtain an optimal policy pi(s) y ) The specific steps of (a) are as follows:
step 3.1, initializing a first neural network parameter θ μ Second neural network parameter θ μ′ Third neural network parameter θ Q And a fourth neural network parameter θ Q′ And let theta μ′ =θ μ 、θ Q′ =θ Q The method comprises the steps of carrying out a first treatment on the surface of the Initializing the capacity of an experience playback pool P as D; initializing learning rate alpha of online evaluation network Q Learning rate alpha of online policy network μ Update parameter τ with moving average, and 0 < α Q <1,0<α μ More than 1, more than 0 and less than 1; the output of the online policy network is noted as a, a=μ (s|θ μ ) Wherein a is an action value output by the online policy network, and a corresponds to the action set A 0 And a=f sw The method comprises the steps of carrying out a first treatment on the surface of the S is a state value input by the online policy network, S corresponds to an individual in the state set S, and s= (U) dc I); mu is the first neural network parameter θ through the online policy network μ And a policy derived from the entered state value s;
step 3.2, state s of the system at time t t Inputting the online policy network to obtain the output of the online policy networkAnd adding noise delta t Action a of obtaining final output t The specific expression is as follows:
in this embodiment, α is taken Q =0.002, take α μ Let τ=0.01, noise δ=0.001 t =0.9995 m ×1000。
Step 3.3, the system is based on the state s t Executing action a t Transition to a new state s t+1 At the same time get the execution action a t The single step prize value r t Will(s) t ,a t ,r t ,s t+1 ) Called a state transition sequence, and (s t ,a t ,r t ,s t+1 ) Stored in the experience playback pool P, the system enters a state s of t+1 at the next moment t+1
Circularly executing the steps 3.2 to 3.3, recording the number of state transition sequences in the experience playback pool P as N, entering the step 3.4 if N=D, otherwise returning to the step 3.2;
step 3.4, randomly extracting n state transition sequences from the experience playback pool P, wherein n is less than D, taking the n state transition sequences as small batch data for training an online strategy network and an online evaluation network, and recording the kth state transition sequence in the small batch data as(s) k ,a k ,r k ,s k+1 ),k=1,2,3…n;
Step 3.5, based on the small batch data(s) obtained in step 3.4 k ,a k ,r k ,s k+1 ) K=1, 2,3 … n, calculated as the jackpot y k And error function L (θ) Q ) The specific expression is as follows:
y k =r k +Q′(s k+1 ,μ′(s k+1μ′ )|θ Q′ )
wherein Q'(s) k+1 ,μ′(s k+1μ′ )|θ Q′ ) A scoring value output for the target evaluation network, wherein μ ′(s k+1μ′ )|θ Q′ Action value s output for target strategy network k+1 The state values input for the target evaluation network and the target strategy network; q(s) k ,a kQ ) For on-line evaluation of the scoring value output by the network s k And a k The method comprises the steps of evaluating a state value and an action value input by a network on line;
step 3.6, on-line evaluation network is performed by minimizing the error function L (θ Q ) To update theta Q On-line policy network through deterministic policy gradientsUpdating theta μ The target evaluation network and the target policy network update theta by a moving average method Q′ And theta μ′ The specific expression is as follows:
in the method, in the process of the invention,is a partial guide symbol, wherein->Representing policy J vs. θ μ Deviation-inducing and->Input representing online evaluation network is s=s k ,a=μ(s k ) When in use, the scoring value output by the network is evaluated onlineDeviation of the action value a is determined, +.>Input representing online policy network is s=s k When the online policy network outputs action value +.>For theta μ Deviation-inducing and->Representing an error function L (θ) Q ) For theta Q Deviation-inducing and->For the third neural network parameter after updating, +.>For the first neural network parameter after updating, +.>For the fourth neural network parameter after updating, +.>For the updated second neural network parameter;
step 3.7, when the steps 3.4 to 3.6 are finished once, the training process of one step length is finished, and when step < step max When step=step, repeating steps 3.4 to 3.6 max When M is less than M, repeatedly executing the steps 3.2-3.6, and when m=M, finishing the training process of M rounds, wherein the training process of the next round is started from the step 3.2 to the step 3.6;
step 3.8, the training algorithm is ended, and the optimal strategy pi (s y )=a y Record the average rewards of a training round as
In M training rounds, the first neural network parameter θ μ Second neural network parameter θ μ′ Third neural network parameter θ Q And a fourth neural network parameter θ Q′ Towards average rewardsThe maximized direction update results in the optimal strategy pi (s y )。
In order to prove the beneficial effects of the invention, the invention is simulated.
FIG. 7 is a chart showing the convergence effect of the average prize R according to the embodiment of the present invention, wherein the abscissa in FIG. 7 represents the training round number m and the ordinate represents the average prize Rm=1, 2,3 … 2500. As can be seen from fig. 7, when the training rounds are between 0 and 500, the average jackpot is small and the hunting phenomenon is serious because the agent randomly interacts with the environment and collects the experience data in the early exploration phase, and the parameters of the strategy network and the evaluation network are not updated temporarily, so that the rewards gain at this time is small and the fluctuation is large. When the data in the experience playback pool reaches maximum capacity, i.e. from the training round number of 500, the network parameters start to be updated, the agent gradually improves the action strategy, the average jackpot gradually increases, but the training process is still less stable. When the training round number is m=1100, the agent learns to minimize power loss and volume And the EMI meets the action strategy of the standard, the average cumulative prize continues to increase and tends to be stable, the training effect reaches the optimum, and four neural network parameters theta μ 、θ μ′ 、θ Q 、θ Q′ Has been updated to obtain the optimal strategy pi (s y )。
In the present embodiment, when U dc When i=120 amperes, for action set a, =1200 volts 0 A of (a) t =(f sw ) t Training is performed, FIG. 8 shows the high frequency switching frequency f as the motion variable in the embodiment of the present invention sw In FIG. 8, the abscissa indicates the training round number m and the ordinate indicates the high-frequency switching frequency f sw M=1, 2,3 … 2500. As can be seen from FIG. 8, as the training round number m increases, the high frequency switching frequency f sw Oscillating up and down, then gradually increasing and finally keeping between 28000Hz and 32000Hz, and f when m=2500 and step=100 sw =33100 Hz is the optimal action variable value, calculated to give a maximum value 0.9868 for the efficiency η of the system and 31.495 kw/li for the power density σ.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (2)

1. An inverter multi-objective optimization method based on deep reinforcement learning, wherein the inverter comprises a direct-current voltage source (10), a three-phase three-level ANPC inverter circuit (20), a filter circuit (30) and a load (40);
the three-phase three-level ANPC inverter circuit (20) comprises two identical supporting capacitors and an inverter main circuit, wherein the two supporting capacitors are respectively marked as supporting capacitors Cap 1 And a support capacitor Cap 2 Supporting capacitor Cap 1 And a support capacitor Cap 2 The series connection is connected between a direct current positive bus P and a direct current negative bus E of a direct current voltage source (10), and supports a capacitor Cap 1 And a support capacitor Cap 2 The connection point of (1) is recorded as the midpoint of the DC busO;
The inverter main circuit comprises an A-phase bridge arm, a B-phase bridge arm and a C-phase bridge arm, wherein each phase bridge arm comprises 6 switching tubes with anti-parallel diodes, namely the inverter main circuit comprises 18 switching tubes with anti-parallel diodes, and the 18 switching tubes with anti-parallel diodes are recorded as a switching tube S ij The 18 antiparallel diodes are denoted as diode D ij Where i represents three phases, i=a, b, c, j represents the serial numbers of the switching tubes and diodes, j=1, 2,3,4,5,6; the A-phase bridge arm, the B-phase bridge arm and the C-phase bridge arm are mutually connected in parallel between a direct current positive bus P and a direct current negative bus E; in each phase leg of the three-phase legs, a switching tube S i1 Switch tube S i2 Switch tube S i3 Switch tube S i4 Sequentially connected in series with a switch tube S i1 The input end of (1) is connected with a direct current positive bus P and a switch tube S i1 Is connected with the switch tube S i2 Input terminal of (a) switch tube S i2 Is connected with the switch tube S i3 Input terminal of (a) switch tube S i3 Is connected with the switch tube S i4 Input terminal of (a) switch tube S i4 The output end of (1) is connected with a direct current negative bus E and a switch tube S i5 Is connected with the switch tube S i1 Output terminal of (2), switch tube S i5 The output end of (1) is connected with the midpoint O of the direct current bus and the switch tube S i6 The input end of (1) is connected with the midpoint O of the DC bus and the switch tube S i6 Is connected with the switch tube S i3 Output terminal of (2), switch tube S i2 And a switch tube S i3 Is denoted as inverter output point phi i The method comprises the steps of carrying out a first treatment on the surface of the Switch tube S i1 Switch tube S i4 Switch tube S i5 And a switch tube S i6 Is a power frequency switch tube with the same switching frequency, and the switch tube S i2 And a switch tube S i3 The switching frequency is the same as the switching frequency of the high-frequency switching tube;
the filter circuit (30) comprises a three-phase filter inductance L f And three-phase filter capacitor C 0 Three-phase filter inductance L f One end of (a) is connected with the output point phi of the inverter i The other end is connected with a load (40), and a three-phase filter capacitor C 0 Parallel connected three-phase filtering inductance L f And a load (40);
the method is characterized by comprising the following specific steps:
step 1, establishing an optimization target model;
The inverter is recorded as a system, 18 switching tubes with anti-parallel diodes are disassembled into 18 switching tubes and 18 anti-parallel diodes, and a supporting capacitor Cap is set 1 Supporting capacitor Cap 2 And three-phase filter capacitor C 0 Is negligible in terms of loss and volume;
step 1.1, establishing an efficiency optimization model;
the efficiency eta of the system is taken as a target, an efficiency optimization model is established, and the expression is as follows:
wherein P is loss P is the total loss of the system loss =P T +P L ,P T Total loss of 18 switching tubes and 18 antiparallel diodes, P L Is three-phase filtering inductance L f Loss of P w The rated input power of the system;
step 1.2, establishing a power density optimization model;
and (3) taking the power density sigma of the system as a target, establishing a power density optimization model, wherein the expression is as follows:
wherein P is w For rated input power of the system, V is the system volume, v=v T +3V L ,V T V for the total volume of 18 switching tubes and 18 antiparallel diodes L Is three-phase filtering inductance L f The magnetic core volume of the middle single-phase filter inductor;
step 1.3, an EMI optimization model is established;
the EMI optimization model predicts the envelope curve of the EMI common mode noise spectrum by using an artificial neural network to represent the actual common mode noise level, and compares the predicted spectrum envelope curve with a noise amplitude curve in a standard to judge whether the standard is met;
Step 1.3.1, determining input variables and output variables of an artificial neural network;
the artificial neural network comprises a neural network 1 and a neural network 2, wherein:
the input variables of the neural network 1 are 3, and are respectively: high frequency switching frequency f sw Filter inductance L f Common mode inductance L CM The output variable of the neural network 1 is the frequency of the 4 turning points of the inverter common mode conducted EMI spectrum, denoted as frequency f 1 Frequency f 2 Frequency f 3 Frequency f 4
The neural network 2 has 4 input variables, which are: voltage value U of DC voltage source dc High frequency switching frequency f sw Filter inductance L f Common mode inductance L CM The method comprises the steps of carrying out a first treatment on the surface of the The output variable of the neural network 2 is the inverter common mode conducted EMI spectrum at frequency f 1 Frequency f 2 Frequency f 3 Frequency f 4 The spectral amplitude at this point is denoted as spectral amplitude M f1 Spectral amplitude M f2 Spectral amplitude M f3 Spectral amplitude M f4
Step 1.3.2, obtaining sample data required for constructing a neural network 1 model and a neural network 2 model by using computer simulation software, wherein:
the sample data required for constructing the neural network 1 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data f of the neural network 1 swN ,L fN ,L CMN And neural network 1 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;
The sample data required for constructing the neural network 2 model comprises K groups of input data and corresponding K groups of simulation output values, which are respectively the input data U of the neural network 2 dcN ,f swN ,L fN ,L CMN And neural network 2 simulation output valueWherein N is the serial number of each group, n=1, 2,3 … K;
step 1.3.3, determining a neural network 1 model and a network structure of a neural network 2;
in the neural network 1 structure, the input layer contains 3 neurons, the hidden layer contains 8 neurons, and the output layer contains 4 neurons;
in the neural network 2 structure, the input layer contains 4 neurons, the hidden layer contains 11 neurons, and the output layer contains 4 neurons;
step 1.3.4, grouping sample data;
dividing the sample data obtained in the step 1.3.2 into a training subset, a verification subset and a test set, wherein the training subset contains K 1 Group sample data, verifying that the subset contains K 2 Group sample data, test set contains K 3 Group sample data, and K 1 +K 2 +K 3 =K;
Step 1.3.5, constructing a neural network 1 model and a neural network 2 model;
randomly extracting a group of input data from the training subset obtained in the step 1.3.4, and inputting the input data into the neural network 1 and the neural network 2 to obtain output corresponding to the input data; the parameters of the neural network 1 and the neural network 2 are updated by adopting an error back propagation gradient descent algorithm, and the updated neural network 1 and the updated neural network 2 are obtained;
And then K of the verification subset obtained in the step 1.3.4 2 The group input data are respectively input into the neural network 1 and the neural network 2 after updating to obtain the K 2 K corresponding to group input data 2 Group output, including output of neural network 1And the output of the neural network 2->N 2 =K 1 +1,K 1 +2,…K 1 +K 2
The root mean square error δ1 and the root mean square error δ2 are defined, and their expressions are respectively:
given a first target error e 1 And a second target error e 2 And makes the following judgment:
if delta 1 < e 1 And delta 2 < e 2 The neural network 1 model and the neural network 2 model are built, the step 1.3.6 is entered, otherwise, the step 1.3.5 is returned;
step 1.3.6, K of the test set obtained in step 1.3.4 3 Inputting the group input data into the neural network 1 model and the neural network 2 model which are constructed in the step 1.3.5 to obtain the model K 3 K corresponding to group input data 3 Simulation output value sum K of group neural network 1 3 The simulated output values of the group neural network 2 are respectively recorded as predicted values of the neural network 1And neural network 2 predictive value->N 3 =K 1 +K 2 +1,K 1 +K 2 +2,…K;
Step 1.3.7, randomly extracting a group of actual values of the neural network 1And the actual value of the neural network 2
Establishing a plane coordinate system by taking frequency as an abscissa and taking frequency spectrum amplitude as an ordinate, and according to the predicted value Drawing inverter common mode conducted EMI spectrum prediction envelope on the coordinate system according to actual value +. >Drawing an actual envelope of the common mode conducted EMI spectrum of the inverter on the coordinate system;
judging whether the inverter common mode conducted EMI spectrum prediction envelope and the inverter common mode conducted EMI spectrum actual envelope are matched or not: if yes, the prediction of the common-mode conducted EMI spectrum envelope curve of the inverter is realized, and the prediction is ended; if not, returning to the step 1.3.4;
the matching means that four turning points on the predicted envelope of the common-mode conducted EMI spectrum of the inverter are in close agreement with four turning points on the actual envelope of the common-mode conducted EMI spectrum of the inverter;
step 2, determining a state set S and an action set A according to the efficiency optimization model, the power density optimization model and the EMI optimization model obtained in the step 1 0 And a reward function R;
step 2.1, determining a State set S and an action set A 0
Recording the current time T of the system, wherein t=1, 2,3 … T, T is the time of the system termination state, and recording the state of the system at the current time T as a state s t ,s t =(U dc ,I) t In the formula, U dc The voltage value of the DC voltage source (10) is recorded as the DC voltage U dc I is the effective value of the output current of the system and is recorded as the output current I;
the state set S is T states S t S= { S 1 ,s 2 ,…s t ,..s T And S.epsilon { (U) dc ,I)};
The action taken by the system at time t is denoted as action a t ,a t =(f sw ) t Wherein f is sw The switching frequency of the high-frequency switching tube is denoted as the high-frequency switching frequency f sw
The action set a 0 For T actions a t Is set of (A), A 0 ={a 1 ,a 2 ,…a t ,..a T And (3) Wherein f sw_min For a high frequency switching frequency f sw Lower limit value f sw_max For a high frequency switching frequency f sw Upper limit value of (2);
step 2.2, determining a reward function R;
step 2.2.1, normalizing the efficiency optimization model and the power density optimization model;
the values of the efficiency optimization model and the power density optimization model of the system are not in the same magnitude, and normalization processing is carried out to ensure that the values of the two optimization models are between 0 and 1;
system total loss P in efficiency optimization model loss To optimize the target f 1 The system volume V in the power density optimization model is an optimization target f 2
Introducing an optimization objective f α α=1, 2, for the optimization objective f α Normalizing to obtain normalized optimization targetAnd-> Its watchThe expression is:
wherein f α,min To optimize the minimum value of the target, f α,max Maximum value for optimization objective;
step 2.2.2, giving weight to efficiency, power density and EMI, and setting a reward function R;
the bonus function R represents a weighted sum of the bonus values generated by all actions of the system from the current state to the end state, expressed as follows:
Wherein r is t For the state s of the system at time t t Take action a t The obtained single-step rewarding value is gamma which is a discount factor, wherein the discount factor gamma represents the influence degree of the length of time on the rewarding value;
when the inverter common mode conducted EMI spectrum prediction envelope is well below the noise amplitude curve of the EMI standard:
when there is an inverter common mode conducted EMI spectrum prediction envelope above the noise amplitude curve of the EMI standard:
wherein,as penalty coefficient, w α As a weight coefficient, α= 1,2,0<w α <1, andw 1 +w 2 =1, c is the prize that EMI meets the criterion;
step 3, offline learning of a DDPG algorithm;
arbitrarily extracting D states S from the state set S t Composing a training data set for offline learning, d=4t/5; according to the state set S and the action set A obtained in the step 2 0 And a reward function R, offline learning is performed by using a DDPG algorithm of deep reinforcement learning, and an optimal strategy pi(s) is obtained y );
The DDPG algorithm comprises 4 neural networks, and the neural network parameters of the online strategy network are recorded as first neural network parameters theta μ The neural network parameters of the target policy network are noted as second neural network parameters θ μ' The neural network parameter of the online evaluation network marks the third neural network parameter as theta Q The neural network parameters of the target evaluation network are recorded as fourth neural network parameters theta Q'
Given a training step and a maximum step max Given a training round number M and a maximum training round number M, step=1, 2,3 … step max M=1, 2,3 … M, i.e. step is included in each training round max Training for the second time, and performing M training rounds altogether;
define the average value of the bonus function R in each training round and record as the average bonus During each training round number m, the 4 neural networks included in the DDPG algorithm all face the average prize +.>The maximized direction update results in the optimal strategy pi (s y );
The optimal strategy pi (s y ) The expression of (2) is as follows:
π(s y )=a y
wherein s is y A state value input for an online policy network corresponding to the optimal policy, and s y =(U dc ,I) y ,(U dc ,I) y For the DC voltage U corresponding to the optimal strategy in the state set S dc And output current I, a y The action value output by the online strategy network corresponding to the optimal strategy is recorded as an optimal action a y And a y =(f sw ) y ,(f sw ) y For action set A 0 Medium and optimum strategy pi(s) y ) Corresponding high frequency switching frequency f sw
Output optimal action a y
Step 4, according to the optimal action a y Performing application;
step 4.1, first, the states S selected from the state set S except the training data set t Reformulating an application data set and then randomly extracting j from the application data set max Individual states s t And redefined as application state s β ,β=1,2,3…j max Application state s β =(U dc ,I) β I.e. application state s β Is a direct current voltage U dc And a set of states at an output current I;
step 4.2, the optimal action a output in the step 3 is processed y Substitution j max Individual application states s β In (3) different application states s are obtained β Down-output optimal application actions
Step 4.3, applying state s β =(U dc ,I) β Optimal application actionsRespectively substituting the power density optimization model, the power density optimization model and the EMI optimization model established in the step 1 to achieve the optimal efficiency of the system on the premise of meeting the EMI standard>Optimal power density of the system->Causing any state in the system state set S to be { (U) dc Maximizing efficiency, power density, and at the same time EMI meets the criteria.
2. The method for optimizing inverter multi-objective based on deep reinforcement learning according to claim 1, wherein in step 3, the offline learning is performed by using a deep reinforcement learning DDPG algorithm to obtain an optimal strategy pi (s y ) The specific steps of (a) are as follows:
step 3.1, initializing a first neural network parameter θ μ Second neural network parameter θ μ' Third neural network parameter θ Q And a fourth neural network parameter θ Q' And let theta μ' =θ μ 、θ Q' =θ Q The method comprises the steps of carrying out a first treatment on the surface of the Initializing the capacity of an experience playback pool P as D; initializing learning rate alpha of online evaluation network Q Learning rate alpha of online policy network μ The running average update parameter τ, and 0<α Q <1,0<α μ <1,0<τ<1, a step of; the output of the online policy network is noted as a, a=μ (s|θ μ ) Wherein a is an action value output by the online policy network, and a corresponds to the action set A 0 And a=f sw The method comprises the steps of carrying out a first treatment on the surface of the S is a state value input by the online policy network, S corresponds to an individual in the state set S, and s= (U) dc I); mu is the first neural network parameter θ through the online policy network μ And a policy derived from the entered state value s;
step 3.2, state s of the system at time t t Inputting the online policy network to obtain the output of the online policy networkAnd adding noise delta t Action a of obtaining final output t The specific expression is as follows:
step 3.3, the system is based on the state s t Executing action a t Transition to a new state s t+1 At the same time get the execution action a t The single step prize value r t Will(s) t ,a t ,r t ,s t+1 ) Called a state transition sequence, and (s t ,a t ,r t ,s t+1 ) Stored in the experience playback pool P, the system enters a state s of t+1 at the next moment t+1
Circularly executing the steps 3.2 to 3.3, recording the number of state transition sequences in the experience playback pool P as N, entering the step 3.4 if N=D, otherwise returning to the step 3.2;
Step 3.4, randomly extracting n state transition sequences from the experience playback pool P, and n<D, taking n state transition sequences as small batch data for training an online strategy network and an online evaluation network, and recording the kth state transition sequence in the small batch data as(s) k ,a k ,r k ,s k+1 ),k=1,2,3…n;
Step 3.5, based on the small batch data(s) obtained in step 3.4 k ,a k ,r k ,s k+1 ) K=1, 2,3 … n, calculated as the jackpot y k And error function L (θ) Q ) The specific expression is as follows:
y k =r k +Q′(s k+1 ,μ′(s k+1μ' )|θ Q' )
wherein Q'(s) k+1 ,μ'(s k+1μ' )|θ Q' ) Scoring value output for target evaluation network, wherein μ'(s) k+1μ' )|θ Q' Action value s output for target strategy network k+1 The state values input for the target evaluation network and the target strategy network; q(s) k ,a kQ ) For on-line evaluation of the scoring value output by the network s k And a k The method comprises the steps of evaluating a state value and an action value input by a network on line;
step 3.6, on-line evaluation network is performed by minimizing the error function L (θ Q ) To update theta Q On-line policy network through deterministic policy gradientsUpdating theta μ The target evaluation network and the target policy network update theta by a moving average method Q' And theta μ' The specific expression is as follows:
in the method, in the process of the invention,is a partial guide symbol, wherein->Representing policy J vs. θ μ Deviation-inducing and->Input representing online evaluation network is s=s k ,a=μ(s k ) In the time of online evaluation of the scoring value outputted by the network +. >Deviation of the action value a is determined, +.>Input representing online policy network is s=s k When the online policy network outputs action value +.>For theta μ Deviation-inducing and->Representing an error function L (θ) Q ) For theta Q Deviation-inducing and->For the third neural network parameter after updating, +.>For the first neural network parameter after updating, +.>For the fourth neural network parameter after updating, +.>For the updated second neural network parameter;
step 3.7, when the steps 3.4 to 3.6 are completed once, the training process of one step length is completed, and when step<step max When step=step, repeating steps 3.4 to 3.6 max When the training process of one round is completed, the training process of the next round starts from the step 3.2 to the step 3.6, and when m is<And (3) repeatedly executing the steps 3.2 to 3.6 when m=m, and ending the learning process of the DDPG algorithm when the training process of the M rounds is completed;
step 3.8, the training algorithm is ended, and the optimal strategy pi (s y )=a y Record the average rewards of a training round as
In M training rounds, the first neural network parameter θ μ Second neural network parameter θ μ' Third neural network parameter θ Q And a fourth neural network parameter θ Q' Towards average rewardsThe maximized direction update results in the optimal strategy pi (s y )。
CN202311003536.4A 2023-08-10 2023-08-10 Inverter multi-objective optimization method based on deep reinforcement learning Pending CN117057228A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311003536.4A CN117057228A (en) 2023-08-10 2023-08-10 Inverter multi-objective optimization method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311003536.4A CN117057228A (en) 2023-08-10 2023-08-10 Inverter multi-objective optimization method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN117057228A true CN117057228A (en) 2023-11-14

Family

ID=88665688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311003536.4A Pending CN117057228A (en) 2023-08-10 2023-08-10 Inverter multi-objective optimization method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN117057228A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117313560A (en) * 2023-11-30 2023-12-29 合肥工业大学 Multi-objective optimization method for IGBT module packaging based on machine learning
CN117634320A (en) * 2024-01-24 2024-03-01 合肥工业大学 Multi-objective optimization design method for three-phase high-frequency transformer based on deep reinforcement learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117313560A (en) * 2023-11-30 2023-12-29 合肥工业大学 Multi-objective optimization method for IGBT module packaging based on machine learning
CN117313560B (en) * 2023-11-30 2024-02-09 合肥工业大学 Multi-objective optimization method for IGBT module packaging based on machine learning
CN117634320A (en) * 2024-01-24 2024-03-01 合肥工业大学 Multi-objective optimization design method for three-phase high-frequency transformer based on deep reinforcement learning
CN117634320B (en) * 2024-01-24 2024-04-09 合肥工业大学 Multi-objective optimization design method for three-phase high-frequency transformer based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN110175386B (en) Method for predicting temperature of electrical equipment of transformer substation
CN117057228A (en) Inverter multi-objective optimization method based on deep reinforcement learning
CN115021325B (en) Photovoltaic inverter multi-objective optimization method based on DDPG algorithm
CN109324291B (en) Prediction method for predicting service life of proton exchange membrane fuel cell
CN110443447B (en) Method and system for adjusting power system load flow based on deep reinforcement learning
CN107730003A (en) One kind supports more high-precision NILM implementation methods of appliance type
CN114784823A (en) Micro-grid frequency control method and system based on depth certainty strategy gradient
CN114243791A (en) Multi-objective optimization configuration method, system and storage medium for wind-solar-hydrogen storage system
CN115986834A (en) Near-end strategy optimization algorithm-based optical storage charging station operation optimization method and system
CN112257348B (en) Method for predicting long-term degradation trend of lithium battery
CN107742000B (en) Boiler combustion oxygen content modeling method
CN112821448A (en) Method for applying deep learning to microgrid island detection
Xu et al. Short-term electricity consumption forecasting method for residential users based on cluster classification and backpropagation neural network
CN115912367A (en) Intelligent generation method for operation mode of power system based on deep reinforcement learning
CN116050461A (en) Improved method for determining membership function of fuzzy control theory by using convolutional neural network
Tongyu et al. Based on deep reinforcement learning algorithm, energy storage optimization and loss reduction strategy for distribution network with high proportion of distributed generation
CN111859780A (en) Micro-grid operation optimization method and system
CN114421470B (en) Intelligent real-time operation control method for flexible diamond type power distribution system
CN116542882B (en) Photovoltaic power generation smoothing method, system and storage medium
Qiu et al. Short term photovoltaic power generation prediction model based on improved GA-LSTM neural network
CN116306226B (en) Fuel cell performance degradation prediction method
CN113346501B (en) Power distribution network voltage optimization method and system based on brainstorming algorithm
Chen et al. Short-Term Photovoltaic Power Prediction Algorithm Based on Improved BP Neural Network
CN113076696A (en) Load short-term prediction method and prediction system based on IPSO-chaotic BP network
CN117893043A (en) Hydropower station load distribution method based on DDPG algorithm and deep learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination