CN113885328A

CN113885328A - Nuclear power tracking control method based on integral reinforcement learning

Info

Publication number: CN113885328A
Application number: CN202111212559.7A
Authority: CN
Inventors: 仲伟峰; 王蒙轩; 赵晶
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-04

Abstract

The invention discloses a nuclear power tracking control method based on integral reinforcement learning. The algorithm trains the evaluation network and corrects the network weights. The evaluation network is used to approximate the tracking error performance index function, and the evaluation network weights are used to evaluate the performance of the current tracking error control system, and the optimal control strategy is selected through the execution process to minimize once The total cost of the global iteration; judge whether the current local iteration is completed, if not, return to the local iteration, otherwise update the iterative performance index function and tracking control law to obtain the optimal tracking control strategy; the global strategy iteration is completed, and the optimal tracking control is obtained strategy, track to the desired power point, and calculate the total cost. Therefore, the present invention can continuously learn and adjust the current strategy to track the desired power point.

Description

A nuclear power tracking control method based on integral reinforcement learning

技术领域technical field

本发明实施例涉及核电机组功率控制技术领域，尤其是涉及一种基于积分强化学习的核电功率跟踪控制方法。The embodiments of the present invention relate to the technical field of nuclear power unit power control, and in particular, to a nuclear power power tracking control method based on integral reinforcement learning.

背景技术Background technique

近年来，由于煤炭燃烧发电，引发的温室效应、空气污染情况日益严重，其资源储备量也在逐年减少。核能作为一种清洁能源，具有无污染、运输成本低廉的优势，开始广泛受到各国关注，并加以应用普及到发电行业中来。核电系统的安全性也一直受到各界关注，因此其功率的调控问题成为了焦点。一个稳定、安全、高效的核电机组功率控制方法对整个核电工业显得尤为重要。In recent years, due to the burning of coal for power generation, the greenhouse effect and air pollution have become increasingly serious, and its resource reserves are also decreasing year by year. As a kind of clean energy, nuclear energy has the advantages of no pollution and low transportation cost. The safety of nuclear power systems has also been concerned by all walks of life, so the regulation of its power has become the focus. A stable, safe and efficient power control method for nuclear power plants is particularly important to the entire nuclear power industry.

有鉴于此，特提出本发明。In view of this, the present invention is proposed.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，提出了本发明以便提供一种至少部分地解决上述问题的一种基于强化积分学习的核电功率跟踪控制方法。In view of the above problems, the present invention is proposed to provide a nuclear power tracking control method based on reinforcement integral learning that at least partially solves the above problems.

为了实现上述目的，根据本发明的一个方面，提供了以下技术方案：In order to achieve the above object, according to one aspect of the present invention, the following technical solutions are provided:

一种基于强化积分学习的核电功率跟踪控制方法，所述方法包括：A nuclear power tracking control method based on reinforcement integral learning, the method includes:

S1：初始策略选取，相关参数初始化，初始功率点与期望功率点选取；S1: Initial strategy selection, related parameter initialization, initial power point and expected power point selection;

S2：进行全局迭代，根据迭代控制序列更新迭代跟踪误差性能指标函数，以获得最优跟踪误差性能指标函数；S2: Perform global iteration, and update the iterative tracking error performance index function according to the iterative control sequence to obtain the optimal tracking error performance index function;

S3：进行局部迭代，利用积分强化学习算法训练评价网络，修正所述评价网络的权值，并利用所述最优跟踪误差性能指标函数得到最优误差控制策略；S3: Perform local iteration, train an evaluation network by using an integral reinforcement learning algorithm, modify the weights of the evaluation network, and use the optimal tracking error performance index function to obtain an optimal error control strategy;

S4：判断当前局部迭代是否完成，如果尚未完成，则返回局部迭代步骤，否则更新迭代跟踪误差性能指标函数和控制律，以获得最优跟踪误差性能指标函数；S4: Determine whether the current local iteration is completed, if not, return to the local iteration step, otherwise update the iterative tracking error performance index function and control law to obtain the optimal tracking error performance index function;

S5：全局策略迭代完成，得到最优跟踪控制策略，跟踪到期望功率点，计算总成本。S5: The global strategy iteration is completed, the optimal tracking control strategy is obtained, the desired power point is tracked, and the total cost is calculated.

与现有技术相比，上述技术方案至少具有以下有益效果：Compared with the prior art, the above technical solution at least has the following beneficial effects:

本发明实施例通过神经网络构建的基于自适应动态规划算法的自学习功率跟踪控制器，能够通过实时的操作而不断地学习、调整和适应不同的核电功率状态，能够跟踪不同核电机组的工况点。The self-learning power tracking controller based on the adaptive dynamic programming algorithm constructed by the neural network in the embodiment of the present invention can continuously learn, adjust and adapt to different nuclear power states through real-time operations, and can track the working conditions of different nuclear power units point.

附图说明Description of drawings

附图作为本发明的一部分，用来提供对本发明的进一步的理解，本发明的示意性实施例及其说明用于解释本发明，但不构成对本发明的不当限定。显然，下面描述中的附图仅仅是一些实施例，对于本领域普通技术人员来说，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。在附图中：The accompanying drawings, as a part of the present invention, are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but do not constitute an improper limitation of the present invention. Obviously, the drawings in the following description are only some embodiments, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort. In the attached image:

图1为根据一示例性实施例示出的核电系统模型示意图；FIG. 1 is a schematic diagram of a nuclear power system model according to an exemplary embodiment;

图2为根据一示例性实施例示出的基于积分强化的核电机组功率跟踪控制方法的流程示意图。FIG. 2 is a schematic flowchart of a method for power tracking control of a nuclear power plant based on integral enhancement according to an exemplary embodiment.

具体实施方式Detailed ways

为了更清楚地说明本发明的目的、技术方案和优点，以下结合具体实例，并参照附图，对本发明作进一步的详细说明。In order to illustrate the objectives, technical solutions and advantages of the present invention more clearly, the present invention will be further described in detail below with reference to specific examples and accompanying drawings.

自适应动态规划自20世纪80年代由Paul J.Werbos提出以来，得到了快速的发展。它主要是用来解决动态规划中的“维度灾难”问题，其具体解决方法是通过多次迭代优化求解。近年来，自适应动态规划算法在求解最优控制方面展现出巨大的优势。自适应动态规划方法一般使用控制器-评价器(actor-critic)结构和神经网络，用来逼近跟踪误差性能指标函数和控制策略，采用迭代的方法逐步逼近方程解析解，最终收敛到最优跟踪误差性能指标函数和最优跟踪控制策略。Adaptive dynamic programming has developed rapidly since it was proposed by Paul J. Werbos in the 1980s. It is mainly used to solve the "curse of dimension" problem in dynamic programming, and its specific solution is to solve it through multiple iterative optimization. In recent years, adaptive dynamic programming algorithms have shown great advantages in solving optimal control. The adaptive dynamic programming method generally uses the controller-evaluator (actor-critic) structure and neural network to approximate the tracking error performance index function and control strategy, and uses an iterative method to gradually approximate the analytical solution of the equation, and finally converges to the optimal tracking Error performance index function and optimal tracking control strategy.

自适应动态规划方法利用函数近似结构(如：神经网络)逼近动态规划方程中的跟踪误差性能指标函数和控制策略以满足最优化原理，从而获得系统最优误差控制和最优跟踪误差性能指标函数。自适应动态规划结构主要包括动态系统、控制网络和评价网络。评价网络用于近似最优代价函数，并给出评价指导执行网络产生最优控制。执行网络输出作用于动态系统后通过动态系统不同阶段产生的奖励/惩罚来影响评价网络，知道执行网络更新控制策略，使得总体代价(即奖励/惩罚的总和)达到最优。The adaptive dynamic programming method uses the function approximation structure (such as: neural network) to approximate the tracking error performance index function and control strategy in the dynamic programming equation to meet the optimization principle, so as to obtain the system optimal error control and optimal tracking error performance index function . The adaptive dynamic programming structure mainly includes dynamic system, control network and evaluation network. The evaluation network is used to approximate the optimal cost function, and the evaluation guidance is given to execute the network to generate optimal control. After the output of the execution network acts on the dynamic system, the evaluation network is affected by the rewards/penalties generated in different stages of the dynamic system, and the control strategy of the implementation of the network update is known to make the overall cost (that is, the sum of the rewards/punishments) optimal.

积分强化学习自适应动态规划方法不依赖系统模型，而是基于实时产生的系统状态和相应的控制动作来调整控制器和评价器神经网络的权重。最终，积分强化学习自适应动态规划方法能够在线运行并使得控制器和评价器神经网络最终迭代收敛到最优控制策略和最优跟踪误差性能指标函数。特别适用于线性或非线性连续系统的在线求解最优控制问题。The integral reinforcement learning adaptive dynamic programming method does not rely on the system model, but adjusts the weights of the controller and evaluator neural network based on the system state generated in real time and the corresponding control actions. Finally, the integral reinforcement learning adaptive dynamic programming method can run online and make the controller and evaluator neural network finally iteratively converge to the optimal control policy and optimal tracking error performance index function. It is especially suitable for online solution of optimal control problems for linear or nonlinear continuous systems.

图1为本发明实施例应用的核电系统示意图，其示意性地示出了核电系统反应传热模型图。该核电系统由一个反应堆、两个冷却堆组成。其中，Q仅代表热量传递，对核电系统模型并无实际含义。本核电系统共包含五个系统状态，Power percentage表示系统的发电功率百分比(其满负载发电功率为2500MW)；Delayed neural concentration 表示核电系统反应釜内部缓发中子的相对浓度；Reactor core Temperature为核电系统反应堆堆芯平均温度(同时我们也可以用T_f表示)；Coolant outlet Temperature表示核电系统内部冷却剂的平均温度； Reactor coefficient表示核电系统由控制棒上下移动引起的反应性变化。该系统仅由控制棒的反应速度作为控制信号，当控制棒以一定速度上下移动时，其反应堆堆芯内部反应会随之变化。控制棒向上移动且速度越快，反应越激烈。控制棒向下移动，则反之。FIG. 1 is a schematic diagram of a nuclear power system to which an embodiment of the present invention is applied, which schematically shows a reaction heat transfer model diagram of the nuclear power system. The nuclear power system consists of one reactor and two cooling reactors. Among them, Q only represents heat transfer and has no practical significance for nuclear power system models. The nuclear power system includes five system states, Power percentage indicates the percentage of the power generation of the system (its full-load power generation is 2500MW); Delayed neural concentration indicates the relative concentration of delayed neutrons inside the reactor of the nuclear power system; Reactor core Temperature is the nuclear power The average temperature of the system reactor core (we can also use T _f to represent it); Coolant outlet Temperature represents the average temperature of the coolant inside the nuclear power system; Reactor coefficient represents the reactivity change of the nuclear power system caused by the up and down movement of the control rod. The system only uses the reaction speed of the control rod as a control signal. When the control rod moves up and down at a certain speed, the reaction inside the reactor core will change accordingly. The faster the control stick is moved up, the more responsive it is. The control stick moves down and vice versa.

如图2所示，本发明实施例提供一种基于积分强化学习的核电系统功率跟踪控制方法，该方法可以包括步骤S1只步骤S5。As shown in FIG. 2 , an embodiment of the present invention provides a power tracking control method for a nuclear power system based on integral reinforcement learning. The method may include step S1 and step S5.

S1：所述初始化参数包括：核电系统参数、评价网络参数、全局迭代时长、积分时间常数、局部迭代时长、收敛精度以及目标参数；其中，所述核电系统参数为核电的功率模型系统参数，该模型包括五个系统输入输出状态。S1: The initialization parameters include: nuclear power system parameters, evaluation network parameters, global iteration duration, integration time constant, local iteration duration, convergence accuracy, and target parameters; wherein, the nuclear power system parameters are power model system parameters of nuclear power, and the The model includes five system input and output states.

其核电功率系统模型主要有堆芯内部中子反应方程，反应堆的两个温度反馈模型，控制棒的反应性方程。在反应堆特性研究中，多采用控制棒控制的方法。因为控制棒具有很强的中子吸收能力，而且移动速度易于控制，操作方便，对反应性控制的准确性高控制棒对反应性的影响可以通过两种方式来体现：位置的变化和速度的变化。The nuclear power system model mainly includes the neutron reaction equation inside the core, two temperature feedback models of the reactor, and the reactivity equation of the control rod. In the research of reactor characteristics, the control rod control method is mostly used. Because the control rod has a strong neutron absorption capacity, and the moving speed is easy to control, the operation is convenient, and the accuracy of the reactivity control is high. Variety.

此外还需要进行对初始功率工况点与期望功率工况点的选取，并确定一个初始稳定控制策略。同时还要对以下参数进行初始化：全局训练时间步长、局部迭代时间步长、神经网络结构(诸如输入节点数量、隐层节点数量和输出层节点数量)、神经网络权重。In addition, it is necessary to select the initial power operating point and the expected power operating point, and to determine an initial stable control strategy. The following parameters are also initialized: global training time step, local iteration time step, neural network structure (such as number of input nodes, number of hidden layer nodes, and number of output layer nodes), neural network weights.

示例地，设置评价网络的结构为5-15-1，其中5为评价网络的输入节点数量，15为评价网络隐层节点数量，1为评价网络输出层节点数量，隐层节点数量可根据经验进行调整以获得最佳的逼近效果，并且定义收敛精度为1.0×10^-2。For example, the structure of the evaluation network is set to 5-15-1, where 5 is the number of input nodes of the evaluation network, 15 is the number of hidden layer nodes of the evaluation network, 1 is the number of output layer nodes of the evaluation network, and the number of hidden layer nodes can be based on experience. Adjustments are made for the best approximation, and the convergence accuracy is defined to be 1.0×10 ⁻² .

在执行阶段，本发明实施例使用简化的有限维控制变量，即设定有限的、确定的核电功率工况点进行跟踪。In the execution stage, the embodiments of the present invention use simplified finite-dimensional control variables, that is, set finite and definite nuclear power operating points for tracking.

在实际应用中，初始工况点与期望工况点的选取可根据实际需求设定，其中核电机组的功率模型及参数设定也需要具有实际意义。In practical applications, the selection of the initial operating point and the expected operating point can be set according to actual needs, and the power model and parameter settings of the nuclear power unit also need to have practical significance.

S2：进行全局训练时，根据迭代控制序列更新迭代跟踪误差性能指标函数，以获得最优性跟踪误差能指标函数；S2: During global training, update the iterative tracking error performance index function according to the iterative control sequence to obtain the optimal tracking error performance index function;

具体地，根据控制器的积分强化学习方法要求，需要对评价网络进行权值初始化训练工作。Specifically, according to the requirements of the integral reinforcement learning method of the controller, it is necessary to perform weight initialization training on the evaluation network.

利用积分强化学习算法训练评价网络：评价网络的输入值包括：核电机组工作点的五个状态x(t)、核电机组期望工作点的五个状态 x_d(t)、核电机组跟踪误差控制策略u_e(t)，输出值是跟踪误差性能指标函数J_e(t)。其中，J_e(t)跟踪误差性能指标函数简称为J函数。最优跟踪误差控制策略u_e(t)由根据评价网络得到的跟踪误差性能指标函数近似而得到。Use the integral reinforcement learning algorithm to train the evaluation network: the input values of the evaluation network include: five states x (t) of the nuclear power unit operating point, five states x _d (t) of the nuclear power unit expected operating point, and the nuclear power unit tracking error control strategy u _e (t), the output value is the tracking error performance indicator function J _e (t). Among them, the J _e (t) tracking error performance index function is abbreviated as the J function. The optimal tracking error control strategy _ue (t) is approximated by the tracking error performance index function obtained from the evaluation network.

评价网络的权值初始化在全局迭代之内进行。优选地，可以在每次全局迭代开始时重新进行权值的初始化，从而在保证评价网络稳定性与收敛速度的基础上更好地保证评价网络的收敛，以便尽快找到核电系统功率的最优跟踪控制策略。The weight initialization of the evaluation network is performed within the global iteration. Preferably, the weights can be re-initialized at the beginning of each global iteration, so as to better ensure the convergence of the evaluation network on the basis of ensuring the stability and convergence speed of the evaluation network, so as to find the optimal tracking of the nuclear power system power as soon as possible Control Strategy.

在执行阶段，评价网络的输入数据为核电机组的五个状态输出 x(t)与期望功率点x_d(t)的差值x_e(t)和根据训练后的评价网络得到的最优跟踪误差控制策略u_e(t)。评价网络的输出数据为跟踪误差性能指标函数J_e(t)。In the execution stage, the input data of the evaluation network is the difference x _e (t) between the five state outputs x (t) of the nuclear power unit and the expected power point x _d (t) and the optimal tracking obtained according to the trained evaluation network Error control strategy _ue (t). The output data of the evaluation network is the tracking error performance index function J _e (t).

根据Bellman方程，利用评价网络下一时刻的输出J_e(t+T)与效用函数U(t)计算得到当前时刻的输出数据J_e(t)，计算公式如下：According to the Bellman equation, the output data _Je (t) at the current moment is calculated by using the output J _e (t+T) of the evaluation network at the next moment and the utility function U (t), and the calculation formula is as follows:

利用全局迭代误差控制律

来更新全局迭代J_e函数。Utilize global iterative error control law

to update the global iterative _Je function.

下面举例详细说明获得最优跟踪误差性能指标函数的过程。The following example illustrates the process of obtaining the optimal tracking error performance index function in detail.

设t时刻，x(t)为该时刻核电机组的五个输入输出状态，x_d(t)为期望功率点，我们有系统跟踪误差x_e(t)，u_e(t)为跟踪误差控制策略；其误差控制系统可以定义为：Set time t, x(t) is the five input and output states of the nuclear power unit at this time, x _d (t) is the expected power point, we have the system tracking error x _e (t), u _e (t) is the tracking error control strategy; its error control system can be defined as:

x_e(t+1)＝f(x(t)-x_d(t),u_e(t),t)x _e (t+1)=f(x(t)-x _d (t),u _e (t),t)

其中f可由核电机组功率模型推导而出。定义效用函数如下所示：where f can be derived from the nuclear power unit power model. Define the utility function as follows:

U(t)＝α[x_e(t)]²+β[u_e(t)]² U(t)=α[x _e (t)] ² +β[u _e (t)] ²

其中，α与β为常数；u_e(t)为核电功率机组在当前时间控制律与期望工作控制律的差值。效用函数U(t)表示t时刻核电机组当前工况点与期望工况点差值和控制棒控制律的效用总和。Among them, α and β are constants; u _e (t) is the difference between the control law of the nuclear power unit at the current time and the expected work control law. The utility function U(t) represents the difference between the current operating point and the expected operating point of the nuclear power unit at time t and the sum of the utility of the control rod control law.

我们给定效用函数一个新的形式：We give the utility function a new form:

其中，Q和R分别为正定矩阵，我们的全局跟踪误差性能指标函数可以定义为：where Q and R are positive definite matrices, respectively, and our global tracking error performance indicator function can be defined as:

其Hamiltonian方程可以推导如下：Its Hamiltonian equation can be derived as follows:

则我们存在一个

使得满足如下等式：then we have a

so that the following equation is satisfied:

其最优跟踪误差控制律可表示为：Its optimal tracking error control law can be expressed as:

定义初始误差控制律

对于

和

我们有Define the initial error control law

for

and

We have

其中，i＝0,1,2,…，则误差跟踪控制律可以由下式得到：Among them, i=0,1,2,..., then the error tracking control law can be obtained by the following formula:

当T→∞，

会收敛于最优值。When T→∞,

will converge to the optimal value.

局部训练迭代的目标就是获得最优的

The goal of local training iterations is to obtain the optimal

给定初始稳定控制策略情况下，我们令其控制律为u_e ⁰。令积分时长T＝1，选取局部训练迭代时长为30个步长。Given the initial stable control strategy, we let its control law be _ue ⁰ . Let the integration duration T=1, and select the local training iteration duration as 30 steps.

其跟踪误差性能指标函数更新规则为：Its tracking error performance indicator function update rule is:

最优误差控制律更新规则如下：The update rule of the optimal error control law is as follows:

当T→∞时，

会收敛于最优值

When T→∞,

will converge to the optimal value

然后，更新评价网络的权值，以逼近最优跟踪误差性能指标函数。Then, the weights of the evaluation network are updated to approximate the optimal tracking error performance indicator function.

其中，更新规则如下：Among them, the update rules are as follows:

W_CL＝-(X^TX)^-1(X^TY)W _CL = -(X ^T X) ^-1 (X ^T Y)

其中，

为评价网络的权重向量偏差，X为评价网络的权重向量内积差值，Y为评价网络近似的效用函数值，W_CL为评价网络的权值。in,

In order to evaluate the weight vector deviation of the network, X is the weight vector inner product difference of the evaluation network, Y is the approximate utility function value of the evaluation network, and W _CL is the weight value of the evaluation network.

由于误差控制策略和跟踪误差性能指标函数是随着控制器、评价器神经网络的权重而改变的，所以，调整控制器、评价器神经网络的权重意味着误差控制策略和跟踪误差性能指标函数的更新。在执行阶段，将有限的控制变量代入由评价网络近似的最优跟踪误差性能指标函数

Since the error control strategy and the tracking error performance index function change with the weights of the controller and the evaluator neural network, adjusting the weights of the controller and the evaluator neural network means that the error control strategy and the tracking error performance index function are not equal to each other. renew. In the execution phase, the limited control variables are substituted into the optimal tracking error performance indicator function approximated by the evaluation network

最优误差控制策略是根据评价网络得到的跟踪误差性能指标函数近似得到的，选择使最优跟踪误差性能指标函数最小的控制变量作为最优跟踪误差控制策略：The optimal error control strategy is approximated according to the tracking error performance index function obtained by the evaluation network, and the control variable that minimizes the optimal tracking error performance index function is selected as the optimal tracking error control strategy:

评价网络用来近似最优跟踪误差性能指标函数，并利用该评价网络权值评测当核电控制棒系统的性能，通过执行流程选择最优跟踪控制策略，最小化全局训练的跟踪误差总成本。The evaluation network is used to approximate the optimal tracking error performance index function, and the evaluation network weight is used to evaluate the performance of the nuclear power control rod system, and the optimal tracking control strategy is selected through the execution process to minimize the total cost of the global training tracking error.

S4：判断当前局部迭代是否完成，如果尚未完成，则返回局部迭代步骤，否则更新迭代跟踪误差性能指标函数和误差控制律，以获得最优跟踪误差性能指标函数；S4: Determine whether the current local iteration is completed, if not, return to the local iteration step, otherwise update the iterative tracking error performance index function and error control law to obtain the optimal tracking error performance index function;

具体地，在完成局部迭代后，确定当前迭代次数是否达到迭代阈值，若是，更新迭代跟踪误差性能指标函数和误差控制律，以获得最优跟踪误差性能指标函数和最优误差控制策略。Specifically, after completing the local iteration, it is determined whether the current number of iterations reaches the iteration threshold, and if so, the iterative tracking error performance index function and error control law are updated to obtain the optimal tracking error performance index function and optimal error control strategy.

如果尚未完成，则执行步骤S3；否则，执行步骤S5。If not completed, go to step S3; otherwise, go to step S5.

S5：全局策略迭代完成，得到最优跟踪误差控制策略，跟踪到期望功率点，计算总成本(跟踪误差与控制棒控制成本)。S5: The global strategy iteration is completed, the optimal tracking error control strategy is obtained, the desired power point is tracked, and the total cost (tracking error and control rod control cost) is calculated.

总成本的计算需要将最优跟踪误差控制策略

代入实际模型，这里由于效用函数U(x_e,u_e)的定义依赖于实际模型，所以总成本可近似为最终得到的最优性跟踪误差性能指标函数

The calculation of the total cost requires the optimal tracking error control strategy

Substitute into the actual model, where since the definition of the utility function U(x _e , u _e ) depends on the actual model, the total cost can be approximated as the final optimal tracking error performance index function

本实施例中虽然将各个步骤按照上述先后次序的方式进行了描述，但是本领域技术人员可以理解，为了实现本实施例的效果，不同的步骤之间不必按照这样的次序执行，其可以同时(并行)执行或以颠倒的次序执行，这些简单的变化都在本发明的保护范围之内。以上对本发明实施例所提供的技术方案进行了详细的介绍。虽然本文应用了具体的个例对本发明的原理和实施方式进行了阐述，但是，上述实施例的说明仅适用于帮助理解本发明实施例的原理；同时，对于本领域技术人员来说，依据本发明实施例，在具体实施方式以及应用范围之内均会做出改变。Although the steps are described in the above-mentioned order in this embodiment, those skilled in the art can understand that, in order to achieve the effect of this embodiment, different steps need not be performed in this order, and they can be performed simultaneously ( parallel) or in reverse order, simple variations of these are within the scope of the present invention. The technical solutions provided by the embodiments of the present invention have been described in detail above. Although specific examples are used to illustrate the principles and implementations of the present invention, the descriptions of the above embodiments are only suitable for helping to understand the principles of the embodiments of the present invention; meanwhile, for those skilled in the art, according to this Changes may be made in the embodiments of the invention within the specific implementation manner and application scope.

需要说明的是，本文中涉及到的流程图不仅仅局限于本文所示的形式，其还可以进行划分和/或组合。It should be noted that the flowcharts involved in this document are not limited to the forms shown in this document, and may also be divided and/or combined.

需要说明的是：附图中的标记和文字只是为了更清楚地说明本发明，不视为对本发明保护范围的不当限定。It should be noted that the symbols and characters in the accompanying drawings are only for the purpose of illustrating the present invention more clearly, and are not regarded as improper limitation of the protection scope of the present invention.

本发明并不限于上述实施方式，在不背离本发明实质内容的情况下，本领域普通技术人员可以想到的任何变形、改进或替换均落入本发明的保护范围。The present invention is not limited to the above-mentioned embodiments, and any modifications, improvements or substitutions that can be conceived by those of ordinary skill in the art without departing from the essence of the present invention fall into the protection scope of the present invention.

Claims

1. a nuclear power system power tracking control method based on integral reinforcement learning, is characterized in that, described method comprises:

S1: Initial strategy selection, related parameter initialization, initial power point and expected power point selection;

S2: Perform global iteration, and update the iterative tracking error performance index function according to the iterative control sequence to obtain the optimal tracking error performance index function;

S3: perform local iteration, train an evaluation network by using an integral reinforcement learning algorithm, modify the weights of the evaluation network, and obtain an optimal tracking control strategy by using the optimal tracking performance index function;

S4: Determine whether the current local iteration is completed, if not, return to the local iteration step, otherwise update the iterative tracking error performance index function and tracking control law to obtain the optimal tracking error performance index function;

S5: The global strategy iteration is completed, the optimal tracking control strategy is obtained, the desired power point is tracked, and the total cost is calculated.

2. The method according to claim 1, wherein in the step S1, the initialization parameters include: nuclear power system parameters, evaluation network parameters, global iteration duration, integration time constant, local iteration duration, convergence accuracy and target parameters; wherein, the nuclear power system parameters are power model system parameters of nuclear power, and the model includes five system input and output states.

3. The method according to claim 2, wherein the structure of the evaluation network is set as 5-15-1, and the convergence accuracy is defined as 1.0× ^10-2 , wherein 5 is the number of input nodes of the evaluation network , 15 is the number of nodes in the hidden layer of the evaluation network, and 1 is the number of nodes in the output layer of the evaluation network.

4 . The method according to claim 1 , wherein the step S1 further comprises the selection of an initial control strategy, and the error control strategy can be obtained by a traditional PID or MPC strategy, in order to obtain an initial stable control rate. 5 .

5 . The method according to claim 1 , wherein, in the step S3 , the input data of the evaluation network includes five working states x(t) of the nuclear power unit and the working state point x _d of the expected power. 6 . (t) of the tracking error value x _e (t), and the tracking control strategy ue (t) of the nuclear power control rod; the output data of the evaluation network includes: the tracking error performance index function J _e ₍ t);

According to the Bellman equation, the output J _e (t+T) and the utility function U(t) of the evaluation network at the next integration moment are used, and the output data J _e (t) at the current moment is calculated by the following formula:

Among them, x _e (t) is the tracking error value x _e (t) between the five working states x (t) of the nuclear power unit and the working state point x _d (t) of the expected power; the utility function U(t) represents the time t The utility sum of the tracking error value x _e (t) and the tracking control strategy _ue (t) of the nuclear power control rod.

6. The method according to claim 5, wherein the calculation formula of the utility function U(t) is:

U(t)=α[x _e (t)] ² +β[u _e (t)] ²

Among them, α and β are constants; u _e (t) is the difference between the control law of the nuclear power unit at the current time and the expected work control law.

7 . The method according to claim 1 , wherein, in the step S3 , the input data of the execution stage of the evaluation network includes the relative power coefficient of the controlled nuclear power unit, the relative concentration of delayed neutrons, the reactor Average core temperature, average temperature of coolant and reactivity of control rods; the output data of the execution phase of the evaluation network includes an optimal tracking control strategy; wherein, the optimal tracking control strategy is obtained according to the evaluation network The tracking error performance indicator function is approximated by .

8. The method according to claim 1, wherein, in the step S3, the update rule of the evaluation network is as follows:

W _CL = -(X ^T X) ^-1 (X ^T Y)

in,