WO2021071304A1

WO2021071304A1 - Stabilized nonlinear optimal control method

Info

Publication number: WO2021071304A1
Application number: PCT/KR2020/013781
Authority: WO
Inventors: 이종민; 임산하; 김연수; 이병준; 배신영
Original assignee: 서울대학교산학협력단
Priority date: 2019-10-11
Filing date: 2020-10-08
Publication date: 2021-04-15
Also published as: KR102231799B1

Abstract

A stabilized nonlinear optimal control method is provided. The stabilized nonlinear optimal control method comprises the step of performing a policy iteration algorithm using a control Lyapunov function (CLF) and Sontag's formula.

Description

Stabilized nonlinear optimal control method

The present invention relates to a stabilized nonlinear optimal control method.

Recently, research on reinforcement learning technology for learning optimal policies based on artificial intelligence technology in the field of computer engineering has been actively conducted. In the case of game fields such as AlphaGo, where the algorithm is widely used, there is little concern about stability, and the algorithm application was mainly focused on optimality. However, in the case of an actual system such as a chemical plant or a robot, stability must be ensured prior to optimum. In the case of existing studies, to ensure stability, an additional actor network was introduced in addition to the critical network to ensure stability. However, most of the existing algorithms have only been designed to update rules for actor networks for single-layer neuralnets, and are difficult to apply to actual systems.

In order to solve the above problems, the present invention provides a nonlinear optimal control method that ensures stability.

Other objects of the present invention will become apparent from the following detailed description and accompanying drawings.

The stabilized nonlinear optimal control method according to embodiments of the present invention includes the step of performing a Policy Iteration Algorithm using a control Lyapunov function (CLF) and Sontag's formula. Includes.

The policy repetition algorithm may be the following fine policy repetition algorithm.

[Precision policy iteration algorithm]

The precise policy iteration algorithm is the current stabilization control input by solving the Leapunov equation in the part that evaluates the policy.

A control Lyapunov function that evaluates the costs incurred under

Can be calculated, and the learning process and post-learning stability can be ensured by using the Sontag equation in the part for updating the policy.

The policy repetition algorithm may be the following first approximate policy repetition algorithm.

[The first approximation policy iteration algorithm]

The first approximate policy repetition algorithm is a stabilization control input in a part for evaluating a policy

A value function approximated by a linear artificial neural network in the direction of collecting the states determined by and reducing the Bellman error (

) Can be performed, and the learning process and post-learning stability can be ensured by using the Sontak equation in the part for updating the policy.

The policy repetition algorithm may be the following second approximate policy repetition algorithm.

[The second approximation policy iteration algorithm]

The second approximate policy repetition algorithm is a stabilization control input in a part for evaluating a policy.

A value function approximated by a deep neural network in the direction of reducing the Bellman error by collecting the states determined by

) Can be updated, and the learning process and post-learning stability can be ensured by using the Sontag equation in the part for updating the policy.

The stabilized nonlinear optimal control method according to embodiments of the present invention guarantees stability by using a policy iteration algorithm using a control Leapunov function and a Sontak equation. The policy repetition algorithm makes it possible to apply artificial intelligence technology, which has been developed centering on computer engineering, to an actual system requiring stability. The policy repetition algorithm can guarantee stability and learn the optimum controller by utilizing the correlation between the stabilization controller and the optimum controller.

1 and 2 show results of applying an approximate policy iteration algorithm using a fine basis function as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention.

3 and 4 show results of applying an approximation policy iteration algorithm using a deep artificial neural network as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention.

5 shows the result of applying the algorithm to 100 test episodes.

Hereinafter, the present invention will be described in detail through examples. Objects, features, and advantages of the present invention will be easily understood through the following embodiments. The present invention is not limited to the embodiments described herein, and may be embodied in other forms. The embodiments introduced herein are provided so that the disclosed contents may be thorough and complete, and the spirit of the present invention may be sufficiently transmitted to those of ordinary skill in the art to which the present invention pertains. Therefore, the present invention should not be limited by the following examples.

[Control Liapunov function and Sontak expression]

The control Leapunov function means a continuous derivative function V _c (x) that satisfies the following conditions in the control-affine system:

L _f V _c (x)< 0 for all x where _{L g} V _{c (x) = 0}

When knowing the control Leapunov function, the control input u is obtained using the following modified Sontak equation.

It is designed as follows.

These controllers stabilize the system. V _c is the optimal value function

If you have the same level set shape

Is the optimal controller

Is the same as Based on this, in the policy iteration algorithm, in the policy evaluation part, the approximate value function is limited to the control Leapunov function, but the weight of the approximation function is updated in the direction of reducing Bellman error. In the update part, the Sontak equation can be used to ensure stability without introducing additional actor-networks and updates.

[Exact Policy Iteration Algorithm]

The fine policy iteration algorithm consists of two main elements as follows.

1) Input the current stabilization control by solving the Leapunov equation in the part of evaluating the policy.

A control Lyapunov function that evaluates the costs incurred under

Calculate

2) In the part of updating the policy, the Sontak equation is used to ensure the learning process and post-learning stability without introducing additional actor-networks.

In the case of using the value function limited to the control Leapunov function, the controller to which the Sontak equation is applied always guarantees stability. On the other hand, in order to stabilize the controller by applying the existing LgV-type optimal formula, the value function satisfies the condition of the control Leapunov function and additional conditions are required. Therefore, it is preferable to use the Sontak formula from the ease of ensuring stability.

The stabilized nonlinear optimal control method according to embodiments of the present invention may use an approximate policy iteration algorithm that guarantees stability for an approximate value function of two classes.

[The first approximation policy iteration algorithm]

The first approximation policy iteration algorithm linearly approximates a value function using Exact Basis Functions, and consists of the following three main elements.

1) Stabilization control input in the part that evaluates the policy

) To update the weight. At this time, if the updated value function does not satisfy the condition of the control Leapunov function, the weight update is performed again to satisfy the function condition.

2) In the part of updating the policy, the Sontak equation is used to introduce additional actor-networks and ensure the stability of the learning process and post-learning without the update rule not specified as a standard.

3) The value function linearly approximated by the precise basis function converges to the optimal value function when the algorithm is applied.

[The second approximation policy iteration algorithm]

The second approximation policy iteration algorithm approximates a value function with a deep artificial neural network, and consists of the following three main elements.

1) Stabilization control input in the part that evaluates the policy

) To update the weight. At this time, if the updated value function does not satisfy the condition of the control Leapunov function, the weight is updated again to satisfy the condition of the function.

2) In the part of updating the policy, the Sontak equation is used to introduce additional actor-networks and ensure the stability of the learning process and post-learning without any update rules that are not set as standard.

3) In the case of using the deep artificial neural network, a function having the same level set form as the optimal value function is learned, and if the function used in the Sontak equation has the same level set form as the optimal value function, it is the same as the optimal control As a result, the optimal controller is approximated by the fact that

[Application Example 1]

Application Example 1 applied an approximate policy iteration algorithm using a fine basis function as an approximation function. Optimal value function

Is the basis function

use with

It is expressed as

From above,

to be. Optimal value function

Is, optimal input

to be. The level set is observed on the domain of D=[-2,2]×[-2,2], and the basis function

to be. In the value function training, the initial state was randomly sampled in the D domain, and the learning rate a _lr =0.03 was set. The total time set for each episode is 10 dimensionless times (M _s =1000) at 0.01 step intervals. In

Cases

1 and 2, the initial weight of the approximation function was set differently, and 100 (M _e =100) episodes for Case 1 and 150 (M _e =150) episodes for Case 2 were trained.

1 and 2 show results of applying an approximate policy iteration algorithm using a fine basis function as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention. 1 shows the result of Case 1, and the initial weight in Case 1

2 shows the result of Case 2, and the initial weight in Case 2

to be.

[Application Example 2]

In Application Example 2, an approximate policy iteration algorithm using a deep artificial neural network as an approximation function was applied, and the problem of applying the algorithm is as follows.

From above,

to be.

1) Training result

A total of two layers of Liapunov artificial neural networks with 9 and 10 nodes in each layer were created, and a tangent hyperbolic activation function was used. Information related to training episodes and hyper-parameters is shown in Table 1 below.

[Table 1]

Results for the state and input values during training are shown in FIGS. 3 and 4.

2) test result

Deep artificial neural network approximation value function trained earlier

Based on the Sontak expression

The test was conducted for 100 episodes using the value as an input, and at this time, the initial state was randomly sampled in the D domain. In addition, the cost of the test episode i starting from the _{initial state x i and proceeding under the controller u(x)}

It is referred to as.

Controller

Wow

_{In the lower case, J i} may be an Infinite-horizon cost because it is stabilized to the origin within 5 dimensionless times (50 steps).

5 shows the results of applying the algorithm to 100 test episodes, and Table 2 shows the results. 5 and Table 2, the controller

It can be seen that the performance is almost close to that of the optimal controller.

[Table 2]

So far, specific examples of the present invention have been looked at. Those of ordinary skill in the art to which the present invention pertains will be able to understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the above description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

A stabilized nonlinear optimal control method comprising the step of performing a Policy Iteration Algorithm using a control Lyapunov function (CLF) and Sontag's formula.
The method of claim 1,

The policy repetition algorithm is a stabilized nonlinear optimal control method, characterized in that the following fine policy repetition algorithm.

[Precision policy iteration algorithm]
The method of claim 2,

The fine policy iteration algorithm,

In the part of evaluating the policy, the current stabilization control input by solving the Liapunov equation
A control Lyapunov function that evaluates the costs incurred under
To calculate,

A stabilized nonlinear optimal control method, characterized in that, in a part of updating the policy, a learning process and post-learning stability are ensured by using the Sontag equation.
The method of claim 1,

The policy iteration algorithm is a stabilized nonlinear optimal control method, characterized in that the following first approximation policy iteration algorithm.

[The first approximation policy iteration algorithm]
The method of claim 4,

The first approximation policy iteration algorithm,

Stabilization control input in the part of evaluating the policy
A value function approximated by a linear artificial neural network in the direction of collecting the states determined by and reducing the Bellman error (
) To update the weight,

A stabilized nonlinear optimal control method, characterized in that, in a part of updating the policy, a learning process and post-learning stability are ensured by using the Sontag equation.
The method of claim 1,

The policy repetition algorithm is a stabilized nonlinear optimal control method, characterized in that the following second approximation policy repetition algorithm.

[The second approximation policy iteration algorithm]
The method of claim 6,

The second approximation policy iteration algorithm,

Stabilization control input in the part of evaluating the policy
A value function approximated by a deep neural network in the direction of reducing the Bellman error by collecting the states determined by
) To update the weight,

A stabilized nonlinear optimal control method, characterized in that, in a portion of updating the policy, a learning process and post-learning stability are ensured by using the Sontag equation.