WO2021071304A1 - Stabilized nonlinear optimal control method - Google Patents

Stabilized nonlinear optimal control method Download PDF

Info

Publication number
WO2021071304A1
WO2021071304A1 PCT/KR2020/013781 KR2020013781W WO2021071304A1 WO 2021071304 A1 WO2021071304 A1 WO 2021071304A1 KR 2020013781 W KR2020013781 W KR 2020013781W WO 2021071304 A1 WO2021071304 A1 WO 2021071304A1
Authority
WO
WIPO (PCT)
Prior art keywords
policy
algorithm
stabilized
control method
optimal control
Prior art date
Application number
PCT/KR2020/013781
Other languages
French (fr)
Korean (ko)
Inventor
이종민
임산하
김연수
이병준
배신영
Original Assignee
서울대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 서울대학교산학협력단 filed Critical 서울대학교산학협력단
Publication of WO2021071304A1 publication Critical patent/WO2021071304A1/en

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B17/00Systems involving the use of models or simulators of said systems
    • G05B17/02Systems involving the use of models or simulators of said systems electric
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems

Definitions

  • the present invention relates to a stabilized nonlinear optimal control method.
  • the present invention provides a nonlinear optimal control method that ensures stability.
  • the stabilized nonlinear optimal control method includes the step of performing a Policy Iteration Algorithm using a control Lyapunov function (CLF) and Sontag's formula. Includes.
  • the policy repetition algorithm may be the following fine policy repetition algorithm.
  • the precise policy iteration algorithm is the current stabilization control input by solving the Leapunov equation in the part that evaluates the policy.
  • a control Lyapunov function that evaluates the costs incurred under Can be calculated, and the learning process and post-learning stability can be ensured by using the Sontag equation in the part for updating the policy.
  • the policy repetition algorithm may be the following first approximate policy repetition algorithm.
  • the first approximate policy repetition algorithm is a stabilization control input in a part for evaluating a policy A value function approximated by a linear artificial neural network in the direction of collecting the states determined by and reducing the Bellman error ( ) Can be performed, and the learning process and post-learning stability can be ensured by using the Sontak equation in the part for updating the policy.
  • the policy repetition algorithm may be the following second approximate policy repetition algorithm.
  • the second approximate policy repetition algorithm is a stabilization control input in a part for evaluating a policy.
  • a value function approximated by a deep neural network in the direction of reducing the Bellman error by collecting the states determined by ) Can be updated, and the learning process and post-learning stability can be ensured by using the Sontag equation in the part for updating the policy.
  • the stabilized nonlinear optimal control method guarantees stability by using a policy iteration algorithm using a control Leapunov function and a Sontak equation.
  • the policy repetition algorithm makes it possible to apply artificial intelligence technology, which has been developed centering on computer engineering, to an actual system requiring stability.
  • the policy repetition algorithm can guarantee stability and learn the optimum controller by utilizing the correlation between the stabilization controller and the optimum controller.
  • 1 and 2 show results of applying an approximate policy iteration algorithm using a fine basis function as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention.
  • 3 and 4 show results of applying an approximation policy iteration algorithm using a deep artificial neural network as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention.
  • the stabilized nonlinear optimal control method includes the step of performing a Policy Iteration Algorithm using a control Lyapunov function (CLF) and Sontag's formula. Includes.
  • control Leapunov function means a continuous derivative function V c (x) that satisfies the following conditions in the control-affine system:
  • V c is the optimal value function If you have the same level set shape Is the optimal controller Is the same as Based on this, in the policy iteration algorithm, in the policy evaluation part, the approximate value function is limited to the control Leapunov function, but the weight of the approximation function is updated in the direction of reducing Bellman error. In the update part, the Sontak equation can be used to ensure stability without introducing additional actor-networks and updates.
  • the fine policy iteration algorithm consists of two main elements as follows.
  • the Sontak equation is used to ensure the learning process and post-learning stability without introducing additional actor-networks.
  • the controller to which the Sontak equation is applied always guarantees stability.
  • the value function satisfies the condition of the control Leapunov function and additional conditions are required. Therefore, it is preferable to use the Sontak formula from the ease of ensuring stability.
  • the stabilized nonlinear optimal control method according to embodiments of the present invention may use an approximate policy iteration algorithm that guarantees stability for an approximate value function of two classes.
  • the first approximation policy iteration algorithm linearly approximates a value function using Exact Basis Functions, and consists of the following three main elements.
  • the Sontak equation is used to introduce additional actor-networks and ensure the stability of the learning process and post-learning without the update rule not specified as a standard.
  • the second approximation policy iteration algorithm approximates a value function with a deep artificial neural network, and consists of the following three main elements.
  • the Sontak equation is used to introduce additional actor-networks and ensure the stability of the learning process and post-learning without any update rules that are not set as standard.
  • Application Example 1 applied an approximate policy iteration algorithm using a fine basis function as an approximation function.
  • Optimal value function Is the basis function use with It is expressed as
  • Optimal value function Is optimal input to be.
  • 1 and 2 show results of applying an approximate policy iteration algorithm using a fine basis function as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention.
  • 1 shows the result of Case 1
  • the initial weight in Case 1 2 shows the result of Case 2, and the initial weight in Case 2 to be.
  • Deep artificial neural network approximation value function trained earlier Based on the Sontak expression
  • the test was conducted for 100 episodes using the value as an input, and at this time, the initial state was randomly sampled in the D domain.
  • the cost of the test episode i starting from the initial state x i and proceeding under the controller u(x) It is referred to as.
  • J i may be an Infinite-horizon cost because it is stabilized to the origin within 5 dimensionless times (50 steps).
  • the stabilized nonlinear optimal control method guarantees stability by using a policy iteration algorithm using a control Leapunov function and a Sontak equation.
  • the policy repetition algorithm makes it possible to apply artificial intelligence technology, which has been developed centering on computer engineering, to an actual system requiring stability.
  • the policy repetition algorithm can guarantee stability and learn the optimum controller by utilizing the correlation between the stabilization controller and the optimum controller.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Analysis (AREA)
  • Automation & Control Theory (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Operations Research (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Feedback Control In General (AREA)

Abstract

A stabilized nonlinear optimal control method is provided. The stabilized nonlinear optimal control method comprises the step of performing a policy iteration algorithm using a control Lyapunov function (CLF) and Sontag's formula.

Description

안정화된 비선형 최적 제어 방법Stabilized nonlinear optimal control method
본 발명은 안정화된 비선형 최적 제어 방법에 관한 것이다.The present invention relates to a stabilized nonlinear optimal control method.
최근 컴퓨터 공학 분야에서 인공지능 기술에 기반하여 최적 정책을 학습하는 강화학습 기술에 대한 연구가 활발히 진행되고 있다. 해당 알고리즘이 많이 활용되고 있는 알파고 등의 게임 분야의 경우 안정성에 대한 우려가 적은 분야로 알고리즘 적용에 있어 주로 최적성에 초점을 맞추어 진행되었다. 그러나 화학 공장이나 로봇 등과 같은 실제 시스템의 경우 최적성에 앞서 안정성이 보장되어야 한다. 기존 연구들의 경우 안정성 보장을 위하여 크리틱(critic) 네트워크 이외에 추가적인 액터(actor) 네트워크를 도입하여 안정성을 보장하고자 하였다. 그러나 기존 알고리즘의 대부분은 단일층(single-layer)의 뉴랄넷에 대한 액터 네트워크의 업데이트 룰 설계에 그쳤으며 실제 시스템에 적용하기가 어렵다. Recently, research on reinforcement learning technology for learning optimal policies based on artificial intelligence technology in the field of computer engineering has been actively conducted. In the case of game fields such as AlphaGo, where the algorithm is widely used, there is little concern about stability, and the algorithm application was mainly focused on optimality. However, in the case of an actual system such as a chemical plant or a robot, stability must be ensured prior to optimum. In the case of existing studies, to ensure stability, an additional actor network was introduced in addition to the critical network to ensure stability. However, most of the existing algorithms have only been designed to update rules for actor networks for single-layer neuralnets, and are difficult to apply to actual systems.
상기와 같은 문제점을 해결하기 위하여, 본 발명은 안정성을 보장하는 비선형 최적 제어 방법을 제공한다.In order to solve the above problems, the present invention provides a nonlinear optimal control method that ensures stability.
본 발명의 다른 목적들은 다음의 상세한 설명과 첨부한 도면으로부터 명확해 질 것이다.Other objects of the present invention will become apparent from the following detailed description and accompanying drawings.
본 발명의 실시예들에 따른 안정화된 비선형 최적 제어 방법은 제어 리아푸노프 함수(control Lyapunov function, CLF) 및 손타크 식(Sontag's formula)을 이용한 정책 반복 알고리즘(Policy Iteration Algorithm)을 진행하는 단계를 포함한다. The stabilized nonlinear optimal control method according to embodiments of the present invention includes the step of performing a Policy Iteration Algorithm using a control Lyapunov function (CLF) and Sontag's formula. Includes.
상기 정책 반복 알고리즘은 하기 정밀 정책 반복 알고리즘일 수 있다.The policy repetition algorithm may be the following fine policy repetition algorithm.
[정밀 정책 반복 알고리즘][Precision policy iteration algorithm]
Figure PCTKR2020013781-appb-I000001
Figure PCTKR2020013781-appb-I000001
상기 정밀 정책 반복 알고리즘은, 정책을 평가하는 부분에서 리아푸노프 방정식을 풀어 현재의 안정화 제어 입력
Figure PCTKR2020013781-appb-I000002
하에서 발생하는 비용을 평가하는 제어 리아푸노프 함수
Figure PCTKR2020013781-appb-I000003
를 계산할 수 있고, 상기 정책을 업데이트하는 부분에서 손타크 식을 사용하여 학습 과정 및 학습 후의 안정성을 보장할 수 있다.
The precise policy iteration algorithm is the current stabilization control input by solving the Leapunov equation in the part that evaluates the policy.
Figure PCTKR2020013781-appb-I000002
A control Lyapunov function that evaluates the costs incurred under
Figure PCTKR2020013781-appb-I000003
Can be calculated, and the learning process and post-learning stability can be ensured by using the Sontag equation in the part for updating the policy.
상기 정책 반복 알고리즘은 하기 제1 근사 정책 반복 알고리즘일 수 있다.The policy repetition algorithm may be the following first approximate policy repetition algorithm.
[제1 근사 정책 반복 알고리즘][The first approximation policy iteration algorithm]
Figure PCTKR2020013781-appb-I000004
Figure PCTKR2020013781-appb-I000004
상기 제1 근사 정책 반복 알고리즘은, 정책을 평가하는 부분에서 안정화 제어 입력
Figure PCTKR2020013781-appb-I000005
에 의해 결정되는 상태들을 모아 벨만 에러를 감소하는 방향으로 선형 인공 신경망으로 근사된 가치 함수(
Figure PCTKR2020013781-appb-I000006
)의 웨이트(weight) 업데이트를 진행할 수 있고, 상기 정책을 업데이트하는 부분에서 손타크 식을 사용하여 학습 과정 및 학습 후의 안정성을 보장할 수 있다.
The first approximate policy repetition algorithm is a stabilization control input in a part for evaluating a policy
Figure PCTKR2020013781-appb-I000005
A value function approximated by a linear artificial neural network in the direction of collecting the states determined by and reducing the Bellman error (
Figure PCTKR2020013781-appb-I000006
) Can be performed, and the learning process and post-learning stability can be ensured by using the Sontak equation in the part for updating the policy.
상기 정책 반복 알고리즘은 하기 제2 근사 정책 반복 알고리즘일 수 있다.The policy repetition algorithm may be the following second approximate policy repetition algorithm.
[제2 근사 정책 반복 알고리즘][The second approximation policy iteration algorithm]
Figure PCTKR2020013781-appb-I000007
Figure PCTKR2020013781-appb-I000007
상기 제2 근사 정책 반복 알고리즘은, 정책을 평가하는 부분에서 안정화 제어 입력
Figure PCTKR2020013781-appb-I000008
에 의해 결정되는 상태들을 모아 벨만 에러를 감소하는 방향으로 심층 신경망으로 근사된 가치 함수(
Figure PCTKR2020013781-appb-I000009
)의 웨이트(weight) 업데이트를 진행할 수 있고, 상기 정책을 업데이트하는 부분에서 손타크 식를 사용하여 학습 과정 및 학습 후의 안정성을 보장할 수 있다.
The second approximate policy repetition algorithm is a stabilization control input in a part for evaluating a policy.
Figure PCTKR2020013781-appb-I000008
A value function approximated by a deep neural network in the direction of reducing the Bellman error by collecting the states determined by
Figure PCTKR2020013781-appb-I000009
) Can be updated, and the learning process and post-learning stability can be ensured by using the Sontag equation in the part for updating the policy.
본 발명의 실시예들에 따른 안정화된 비선형 최적 제어 방법은 제어 리아푸노프 함수 및 손타크 식을 이용한 정책 반복 알고리즘을 사용하여 안정성을 보장한다. 상기 정책 반복 알고리즘은 컴퓨터 공학을 중심으로 발전해온 인공지능 기술을 안정성이 요구되는 실제 시스템으로 적용할 수 있게 한다. 상기 정책 반복 알고리즘은 안정화 제어기와 최적 제어기와의 상관관계를 활용하여 안정성을 보장하는 동시에 최적 제어기를 학습할 수 있다.The stabilized nonlinear optimal control method according to embodiments of the present invention guarantees stability by using a policy iteration algorithm using a control Leapunov function and a Sontak equation. The policy repetition algorithm makes it possible to apply artificial intelligence technology, which has been developed centering on computer engineering, to an actual system requiring stability. The policy repetition algorithm can guarantee stability and learn the optimum controller by utilizing the correlation between the stabilization controller and the optimum controller.
도 1 및 도 2는 본 발명의 일 실시예에 따른 안정화된 비선형 최적 제어 방법을 이용하여 정밀 기저 함수를 근사 함수로 사용하는 근사 정책 반복 알고리즘을 적용한 결과를 나타낸다.1 and 2 show results of applying an approximate policy iteration algorithm using a fine basis function as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention.
도 3 및 도 4는 본 발명의 일 실시예에 따른 안정화된 비선형 최적 제어 방법을 이용하여 심층 인공 신경망을 근사 함수로 사용하는 근사 정책 반복 알고리즘을 적용한 결과를 나타낸다.3 and 4 show results of applying an approximation policy iteration algorithm using a deep artificial neural network as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention.
도 5는 100개의 테스트 에피소드에 알고리즘을 적용한 결과를 나타낸다.5 shows the result of applying the algorithm to 100 test episodes.
이하, 실시예들을 통하여 본 발명을 상세하게 설명한다. 본 발명의 목적, 특징, 장점은 이하의 실시예들을 통해 쉽게 이해될 것이다. 본 발명은 여기서 설명되는 실시예들에 한정되지 않고, 다른 형태로 구체화될 수도 있다. 여기서 소개되는 실시예들은 개시된 내용이 철저하고 완전해질 수 있도록 그리고 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 제공되는 것이다. 따라서, 이하의 실시예들에 의하여 본 발명이 제한되어서는 안 된다.Hereinafter, the present invention will be described in detail through examples. Objects, features, and advantages of the present invention will be easily understood through the following embodiments. The present invention is not limited to the embodiments described herein, and may be embodied in other forms. The embodiments introduced herein are provided so that the disclosed contents may be thorough and complete, and the spirit of the present invention may be sufficiently transmitted to those of ordinary skill in the art to which the present invention pertains. Therefore, the present invention should not be limited by the following examples.
본 발명의 실시예들에 따른 안정화된 비선형 최적 제어 방법은 제어 리아푸노프 함수(control Lyapunov function, CLF) 및 손타크 식(Sontag's formula)을 이용한 정책 반복 알고리즘(Policy Iteration Algorithm)을 진행하는 단계를 포함한다.The stabilized nonlinear optimal control method according to embodiments of the present invention includes the step of performing a Policy Iteration Algorithm using a control Lyapunov function (CLF) and Sontag's formula. Includes.
[제어 리아푸노프 함수와 손타크 식][Control Liapunov function and Sontak expression]
제어 리아푸노프 함수는 제어-아핀 시스템(control-affine system)에서 다음과 같은 조건을 만족하는 연속 미분 함수 Vc(x)를 의미한다:The control Leapunov function means a continuous derivative function V c (x) that satisfies the following conditions in the control-affine system:
LgVc(x)=0 인 모든 x에 대해 LfVc(x)<0L f V c (x)< 0 for all x where L g V c (x) = 0
제어 리아푸노프 함수를 알 때, 다음의 변형된 손타크 식을 활용해 제어 입력 u를
Figure PCTKR2020013781-appb-I000010
로 설계하면 다음과 같다.
When knowing the control Leapunov function, the control input u is obtained using the following modified Sontak equation.
Figure PCTKR2020013781-appb-I000010
It is designed as follows.
Figure PCTKR2020013781-appb-I000011
Figure PCTKR2020013781-appb-I000011
이러한 제어기는 시스템을 안정화한다. Vc가 최적 가치함수
Figure PCTKR2020013781-appb-I000012
와 같은 레벨셋 형태를 가지면
Figure PCTKR2020013781-appb-I000013
가 최적 제어기
Figure PCTKR2020013781-appb-I000014
와 동일하다. 이를 바탕으로 정책 반복 알고리즘을 진행함에 있어, 정책 평가 파트에서 근사 가치 함수를 제어 리아푸노프 함수에 한정하되 벨만 에러(Bellman error)를 감소시키는 방향으로 근사 함수의 웨이트(weight)를 업데이트 하며, 정책 업데이트 파트에서는 손타크 식을 사용하여 추가적인 액터-네트워크와 업데이트의 도입 없이 안정성을 보장할 수 있다.
These controllers stabilize the system. V c is the optimal value function
Figure PCTKR2020013781-appb-I000012
If you have the same level set shape
Figure PCTKR2020013781-appb-I000013
Is the optimal controller
Figure PCTKR2020013781-appb-I000014
Is the same as Based on this, in the policy iteration algorithm, in the policy evaluation part, the approximate value function is limited to the control Leapunov function, but the weight of the approximation function is updated in the direction of reducing Bellman error. In the update part, the Sontak equation can be used to ensure stability without introducing additional actor-networks and updates.
[정밀 정책 반복 알고리즘(Exact Policy Iteration Algorithm)][Exact Policy Iteration Algorithm]
Figure PCTKR2020013781-appb-I000015
Figure PCTKR2020013781-appb-I000015
상기 정밀 정책 반복 알고리즘은 다음과 같은 2개의 주요한 요소로 이루어진다.The fine policy iteration algorithm consists of two main elements as follows.
1) 정책을 평가하는 부분에서 리아푸노프 방정식을 풀어 현재의 안정화 제어 입력
Figure PCTKR2020013781-appb-I000016
하에서 발생하는 비용을 평가하는 제어 리아푸노프 함수
Figure PCTKR2020013781-appb-I000017
를 계산한다.
1) Input the current stabilization control by solving the Leapunov equation in the part of evaluating the policy.
Figure PCTKR2020013781-appb-I000016
A control Lyapunov function that evaluates the costs incurred under
Figure PCTKR2020013781-appb-I000017
Calculate
2) 정책을 업데이트하는 부분에서 손타크 식을 사용하여 추가적인 액터-네트워크 도입 없이 학습 과정 및 학습 후의 안정성을 보장한다.2) In the part of updating the policy, the Sontak equation is used to ensure the learning process and post-learning stability without introducing additional actor-networks.
제어 리아푸노프 함수에 한정하여 가치 함수를 사용하는 경우, 손타크 식을 적용한 제어기는 항상 안정성을 보장한다. 반면, 기존에 쓰이던 LgV 타입 최적 식(LgV-type optimal formula)을 적용하여 제어기를 안정화시키려면 가치 함수가 제어 리아푸노프 함수 조건을 만족하는 동시에 부가적인 조건들이 더 필요하다. 따라서, 안정성 보장 용이성에서 손타크 식을 사용하는 것이 바람직하다.In the case of using the value function limited to the control Leapunov function, the controller to which the Sontak equation is applied always guarantees stability. On the other hand, in order to stabilize the controller by applying the existing LgV-type optimal formula, the value function satisfies the condition of the control Leapunov function and additional conditions are required. Therefore, it is preferable to use the Sontak formula from the ease of ensuring stability.
본 발명의 실시예들에 따른 안정화된 비선형 최적 제어 방법은 두 분류의 근사 가치 함수에 대하여 안정성을 보장하는 근사 정책 반복 알고리즘을 사용할 수 있다.The stabilized nonlinear optimal control method according to embodiments of the present invention may use an approximate policy iteration algorithm that guarantees stability for an approximate value function of two classes.
[제1 근사 정책 반복 알고리즘][The first approximation policy iteration algorithm]
Figure PCTKR2020013781-appb-I000018
Figure PCTKR2020013781-appb-I000018
상기 제1 근사 정책 반복 알고리즘은 정밀 기저 함수들(Exact Basis Functions)을 이용하여 가치 함수를 선형 근사하며, 다음과 같은 3개의 주요한 요소로 이루어진다.The first approximation policy iteration algorithm linearly approximates a value function using Exact Basis Functions, and consists of the following three main elements.
1) 정책을 평가하는 부분에서 안정화 제어 입력
Figure PCTKR2020013781-appb-I000019
에 의해 결정되는 상태들을 모아 벨만 에러를 감소하는 방향으로 선형 인공 신경망으로 근사된 가치 함수(
Figure PCTKR2020013781-appb-I000020
)의 웨이트(weight) 업데이트를 진행한다. 이때, 업데이트된 가치 함수가 제어 리아푸노프 함수 조건을 만족하지 않으면, 함수 조건을 만족하도록 웨이트 업데이트를 다시 진행한다.
1) Stabilization control input in the part that evaluates the policy
Figure PCTKR2020013781-appb-I000019
A value function approximated by a linear artificial neural network in the direction of collecting the states determined by and reducing the Bellman error (
Figure PCTKR2020013781-appb-I000020
) To update the weight. At this time, if the updated value function does not satisfy the condition of the control Leapunov function, the weight update is performed again to satisfy the function condition.
2) 정책을 업데이트하는 부분에서 손타크 식을 사용하여 추가적인 액터-네트워크 도입과 표준으로 정해지지 않은 업데이트 룰 없이 학습 과정 및 학습 후의 안정성을 보장한다.2) In the part of updating the policy, the Sontak equation is used to introduce additional actor-networks and ensure the stability of the learning process and post-learning without the update rule not specified as a standard.
3) 정밀 기저 함수로 선형 근사된 가치 함수는 알고리즘 적용시 최적 가치 함수로 수렴한다.3) The value function linearly approximated by the precise basis function converges to the optimal value function when the algorithm is applied.
[제2 근사 정책 반복 알고리즘][The second approximation policy iteration algorithm]
Figure PCTKR2020013781-appb-I000021
Figure PCTKR2020013781-appb-I000021
상기 제2 근사 정책 반복 알고리즘은 심층 인공 신경망으로 가치 함수를 근사하며, 다음과 같은 3개의 주요한 요소로 이루어진다.The second approximation policy iteration algorithm approximates a value function with a deep artificial neural network, and consists of the following three main elements.
1) 정책을 평가하는 부분에서 안정화 제어 입력
Figure PCTKR2020013781-appb-I000022
에 의해 결정되는 상태들을 모아 벨만 에러를 감소하는 방향으로 심층 신경망으로 근사된 가치 함수(
Figure PCTKR2020013781-appb-I000023
)의 웨이트(weight) 업데이트를 진행한다. 이때, 업데이트된 가치 함수가 제어 리아푸노프 함수 조건을 만족하지 않으면, 함수 조건을 만족하도록 웨이트(weight) 업데이트를 다시 진행한다.
1) Stabilization control input in the part that evaluates the policy
Figure PCTKR2020013781-appb-I000022
A value function approximated by a deep neural network in the direction of reducing the Bellman error by collecting the states determined by
Figure PCTKR2020013781-appb-I000023
) To update the weight. At this time, if the updated value function does not satisfy the condition of the control Leapunov function, the weight is updated again to satisfy the condition of the function.
2) 정책을 업데이트하는 부분에서 손타크 식를 사용하여 추가적인 액터-네크워크 도입과 표준으로 정해지지 않은 업데이트 룰 없이 학습 과정 및 학습 후의 안정성을 보장한다.2) In the part of updating the policy, the Sontak equation is used to introduce additional actor-networks and ensure the stability of the learning process and post-learning without any update rules that are not set as standard.
3) 심층 인공신경망을 사용하는 경우에는 최적 가치 함수와 같은 레벨셋 형태를 갖는 함수를 학습하게 되며 손타크 식에 사용하는 함수가 최적 가치 함수와 같은 레벨셋 형태를 가지는 경우에는 최적 제어와 동일하다라는 사실에 의해 결과적으로 최적 제어기를 근사한다.3) In the case of using the deep artificial neural network, a function having the same level set form as the optimal value function is learned, and if the function used in the Sontak equation has the same level set form as the optimal value function, it is the same as the optimal control As a result, the optimal controller is approximated by the fact that
[적용예 1][Application Example 1]
적용예 1은 정밀 기저 함수를 근사 함수로 사용하는 근사 정책 반복 알고리즘을 적용하였다. 최적 가치 함수
Figure PCTKR2020013781-appb-I000024
는 기저 함수로
Figure PCTKR2020013781-appb-I000025
를 사용하여
Figure PCTKR2020013781-appb-I000026
로 표현된다.
Application Example 1 applied an approximate policy iteration algorithm using a fine basis function as an approximation function. Optimal value function
Figure PCTKR2020013781-appb-I000024
Is the basis function
Figure PCTKR2020013781-appb-I000025
use with
Figure PCTKR2020013781-appb-I000026
It is expressed as
Figure PCTKR2020013781-appb-I000027
Figure PCTKR2020013781-appb-I000027
위에서,
Figure PCTKR2020013781-appb-I000028
이다. 최적 가치 함수
Figure PCTKR2020013781-appb-I000029
이고, 최적 입력
Figure PCTKR2020013781-appb-I000030
이다. 레벨셋은 D=[-2,2]×[-2,2] 도메인(domain) 위에서 관찰되고, 기저 함수
Figure PCTKR2020013781-appb-I000031
이다. 가치 함수 트레이닝에서 초기 상태는 D 도메인(domain)에서 무작위로 샘플링하였고, 학습 속도 alr=0.03로 설정하였다. 각 에피소드마다 설정된 전체 시간은 0.01 스텝 간격으로 10 무차원 시간(Ms=1000)이다. 케이스 1과 2에서 근사 함수의 초기 웨이트(weight)는 다르게 설정해주었으며, 케이스 1에 대해서는 100(Me=100)개 에피소드, 케이스 2에 대해서는 150(Me=150)개 에피소드를 학습시켜주었다.
From above,
Figure PCTKR2020013781-appb-I000028
to be. Optimal value function
Figure PCTKR2020013781-appb-I000029
Is, optimal input
Figure PCTKR2020013781-appb-I000030
to be. The level set is observed on the domain of D=[-2,2]×[-2,2], and the basis function
Figure PCTKR2020013781-appb-I000031
to be. In the value function training, the initial state was randomly sampled in the D domain, and the learning rate a lr =0.03 was set. The total time set for each episode is 10 dimensionless times (M s =1000) at 0.01 step intervals. In Cases 1 and 2, the initial weight of the approximation function was set differently, and 100 (M e =100) episodes for Case 1 and 150 (M e =150) episodes for Case 2 were trained.
도 1 및 도 2는 본 발명의 일 실시예에 따른 안정화된 비선형 최적 제어 방법을 이용하여 정밀 기저 함수를 근사 함수로 사용하는 근사 정책 반복 알고리즘을 적용한 결과를 나타낸다. 도 1은 케이스 1의 결과를 나타내고, 케이스 1에서 초기 웨이트(weight)
Figure PCTKR2020013781-appb-I000032
이며, 도 2는 케이스 2의 결과를 나타내고, 케이스 2에서 초기 웨이트(weight)
Figure PCTKR2020013781-appb-I000033
이다.
1 and 2 show results of applying an approximate policy iteration algorithm using a fine basis function as an approximation function using a stabilized nonlinear optimal control method according to an embodiment of the present invention. 1 shows the result of Case 1, and the initial weight in Case 1
Figure PCTKR2020013781-appb-I000032
2 shows the result of Case 2, and the initial weight in Case 2
Figure PCTKR2020013781-appb-I000033
to be.
[적용예 2][Application Example 2]
적용예 2는 심층 인공 신경망을 근사 함수로 사용하는 근사 정책 반복 알고리즘 적용하였고, 알고리즘을 적용한 문제는 아래와 같다.In Application Example 2, an approximate policy iteration algorithm using a deep artificial neural network as an approximation function was applied, and the problem of applying the algorithm is as follows.
Figure PCTKR2020013781-appb-I000034
Figure PCTKR2020013781-appb-I000034
위에서,
Figure PCTKR2020013781-appb-I000035
이다.
From above,
Figure PCTKR2020013781-appb-I000035
to be.
1) 트레이닝 결과1) Training result
각 층마다 9개, 10개의 노드를 갖는 총 2층의 리아푸노프 인공 신경망을 만들었고, 탄젠트 하이퍼보릭 활성 함수를 사용하였다. 트레이닝 에피소드와 하이퍼-파라미터와 관련된 정보는 아래 표 1과 같다. A total of two layers of Liapunov artificial neural networks with 9 and 10 nodes in each layer were created, and a tangent hyperbolic activation function was used. Information related to training episodes and hyper-parameters is shown in Table 1 below.
[표 1][Table 1]
Figure PCTKR2020013781-appb-I000036
Figure PCTKR2020013781-appb-I000036
트레이닝 중 상태와 입력 값에 대한 결과는 도 3 및 도 4에 도시된다.Results for the state and input values during training are shown in FIGS. 3 and 4.
2) 테스트 결과2) test result
앞에서 트레이닝된 심층 인공 신경망 근사 가치 함수
Figure PCTKR2020013781-appb-I000037
를 토대로 손타크 식
Figure PCTKR2020013781-appb-I000038
값을 입력으로 사용하여 100개의 에피소드에 대하여 테스트를 진행하였고, 이때, 초기 상태는 D 도메인에서 무작위 샘플링하였다. 또, 초기 상태 xi에서 시작하고 제어기 u(x) 하에서 진행되는 테스트 에피소드 i의 비용을
Figure PCTKR2020013781-appb-I000039
로 지칭한다.
Deep artificial neural network approximation value function trained earlier
Figure PCTKR2020013781-appb-I000037
Based on the Sontak expression
Figure PCTKR2020013781-appb-I000038
The test was conducted for 100 episodes using the value as an input, and at this time, the initial state was randomly sampled in the D domain. In addition, the cost of the test episode i starting from the initial state x i and proceeding under the controller u(x)
Figure PCTKR2020013781-appb-I000039
It is referred to as.
Figure PCTKR2020013781-appb-I000040
Figure PCTKR2020013781-appb-I000040
제어기
Figure PCTKR2020013781-appb-I000041
Figure PCTKR2020013781-appb-I000042
하에서는 5 무차원시간(50 스텝)안에 원점으로 안정화되기 때문에, Ji는 Infinite-horizon 비용일 수 있다.
Controller
Figure PCTKR2020013781-appb-I000041
Wow
Figure PCTKR2020013781-appb-I000042
In the lower case, J i may be an Infinite-horizon cost because it is stabilized to the origin within 5 dimensionless times (50 steps).
도 5는 100개의 테스트 에피소드에 알고리즘을 적용한 결과를 나타내고, 표 2에 그 결과 값이 표시되어 있다. 도 5 및 표 2를 참조하면, 제어기
Figure PCTKR2020013781-appb-I000043
가 최적 제어기에 거의 가까운 성능을 냄을 확인할 수 있다.
5 shows the results of applying the algorithm to 100 test episodes, and Table 2 shows the results. 5 and Table 2, the controller
Figure PCTKR2020013781-appb-I000043
It can be seen that the performance is almost close to that of the optimal controller.
[표 2][Table 2]
Figure PCTKR2020013781-appb-I000044
Figure PCTKR2020013781-appb-I000044
이제까지 본 발명에 대한 구체적인 실시예들을 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, specific examples of the present invention have been looked at. Those of ordinary skill in the art to which the present invention pertains will be able to understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the above description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.
본 발명의 실시예들에 따른 안정화된 비선형 최적 제어 방법은 제어 리아푸노프 함수 및 손타크 식을 이용한 정책 반복 알고리즘을 사용하여 안정성을 보장한다. 상기 정책 반복 알고리즘은 컴퓨터 공학을 중심으로 발전해온 인공지능 기술을 안정성이 요구되는 실제 시스템으로 적용할 수 있게 한다. 상기 정책 반복 알고리즘은 안정화 제어기와 최적 제어기와의 상관관계를 활용하여 안정성을 보장하는 동시에 최적 제어기를 학습할 수 있다.The stabilized nonlinear optimal control method according to embodiments of the present invention guarantees stability by using a policy iteration algorithm using a control Leapunov function and a Sontak equation. The policy repetition algorithm makes it possible to apply artificial intelligence technology, which has been developed centering on computer engineering, to an actual system requiring stability. The policy repetition algorithm can guarantee stability and learn the optimum controller by utilizing the correlation between the stabilization controller and the optimum controller.

Claims (7)

  1. 제어 리아푸노프 함수(control Lyapunov function, CLF) 및 손타크 식(Sontag's formula)을 이용한 정책 반복 알고리즘(Policy Iteration Algorithm)을 진행하는 단계를 포함하는 안정화된 비선형 최적 제어 방법.A stabilized nonlinear optimal control method comprising the step of performing a Policy Iteration Algorithm using a control Lyapunov function (CLF) and Sontag's formula.
  2. 제 1 항에 있어서,The method of claim 1,
    상기 정책 반복 알고리즘은 하기 정밀 정책 반복 알고리즘인 것을 특징으로 하는 안정화된 비선형 최적 제어 방법.The policy repetition algorithm is a stabilized nonlinear optimal control method, characterized in that the following fine policy repetition algorithm.
    [정밀 정책 반복 알고리즘][Precision policy iteration algorithm]
    Figure PCTKR2020013781-appb-I000045
    Figure PCTKR2020013781-appb-I000045
  3. 제 2 항에 있어서,The method of claim 2,
    상기 정밀 정책 반복 알고리즘은,The fine policy iteration algorithm,
    정책을 평가하는 부분에서 리아푸노프 방정식을 풀어 현재의 안정화 제어 입력
    Figure PCTKR2020013781-appb-I000046
    하에서 발생하는 비용을 평가하는 제어 리아푸노프 함수
    Figure PCTKR2020013781-appb-I000047
    를 계산하고,
    In the part of evaluating the policy, the current stabilization control input by solving the Liapunov equation
    Figure PCTKR2020013781-appb-I000046
    A control Lyapunov function that evaluates the costs incurred under
    Figure PCTKR2020013781-appb-I000047
    To calculate,
    상기 정책을 업데이트하는 부분에서 손타크 식을 사용하여 학습 과정 및 학습 후의 안정성을 보장하는 것을 특징으로 하는 안정화된 비선형 최적 제어 방법.A stabilized nonlinear optimal control method, characterized in that, in a part of updating the policy, a learning process and post-learning stability are ensured by using the Sontag equation.
  4. 제 1 항에 있어서,The method of claim 1,
    상기 정책 반복 알고리즘은 하기 제1 근사 정책 반복 알고리즘인 것을 특징으로 하는 안정화된 비선형 최적 제어 방법.The policy iteration algorithm is a stabilized nonlinear optimal control method, characterized in that the following first approximation policy iteration algorithm.
    [제1 근사 정책 반복 알고리즘][The first approximation policy iteration algorithm]
    Figure PCTKR2020013781-appb-I000048
    Figure PCTKR2020013781-appb-I000048
  5. 제 4 항에 있어서,The method of claim 4,
    상기 제1 근사 정책 반복 알고리즘은,The first approximation policy iteration algorithm,
    정책을 평가하는 부분에서 안정화 제어 입력
    Figure PCTKR2020013781-appb-I000049
    에 의해 결정되는 상태들을 모아 벨만 에러를 감소하는 방향으로 선형 인공 신경망으로 근사된 가치 함수(
    Figure PCTKR2020013781-appb-I000050
    )의 웨이트(weight) 업데이트를 진행하고,
    Stabilization control input in the part of evaluating the policy
    Figure PCTKR2020013781-appb-I000049
    A value function approximated by a linear artificial neural network in the direction of collecting the states determined by and reducing the Bellman error (
    Figure PCTKR2020013781-appb-I000050
    ) To update the weight,
    상기 정책을 업데이트하는 부분에서 손타크 식을 사용하여 학습 과정 및 학습 후의 안정성을 보장하는 것을 특징으로 하는 안정화된 비선형 최적 제어 방법.A stabilized nonlinear optimal control method, characterized in that, in a part of updating the policy, a learning process and post-learning stability are ensured by using the Sontag equation.
  6. 제 1 항에 있어서,The method of claim 1,
    상기 정책 반복 알고리즘은 하기 제2 근사 정책 반복 알고리즘인 것을 특징으로 하는 안정화된 비선형 최적 제어 방법.The policy repetition algorithm is a stabilized nonlinear optimal control method, characterized in that the following second approximation policy repetition algorithm.
    [제2 근사 정책 반복 알고리즘][The second approximation policy iteration algorithm]
    Figure PCTKR2020013781-appb-I000051
    Figure PCTKR2020013781-appb-I000051
  7. 제 6 항에 있어서,The method of claim 6,
    상기 제2 근사 정책 반복 알고리즘은,The second approximation policy iteration algorithm,
    정책을 평가하는 부분에서 안정화 제어 입력
    Figure PCTKR2020013781-appb-I000052
    에 의해 결정되는 상태들을 모아 벨만 에러를 감소하는 방향으로 심층 신경망으로 근사된 가치 함수(
    Figure PCTKR2020013781-appb-I000053
    )의 웨이트(weight) 업데이트를 진행하고,
    Stabilization control input in the part of evaluating the policy
    Figure PCTKR2020013781-appb-I000052
    A value function approximated by a deep neural network in the direction of reducing the Bellman error by collecting the states determined by
    Figure PCTKR2020013781-appb-I000053
    ) To update the weight,
    상기 정책을 업데이트하는 부분에서 손타크 식를 사용하여 학습 과정 및 학습 후의 안정성을 보장하는 것을 특징으로 하는 안정화된 비선형 최적 제어 방법.A stabilized nonlinear optimal control method, characterized in that, in a portion of updating the policy, a learning process and post-learning stability are ensured by using the Sontag equation.
PCT/KR2020/013781 2019-10-11 2020-10-08 Stabilized nonlinear optimal control method WO2021071304A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020190125773A KR102231799B1 (en) 2019-10-11 2019-10-11 Stabilized method for nonlinear optimal control
KR10-2019-0125773 2019-10-11

Publications (1)

Publication Number Publication Date
WO2021071304A1 true WO2021071304A1 (en) 2021-04-15

Family

ID=75223651

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/013781 WO2021071304A1 (en) 2019-10-11 2020-10-08 Stabilized nonlinear optimal control method

Country Status (2)

Country Link
KR (1) KR102231799B1 (en)
WO (1) WO2021071304A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023154963A1 (en) 2022-02-14 2023-08-17 Banner Engineering Corp. Selectable-signal three-dimensional fill monitoring sensor

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102560482B1 (en) * 2021-11-17 2023-07-26 광운대학교 산학협력단 Method for nonlinear optimal control

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120095621A1 (en) * 2009-03-26 2012-04-19 Ohio University Trajectory tracking flight controller
US20160147203A1 (en) * 2014-11-25 2016-05-26 Mitsubishi Electric Research Laboratories, Inc. Model Predictive Control with Uncertainties
JP2018084899A (en) * 2016-11-22 2018-05-31 学校法人立命館 Autonomous travel vehicle, controller, computer program, control method of autonomous travel vehicle
US20190026644A1 (en) * 2015-08-14 2019-01-24 King Abdullah University Of Science And Technology Robust lyapunov controller for uncertain systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120095621A1 (en) * 2009-03-26 2012-04-19 Ohio University Trajectory tracking flight controller
US20160147203A1 (en) * 2014-11-25 2016-05-26 Mitsubishi Electric Research Laboratories, Inc. Model Predictive Control with Uncertainties
US20190026644A1 (en) * 2015-08-14 2019-01-24 King Abdullah University Of Science And Technology Robust lyapunov controller for uncertain systems
JP2018084899A (en) * 2016-11-22 2018-05-31 学校法人立命館 Autonomous travel vehicle, controller, computer program, control method of autonomous travel vehicle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PRIMBS JAMES A., NEVISTIĆ VESNA, DOYLE JOHN C.: "NONLINEAR OPTIMAL CONTROL: A CONTROL LYAPUNOV FUNCTION AND RECEDING HORIZON PERSPECTIVE", ASIAN JOURNAL OF CONTROL, CHINESE AUTOMATIC CONTROL SOCIETY, vol. 1, no. 1, 1 March 1999 (1999-03-01), pages 14 - 24, XP055802392, ISSN: 1561-8625, DOI: 10.1111/j.1934-6093.1999.tb00002.x *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023154963A1 (en) 2022-02-14 2023-08-17 Banner Engineering Corp. Selectable-signal three-dimensional fill monitoring sensor

Also Published As

Publication number Publication date
KR102231799B1 (en) 2021-03-23

Similar Documents

Publication Publication Date Title
WO2021071304A1 (en) Stabilized nonlinear optimal control method
Zhao et al. Control of nonlinear systems under dynamic constraints: A unified barrier function-based approach
WO2017209548A1 (en) Device and method for generating artificial neural network-based prediction model
WO2019194465A1 (en) Neural network processor
WO2020230977A1 (en) Metacognition-based high-speed environmental search method and device
WO2020159016A1 (en) Method for optimizing neural network parameter appropriate for hardware implementation, neural network operation method, and apparatus therefor
WO2021157863A1 (en) Autoencoder-based graph construction for semi-supervised learning
WO2023105392A1 (en) Method for generating artificial intelligence model for process control, process control system based on artificial intelligence model, and reactor comprising same
WO2022108287A1 (en) System comprising robust optimal disturbance observer for high-precision position control performed by electronic device, and control method therefor
WO2022164299A1 (en) Framework for causal learning of neural networks
WO2024111866A1 (en) Reinforcement learning system for self-development
WO2022114368A1 (en) Method and device for completing knowledge through neuro-symbolic-based relation embedding
CN109634118A (en) A kind of robust adaptive nerve method for handover control of mechanical arm
WO2019151606A1 (en) Optimization calculation device and method
WO2022163996A1 (en) Device for predicting drug-target interaction by using self-attention-based deep neural network model, and method therefor
WO2019198900A1 (en) Electronic apparatus and control method thereof
WO2022191513A1 (en) Data augmentation-based knowledge tracking model training device and system, and operation method thereof
WO2018131749A1 (en) Deep learning-based self-adaptive learning engine module
WO2022107955A1 (en) Semantic role labeling-based method and apparatus for neural network calculation
Sepulchre et al. Interlaced systems and recursive designs for global stabilization
WO2023090749A1 (en) Nonlinear optimal control method
WO2015141981A1 (en) Harmful material oxidizing apparatus for removing harmful materials from ship by using catalyst
WO2021080151A1 (en) Method and system for optimizing reinforcement-learning-based autonomous driving according to user preferences
WO2011087308A2 (en) Apparatus and method for controlling a crane
WO2021002523A1 (en) Neuromorphic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20875202

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20875202

Country of ref document: EP

Kind code of ref document: A1