KR20230070779A

KR20230070779A - Demand response management method for discrete industrial manufacturing system based on constrained reinforcement learning

Info

Publication number: KR20230070779A
Application number: KR1020210156719A
Authority: KR
Inventors: 홍승호; 장시옹펑
Original assignee: 네스트필드(주)
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-05-23
Also published as: WO2023085560A1

Abstract

Demand response (DR) has been recognized as an effective method of improving the stability and financial efficiency of a power grid. DR realization for the industrial sector as a primary consumer is urgently needed. The present invention provides a novel industrial price-based DR management approach for an industrial discrete manufacturing system by simultaneously considering energy costs and daily production goals. To achieve this, the discrete manufacturing system is modeled as a constrained Markov decision process (CMDP), and a constrained reinforcement learning (CRL) algorithm is adopted to determine an efficient operation strategy for the discrete manufacturing system. A simulation was performed by using a real lithium-ion battery assembly system to verify the performance of the method in accordance with the present invention. The evaluation results showed that the method in accordance with the present invention can optimize energy costs without violating production goals.

Description

Demand response management method for discrete industrial manufacturing system based on constrained reinforcement learning}

본 발명은 이산 산업 제조 시스템의 수요반응 관리 방법에 관한 것으로서, 상세하게는 CMDP(Constrained Markov Decision Process)로 모델링되고, CRL(Constrained Reinforcement Learning) 알고리즘을 채택하여 이산 산업 제조 시스템에 대한 비용 효율적인 운영 전략을 결정할 수 있는 제약 강화 학습이 적용된 이산 산업 제조 시스템의 수요반응 관리 방법에 관한 것이다. The present invention relates to a demand response management method for a discrete industrial manufacturing system, and more specifically, a cost-effective operating strategy for a discrete industrial manufacturing system modeled with a Constrained Markov Decision Process (CMDP) and adopting a Constrained Reinforcement Learning (CRL) algorithm. It is about a demand response management method for a discrete industrial manufacturing system to which constraint reinforcement learning can determine

수요반응(DR)은 시변하는 가격에 대한 고객의 에너지 수요 변화를 의미한다. 잘 설계된 DR 프로그램은 스마트 그리드에서 경제적 효율성, 운영 유연성 및 시스템 신뢰성을 촉진할 수 있다. 최근 연구에 따르면 산업 부문의 에너지 소비는 지난 몇 년 동안 급격히 증가하여 전 세계 에너지 사용의 50% 이상을 차지하고 있다. 또한, 산업 부문에서 소비하는 에너지는 향후 몇 년 동안 더욱 증가할 것으로 예상된다. 따라서 산업 부문의 에너지 관리를 위한 효율적인 DR 프로그램을 개발하는 것이 중요하다.Demand response (DR) refers to the change in energy demand by customers for a time-varying price. A well-designed DR program can promote economic efficiency, operational flexibility and system reliability in the smart grid. According to a recent study, energy consumption in the industrial sector has increased dramatically over the past few years, accounting for more than 50% of global energy use. In addition, energy consumed by the industrial sector is expected to increase further in the coming years. Therefore, it is important to develop an effective DR program for energy management in the industrial sector.

그러나 산업 제조 시스템을 위한 효과적인 DR 프로그램을 구현하는 것은 복잡하고 어려운 작업이다. 주거 및 상업 부문과 달리 산업 에너지 소비 패턴은 일반적으로 순차적이고 상호 의존적이며 상관 관계가 있는 다양한 생산 작업과 다양한 제조 장비에서 비롯된다. 따라서 성공적인 산업용 DR 구현에는 대상 시스템에 있는 각 장비의 물리적 특성을 모두 취득할 수 있는 고해상도 시스템 모델이 필요하다. 또한, 많은 산업용 DR 조치는 생산 이동과 관련된 생산 손실 또는 비용 증가를 초래할 수 있으며, 이는 DR 프로그램 참여에 대한 산업 고객의 관심을 제한할 것이다.However, implementing an effective DR program for industrial manufacturing systems is a complex and challenging task. Unlike the residential and commercial sectors, industrial energy consumption patterns typically result from a variety of sequential, interdependent and correlated production operations and a variety of manufacturing equipment. Therefore, a successful industrial DR implementation requires a high-resolution system model capable of acquiring all the physical characteristics of each device in the target system. Additionally, many industrial DR measures can result in lost production or increased costs associated with shifting production, which will limit industrial customers' interest in participating in DR programs.

지난 몇 년 동안 강화 학습(RL) 방법은 복잡한 순차적 의사 결정 문제를 해결하는데 점점 더 많은 관심을 끌고 있다. RL의 일련의 성공은 딥 Q-러닝의 적용으로 달성되었다. 행동 심리학에서 영감을 받은 인공 지능(AI)의 한 분야인 RL은 소프트웨어 에이전트가 불확실한 환경에서 어떻게 행동하여 누적 보상을 극대화할 수 있는지 탐구한다. RL의 사용은 에이전트 및 환경의 식별을 포함한다. 에이전트는 환경과 상호 작용하여 환경으로부터 피드백을 받는다. 이 피드백(보상)은 각 상태-동작 쌍을 평가하는 데 사용된다. 일반적으로 의사결정 문제를 처리하기 위해 RL을 적용하는 이점은 두 가지 주요 측면으로 요약될 수 있다. Over the past few years, reinforcement learning (RL) methods have attracted more and more attention for solving complex sequential decision-making problems. A series of successes in RL have been achieved with the application of deep Q-learning. RL, a branch of artificial intelligence (AI) inspired by behavioral psychology, explores how software agents can act in uncertain environments to maximize cumulative rewards. Use of RL includes identification of agents and environments. The agent interacts with the environment and receives feedback from the environment. This feedback (compensation) is used to evaluate each state-action pair. In general, the benefits of applying RL to address decision-making problems can be summarized in two main aspects.

첫째, RL은 모델이 없다. 작업을 선택하는 방법을 결정하기 위해 사전 정의된 규칙이나 사전 정의된 규칙이 필요하지 않다. 둘째, RL은 적응적이다. 매우 불확실한 시스템 역학을 다루기 위해 과거 데이터로부터 귀중한 지식을 학습할 수 있으며 추출된 지식을 일반화하고 새로 발생하는 상황에 적용할 수 있다. RL의 인상적인 장점으로 인해 DR 에너지 관리, 전기 자동차 충전 및 동적 경제 파견(dynamic economic dispatch) 등과 같은 스마트 그리드의 의사 결정 문제를 잠재적으로 해결할 수 있다.First, RL has no models. There is no need for predefined rules or predefined rules to determine how to select an action. Second, RL is adaptive. Valuable knowledge can be learned from historical data to deal with highly uncertain system dynamics, and the extracted knowledge can be generalized and applied to newly emerging situations. The impressive advantages of RL can potentially solve smart grid decision-making problems such as DR energy management, electric vehicle charging and dynamic economic dispatch.

주거 및 상업 부문의 DR 관리 문제를 해결하기 위해 RL을 적용하는데 많은 관심을 기울이고 있다. 예를 들어, 종래 연구는 사용자 행동과 전기 가격을 확률론적으로 고려하여 전기 자동차에 대한 최적의 충전 전략을 얻기 위해 DRL 기반 에너지 관리 알고리즘을 제안했다. 또한, 다른 종래 연구는 DRL 기반의 DR 스케줄링 알고리즘을 개발하여 입주민 행동의 불확실성, 실시간 전기요금, 실외 온도 등을 고려하여 가전제품 세트를 최적으로 제어할 수 있도록 하였다. Much attention is being paid to the application of RL to solve DR management problems in the residential and commercial sectors. For example, previous studies have proposed a DRL-based energy management algorithm to obtain an optimal charging strategy for an electric vehicle by probabilistically considering user behavior and electricity price. In addition, other conventional studies have developed a DRL-based DR scheduling algorithm to optimally control a set of home appliances by considering the uncertainty of residents' behavior, real-time electricity rates, and outdoor temperature.

또한, 다른 종래 연구는 시스템 역학에 대한 제한된 정보 하에 열 펌프 온도 조절기 또는 전기 히터와 같은 에너지 가변 부하의 에너지 소비를 최적화하기 위해 일괄 RL 기반 하루 전(day-ahead) DR 계획을 적용했다. 또한, 다른 종래 연구는 RL 방법을 사용하여 서비스 제공자 측에서 동적 가격 설정 문제를 탐구했으며, 가격 설정 문제는 Markov Decision Process(MDP)로 모델링된 다음 Q-leaning 알고리즘을 사용하여 이상적인 값을 결정했다. 또한, 다른 종래 연구에서 RL은 서비스 제공자의 이익과 고객의 비용 모두를 최적화하는 것을 목표로 다양한 고객에게 이상적인 인센티브 비율을 탐색하기 위해 적용되었다. 또한, 다른 종래 연구에서 다양한 유형의 가전제품이 있는 스마트 홈을 고려하여 전력 비용과 사용자 불만족 비용의 합을 최소화하기 위해 다중 에이전트 RL 기반 DR 알고리즘을 개발했다. In addition, other prior studies have applied batch RL-based day-ahead DR planning to optimize the energy consumption of energy-varying loads such as heat pump thermostats or electric heaters under limited information on system dynamics. In addition, other prior studies explored the dynamic pricing problem at the service provider side using the RL method, and the pricing problem was modeled with the Markov Decision Process (MDP) and then the Q-leaning algorithm was used to determine the ideal value. In addition, in other prior studies, RL has been applied to explore ideal incentive ratios for various customers with the goal of optimizing both the service provider's benefit and the customer's cost. In addition, in another prior study, a multi-agent RL-based DR algorithm was developed to minimize the sum of power cost and user dissatisfaction cost considering smart homes with various types of home appliances.

위에서 강조한 연구는 주거 및 상업용 DR 문제를 해결하는데 있어 일련의 성공을 입증했지만, 그 성공은 단순한 애플리케이션 시나리오 및/또는 장치 다양성이 적기 때문이다. 더욱이 이러한 응용 프로그램의 장치는 일반적으로 서로 함께 작동하지 않고 독립적으로 작동하는 것으로 간주된다. 이러한 기능을 통해 Q-learning, Actor-Critic과 같은 다양한 RL 접근 방식을 주거 및 상업 시나리오에 적용하여 비용 최소화 문제 또는 동적 가격 책정 문제를 해결할 수 있다.The studies highlighted above have demonstrated a series of successes in addressing residential and commercial DR challenges, but that success is due to simple application scenarios and/or low device diversity. Moreover, the devices in these applications are generally considered to operate independently of each other rather than in conjunction with each other. With these capabilities, various RL approaches such as Q-learning and Actor-Critic can be applied to residential and commercial scenarios to solve cost minimization problems or dynamic pricing problems.

다음과 같은 몇 가지 이유로 인해 RL 관련 기술을 사용하여 산업 부문의 DR 잠재력을 탐구하기 위해 드문 작업이 수행되었다. Unusual work has been undertaken to explore the potential of DR in the industrial sector using RL-related technologies for several reasons:

i) 산업 자산은 일반적으로 밀접하게 연결되어 함께 함께 작동한다. i) Industrial assets are usually closely linked and work together.

ii) 제조 공정에서 연속적인 장비는 순서를 위반하지 않고 특정 순서로 작동해야 한다. ii) Continuous equipment in the manufacturing process must operate in a specific sequence without breaking the sequence.

iii) 산업 DR 관리는 일반적으로 에너지 소비를 줄이는 것뿐만 아니라 생산 요구 사항을 유지하는 방법도 고려해야 한다. iii) Industrial DR management should consider how to maintain production requirements as well as reducing energy consumption in general.

최근 연구에서 리튬 이온 배터리 어셈블리 제조 시스템의 에너지 소비를 조절하기 위해 실시간 기반 DR 방식을 제안했다. 여기서 제조 시스템은 Markov 게임으로 모델링되었다. 이후 이 Markov 게임을 해결하기 위해 다중 에이전트 DRL 알고리즘이 사용되었다. 그러나 이 연구는 에너지 소비를 최적화하려고 노력할 뿐 실제 생산 관점에서 생산 목표에 대한 제약은 언급되지 않았다. 에너지 비용을 최소화하는 것이 매력적이지만 제조 시스템은 생산 요구 사항을 위반하는 비용으로 이를 달성해서는 안 된다. In a recent study, a real-time based DR method was proposed to control the energy consumption of a Li-ion battery assembly manufacturing system. Here, the manufacturing system is modeled as a Markov game. A multi-agent DRL algorithm was then used to solve this Markov game. However, this study only tries to optimize energy consumption, and no constraints on production targets are mentioned in terms of actual production. While minimizing energy costs is attractive, manufacturing systems should not achieve this at a cost that violates production requirements.

한국공개특허공보 제10-2019-0132193호Korean Patent Publication No. 10-2019-0132193

이러한 문제를 극복하기 위해 본 발명은 생산 목표를 보장하면서 에너지 비용을 최적화하려는 산업 제조 시스템을 위한 제약 강화 학습(CRL) 기반 DR 관리 알고리즘을 제공하는 것을 목적으로 한다. To overcome these problems, the present invention aims to provide a Constraint Reinforcement Learning (CRL) based DR management algorithm for an industrial manufacturing system that seeks to optimize energy costs while ensuring production targets.

이를 위해, 본 발명에 따른 제약 강화 학습이 적용된 이산 산업 제조 시스템의 수요반응 관리 방법은 이산 산업 제조 시스템의 에너지 관리 장치에서 수행되는 방법으로서, 시간 구간 t 상태(S_t)에서 정책(π)에 따라 행동(작업)(a_t)을 실행하고 보상(R_t) 및 비용(R^c _t)을 획득하여 다음 상태(S_t+1)로 이동하고 현 상태(S_t), 행동(a_t), 다음 상태(S_t+1), 보상(R_t) 및 비용(R^c _t)으로 구성된 샘플을 저장하는 방식으로 총 시간 단계(t=1~T)의 훈련 집합을 저장하는 경험 축적 단계와, 상기 훈련 집합에서 임의로 미니 배치(mini-batch)를 샘플링하고 미니 배치에 대해 행동 가치 함수(state-action value function)와 상태 가치 함수(state value function)의 목표값(target label)을 계산하고, 함수값과 목표값 간의 오차가 최소화되도록 행동 가치 함수의 파라미터와 상태 가치 함수의 파라미터를 경사 하강법에 따라 업데이트하는 과정; 정책 함수(policy function)의 파라미터를 업데이트하는 과정; 목표 상태 가치 함수(target state value function)의 파라미터를 업데이트하는 과정; 및 라그랑주 승수(Lagrange multiplier)를 업데이트하는 과정을 포함하는 파라미터 갱신 단계와, 상기 파라미터 갱신 단계에서 결정된 파라미터에 근거하여 현 상태에서 정책에 따라 행동을 실행하고 보상을 획득하여 다음 상태로 이동하는 동작을 반복 실행하여 누적 보상을 계산하는 누적 보상 계산 단계를 포함하여, 상기 누적 보상이 최대가 될 때까지 상기 파라미터 갱신 단계를 반복하는 것을 특징으로 한다. To this end, the demand response management method of a discrete industrial manufacturing system to which constrained reinforcement learning is applied according to the present invention is a method performed in an energy management device of a discrete industrial manufacturing system _. It executes the action (task) (a _t ) and acquires the reward (R _t ) and cost (R ^c _t ) to move to the next state (S _t+1 ), the current state (S _t ), and the action (a _t ) , an experience accumulation step of storing a training set of total time steps (t=1 to T) by storing samples consisting of the next state (S _t+1 ), reward (R _t ), and cost (R ^c _t ), and , Randomly sampling a mini-batch from the training set and calculating a target label of a state-action value function and a state value function for the mini-batch, updating the parameters of the action value function and the parameters of the state value function according to the gradient descent method so that an error between the function value and the target value is minimized; Updating parameters of a policy function; Updating parameters of a target state value function; and a parameter update step including a step of updating a Lagrange multiplier, and an operation of executing an action according to a policy in the current state based on the parameter determined in the parameter update step, obtaining a reward, and moving to the next state. It is characterized in that the parameter update step is repeated until the cumulative compensation is maximized, including an accumulation compensation calculation step of repeatedly executing and calculating an accumulation compensation.

상술한 바와 같이, 본 발명은 이산 산업 제조 시스템에 대한 비용 효율적인 운영 전략을 결정할 수 있는 효과가 있다. As described above, the present invention has the effect of determining a cost-effective operating strategy for a discrete industrial manufacturing system.

인공지능(AI)의 출현으로 복잡한 의사 결정 프로세스를 처리하기 위해 강화학습(RL)을 채택하는 것에 대한 관심이 높아지고 있다. 본 발명은 생산 목표를 달성하면서 에너지 비용을 최소화하기 위해 이산 제조 시스템에 대해 CRL 기반 DR 알고리즘을 제안한다. 특히, 이산 제조 시스템이 CMDP로 공식화되고 CRL 알고리즘이 채택되어 모든 기계에 대한 최적의 작동 일정을 식별할 수 있다.With the advent of artificial intelligence (AI), there is growing interest in adopting reinforcement learning (RL) to handle complex decision-making processes. The present invention proposes a CRL-based DR algorithm for discrete manufacturing systems to minimize energy costs while achieving production targets. In particular, discrete manufacturing systems are formulated into CMDP and CRL algorithms are adopted to identify optimal operating schedules for all machines.

마지막으로 본 발명에 따른 CDRL 알고리즘을 리튬 이온 배터리 조립 공정에 적용한 시뮬레이션 결과, 본 발명에 따른 알고리즘이 생산 목표를 위반하지 않고 공정의 에너지 비용을 최소화할 수 있음을 입증하였다. Finally, as a simulation result of applying the CDRL algorithm according to the present invention to the lithium ion battery assembly process, it was demonstrated that the algorithm according to the present invention can minimize the energy cost of the process without violating the production target.

앞으로 본 발명에 따른 DR 계획은 에너지 저장 시스템과 재생 가능 에너지 자원(예: 태양광 및 풍력)을 통합하는 다른 많은 종류의 산업 시설에서 평가될 수 있고, 또한 실제 산업 제조 시스템에서 구현되어 성능을 평가 받을 수 있을 것이다. In the future, the DR scheme according to the present invention can be evaluated in many other types of industrial facilities that integrate energy storage systems and renewable energy resources (eg solar and wind power), and can also be implemented in actual industrial manufacturing systems to evaluate their performance. you will be able to receive

도 1은 본 발명에 따른 산업 이산 제조 시스템의 일반적인 구성을 나타낸 도면.
도 2는 행위자-비평가(actor-critic) 방식의 구성을 나타낸 도면.
도 3은 본 발명에 따른 방법이 적용되는 리튬이온 배터리 조립 공정을 나타낸 도면.
도 4는 리튬이온 배터리 모듈의 계층적 구조를 나타낸 도면.
도 5는 본 발명에 따른 학습 과정에서 누적 보상을 나타낸 도면.
도 6은 전기 가격에 대응하는 총 에너지 수요를 나타낸 도면.
도 7은 각 시간 단계(구간)에서 최종 배터리 생산의 저장량을 나타낸 도면.
도 8은 CSAC 및 Gurobi solver에 의해 얻은 총 비용을 나타낸 도면.
도 9는 2019.6.2~4일까지 학습 과정에서 누적 보상을 나타낸 도면.
도 10은 2019.6.2~4일까지 전기 가격에 대응하는 총 에너지 수요를 나타낸 도면.
도 11은 2019.6.2~4일까지 각 시간 구간에서 최종 배터리 생산의 저장량을 나타낸 도면.
도 12는 본 발명에 따른 방법에서 처리 과정을 나타낸 순서도. 1 is a diagram showing the general configuration of an industrial discrete manufacturing system according to the present invention.
2 is a diagram showing the configuration of an actor-critic scheme;
3 is a view showing a lithium ion battery assembly process to which the method according to the present invention is applied.
4 is a diagram showing a hierarchical structure of a lithium ion battery module.
5 is a diagram showing accumulated compensation in a learning process according to the present invention.
6 is a diagram showing total energy demand corresponding to electricity price;
Figure 7 shows the storage capacity of final battery production at each time step (interval).
Figure 8 shows the total cost obtained by CSAC and Gurobi solver.
9 is a diagram showing accumulated compensation in the learning process from June 2 to 4, 2019.
10 is a diagram showing total energy demand corresponding to electricity prices from June 2 to June 4, 2019.
11 is a diagram showing the storage amount of final battery production in each time interval from June 2 to 4, 2019.
12 is a flow chart showing processing steps in the method according to the present invention.

본 발명에 따른 CRL 알고리즘을 적용하기 위해 산업 제조 공정의 운영 비용 최적화 문제는 에너지 소비와 자원 관리를 모두 고려하는 미지의 변수나 동적 모델에 대한 예측이 없는 CMDP(Constrained Markov Decision Process)로 공식화된다. To apply the CRL algorithm according to the present invention, the operating cost optimization problem of an industrial manufacturing process is formulated as a Constrained Markov Decision Process (CMDP) without prediction of unknown variables or dynamic models that consider both energy consumption and resource management.

에너지 비용을 최소화하면서 생산 목표를 달성할 수 있도록 보상 함수와 비용 함수를 정교하게 설계하여 해석된(construed) CDMP에 통합된다. Compensation and cost functions are elaborately designed and integrated into the construed CDMP to achieve production targets while minimizing energy costs.

본 발명에 따른 CMDP를 기반으로 산업 제조 시스템의 모든 장비에 대한 최적의 운영 정책을 결정하기 위해 CMDP를 해결하는 새로운 CRL 알고리즘을 설계한다.Based on the CMDP according to the present invention, we design a new CRL algorithm that solves the CMDP to determine the optimal operating policy for all equipment in the industrial manufacturing system.

마지막으로, 실제 산업 사례인 일반적인 리튬 이온 배터리 어셈블리 제조 공정을 사용하여 본 발명에 따른 CRL 알고리즘의 효율성을 평가한다. 평가 결과는 본 발명에 따른 CRL 알고리즘이 생산 요구 사항을 보장하면서 에너지 수요 공급의 균형을 유지하고 에너지 비용을 절감할 수 있음을 보여준다. 이는 본 발명에 따른 알고리즘이 복잡한 산업 DR 관리 문제를 처리할 가능성이 있음을 나타낸다.Finally, the efficiency of the CRL algorithm according to the present invention is evaluated using a typical lithium ion battery assembly manufacturing process, which is an actual industrial case. The evaluation results show that the CRL algorithm according to the present invention can balance energy demand and supply and reduce energy costs while ensuring production requirements. This indicates that the algorithm according to the present invention has the potential to handle complex industrial DR management problems.

본 발명은 여러 대의 연속 기계의 작동과 다양한 자원(전기 및 생산 자재)의 사용을 동시에 고려하는 산업 제조 시스템의 DR 관리에 CRL을 적용한 첫 번째 사례이다.The present invention is the first application of CRL to DR management of an industrial manufacturing system that simultaneously considers the operation of several continuous machines and the use of various resources (electricity and production materials).

본 발명의 명세서는 다음과 같이 구성된다.The specification of the present invention is structured as follows.

먼저, 에너지 수요 모델, 생산 자원 균형 모델 및 대상 기능을 포함하여 일반적인 개별 제조 시스템의 문제 공식을 상술하고, 종래의 DR 문제를 해결하기 위한 본 발명에 따른 CRL 알고리즘을 설명한 후. 그 CRL 알고리즘에 대한 평가 결과를 기술하기로 한다. First, the problem formulation of a typical individual manufacturing system is detailed, including an energy demand model, a production resource balance model and a target function, and then the CRL algorithm according to the present invention for solving the conventional DR problem is described. The evaluation results for the CRL algorithm will be described.

문제 공식화Formulate the problem

가격 기반 DR 환경에서 일반적인 이산 제조 시스템의 에너지 관리 문제를 살펴본다. 시설에 주기적으로(예: 시간 단위로) 유틸리티 회사로부터 실시간 가격(RTP) 데이터를 수신하는 에너지 관리 센터(EMC)가 있다고 가정한다. 그런 다음 EMC는 이산 제조 시스템에서 다양한 기계의 에너지 소비를 관리하기 위한 최적의 제어 정책을 결정한다. 본 발명에 따른 CRL 기반 DR 체계는 EMC에 사전 설치된다. We look at the energy management issues of a typical discrete manufacturing system in a price-based DR environment. Assume that the facility has an energy management center (EMC) that receives real-time pricing (RTP) data from the utility company on a periodic basis (eg hourly). EMC then determines the optimal control policy to manage the energy consumption of the various machines in the discrete manufacturing system. The CRL-based DR system according to the present invention is pre-installed in the EMC.

도 1은 일반적인 산업용 이산 제조 시스템을 나타낸 것이다. 1 shows a typical industrial discrete manufacturing system.

도 1을 참고하면, M_i,j 및 B_i,j는 각각 중간 제품을 처리하고 저장하는 데 사용되는 기계와 해당 버퍼를 나타낸다. 여기서 i는 i번째 직렬 생산 라인 분기를 나타내고 j는 i번째 분기의 j번째 기계 또는 버퍼를 나타낸다. 에너지 수요 모델, 생산 자원 균형 모델 및 대상 함수의 수학적 공식은 다음과 같이 설명할 수 있다.Referring to FIG. 1 , M _i,j and B _i,j denote a machine and a corresponding buffer used to process and store intermediate products, respectively. where i denotes the ith serial production line branch and j denotes the jth machine or buffer of the ith branch. The mathematical formulas of the energy demand model, the production resource balance model, and the target function can be described as follows.

A. 에너지 수요 모델A. Energy Demand Model

일반적으로 이산 제조 시스템에서 각 기계에는 작동 및 유휴의 두 가지 작동 옵션이 있다. 작동은 기계가 완전 작동 모드에 있음을 의미하고 유휴 상태는 기계가 절전 모드로 전환됨을 나타낸다. Typically, in a discrete manufacturing system, each machine has two operating options: running and idle. Running means the machine is in full operating mode and Idle means the machine is going into sleep mode.

z_i,j ^t가 기계 M_i,j의 이진 변수이다. 즉, M_i,j가 작동 모드인 경우 z_i,j=1이고, 그렇지 않으면 z_i,j=0이다. z _i,j ^t is the binary variable of machine M _i,j . That is, z _i,j =1 when M _i,j is the operating mode, otherwise z _i,j =0.

단계 t 동안 각 기계는 하나의 작동 옵션만 선택할 수 있다. 따라서 단계 t 동안 기계 M_i,j의 에너지 소비는 수학식 1과 같이 나타낼 수 있다. During step t, each machine can select only one operating option. Therefore, the energy consumption of machine M _i,j during step t can be expressed as Equation 1.

여기서, e_i,j ^op 및 e_i,j ^idle은 각각 작동 또는 절전 모드에서 기계 M_i,j의 에너지 소비를 나타낸다. Here, e _i,j ^op and e _i,j ^idle denote the energy consumption of machine M _i,j in operating or sleep mode, respectively.

따라서 t 단계 동안 전체 제조 시스템의 에너지 소비량은 수학식 2를 사용하여 합산할 수 있다. Therefore, the energy consumption of the entire manufacturing system during step t can be summed using Equation 2.

또한, 전체 시스템의 총 에너지 소비 E^t는 수학식 3과 같이 로컬 전력망의 최대 용량 E_max에 종속된다. In addition, the total energy consumption E ^t of the entire system depends on the maximum capacity E _max of the local power grid as shown in Equation 3.

B. 생산 자원 균형 모델B. Production resource balance model

두 개의 순차적 기계 사이에서 버퍼 B_i,j는 생산 프로세스를 따라 제품의 다른 부분을 위한 저장 공간 역할을 한다. 단계 t에서 버퍼 B_i,j의 재료 저장은 수학식 4와 같이 표현된다.Between the two sequential machines, the buffer B _i,j serves as a storage space for the different parts of the product along the production process. In step t, the material storage of buffer B _i,j is expressed as Equation 4.

여기서, P_i,j ^t(C_i,j ^t)는 단계 t 동안 버퍼 B_i,j에서 생산(소비)된 양을 나타낸다. P_i,j ^t와 C_i,j ^t는 각각 수학식 5와 6과 같다. Here, P _i,j ^t (C _i,j ^t ) represents the amount produced (consumed) in buffer B _i,j during step t. P _i,j ^t and C _i,j ^t are the same as Equations 5 and 6, respectively.

p_i,j ^t(c_i,j ^t)는 작동 모드 z_i,j ^t에서 기계 M_i,j의 생산(소비)율을 나타낸다. p _i,j ^t (c _i,j ^t ) denotes the production (consumption) rate of machine M _i,j in operating mode z _i,j ^t .

공정 기계의 정상적인 작동을 보장하기 위해 t 단계 동안 버퍼 B_i,j의 재료 양은 수학식 7와 같이 제약 조건을 유지해야 한다.In order to ensure normal operation of the process machine, the amount of material in the buffer B _i,j during step t must maintain constraints as shown in Equation 7.

여기서 B_i,j ^min 및 B_i,j ^max는 버퍼 B_i,j에 있는 재료의 최소 및 최대 양을 나타낸다. where B _i,j ^min and B _i,j ^max represent the minimum and maximum amounts of material in buffer B _i,j .

C. 목적 함수 C. Objective function

실제 산업용 이산 제조 시스템의 목표는 최소한의 에너지 비용으로 생산 작업을 수행하는 것이다. 시스템의 목적 함수는 수학식 8 및 9로 나타낼 수 있다. The goal of real-world industrial discrete manufacturing systems is to perform production operations with minimal energy costs. The objective function of the system can be expressed by Equations 8 and 9.

수학식 8은 v^t가 단계 t에서 전기 가격을 나타내며 그 날 하루의 에너지 비용 최소화를 정의한다. 수학식 9는 생산 목표 제약을 정의하는데, 이는 최종 버퍼 B_final의 잔존 값 B_final ²⁴가 마지막 단계(즉, t=24)의 끝에서 사전 정의된 목표 출력

보다 작을 수 없음을 의미한다.Equation 8 defines the energy cost minimization for the day where v ^t represents the electricity price at step t. Equation 9 defines the production target constraint, which means that the residual value B _final ²⁴ of the final buffer B _final is the predefined target output at the end of the last step (i.e. t=24).

This means that it cannot be smaller than

CRL 방법론CRL methodology

먼저 CMDP를 간략하게 설명한 후, 산업 DR 관리 문제를 CMDP로 공식화하고, 마지막으로 CMDP를 해결하기 위해 CRL 방법론을 적용한다. First, CMDP is briefly described, then the industrial DR management problem is formulated as CMDP, and finally, the CRL methodology is applied to solve CMDP.

A. CMDPA.CMDP

MDP의 확장으로서 CMDP는 6쌍(S, A, π, P _r, R, R ^c )을 특징으로 한다. 여기서, S는 사용 가능한 상태가 있는 상태 공간을 나타낸다. A는 사용 가능한 행동이 있는 행동 공간을 나타낸다. π는 주어진 상태의 행동에 대한 분포를 나타낸다. As an extension of MDP, CMDP features six pairs: S , A , π , P _r , R , R ^c . Here, S represents the state space in which there are available states. A represents an action space with available actions. π represents the distribution of actions in a given state.

P _r: S×A×S → [0,1]은 천이 확률 함수를 나타내고, R(또는 R ^c ) :S×A×S → R(또는 R ^c )은 보상함수(또는 비용함수)를 나타낸다. CDMP는 일반적으로 이산 시간 단계에서 서로 상호 작용하는 에이전트와 환경의 개념을 포함한다. 각 단계 t∈[0,T] 동안 에이전트는 환경의 상태 s_t∈S를 관찰하고 정책 π에 따라 행동 a_t∈A를 선택한다. 다음 단계 t+1에서 에이전트는 보상 R(s_t,a_t,s_t+1)⊂R과 비용 R^c(s_t,a_t,s_t+1)⊂R ^c 를 얻는다. 환경은 천이 확률 함수 P _r(s_t+1|s_t,a_t)에 따라 다음 상태 s_t+1로 이동한다. 에이전트의 목표는 예상 할인 총 비용 J^c에 대한 상한 제약

에 종속되는 예상 할인 수익 J를 최대화하는 최적 정책 π를 식별하는 것이다. P _r : S × A × S → [0,1] represents the transition probability function, R (or R ^c ) : S × A × S → R (or R ^c ) represents the reward function (or cost function) . CDMP generally includes the concept of agents and environments interacting with each other in discrete time steps. During each step t∈[0,T], the agent observes the state s_t∈S of the environment and chooses an action a _t∈ A according to policy π . At the next step t+1, the agent gets the reward R(s _t ,a _t ,s _t+1 )⊂ R and the cost R ^c (s _t ,a _t ,s _t+1 )⊂ R ^c . The environment moves to the next state s t ₊₁ according to the transition probability function P _r (s _t+1 |s _t ,a _t ). The agent's target is an upper bound constraint on the expected discounted total cost J ^c

is to identify the optimal policy π that maximizes the expected discounted return J that depends on

여기서, τ는 경로

이고, γ∈[0,1]은 할인율를 나타내고, R과 R_t ^c는 각각 R(s_t,a_t,s_t+1)와 R^c(s_t,a_t,s_t+1)의 약자이다. where τ is the path

, γ∈[0,1] represents the discount rate, and R and R _t ^c are abbreviations of R(s _t ,a _t ,s _t+1 ) and R ^c (s _t ,a _t ,s _t+1 ), respectively. am.

마지막으로 상태가치 함수 V^π(s)와 행동가치 함수 Q^π(s)는 다음과 같이 정의된다. Finally, the state value function V ^π (s) and action value function Q ^π (s) are defined as follows.

여기서, π는 상태 공간 S에서 행동 공간 A으로 매핑한 결정 정책이거나 상태에서 다른 행동을 선택할 확률을 매핑한 확률 정책이다. Here, π is a decision policy mapped from state space S to action space A , or a probability policy mapped to the probability of choosing another action in a state.

상태가치 함수와 행동가치 함수는 벨만 방정식에 따라 즉각적 보상과 후속 상태의 할인값으로 분해될 수 있다. The state value function and the action value function can be decomposed into the immediate reward and the discount value of the subsequent state according to the Bellman equation.

B. 산업적 DR 문제에 대한 CMDP 공식화B. CMDP Formulation of Industrial DR Problems

산업 DR 관리 문제는 CMDP로 설정되고 EMC가 환경, 즉 전체 이산 제조 시스템과 상호 작용하는 학습 에이전트로 간주된다. 공식화된 CMDP는 상태 공간(state space), 행동 공간(action space), 보상함수(reward function), 비용함수(cost function)을 포함한다. 최적화 수평(optimization horizon)은 시간당 가격을 기준으로 내려야 하는 총 24개의 결정에 따라 24개의 단계로 구성됩니다.The industrial DR management problem is established as CMDP and EMC is considered as a learning agent that interacts with the environment, i.e. the entire discrete manufacturing system. The formalized CMDP includes state space, action space, reward function, and cost function. The optimization horizon consists of 24 steps with a total of 24 decisions to be made based on hourly price.

본 발명의 목표는 이산 제조 시스템의 총 에너지 비용을 최소화하면서 생산 목표 제약을 충족하는 최적의 전략을 찾는 것이다.The goal of the present invention is to find an optimal strategy to meet production target constraints while minimizing the total energy cost of a discrete manufacturing system.

1) 상태 공간1) state space

산업 DR 관리에서 각 단계의 시작 부분에서 관찰된 상태 s는 시간 표시기 t, 전기 가격 v^t, 기계 에너지 소비 e^t _ij, 버퍼 저장량 B^t _i,j의 네 부분을 포함한다. State s observed at the beginning of each stage in industrial DR management includes four parts: time indicator t, electricity price v ^t , mechanical energy consumption e ^t _ij , and buffer storage B ^t _i,j .

수학식 18은 시간 표시기 t, 전기 가격 v^t, 기계 에너지 소비 e^t _ij, 버퍼 저장량 B^t _i,j을 포함하는 단계 t에서 상태(s_t)의 샘플을 나타낸다.Equation 18 represents a sample of the state s _t at step t, including time indicator t, electricity price v ^t , mechanical energy consumption e ^t _ij , and buffer storage B ^t _i,j .

2) 행동 공간 2) action space

각 단계의 시작에서, EMC는 이진 결정 변수 z^t _i,j∈{0,1}로 각 기계의 작동을 스케줄링한다. 따라서 행동 공간 A는 모든 기계의 행동을 포함한다. At the beginning of each phase, the EMC schedules each machine's operation with a binary decision variable z ^t _i,j ∈{0,1}. Action space A thus contains all the machine's actions.

수학식 20은 단계 t에서 행동(a_t)의 샘플을 나타낸다. 여기서 z^t _i,j는 단계 t에서 기계 M_i,j의 동작점의 선택을 나타낸다. Equation 20 represents a sample of action (a _t ) at step t. where z ^t _i,j represents the selection of the operating point of machine M _i,j at step t.

3) 보상 함수 및 비용 함수3) reward function and cost function

상술한 바와 같이, 산업용 DR의 목적함수는 생산과제를 만족시키는 것과 에너지 비용을 최소화하는 두 부분으로 구성된다. 따라서 이 CMDP 프레임워크에서 보상 함수 R_t는 다음과 같이 정의된다. As described above, the objective function of industrial DR consists of two parts: satisfying the production task and minimizing energy costs. Therefore, in this CMDP framework, the reward function R _t is defined as:

여기서, R_t는 단계 t에서 제조 조립 시스템의 에너지 비용의 역수이다. where R _t is the reciprocal of the energy cost of the manufacturing assembly system at step t.

비용 함수 R_t ^c는 다음과 같이 정의된다.The cost function R _t ^c is defined as

여기서, R_t ^c < 0이고, B^t _final는 단계 t에서 최종 버퍼 B_final의 저장량이다. where R _t ^c < 0, and B ^t _final is the storage amount of the final buffer B _final at step t.

첫 번째 줄은 마지막 시간(t=24)이 끝날 때 최종 생산 저장 B^t _final이 목표 출력

에서 얼마나 벗어났는지 계산한다. 두 번째 줄은 허용 가능한 최대 저장 B^max _final을 초과하는 최종 생산 저장 B^t _final의 양을 측정한다. 세 번째 줄은 허용 가능한 최소 저장 B^min _final 미만인 최종 생산 저장 B^t _final의 양을 계산한다. 네 번째 줄은 최종 생산 저장 B^t _final이 최소 저장 B^min _final과 최대 저장 B^max _final 사이에 있음을 나타낸다. The first line stores the final production at the end of the last time (t=24) B ^t _final is the target output

Calculate how far out of The second line measures the amount of final production storage B ^t _final that exceeds the maximum allowable storage B ^max _final . The third line calculates the amount of final production storage B ^t _final that is less than the minimum allowable storage B ^min _final . The fourth line indicates that the final production save B ^t _final is between the minimum save B ^min _final and the maximum save B ^max _final .

C. CRL 알고리즘 C. CRL Algorithm

본 발명에 따른 CRL 알고리즘은 CMDP를 풀기 위한 제약 소프트 행위자-비평가(constrained soft actor-critic)(CSAC)를 나타낸다. CSAC 알고리즘을 이해하기 위해, 행위자-비평가 알고리즘의 배경을 우선 설명한 후 본 발명에 따른 CSAC 알고리즘에 대해 상술하기로 한다. The CRL algorithm according to the present invention represents a constrained soft actor-critic (CSAC) for solving CMDP. In order to understand the CSAC algorithm, the background of the actor-critic algorithm will be explained first, and then the CSAC algorithm according to the present invention will be described in detail.

1) 행위자-비평가 알고리즘1) Actor-critic algorithm

여러 학습 전략에 따라 RL 알고리즘은 가치 기반, 정책 기반 또는 행위자-비평가(actor-critic)로 분류될 수 있다. Q-학습 및 SARSA와 같은 가치 기반 접근 방식은 가치 함수만 사용하며 정책에 대한 명시적 공식이 없다. 정책 그레이디언트(policy gradient)와 같은 정책 기반 접근 방식은 어떤 형태의 가치 함수 없이 직접 최적의 정책을 식별하려고 한다. 세 번째 유형은 위의 두 가지 접근 방식을 결합한 도 2와 같은 행위자-비평가 알고리즘이다. 행위자는 행위 생성을 담당하고 비평가는 보상 처리를 담당한다. 훈련 과정에서 에이전트가 환경에서 가장 최근 상태를 관찰하면 행위자는 현재 정책을 기반으로 일련의 행동을 출력한다. 한편, 비평가는 현재의 정책이 얼마나 좋은지를 가치 함수를 통해 판단할 것이다. 그런 다음 예상 값과 받은 보상 간의 편차를 시간차(TD) 오류로 표시하며, 이는 행위자와 비평가에게 동시에 피드백되어 정책 및 가치 함수가 조정된다. According to different learning strategies, RL algorithms can be classified as value-based, policy-based or actor-critic. Value-based approaches such as Q-learning and SARSA use only value functions and do not have explicit formulations for policies. Policy-based approaches, such as policy gradient, try to identify the optimal policy directly without any form of value function. The third type is an actor-critic algorithm such as Fig. 2 that combines the above two approaches. Actors are responsible for generating actions, and critics are responsible for processing rewards. During training, when the agent observes the most recent state in the environment, the actor outputs a set of actions based on the current policy. On the other hand, critics will judge how good the current policy is through the value function. The deviation between the expected value and the reward received is then represented by the time lag (TD) error, which is fed back to actors and critics simultaneously to adjust the policy and value functions.

종래의 행위자-비평가 알고리즘은 일반적으로 샘플 효율성이 좋지 않으며, 특히 근접 정책 최적화(PPO) 및 비동기식 행위자-비평가(A3C)와 같은 정책 기반 학습 알고리즘이 그렇다. DDPG(Deep Deterministic Policy Gradient)와 같은 정책 기반 학습 알고리즘은 각 그레이디언트 단계에서 새로운 샘플 수집이 필요하며, 이는 훈련 프로세스의 효율성에 큰 영향을 미친다. 정책 외 학습 알고리즘은 과거 샘플을 재사용할 수 있기 때문에 샘플 효율성이 크게 향상되었다. 그러나 정책 외 학습 알고리즘은 안정성과 수렴이 더 큰 도전을 제시할 수 있음을 의미하는 학습 하이퍼파라미터에 민감하다. 이러한 문제를 극복하기 위해 정책 외 최대 엔트로피 DRL 알고리즘인 SAC(Soft Actor-Critic) 알고리즘이 강인하고 샘플 효율적인 성능을 달성하도록 설계되었다.Conventional actor-critic algorithms generally have poor sample efficiency, especially policy-based learning algorithms such as proximity policy optimization (PPO) and asynchronous actor-critic (A3C). Policy-based learning algorithms such as Deep Deterministic Policy Gradient (DDPG) require new sample collection at each gradient step, which greatly affects the efficiency of the training process. Because the out-of-policy learning algorithm can reuse past samples, the sample efficiency is greatly improved. However, out-of-policy learning algorithms are sensitive to learning hyperparameters, meaning stability and convergence can present greater challenges. To overcome these problems, the soft actor-critic (SAC) algorithm, an out-of-policy maximum entropy DRL algorithm, was designed to achieve robust and sample-efficient performance.

SAC 알고리즘은 오프 정책 방식을 통해 확률적 정책을 개선한다. SAC의 두드러진 속성은 예상 보상과 엔트로피 간의 균형을 맞추기 위해 에이전트가 정책을 학습하는 엔트로피 정규화이다. 이것은 탐험-탐사 메커니즘과 밀접하게 관련되어 있다. 즉, 엔트로피가 증가하면 더 많은 탐험이 발생한다. 이를 통해 SAC는 학습 속도를 가속화하고 정책이 비이상적인 로컬 최적으로 조기에 수렴하는 것을 방지할 수 있다. The SAC algorithm improves the stochastic policy through an off-policy method. A salient property of SAC is entropy normalization, where the agent learns a policy to strike a balance between expected reward and entropy. This is closely related to the exploration-exploration mechanism. In other words, as entropy increases, more exploration occurs. This allows SAC to accelerate the learning rate and prevent premature convergence of the policy to a non-ideal local optimum.

최대 엔트로피 RL 프레임워크에서 가치 함수는 각 단계에서 엔트로피 보너스를 포함하도록 변경된다. In the maximum entropy RL framework, the value function changes to include the entropy bonus at each step.

여기서,

는 단계 t에서 확률적 정책을 위한 엔트로피이다. here,

is the entropy for the stochastic policy at step t.

따라서 엔트로피 정규화 형식의 벨만 방정식(Bellman equation)은 다음과 같다. Therefore, the Bellman equation in entropy normalized form is:

두 개의 정규화된 가치 함수

와

는 수학식 27에 의해 연결된다. Two normalized value functions

and

is connected by Equation 27.

수학식 27에 따라, 정책의 근사해

(여기서,

)는 수학식 28과 같이 유도된다. According to Equation 27, the approximate solution of the policy

(here,

) is derived as in Equation 28.

Q_h ^π가 Q_h ^*로 수렴하면 에이전트는 최적의 정책

을 산출한다. 최적 정책

을 참조하면 최적값 V_h*(s)도 얻을 수 있다. 수학식 28에 따르면, Q-값 함수의 업데이트 메커니즘은 오프 정책 방식을 통해 달성될 수 있다.When Q _h ^π converges to Q _h ^* , the agent chooses the optimal policy

yields optimal policy

Referring to , the optimal value V _h *(s) can also be obtained. According to Equation 28, the update mechanism of the Q-value function can be achieved through an off-policy method.

SAC 프레임워크는 알고리즘 1에 제시되어 있다. 여기서 클리핑된 이중 Q-학습, 기준 가치(baseline value) 함수, 가치 함수의 지연된 업데이트와 같은 구현 세부 정보는 생략되어 있다. The SAC framework is presented in Algorithm 1. Implementation details such as clipped double Q-learning, baseline value function, and deferred update of value function are omitted here.

알고리즘 1 : 소프트 행위자-비평가(Soft Actor-Critic)Algorithm 1: Soft Actor-Critic

1. 정책 및 정규화된 가치 함수 파라미터를 초기화1. Initialize policy and normalized value function parameters

2. 반복 구간 3~62. Repeat section 3~6

3. 현재 정책에 근거해 샘플 생성3. Generate samples based on current policy

4. 데이터 버퍼로부터 샘플링4. Sampling from the data buffer

5. 수학식 26에 따라 가치 함수의 파라미터를 업데이트5. Update the parameters of the value function according to Equation 26

6. 수학식 28에 따라 정책 파라미터를 업데이트6. Update the policy parameters according to Equation 28

7. 수렴할 때까지 3~6 반복 7. Repeat 3-6 until convergence

2) CSAC 알고리즘 2) CSAC algorithm

SAC가 복잡한 의사 결정 작업을 해결하는 데 일련의 성공을 거두었다는 사실에도 불구하고 SAC는 MDP를 처리하도록 고안되었으며 실제적인 제약이 있는 CMDP를 해결할 수 없다. 이러한 관점에서 CMDP의 제약을 해결하기 위해 조정 가능한 라그랑주 승수(Lagrange multiplier)를 SAC와 통합하여 본 발명에 따른 CSAC 알고리즘을 제안한다. Despite the fact that SAC has had a series of successes in solving complex decision-making tasks, SAC is designed to handle MDP and cannot solve CMDP which has practical constraints. From this point of view, in order to solve the constraints of CMDP, we propose a CSAC algorithm according to the present invention by integrating an adjustable Lagrange multiplier with SAC.

산업 DR 문제는 CMDP로 공식화되었는데, 생산 목표 제약은 각 시간 단계로 분할되었다. 즉, 수학식 22에 정의된 R_t ^c < 0이다. The industry DR problem was formulated as CMDP, where production target constraints were partitioned into each time step. That is, R _t ^c < 0 defined in Equation 22.

SAC 프레임워크 아래에서, CMDP의 최적 해는 수학식 29를 풀어 얻을 수 있다. Under the SAC framework, the optimal solution of CMDP can be obtained by solving Equation (29).

여기서, D는 데이터 샘플링 버퍼, 즉 일련의 경험 데이터이다. Here, D is a data sampling buffer, that is, a series of empirical data.

라그랑주 승수를 도입하여, 제약 최적화 문제가 다음과 같이 재공식화될 수 있다. By introducing Lagrange multipliers, the constrained optimization problem can be reformulated as:

여기서,

는 다음과 같다. here,

is as follows

제한된 최적화 문제를 처리하기 위해 라그랑주 승수를 사용할 수 있다. k번째 반복에서 승수 λ^k≥0가 주어지면 정책 영역에 대해

을 최대화하여 정책 π^k을 얻을 수 있다. Lagrange multipliers can be used to address constrained optimization problems. For the policy domain given a multiplier λ ^k ≥ 0 at the kth iteration

By maximizing , the policy π ^k can be obtained.

그러면 수학식 32와 같이 설정하여 프로세스를 반복한다. Then, set as in Equation 32 and repeat the process.

δ_i는 λ를 업데이트하기 위한 스텝 크기이다. []⁺는 음이 아닌 실수를 말한다.δ _i is the step size for updating λ. [] ⁺ represents a non-negative real number.

정책 파라미터와 라그랑주 승수를 업데이트하는 반복적인 방법은 다음과 같은 세 가지 가정이 만족될 때 이상적이고 실현 가능한 해로 수렴될 수 있음이 입증되었다. It is demonstrated that the iterative method of updating policy parameters and Lagrange multipliers can converge to an ideal and feasible solution when the following three assumptions are satisfied.

첫째, V_h ^π(s)는 모든 정책 π∈Π에 대해 치역의 한계가 있다. First, V _h ^π (s) is bounded by a range for all policies π∈Π.

둘째, J_c(π)의 모든 최소값은 실현 가능한 해이다. Second, all minima of J _c (π) are feasible solutions.

세째,

와

는 정책 신경망의 파라미터 θ를 업데이트하기 위한 스텝이다. third,

and

is a step for updating the parameter θ of the policy neural network.

유한 반복 상황의 경우, δ_λ는 실제로 δ_θ보다 작게 설정될 수 있다. 본 발명에 따른 CSAC 알고리즘은 알고리즘 2로 요약된다. For finite repetition situations, δ _λ can actually be set smaller than δ _θ . The CSAC algorithm according to the present invention is summarized as Algorithm 2.

알고리즘 2: 제약 소프트 행위자-비평가(Constrained Soft Actor-Critic)Algorithm 2: Constrained Soft Actor-Critic

입력값: 정책 신경망 파라미터 θ, 상태가치 함수 V 파라미터

, 상태-행동가치 함수 Q 파라미터

, 대응하는 업데이트 스텝 크기

, 라그랑주 승수 λ, 온도 파라미터 α, 할인율 γInput values: policy neural network parameter θ, state value function V parameter

, the state-action-value function Q parameter

, the corresponding update step size

, the Lagrange multiplier λ, the temperature parameter α, and the discount rate γ

출력값: 모든 기계의 최적 제어 전략 A^*=

, 현재 단계를 위한 행동만이 실행될 것이다. Output: Optimal control strategy A ^* = for all machines

, only actions for the current step will be executed.

1. 정책 신경망 파라미터 θ, 상태가치 함수 V 파라미터

, 상태-행동가치 함수 Q 파라미터

, 대응하는 업데이트 스텝 크기

, 라그랑주 승수 λ, 온도 파라미터 α, 할인율 γ을 초기화한다. 1. Policy neural network parameter θ, state value function V parameter

, the state-action-value function Q parameter

, the corresponding update step size

, the Lagrange multiplier λ, the temperature parameter α, and the discount rate γ are initialized.

2.

을 위한 빈 응답 풀(empty reply pool) D를 초기화한다. 2.

Initializes an empty reply pool D for

3. 각 에피소드(episode) i3. Each episode i

4. 각 샘플 단계 t4. Each sample step t

5. 현재 상태 s_t에서 행동

을 선택한다5. Act on current state s _t

choose

6. 보상 R_t, 비용 R^c _t을 얻고 다음 상태 s_t+1로 이동한다 6. Get reward R _t , cost R ^c _t and move to next state s _t+1

7. 데이터 D를 저장한다

7. Save data D

8. 샘플 단계 종료 8. End of sample phase

9. 각 그레이디언트 단계 n9. Each gradient step n

10. 임의로 천이 미니 배치 B를 샘플링한다 10. Randomly sample transition mini-batch B

11. 미니 배치 B에 대해, 수학식 33 내지 37과 같이 목표 라벨(목표값)을 계산한다11. For mini-batch B, calculate the target label (target value) as in Equations 33 to 37

12. 경사 하강법에 의해

를 업데이트한다12. By gradient descent

update

13. 경사 하강법에

의해 업데이트한다 13. Gradient Descent

updated by

14. 경사 하강법에 의해 θ를 업데이트한다14. Update θ by gradient descent

15. 지연율

로 목표 V 함수 파라미터

를 업데이트한다. 15. Latency

as the target V function parameter

update

16. 스텝 크기 δ_λ로 목표 λ를 업데이트한다16. Update target λ with step size δ _λ

17. 그레이디언트 단계 종료17. End of the gradient step

18. 최대 누적 보상을 얻을 때까지(에피소드 I_max까지) 그레이디언트 단계 반복 수행 18. Repeat the gradient step until you get the maximum cumulative reward (up to episode I _max )

19. 알고리즘 종료 19. End of Algorithm

CSAC 알고리즘은 9개의 신경망으로 구성되며, 9개의 신경망은 3개의 집합으로 분류될 수 있다. The CSAC algorithm consists of 9 neural networks, and the 9 neural networks can be classified into 3 sets.

첫번째 4개의 신경망은 2개의 행동가치 함수

와 2개의 상태가치 함수

를 라그랑주 함수(수학식 32)에서 가치 함수와 관련있는

으로 근사화하는데 사용된다. The first four neural networks have two action-value functions

and two state-value functions

in relation to the value function in the Lagrange function (Equation 32).

is used to approximate

다음 4개의 신경망은 제약과 연관되고

파라미터를 가진 2개의 행동가치 함수

와 2개의 상태가치 함수

를 근사화하는데 사용된다. The following four neural networks are associated with constraints and

2 action-value functions with parameters

and two state-value functions

is used to approximate

파라미터 θ를 가진 마지막 신경망은 정책 함수를 근사화하는데 사용된다. The last neural network with parameters θ is used to approximate the policy function.

상기 9개 신경망의 매개변수는 CSAC 알고리즘의 입력으로 사용된다. 즉, 상태가치 함수 V 파라미터

, 상태-행동가치 함수 Q 파라미터

및 정책 네트워크 매개변수 θ이다. 또한, 대응하는 경사 하강 업데이트 스텝 크기

, 라그랑주 승수 λ, 온도 파라미터 α, 할인율 γ 역시 CSAC 알고리즘의 입력으로 사용된다.The parameters of the 9 neural networks are used as inputs to the CSAC algorithm. That is, the state value function V parameter

, the state-action-value function Q parameter

and the policy network parameter θ. Also, the corresponding gradient descent update step size

, the Lagrange multiplier λ, the temperature parameter α, and the discount rate γ are also used as inputs to the CSAC algorithm.

알고리즘 2에서 볼 수 있듯이 전체 학습 프로세스는 i(알고리즘 2의 라인 3), t(알고리즘 2의 라인 4) 및 n(알고리즘 2의 라인 9)의 세 가지 시간 표시기로 제어된다. 그 중 i는 학습 에피소드를 계산한다. t는 매일의 시간별 단계를 나타낸다. n은 그레이디언트 단계를 나타낸다. 3행부터 8행까지 알고리즘은 경험 축적 과정에 들어간다. 단계 카운터 t가 증가함에 따라 에이전트는 현재 정책 π에 따라 a_t를 실행하고 보상 R_t를 얻고 비용 R^c _t를 얻고 다음 상태 s_t+1로 이동한다. 이 과정에서 (s_t,a_t,s_t+1,R_t,R^c _t)의 각 쌍은 향후 에이전트 학습을 위해 응답 풀 D에 저장된다.As shown in Algorithm 2, the entire learning process is controlled by three time indicators: i (line 3 of Algorithm 2), t (Line 4 of Algorithm 2) and n (Line 9 of Algorithm 2). of which i counts learning episodes. t represents the daily hourly step. n represents a gradient step. From line 3 to line 8, the algorithm enters the process of accumulating experience. As the step counter t increases, the agent executes a _t according to the current policy π, gets the reward R _t , the cost R ^c _t , and moves to the next state s _t+1 . In this process, each pair of (s _t ,a _t ,s _t+1 ,R _t ,R ^c _t ) is stored in response pool D for future agent learning.

9행에서 17행까지 알고리즘은 학습 단계, 즉 그레이디언트 프로세스에 들어간다. 10행에서 알고리즘은 응답 풀 D에서 무작위 B 샘플의 배치 즉, {(s_t,a_t,s_t+1,R_t,R^c _t)}를 샘플링한다. |B|는 미니 배치의 크기이다. 11행에서 Q 및 V 신경망에 대한 훈련 목표는 수학식 33 내지 37을 통해 계산된다. 수학식 34, 35를 고려하면 훈련 목표

는

로 계산된다. 여기서, r_t는 라그랑주 함수와 관련된 신경망에 대한

이고, R^c _t는 제약과 관련된 신경망에 대한 것이고,

는 V 신경망의 추가 카피로 수학식 41에 따라

을 업데이트한다. From lines 9 to 17, the algorithm enters the learning phase, the gradient process. In line 10, the algorithm samples a random batch of B samples from response pool D, i.e. {(s _t ,a _t ,s _t+1 ,R _t ,R ^c _t )}. |B| is the size of the mini-batch. In line 11, the training targets for the Q and V networks are calculated through Equations 33 to 37. Considering Equations 34 and 35, the training target

Is

is calculated as where r _t is for the neural network associated with the Lagrange function

, and R ^c _t is for the neural network associated with the constraint,

Is an additional copy of the V neural network according to Equation 41

update

수학식 35, 36을 고려하면, 정책 업데이트 과정에서 편향을 줄이기 위해 clipped double Q-learning 기법이 사용된다. V 신경망에 대한 훈련 레이블은

로 표시된다.Considering Equations 35 and 36, a clipped double Q-learning technique is used to reduce bias in the policy update process. The training labels for the V neural network are

is indicated by

여기서 Q 신경망의 두 집합

은 별도로 관리되어 학습된다. 수학식 37을 고려하면,

는

에서 샘플링된 행동이라는 것을 나타낸다. 12행에서 행동가치 신경망의 매개변수

를 업데이트하기 위해 평균 제곱 오차(MSE)

를 최소화하기 위해 Adam 옵티마이저를 사용하여 경사하강법 업데이트를 수행한다. 마찬가지로 13행에서 상태가치 신경망의 매개변수

를 업데이트하기 위해 MSE 오차

를 최소화하기 위해 Adam optimizer를 통해 경사하강법 업데이트가 수행된다. where two sets of Q networks

is separately managed and learned. Considering Equation 37,

Is

Indicates that the behavior is sampled from . Parameters of the action-value neural network in line 12

Mean squared error (MSE) to update

To minimize , gradient descent update is performed using the Adam optimizer. Similarly, in line 13, the parameters of the state-valued neural network

To update the MSE error

To minimize , gradient descent update is performed through the Adam optimizer.

14행에서, 정책 신경망의 매개변수 θ를 업데이트하기 위해 Adam 옵티마이저 사용하여 경사하강법을 수행하여 손실을 최소화한다. In line 14, we perform gradient descent using the Adam optimizer to update the parameter θ of the policy neural network to minimize the loss.

15행에서, 목표 V 신경망 매개변수

의 지연된 업데이트가 수학식 41에서 수행됩니다. 16행에서, 라그랑주 승수 λ는 수학식 42에 따라 업데이트된다. 17행에서 알고리즘은 특수 미니 배치 B 샘플을 기반으로 현재 그레이디언트 프로세스를 완료한다. 마지막으로 알고리즘은 다음 에피소드에 진입하여 최대 누적 보상을 얻을 때까지 학습 과정을 반복한다. 이는 알고리즘이 최적의 운영 정책을 생성할 수 있음을 의미한다.On line 15, the target V network parameters

A deferred update of is performed in Equation 41. At line 16, the Lagrange multiplier λ is updated according to equation (42). At line 17, the algorithm completes the current gradient process based on the special mini-batch B samples. Finally, the algorithm repeats the learning process until it enters the next episode and obtains the maximum cumulative reward. This means that the algorithm can generate an optimal operating policy.

도 12는 본 발명에 따른 제약 강화 학습이 적용된 이산 산업 제조 시스템의 수요반응 관리 방법의 처리 과정을 나타낸 것이다. 12 illustrates a process of a demand response management method of a discrete industrial manufacturing system to which constraint reinforcement learning is applied according to the present invention.

도 12에 도시된 처리 과정은 이산 산업 제조 시스템의 에너지 관리 장치(EMC) 또는 별도의 컴퓨터 장치에서 수행되며, 이러한 처리 과정을 통해 생성된 최적 운영 정책(에너지 전략)이 이산 산업 제조 시스템에 적용된다. The process shown in FIG. 12 is performed in the energy management unit (EMC) of the discrete industrial manufacturing system or a separate computer device, and the optimal operating policy (energy strategy) generated through this process is applied to the discrete industrial manufacturing system. .

구체적으로 도 12에 도시된 처리 과정은 이산 산업 제조 시스템의 수요반응 관리를 위한 최적 운영 정책을 생성하기 위한 과정이며, 이러한 처리 과정은 이산 산업 제조 시스템의 에너지 관리 장치(EMC) 또는 별도의 컴퓨터 장치의 프로세서에 의해 수행될 것이다. 이하에서는 프로세서에 의해 수행되는 것으로 상술한다. Specifically, the process shown in FIG. 12 is a process for generating an optimal operating policy for demand response management of a discrete industrial manufacturing system, and this process is performed by an energy management device (EMC) or a separate computer device of a discrete industrial manufacturing system. will be performed by the processor of Hereinafter, it will be described in detail as being performed by a processor.

도 12를 참조하면, 본 발명에 따른 방법은 경험 축적 단계(S100), 파라미터 갱신 단계(S102), 누적 보상 계산 단계(S104)로 크게 구분된다. Referring to FIG. 12, the method according to the present invention is largely divided into an experience accumulation step (S100), a parameter update step (S102), and an accumulated reward calculation step (S104).

경험 축적 단계(S100)는 훈련 집합을 생성하는 과정이다. 프로세서는 시간 구간 t 상태(S_t)에서 정책(π)에 따라 행동(작업)(a_t)을 실행하고 보상(R_t) 및 비용(R^c _t)을 획득하여 다음 상태(S_t+1)로 이동하고 현 상태(S_t), 행동(a_t), 다음 상태(S_t+1), 보상(R_t) 및 비용(R^c _t)으로 구성된 샘플을 저장하는 방식으로 총 시간 단계(t=1~T)의 훈련 집합을 내부 메모리에 구비된 응답 풀 D에 저장한다. The experience accumulation step (S100) is a process of generating a training set. The processor executes the action (a _t ) according to the policy (π) in the time interval t state (S _t ) and obtains the reward (R _t ) and cost (R ^c _t ) to reach the next state (S _t+1 ) and store samples consisting of the current state (S _t ), action (a _t ), next state (S _t+1 ), reward (R _t ), and cost (R ^c _t ), so that the total time step ( A training set of t = 1 to T) is stored in the response pool D provided in the internal memory.

여기서, 상태(S_t)는 해당 시간 구간에서 전기가격, 각 기계의 에너지 소비 및 각 버퍼의 저장량을 포함하고(수학식 18). 행동은(a_t) 해당 시간 구간에서 각 기계의 동작 또는 유휴를 나타낸다(수학식 20).Here, the state (S _t ) includes the electricity price, energy consumption of each machine, and storage amount of each buffer in the corresponding time interval (Equation 18). The action (a _t ) represents the operation or idleness of each machine in the corresponding time interval (Equation 20).

파라미터 갱신 단계(S102)는 상술한 알고리즘 2의 입력값을 결정하는 과정이다. 파라미터 갱신 단계(S102)는 신경망의 파라미터를 결정하기 위한 학습 과정(그레이디언트 단계)이다. Parameter update step (S102) is a process of determining the input value of the algorithm 2 described above. The parameter updating step (S102) is a learning process (gradient step) for determining parameters of the neural network.

알고리즘 2의 입력값은 정책 신경망(정책 함수) 파라미터 θ, 상태가치 함수 V 파라미터

, 상태-행동가치 함수 Q 파라미터

, 대응하는 업데이트 스텝 크기

, 라그랑주 승수 λ, 온도 파라미터 α, 할인율 γ 등을 포함한다. The input values of Algorithm 2 are the policy neural network (policy function) parameter θ and the state value function V parameter.

, the state-action-value function Q parameter

, the corresponding update step size

, the Lagrange multiplier λ, the temperature parameter α, and the discount rate γ.

프로세서는 경험 축적 단계(S100)에서 생성한 훈련 집합에서 임의로 미니 배치(mini-batch)를 샘플링하고 미니 배치에 대해 행동 가치 함수(state-action value function)와 상태 가치 함수(state value function)의 목표값(target label)을 계산하고, 함수값과 목표값 간의 오차가 최소화되도록 행동 가치 함수의 파라미터와 상태 가치 함수의 파라미터를 경사 하강법에 따라 업데이트한다.The processor randomly samples a mini-batch from the training set generated in the experience accumulation step (S100), and targets the state-action value function and the state value function for the mini-batch. A value (target label) is calculated, and the parameters of the action value function and the parameter of the state value function are updated according to the gradient descent method so that the error between the function value and the target value is minimized.

또한 프로세서는 정책 함수(policy function)의 파라미터를 업데이트하는 과정(수학식 40), 목표 상태 가치 함수(target state value function)의 파라미터를 업데이트하는 과정(수학식 41) 및 라그랑주 승수(Lagrange multiplier)를 업데이트하는 과정(수학식 42)을 순차적으로 수행한다. In addition, the processor performs a process of updating parameters of a policy function (Equation 40), a process of updating parameters of a target state value function (Equation 41), and a Lagrange multiplier. The updating process (Equation 42) is sequentially performed.

누적 보상 계산 단계(S104)는 하나의 에피소드에 대해 누적 보상액을 구하는 과정이다. 프로세서는 결정된 파라미터에 근거하여 현 상태에서 정책에 따라 행동을 실행하고 보상을 획득하여 다음 상태로 이동하는 동작을 반복 실행하여 누적 보상을 계산한다. The accumulative compensation calculation step (S104) is a process of obtaining an accumulative compensation amount for one episode. Based on the determined parameter, the processor calculates an accumulated reward by repeatedly executing an action according to a policy in the current state, obtaining a reward, and moving to the next state.

이후 프로세서는 상기 누적 보상이 최대인지 판단하여(S106), 누적 보상이 아직 최대에 이르지 않은 경우 파라미터 갱신 단계(S102)를 반복해서 수행하다가, 누적 보상이 최대에 도달한 것으로 판단하면 그 때의 정책을 최적 운영 정책으로 결정하게 된다(S108).Thereafter, the processor determines whether the accumulated compensation is maximum (S106), and if the accumulated compensation has not yet reached the maximum, repeatedly performs the parameter updating step (S102), and determines that the accumulated compensation has reached the maximum policy at that time. is determined as the optimal operating policy (S108).

상술한 과정을 통해 최적 운영 정책이 결정되면, 최적 운영 정책을 이산 산업 제조 시스템에 적용하여 생산 목표를 보장하면서 에너지 비용을 최소화할 수 있는 수요반응 관리를 달성할 수 있다. When the optimal operating policy is determined through the above-described process, the optimal operating policy can be applied to the discrete industrial manufacturing system to achieve demand response management that can minimize energy costs while ensuring production targets.

실험 평가Experiment evaluation

본 발명에 따른 CSAC 알고리즘의 타당성을 검증하기 위해 이산 제조 공정의 예로 실제적인 리튬이온 배터리 조립 공정을 선택하였다. 먼저 리튬이온 배터리 조립공정의 세부사항을 소개하고 본 발명에 따른 알고리즘을 평가한다. To verify the validity of the CSAC algorithm according to the present invention, a practical lithium-ion battery assembly process was selected as an example of a discrete manufacturing process. First, the details of the lithium-ion battery assembly process are introduced and the algorithm according to the present invention is evaluated.

A. 케이스 연구A. Case study

도 3에 도시된 바와 같이, 리튬 이온 배터리 조립 공정은 조립, 포화, 형성 및 등급화의 4가지 공정을 포함한다. As shown in FIG. 3 , the lithium ion battery assembly process includes four processes of assembly, saturation, formation and grading.

조립: 구성 요소들이 함께 조립되어 도 4와 같은 배터리 모듈의 계층적 구조를 갖는 배터리 모듈로 구성된다. 구성 요소들은 사이드 프레임(SF), 배터리 셀(BC), 냉각 플레이트(CP), 중간 프레임( IF) 및 압축 발포체(CF)를 포함한다.Assembling: Components are assembled together to form a battery module having a hierarchical structure of battery modules as shown in FIG. 4 . The components include a side frame (SF), a battery cell (BC), a cooling plate (CP), an intermediate frame (IF) and a compressed foam (CF).

포화: 모듈에 적절한 양의 전해질이 주입된다. Saturation: The module is injected with an appropriate amount of electrolyte.

형성: 모듈은 적절한 충전 및 방전 프로세스에 의해 사용 가능한 모드로 활성화된다. Formation: The module is activated into a usable mode by an appropriate charge and discharge process.

등급: 일련의 저항 및 정전용량 측정을 통해 배터리 모듈은 성능에 따라 등급이 매겨진다. Rating: Through a series of resistance and capacitance measurements, battery modules are rated according to their performance.

도 3은 관련 기계에 할당된 각 작업으로 10개의 작업으로 분리될 수 있는 배터리 모듈 조립 프로세스의 개요를 제공한다. 사용 가능한 운전 모드, 생산 속도, 소비 전력 및 버퍼 용량을 포함하여 각 기계의 작업 정보는 표 1에 제공된다. Figure 3 provides an overview of the battery module assembly process which can be divided into 10 tasks with each task assigned to a related machine. Operational information for each machine, including available operating modes, production speed, power consumption and buffer capacity, is provided in Table 1.

이러한 매개변수를 기반으로 이 조립 시스템의 목표는 하루에 500개의 리튬 이온 배터리 유닛(즉, 기계의 최대 버퍼 용량)을 생산하도록 설정된다. 시스템이 끌어오는 최대 전력량은 500kW으로 설정된다. Based on these parameters, the assembly system's target is set to produce 500 lithium-ion battery units per day (i.e., the machine's maximum buffer capacity). The maximum amount of power drawn by the system is set at 500 kW.

B. 수치 결과B. Numerical results

표 2에 도시된 것처럼 훈련을 시작하기 위해 상태가치 신경망 V 파라미터

, 상태-행동가치(state-action value) 신경망 Q 파라미터

, 정책 신경망 파라미터 θ, 라그랑주 승수 λ를 업데이트하기 위해 아담 옵티마이저(Adam optimizer)를 사용한다.As shown in Table 2, the state-valued neural network V parameters to start training

, state-action value neural network Q parameters

, we use the Adam optimizer to update the policy neural network parameter θ, and the Lagrange multiplier λ.

상태가치 신경망 V, 상태-행동가치 Q, 정책 신경망에 대한 스텝 크기

는 각각 5e^-4, e^-3, e^-3 및 e^-5로 설정된다.Step sizes for state-value networks V, state-action-value Q, policy networks

are set to 5e ^-4 , e ^-3 , e ^-3 and e ^-5 , respectively.

각 신경망에 대한 은닉층의 수는 2개로 설정되고 각 계층에는 64개의 뉴런이 있다. 또한 ReLU(Rectified Linear Unit)를 활성 함수로 구현하여 각 신경망에서 두 개의 은닉층을 연결한다. Q-값 신경망에서 정규화에서 온도 계수 α와 할인율 γ를 각각 0.02와 0.95로 설정한다. softmax 함수는 정책의 마지막 층에 적용되어 이산 행동(discret action)을 생성한다. 재생 버퍼 크기와 배치 크기는 각각 500과 256으로 설정된다. 9개의 신경망의 가중치는 무작위로 초기화되고 반복적으로 업데이트된다. The number of hidden layers for each neural network is set to 2 and each layer has 64 neurons. In addition, ReLU (Rectified Linear Unit) is implemented as an activation function to connect the two hidden layers in each neural network. In the normalization of the Q-value neural network, the temperature coefficient α and the discount rate γ are set to 0.02 and 0.95, respectively. The softmax function is applied in the last layer of the policy to create discrete actions. Playback buffer size and batch size are set to 500 and 256 respectively. The weights of the 9 neural networks are randomly initialized and updated iteratively.

모든 매개변수를 설정한 후 에이전트는 누적 보상을 최대화하기 시작한다. 도 5는 2019년 9월 5일의 훈련 과정을 도시하고 있다. 에이전트는 보상에서 알 수 있듯이 초기에 저조한 성능을 보이고 있다. 그러나 반복 횟수가 증가함에 따라 에이전트는 시행 착오를 통해 더 높은 보상을 제공하는 작업(action)을 수행하기 시작하였다. 드디어 약 30000회만에 최대 보상을 달성했다. 최대 보상을 획득하면 해당하는 최적의 운영 전략이 결정된다. 표 3은 각 단계에서 모든 기계의 작동 옵션을 나열하며 "1"과 "0"은 각각 작동 및 유휴 모드를 나타낸다.After setting all the parameters, the agent starts maximizing the cumulative reward. 5 shows the training process on September 5, 2019. The agent is performing poorly initially, as evidenced by the reward. However, as the number of iterations increased, the agent began to perform actions that provided higher rewards through trial and error. Finally achieved the maximum reward in about 30,000 times. Obtaining the maximum reward determines the corresponding optimal operating strategy. Table 3 lists the operating options of all machines at each stage, with "1" and "0" representing the working and idle modes respectively.

도 6은 본 발명에 따른 방법에서 생성된 최적 운영 정책 하에서 모든 기계의 총 에너지 소비량을 보여준다. 기계는 가격이 낮을 때 더 많은 에너지를 소비하고 가격이 높을 때 적게 소비하며 피크 시간에는 에너지 소비를 피하고 있다. 특히, 기계는 단계 1-15 및 19-24 동안 더 많은 에너지를 소비하고 16-18 동안 덜 소비하고 있다. 특히 대부분의 기계는 16, 17 단계에서 전력 가격이 최고가에 있기 때문에 에너지 소비를 최소값으로 줄이고 있다. 이는 전력망에 가해지는 스트레스를 완화할 뿐만 아니라 산업 소비자의 에너지 비용도 절감한다. 실시간 배터리 생산을 설명하기 위해 제안된 DR 계획에 따라 각 단계 t에서 해당하는 최종 배터리 생산 저장이 도 7에 도시되어 있다. 시스템은 표 3에 도시된 최적의 스케줄 표를 실행함으로써 최종적으로 500개의 배터리 모듈(즉, 생산 목표)을 산출한다. Figure 6 shows the total energy consumption of all machines under the optimal operating policy generated in the method according to the invention. Machines consume more energy when prices are low, consume less when prices are high, and avoid consuming energy during peak hours. In particular, the machine is consuming more energy during steps 1-15 and 19-24 and less during steps 16-18. In particular, most machines are reducing energy consumption to a minimum value because power prices are at their peak at levels 16 and 17. This not only relieves stress on the power grid, but also reduces energy costs for industrial consumers. According to the proposed DR plan to describe real-time battery production, the corresponding final battery production storage at each stage t is shown in FIG. 7 . The system finally calculates 500 battery modules (ie, production target) by executing the optimal schedule table shown in Table 3.

도 8은 MILP Gurobi solver와 본 발명에 따른 CSAC 알고리즘에서 얻은 총 비용을 비교하고 있다. MILP Gurobi solver는 시스템의 모든 정보를 고려한 특정 모델을 사용하여 근시안적 조치로 수학식 8 및 수학식 9에 정의된 대로 생산 목표를 충족하고 에너지 비용을 최소화한다. 그러나 CSAC 알고리즘은 상술한 바와 같ㅌ이 보상을 최대화하기 위해 다양한 행동을 선택하는 자체 학습 능력을 발휘한다. 도 8에서 보는 바와 같이, 본 발명에 따른 CRL 알고리즘 방법은 시행착오를 거치기 때문에 초기 훈련 단계에서 낮은 성능을 보인다. 그러나 더 많은 에피소드를 경험한 후 에이전트는 학습 환경에 적응하고 탐색 및 활용 메커니즘을 통해 정책을 조정한다. 마지막으로 최적의 정책을 얻게 된다. 본 발명에 따른 방법은 모델이 없고 복잡한 에너지 관리 시나리오에 대한 전문 지식이 필요하지 않기 때문에, 복잡한 산업 에너지 관리 문제를 해결할 수 있는 유망한 솔루션을 제공할 수 있다.Figure 8 compares the total cost obtained from the MILP Gurobi solver and the CSAC algorithm according to the present invention. The MILP Gurobi solver uses a specific model that takes into account all information from the system, short-sighted measures to meet production targets and minimize energy costs as defined in Equations 8 and 9. However, the CSAC algorithm exhibits the self-learning ability to select different actions to maximize this reward, as described above. As shown in FIG. 8, the CRL algorithm method according to the present invention shows low performance in the initial training stage because it goes through trial and error. However, after experiencing more episodes, the agent adapts to the learning environment and adjusts its policy through a search and exploit mechanism. Finally, the optimal policy is obtained. Since the method according to the present invention is modelless and does not require expertise in complex energy management scenarios, it can provide a promising solution to solve complex industrial energy management problems.

본 발명에 따른 DR 알고리즘의 효율성에 대한 통찰력을 얻기 위해 ComEd에서 얻은 2019년 6월 2일부터 6월 4일까지 3일 동안의 전력 가격을 고려한 시뮬레이션을 수행한다. 도 9 및 10은 각각 학습 과정에서 누적 보상의 수렴과 본 발명에 따른 DR 방식 하에서 모든 기계의 대응하는 총 에너지 소비를 나타내고 있다. In order to gain insight into the efficiency of the DR algorithm according to the present invention, simulations are performed considering electricity prices for three days from June 2nd to June 4th, 2019 obtained from ComEd. Figures 9 and 10 respectively show the convergence of cumulative rewards in the learning process and the corresponding total energy consumption of all machines under the DR scheme according to the present invention.

도 10에서 보는 바와 같이, 전일 패턴 간에 유사한 에너지 소비 경향이 관찰되어 본 발명에 따른 DR 방식의 타당성을 더욱 입증한다. 2019년 6월 2일부터 6월 4일까지 실시간으로 최종 배터리 생산 저장을 설명하기 위해 제안된 DR 방식에 따라 각 단계 t에서 3일 동안 최종 배터리 생산 저장이 도 11에 표시되어 있다. 도 11에서 볼 수 있듯이, 시스템은 매일 마지막 단계에서 500개의 배터리 모듈을 생산한다. As shown in FIG. 10, similar energy consumption trends were observed between the previous day patterns, further proving the validity of the DR method according to the present invention. The final battery production and storage for 3 days at each step t is shown in FIG. As can be seen in Figure 11, the system produces 500 battery modules at the end stage each day.

Claims

As a method performed in an energy management device of a discrete industrial manufacturing system,
In the time interval t state (S _t ), the action (task) (a _t ) is executed according to the policy (π), and the reward (R _t ) and cost (R ^c _t ) are obtained to move to the next state (S _t+1 ). _The _total _time ^step ₍ t ₌ An experience accumulation step of storing a training set of 1 to T);
Randomly sample a mini-batch from the training set, calculate a target label of a state-action value function and a state value function for the mini-batch, and updating the parameters of the action value function and the parameters of the state value function according to gradient descent so that an error between the value and the target value is minimized; Updating parameters of a policy function; Updating parameters of a target state value function; and a parameter updating step including updating a Lagrange multiplier;
Including a cumulative reward calculation step of calculating an accumulated reward by repeatedly executing an action according to a policy in the current state based on the parameter determined in the parameter update step, obtaining a reward, and moving to the next state,
Demand response management method of a discrete industrial manufacturing system to which constraint reinforcement learning is applied, characterized in that the parameter updating step is repeated until the cumulative compensation is maximized.

According to claim 1,
The condition includes the electricity price, the energy consumption of each machine, and the storage amount of each buffer in the corresponding time interval.

According to claim 1,
The action is a demand response management method of a discrete industrial manufacturing system applied with constraint reinforcement learning, characterized in that indicating the operation or idleness of each machine in the corresponding time interval.

According to claim 1,
Demand response management method of a discrete industrial manufacturing system applied with constraint reinforcement learning, characterized in that when the cumulative compensation is maximized, the policy at that time is determined as the optimal operating policy.