KR102114989B1

KR102114989B1 - System for precision training of air conditioning device using deep reinforcement learning algorithm and method thereof

Info

Publication number: KR102114989B1
Application number: KR1020190163962A
Authority: KR
Inventors: 윤명섭; 윤원식
Original assignee: 한국산업기술시험원
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-05-26

Abstract

The present invention relates to an air conditioner precise learning system applied with a deep reinforcement learning algorithm and a method thereof. According to the present invention, the air conditioner precise learning system comprises: a remote control for transmitting at least one of a target temperature and a target humidity in accordance with a user input; an air conditioner for realizing a test environment, receiving at least one of the target temperature and the target humidity transmitted from the remote control, and driven based on the received target temperature and humidity in the test environment; a precise test chamber set with outdoor and indoor environmental conditions previously simulated to perform deep reinforcement learning of the air conditioner; a data collection unit for collecting data related to the air conditioner; and a terminal for generating an input value to be applied to a deep reinforcement learning neural model on the basis of the collected data related to the air conditioner, performing a learning function in the test environment of the precise test chamber using the deep reinforcement learning neural model on the basis of the generated input value, storing the weight and a deflection value in a plurality of individual nodes configured in a neural network including information obtained by learning performance after a learning stage performed during a preset simulation period is normally completed, and providing the air conditioner with a learning result including the weight and the deflection value of each of the individual nodes for each test environment. According to the present invention, user convenience is improved and the overall system operation efficiency is improved.

Description

Air conditioner precision learning system and method using deep reinforcement learning algorithm {SYSTEM FOR PRECISION TRAINING OF AIR CONDITIONING DEVICE USING DEEP REINFORCEMENT LEARNING ALGORITHM AND METHOD THEREOF}

본 발명은 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템 및 그 방법에 관한 것이다. 좀 더 구체적으로, 외기 온/습도 조건 변화, 건물 현열/잠열 부하 변화, 미세먼지 농도 변화 등 시뮬레이션 조건 이외의 외부 교란이 없는 테스트 환경에서 다양한 조건하에 반복 학습을 수행하여 목표로 하는 범위 내에서 안정화된 학습 패턴을 보여주면 공기조화기 컴포넌트 수학 모델식, 환경에 대한 수학 모델식, 공기조화기-환경 연동 수학 모델식에 상관없이 단지 사용자가 목표로 하는 범위 만족여부에 대한 보상 혹은 벌점 값을 피드백 받아, 해당 안정화된 학습 패턴에 따른 최종 학습 결과를 신경망에 저장하고, 실제 현장에 배치된 공기조화기에서 앞서 저장된 최종 학습 결과를 근거로 실내 및/또는 실외 환경에 따른 적응형 학습을 수행하여 공기조화기의 동작을 제어하는 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템 및 그 방법에 관한 것이다.The present invention relates to an air conditioner precision learning system and a method of applying an in-depth reinforcement learning algorithm. More specifically, it is stabilized within the target range by performing iterative learning under various conditions in a test environment without external disturbances other than simulation conditions such as changes in ambient temperature / humidity conditions, changes in sensible heat / latent heat loads, and changes in fine dust concentration. If you show the learned learning pattern, regardless of the air conditioner component mathematical model formula, the environmental mathematical model formula, or the air conditioner-environment interlocked mathematical model formula, the feedback or reward value for the satisfaction of the range that the user aims at is satisfied. Receiving, storing the final learning result according to the stabilized learning pattern in the neural network, and performing the adaptive learning according to the indoor and / or outdoor environment based on the last learning result stored in the air conditioner placed in the actual field to perform the air The present invention relates to an air conditioner precision learning system and a method of applying an in-depth reinforcement learning algorithm that controls the operation of a harmonizer.

일반적으로, 공기조화기는 가정, 회사 사무실 등에 배치되어, 실내의 냉방이나 난방, 습도, 공기 청정도를 관리하는 장치이다.In general, an air conditioner is disposed in a home, a company office, or the like, and is a device for managing indoor cooling, heating, humidity, and air cleanliness.

이러한 공기조화기는 유선 혹은 무선 리모컨을 통한 사용자 입력에 따른 설정 온도 및 습도 등에 따라 실내 온도 및 상대습도, 공기청정도 등을 조절하고 있으며, 보통 사용자가 리모컨으로 세팅하는 설정치 대비하여 위아래로 오차를 가지고 제어되고 있다.These air conditioners adjust the indoor temperature, relative humidity, and air cleanliness according to the set temperature and humidity according to user input through a wired or wireless remote control, and usually have errors up and down compared to the settings set by the remote control. Controlled.

또한, 공기조화기는 민감한 사용자의 경우, 수시로 리모컨을 설정하여 온도, 풍량 등을 조절함에 따라 동일한 모델의 제품이라도 각 제품이 사용되는 외부 환경조건 및 내부의 건물부하 변화, 사용자 재실패턴 등 각 사용환경에 따서 목표로 하는 설정치 대비 편차를 갖는 경우가 발생되어 사용상의 불편함이 존재할 수 있다.In addition, in case of a sensitive user, the air conditioner sets the remote control from time to time to adjust the temperature, air volume, etc., so that even if it is a product of the same model, each environment, such as the external environment conditions and internal building load changes and user occupancy patterns used by each product. Accordingly, there is a case in which there is a deviation from the target set value, which may cause inconvenience in use.

한국등록특허 제10-1875488호 [제목: 인공지능 이용하여 냉방 시스템을 자동으로 제어하는 장치 및 방법]Korean Registered Patent No. 10-1875488 [Title: Apparatus and method for automatically controlling the cooling system using artificial intelligence]

본 발명의 목적은 시뮬레이션 조건 이외의 교란이 없는 시험 환경에서 외기 온/습도 조건 변화, 건물 현열/잠열 부하 변화, 미세먼지 농도 변화 등 다양한 조건하에 심층신경망을 적용한 강화학습을 포함하는 반복 학습을 수행하여 목표로 하는 범위 내에서 안정화된 학습 패턴을 보여주면 해당 안정화된 학습 패턴에 따른 최종 학습 결과를 신경망에 저장하고, 실제 현장에 배치된 공기조화기에서 앞서 저장된 최종 학습 결과를 근거로 실내 및/또는 실외 환경에 따른 적응형 학습을 수행하여 공기조화기의 동작을 제어하는 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템 및 그 방법을 제공하는 데 있다.The object of the present invention is to perform repetitive learning including reinforcement learning by applying deep neural networks under various conditions such as changes in ambient temperature / humidity, changes in sensible heat / latent heat loads, and changes in fine dust concentration in a test environment without disturbances other than simulation conditions. By showing the stabilized learning pattern within the target range, the final learning result according to the stabilized learning pattern is stored in the neural network, and based on the final learning result previously stored in the air conditioner placed in the actual field, and / or Another object is to provide an air conditioner precision learning system and a method of applying an in-depth reinforcement learning algorithm that controls the operation of an air conditioner by performing adaptive learning according to an outdoor environment.

본 발명의 실시예에 따른 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템은, 사용자 입력에 따른 목표 온도 및 목표 습도 중 적어도 하나를 전송하는 리모컨; 시험 환경을 구현하고, 상기 리모컨으로부터 전송되는 목표 온도 및 목표 습도 중 적어도 하나를 수신하고, 상기 시험 환경에서 상기 수신된 목표 온도 및 목표 습도를 근거로 구동하는 공기조화기; 사전에 상기 공기조화기의 심층 강화 학습을 수행하기 위해 시뮬레이션된 실외 및 실내 환경조건이 설정된 정밀 시험 챔버; 상기 공기조화기와 관련한 데이터를 수집하는 데이터 수집부; 및 상기 수집된 상기 공기조화기와 관련한 데이터를 근거로 심층 강화 학습 신경 모델에 적용할 입력값을 생성하고, 상기 생성된 입력값을 근거로 상기 심층 강화 학습 신경 모델을 이용해서 상기 정밀 시험 챔버의 시험 환경에서 학습 기능을 수행하고, 미리 설정된 시뮬레이션 기간 동안 수행된 학습 단계가 정상적으로 종료된 후, 학습 수행에 따른 정보를 포함하는 신경망에 구성된 복수의 개별 노드에서의 가중치 및 편향값을 저장하고, 시험 환경별로 각 노드에서의 가중치 및 편향값을 포함하는 학습 결과를 상기 공기조화기에 제공하는 단말을 포함한다.An air conditioner precision learning system to which an in-depth reinforcement learning algorithm according to an exemplary embodiment of the present invention is applied includes a remote controller that transmits at least one of target temperature and target humidity according to a user input; An air conditioner for implementing a test environment, receiving at least one of a target temperature and a target humidity transmitted from the remote controller, and driving based on the received target temperature and a target humidity in the test environment; A precision test chamber in which simulated outdoor and indoor environmental conditions are set in advance to perform in-depth reinforcement learning of the air conditioner; A data collection unit for collecting data related to the air conditioner; And generating an input value to be applied to a deep reinforcement learning neural model based on the collected data related to the air conditioner, and testing the precision test chamber using the deep reinforcement learning neural model based on the generated input value. After performing the learning function in the environment, and after the learning step performed during the preset simulation period is normally ended, the weights and bias values at a plurality of individual nodes configured in the neural network including information according to the learning performance are stored, and the test environment It includes a terminal that provides learning results including weights and bias values at each node to the air conditioner.

또한, 상기 공기조화기는, 일정한 냉난방 온도 및 습도 조건 또는 냉방기간 동안의 가변적인 온도 및 습도 조건을 학습하며, 상기 일정한 냉난방 온도 및 습도 조건은, 시뮬레이션 조건 이외의 교란이 없는 시험 환경을 구현하기 위해 설정되고, 상기 가변적인 온도 및 습도 조건은, 실외의 경우 가변 온도 및 가변 습도의 조건을 포함하고, 실내의 경우 가변 현열 부하 및 가변 잠열 부하의 조건을 포함하고, 상기 공기조화기의 설치 이전 현장에서의 가변적인 환경을 사전에 시뮬레이션 학습하기 위해 설정될 수 있다.In addition, the air conditioner learns constant heating and cooling temperature and humidity conditions or variable temperature and humidity conditions during the cooling period, and the constant heating and cooling temperature and humidity conditions are used to implement a test environment without disturbances other than simulation conditions. Set, the variable temperature and humidity conditions include conditions of variable temperature and variable humidity in the case of outdoor, conditions of variable sensible heat load and variable latent heat in the case of indoor, and the site before installation of the air conditioner It can be set in advance to learn the variable environment in the simulation.

또한, 상기 공기조화기와 관련한 데이터는, 소비 전력, 상기 공기조화기의 토출구 주변에서의 토출 온도, 상기 공기조화기의 토출구 부근에서의 토출 습도, 상기 공기조화기의 토출구에서의 공기청정도, 상기 공기조화기가 위치한 실내의 온도와 습도, 실외 온도와 실외 습도, 실내 공기청정도, 실내 현열 부하 및 실내 잠열 부하, 재실자와 재실 사물의 반응 데이터, 기상예보, 및 기상 데이터베이스 중 적어도 하나를 포함할 수 있다.In addition, the data relating to the air conditioner include: power consumption, discharge temperature around the outlet of the air conditioner, discharge humidity near the outlet of the air conditioner, air cleanliness at the outlet of the air conditioner, and The air conditioner may include at least one of indoor temperature and humidity, outdoor temperature and outdoor humidity, indoor air cleanliness, indoor sensible heat load and indoor latent heat load, reaction data of occupants and occupants, weather forecast, and weather database. have.

또한, 상기 입력값은, 상기 수집된 데이터 중에서 실내 온도와 상기 수신된 목표 온도 간의 차이값, 상기 수집된 데이터 중에서 실내 온도의 미리 설정된 시간 동안의 순간 변화 기울기, 상기 수집된 데이터 중에서 상기 공기조화기의 토출구 부근에서의 토출 온도, 상기 수집된 데이터 중에서 실내 습도와 상기 수신된 목표 습도 간의 차이값, 상기 수집된 데이터 중에서 실내 습도의 미리 설정된 시간 동안의 순간 변화 기울기 및 상기 수집된 데이터 중에서 상기 공기조화기의 토출구 부근에서의 토출 습도, 실내 공기청정도와 목표 공기청정도 간의 차이값, 실내 공기청정도의 미리 설정된 시간 동안의 순간 변화 기울기, 상기 공기조화기의 토출구 부근에서의 공기청정도, 재실자 및 재실 사물의 과거 일정기간 동안의 위치정보 및 온도를 포함하는 상태정보, 및 재실자 및 재실 사물의 현재 위치정보 및 온도를 포함하는 상태정보 중 적어도 하나를 포함할 수 있다.In addition, the input value is a difference value between the indoor temperature and the received target temperature among the collected data, an instantaneous change slope during a preset time of the indoor temperature among the collected data, and the air conditioner among the collected data The discharge temperature in the vicinity of the discharge port of the room, the difference between the indoor humidity and the received target humidity among the collected data, the instantaneous change slope during the preset time of the indoor humidity among the collected data, and the air conditioning among the collected data The discharge humidity in the vicinity of the outlet of the machine, the difference between the indoor air cleanliness and the target air cleanliness, the slope of the instantaneous change over the preset time of the indoor air cleanliness, the air cleanliness near the outlet of the air conditioner, occupants and occupants It may include at least one of status information including location information and temperature for a certain period of time in the past, and status information including current location information and temperature of the occupants and occupants.

또한, 상기 단말은, 상기 심층 강화 학습 신경 모델에 사용할 신경망을 구성하고, 상기 심층 강화 학습 알고리즘을 통해 상기 시험 환경에 대한 학습 기능을 수행할 수 있다.In addition, the terminal may configure a neural network to be used for the deep reinforcement learning neural model, and perform a learning function for the test environment through the deep reinforcement learning algorithm.

본 발명의 다른 실시예에 따른 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 방법은, 정밀 시험 챔버에 의해, 시험 환경을 구현하는 단계; 상기 정밀 시험 챔버에 의해, 리모컨을 통한 사용자 입력에 따른 목표 온도, 목표 습도 및 목표 공기청정도 중 적어도 하나를 수신하는 단계; 공기조화기에 의해, 상기 정밀 시험 챔버의 시험 환경에서 상기 수신된 목표 온도, 목표 습도 및 목표 공기청정도를 근거로 구동하는 단계; 데이터 수집부에 의해, 상기 공기조화기와 관련한 데이터 및 상기 정밀 시험 챔버의 시험 환경 데이터를 수집하는 단계; 단말에 의해, 수집된 상기 시험 환경 데이터 및 상기 공기조화기와 관련한 데이터를 근거로 심층 강화 학습 신경 모델에 적용할 입력값을 생성하는 단계; 상기 단말에 의해, 상기 생성된 입력값을 근거로 상기 심층 강화 학습 신경 모델을 이용해서 상기 시험 환경에서 학습 기능을 수행하는 단계; 상기 단말에 의해, 미리 설정된 시간 동안 수행된 학습 단계가 정상적으로 종료된 후, 학습 수행에 따른 정보를 포함하는 신경망에 구성된 복수의 개별 노드에서의 가중치 및 편향값을 저장하는 단계; 및 상기 공기조화기에 의해, 상기 단말로부터 제공되는 시험 환경별로 각 노드에서의 가중치 및 편향값을 포함하는 학습 결과를 저장하는 단계를 포함한다.An air conditioner precision learning method to which an in-depth reinforcement learning algorithm is applied according to another embodiment of the present invention includes: implementing a test environment by a precision test chamber; Receiving, by the precision test chamber, at least one of target temperature, target humidity, and target air cleanliness according to user input through a remote control; Driving by the air conditioner based on the received target temperature, target humidity, and target air cleanliness in a test environment of the precision test chamber; Collecting data related to the air conditioner and test environment data of the precision test chamber by a data collection unit; Generating an input value to be applied to a deep reinforcement learning neural model based on the collected test environment data and data related to the air conditioner by the terminal; Performing, by the terminal, a learning function in the test environment using the deep reinforcement learning neural model based on the generated input value; Storing, by the terminal, weights and bias values at a plurality of individual nodes configured in a neural network including information according to the learning performance after the learning steps performed for a predetermined time are normally ended; And storing, by the air conditioner, a learning result including weights and bias values at each node for each test environment provided from the terminal.

또한, 상기 공기조화기에 의해, 사용자의 용도에 따라 특정 장소에 설치되는 단계; 상기 공기조화기에 의해, 상기 설치된 특정 장소에 따른 특정 환경에서 사용자 입력에 따른 다른 목표 온도, 다른 목표 습도 및 다른 목표 공기청정도에 따라 동작하는 단계; 및 상기 공기조화기에 의해, 상기 특정 환경에서의 동작에 따라 수집되는 데이터, 상기 저장된 시험 환경별로 각 노드에서의 가중치 및 편향값을 포함하는 학습 결과를 근거로 심층 강화 학습 알고리즘에 따라 학습 기능을 실제 설치된 현장 조건에 맞게 추가적으로 수행하는 단계를 더 포함할 수 있다.In addition, the air conditioner is installed in a specific place according to the user's use; Operating by the air conditioner according to different target temperatures, different target humidity, and different target air cleanliness according to user input in a specific environment according to the specific place installed; And by the air conditioner, based on the learning result including the data collected according to the operation in the specific environment, the weight and bias values at each node for each of the stored test environment, the learning function is actually implemented according to the deep reinforcement learning algorithm. It may further include the step of performing additionally according to the installed site conditions.

또한, 상기 시험 환경에서 학습 기능을 수행하는 단계는, 상기 단말에 의해, 상기 심층 강화 학습 알고리즘에 사용할 온도 초기화 범위, 습도 초기화 범위, 공기청정도 초기화 범위, 온도 보상 범위, 습도 보상 범위, 공기청정도 보상 범위 및 시뮬레이션 기간을 설정하는 단계; 상기 단말에 의해, 상기 심층 강화 학습 알고리즘의 에피소드 시작 점수 및 에피소드 이전 점수를 초기화하는 단계; 상기 시뮬레이션 기간 동안에 온도 초기화 범위, 습도 초기화 범위, 공기청정도 초기화 범위와는 별개로 상점 또는 벌점의 보상(R_t)을 판정하기 위한 온도 보상 범위, 습도 보상 범위 및 공기청정도 보상 범위를 설정하는 단계; 상기 단말에 의해, 상기 심층 강화 학습 알고리즘에 구성된 신경망에 포함된 출력 레이어에 포함되는 복수의 출력값 중에서 현재 상태(S_t) 및 가치함수값(Q)을 근거로 어느 하나의 출력값(A_t)을 선택하고, 상기 선택된 출력값을 근거로 상기 리모컨을 통해 상기 공기조화기의 동작을 제어하는 단계; 상기 단말에 의해, 상기 리모컨 제어 시로부터 미리 설정된 시간이 지난 후, 상기 데이터 수집부로부터 다음 현재 상태를 수신하는 단계; 상기 단말에 의해, 상기 현재 상태(S_t), 상기 출력값(A_t) 및 상기 다음 현재 상태(S_t+1)를 근거로 보상값(Rt)을 산출하는 단계; 상기 단말에 의해, 상기 현재 상태(S_t), 상기 출력값(A_t), 상기 보상값(Rt) 및 상기 다음 현재 상태(S_t+1)를 포함하는 튜플(D)을 저장하는 단계; 상기 단말에 의해, 상기 저장되는 복수의 튜플(D) 중 일부를 랜덤 샘플링 하여 학습용 미니-배치(mini-batch)를 생성하는 단계; 상기 단말에 의해, 미리 설정된 시뮬레이션 타임 이내에 진행 중인 하나의 에피소드에 대하여 미니-배치 데이터에 근거하여 타겟 출력값(Y_j)을 산출하는 단계; 상기 타겟 출력값(Y_j) 및 상기 미니-배치에 근거하는 가치함수값(Q) 간의 차이의 제곱인 손실함수값(L_j)을 산출하는 단계; 상기 손실함수값(L_j)이 최소화되도록 상기 신경망의 각 노드의 편향 및 가중치를 업데이트 하는 단계; 상기 에피소드 하나에서 상기 단말에 의해, 미리 설정된 상기 온도 보상 범위, 습도 보상 범위 및 공기청정도 보상 범위를 만족하여 보상이 (+)값인지 또는 불만족하여 (-)값인지를 판단하는 단계; 상기 단말에 의해, 기존 점수에 상기 보상값(R_t)을 누적하여 점수를 업데이트하는 단계; 상기 보상값(R_t)이 상점(+)인 경우 해당 보상값을 현재 에피소드의 점수에 누적하고, 상기 신경망에 포함된 출력 레이어에 포함되는 복수의 출력값(A_t) 중 현재 상태(S_t) 및 가치함수값(Q)에 기초하여, 어느 하나의 출력값(A_t)을 선택하고, 선택된 출력값(A_t)에 기초하여 상기 리모컨을 통해 상기 공기조화기의 동작을 제어하는 단계; 상기 보상값(R_t)이 벌점(-)인 경우 해당 보상값을 현재 에피소드의 점수에 누적하고, 상기 온도 초기화 범위 및 습도 초기화 범위 및 공기청정도 초기화 범위에 내에 진입할 때까지 상기 리모컨을 특정 방향에 대한 상기 선택된 출력값(A_t)으로 출력하는 단계; 상기 단말에 의해, 시뮬레이션 시간 종료 조건을 만족하지 않는 경우 상기 에피소드 시작 점수 및 에피소드 이전 점수를 기록과, 새로운 에피소드 점수를 초기화하는 단계; 상기 단말에 의해, 시뮬레이션의 타임 종료 조건을 만족하는 경우 상기 심층 강화학습 알고리즘을 종료하는 단계를 포함할 수 있다.In addition, the step of performing a learning function in the test environment includes, by the terminal, a temperature initialization range, a humidity initialization range, an air freshness initialization range, a temperature compensation range, a humidity compensation range, and air cleaning to be used for the deep reinforcement learning algorithm. Setting a degree compensation range and a simulation period; Initializing, by the terminal, the episode start score and the episode previous score of the deep reinforcement learning algorithm; Set a temperature compensation range, a humidity compensation range, and an air freshness compensation range to determine the compensation (R _t ) of a store or a penalty, independently of the temperature initialization range, the humidity initialization range, and the air freshness initialization range during the simulation period. step; By the terminal, any one output value (A _t ) based on the current state (S _t ) and the value function value (Q) among the plurality of output values included in the output layer included in the neural network configured in the deep reinforcement learning algorithm Selecting and controlling the operation of the air conditioner through the remote control based on the selected output value; Receiving, by the terminal, a next current state from the data collection unit after a preset time has elapsed from the time of controlling the remote control; Calculating, by the terminal, a compensation value (Rt) based on the current state (S _t ), the output value (A _t ), and the next current state (S _{t + 1} ); Storing, by the terminal, a tuple (D) including the current state (S _t ), the output value (A _t ), the compensation value (Rt), and the next current state (S _{t + 1} ); Generating, by the terminal, a mini-batch for learning by random sampling a part of the plurality of stored tuples D; Calculating, by the terminal, a target output value Y _j based on mini-batch data for one episode in progress within a preset simulation time; Calculating a loss function value (L _j ) that is a square of a difference between the target output value (Y _j ) and the value function value (Q) based on the mini-batch; Updating the bias and weight of each node of the neural network so that the loss function value L _j is minimized; Determining, by the terminal in one of the episodes, whether the compensation is a positive value or a negative value by satisfying the preset temperature compensation range, humidity compensation range, and air cleanliness compensation range; Updating the score by accumulating the reward value (R _t ) in the existing score by the terminal; When the reward value R _t is a store (+), the corresponding reward value is accumulated in the score of the current episode, and the current state (S _t ) among the plurality of output values (A _t ) included in the output layer included in the neural network And selecting one output value (A _t ) based on the value function value (Q) and controlling the operation of the air conditioner through the remote control based on the selected output value (A _t ); If the compensation value (R _t ) is a penalty (-), the corresponding compensation value is accumulated in the score of the current episode, and the remote controller is specified until it enters the temperature initialization range and humidity initialization range and air cleanness initialization range. Outputting the selected output value (A _t ) for a direction; Recording, by the terminal, the episode start score and the episode previous score if the simulation time end condition is not satisfied, and initializing a new episode score; When the time end condition of the simulation is satisfied by the terminal, the step of terminating the deep reinforcement learning algorithm may be included.

본 발명은 시뮬레이션 조건 이외의 외부 교란이 없는 시험 환경에서 외기 온/습도 조건 변화, 건물 현열/잠열 부하 변화, 미세먼지 농도 변화 등 다양한 조건하에 반복 학습을 수행하여 목표로 하는 범위 내에서 안정화된 학습 패턴을 보여주면 해당 안정화된 학습 패턴에 따른 최종 학습 결과를 저장하고, 실제 현장에 배치된 공기조화기에서 앞서 저장된 최종 학습 결과를 근거로 실내 및/또는 실외 환경에 따른 적응형 학습을 수행하여 공기조화기의 동작을 제어함으로써, 사용상의 편의성을 향상시키고, 전체 시스템 운영 효율을 향상시키고, 외부 환경 변환에 유연하게 대응할 수 있는 효과가 있다.The present invention is a stable learning within a target range by performing iterative learning under various conditions such as a change in ambient temperature / humidity, a change in sensible heat / latent heat load, and a change in fine dust concentration in a test environment without external disturbances other than simulation conditions. When the pattern is shown, the final learning result according to the stabilized learning pattern is stored, and the adaptive learning according to the indoor and / or outdoor environment is performed based on the last learning result stored in the air conditioner placed in the actual field to perform the air. By controlling the operation of the harmonizer, there is an effect of improving convenience in use, improving overall system operating efficiency, and flexibly responding to external environment conversion.

도 1은 본 발명의 실시예에 따른 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템의 구성을 나타낸 블록도이다.
도 2는 본 발명의 실시예에 따른 신경망(Artificial Neural Network model)을 나타낸 도이다.
도 3은 본 발명의 실시예에 따른 정밀 시험 챔버 환경과 대응되는 공기조화기의 실사용 환경의 예를 나타낸 도이다.
도 4는 본 발명의 실시예에 따른 정밀 시험 챔버에서 구현되는 시뮬레이션 환경의 예를 나타낸 도이다.
도 5는 기계 학습 알고리즘이 적용되지 않은 공기조화기에서의 정상 운영의 예를 나타낸 도이다.
도 6은 본 발명의 실시예에 따른 심층 강화 학습이 적용된 첫날의 시험 결과를 나타낸 도이다.
도 7은 본 발명의 실시예에 따른 실내 온도(Tin)의 확대된 상태를 나타낸 도이다.
도 8은 본 발명의 실시예에 따른 각 에피소드에서 획득된 점수를 나타낸다.
도 9는 본 발명의 실시예에 따른 심층 강화 학습이 적용된 둘째 날의 시험 결과를 나타낸 도이다.
도 10은 본 발명의 실시예에 따른 심층 강화 학습이 적용된 셋째 날의 시험 결과를 나타낸 도이다.
도 11은 본 발명의 실시예에 따른 실내 온도의 평균을 나타낸 도이다.
도 12는 본 발명의 실시예에 따른 실내 온도의 표준편차(S.D.)를 나타낸 도이다.
도 13 내지 도 14는 본 발명의 실시예에 따른 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 방법을 나타낸 흐름도이다.
도 15는 본 발명의 실시예에 따른 심층 강화 학습 알고리즘을 적용한 학습 방법을 나타낸 흐름도이다.1 is a block diagram showing the configuration of an air conditioner precision learning system to which an in-depth reinforcement learning algorithm according to an embodiment of the present invention is applied.
2 is a diagram showing an artificial neural network model according to an embodiment of the present invention.
3 is a view showing an example of the actual use environment of the air conditioner corresponding to the precision test chamber environment according to an embodiment of the present invention.
4 is a view showing an example of a simulation environment implemented in a precision test chamber according to an embodiment of the present invention.
5 is a diagram illustrating an example of normal operation in an air conditioner to which a machine learning algorithm is not applied.
6 is a diagram showing test results of the first day to which deep reinforcement learning according to an embodiment of the present invention is applied.
7 is a view showing an enlarged state of the room temperature (Tin) according to an embodiment of the present invention.
8 shows scores obtained in each episode according to an embodiment of the present invention.
9 is a diagram showing test results on the second day to which deep reinforcement learning according to an embodiment of the present invention is applied.
10 is a diagram showing test results of the third day to which deep reinforcement learning according to an embodiment of the present invention is applied.
11 is a view showing an average of room temperature according to an embodiment of the present invention.
12 is a view showing a standard deviation (SD) of room temperature according to an embodiment of the present invention.
13 to 14 are flowcharts illustrating an air conditioner precision learning method to which an in-depth reinforcement learning algorithm according to an embodiment of the present invention is applied.
15 is a flowchart illustrating a learning method to which a deep reinforcement learning algorithm is applied according to an embodiment of the present invention.

본 발명에서 사용되는 기술적 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본 발명에서 사용되는 기술적 용어는 본 발명에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. 또한, 본 발명에서 사용되는 기술적인 용어가 본 발명의 사상을 정확하게 표현하지 못하는 잘못된 기술적 용어일 때에는 당업자가 올바르게 이해할 수 있는 기술적 용어로 대체되어 이해되어야 할 것이다. 또한, 본 발명에서 사용되는 일반적인 용어는 사전에 정의되어 있는 바에 따라, 또는 전후 문맥상에 따라 해석되어야 하며, 과도하게 축소된 의미로 해석되지 않아야 한다.It should be noted that the technical terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. In addition, technical terms used in the present invention should be interpreted as meanings generally understood by a person having ordinary knowledge in the technical field to which the present invention belongs, unless otherwise defined in the present invention. It should not be interpreted as a meaning or an excessively reduced meaning. In addition, when the technical term used in the present invention is a wrong technical term that does not accurately represent the spirit of the present invention, it should be understood as being replaced by a technical term that can be correctly understood by those skilled in the art. In addition, the general terms used in the present invention should be interpreted as defined in the dictionary or in context before and after, and should not be interpreted as an excessively reduced meaning.

또한, 본 발명에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함한다. 본 발명에서 "구성된다" 또는 "포함한다" 등의 용어는 발명에 기재된 여러 구성 요소들 또는 여러 단계를 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.In addition, the singular expression used in the present invention includes a plural expression unless the context clearly indicates otherwise. In the present invention, terms such as “consisting of” or “comprising” should not be construed as including all of the various components or steps described in the invention, and some of the components or some steps are not included. It may be, or should be construed to further include additional components or steps.

또한, 본 발명에서 사용되는 제 1, 제 2 등과 같이 서수를 포함하는 용어는 구성 요소들을 설명하는데 사용될 수 있지만, 구성 요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성 요소는 제 2 구성 요소로 명명될 수 있고, 유사하게 제 2 구성 요소도 제 1 구성 요소로 명명될 수 있다.Further, terms including ordinal numbers such as first and second used in the present invention may be used to describe elements, but the elements should not be limited by terms. The terms are used only to distinguish one component from other components. For example, the first component may be referred to as a second component without departing from the scope of the present invention, and similarly, the second component may be referred to as a first component.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성 요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings, but the same or similar elements will be given the same reference numbers regardless of the reference numerals, and redundant descriptions thereof will be omitted.

또한, 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 발명의 사상을 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 발명의 사상이 제한되는 것으로 해석되어서는 아니 됨을 유의해야 한다.In addition, in the description of the present invention, when it is determined that detailed descriptions of related known technologies may obscure the subject matter of the present invention, detailed descriptions thereof will be omitted. In addition, it should be noted that the accompanying drawings are only for easy understanding of the spirit of the present invention and should not be interpreted as limiting the spirit of the present invention by the accompanying drawings.

도 1은 본 발명의 실시예에 따른 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템(10)의 구성을 나타낸 블록도이다.1 is a block diagram showing the configuration of an air conditioner precision learning system 10 to which an in-depth reinforcement learning algorithm according to an embodiment of the present invention is applied.

도 1을 참조하면, 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템(10)은 공기조화기(100), 데이터 수집부(200), 단말(300) 리모컨(400) 및 정밀 시험 챔버(500)을 포함할 수 있다. Referring to FIG. 1, an air conditioner precision learning system 10 to which an in-depth reinforcement learning algorithm is applied includes an air conditioner 100, a data collection unit 200, a terminal 300, a remote control 400, and a precision test chamber 500 ).

도 1에 도시된 공기조화기 정밀 제어 시스템(10)의 구성 요소 모두가 필수 구성 요소인 것은 아니며, 도 1에 도시된 구성 요소보다 많은 구성 요소에 의해 공기조화기 정밀 제어 시스템(10)이 구현될 수도 있고, 그보다 적은 구성 요소에 의해서도 공기조화기 정밀 제어 시스템(10)이 구현될 수도 있다.Not all components of the air conditioner precision control system 10 shown in FIG. 1 are essential components, and the air conditioner precision control system 10 is implemented by more components than those shown in FIG. 1. The air conditioner precision control system 10 may be implemented with fewer components.

상기 공기조화기(또는 공기청정기 겸용 공기조화기)(100)는 실내의 냉방이나 난방, 습도 조절 및/또는 공기청정도를 관리하는 장치로, 소비지가 사용하는 제품을 의미한다.The air conditioner (or air conditioner combined with an air purifier) 100 is a device for managing indoor cooling, heating, humidity control, and / or air cleanliness, and means a product used by a consumer.

상기 정밀 시험 챔버(500)는, 일종의 공기조화기에 해당되나, 본 실시예에서는 공기조화기(100)와 구분되며 실제 상황으로 모사된 다양한 조건의 시험 환경을 구현하거나 설정할 수 있다The precision test chamber 500 is a type of air conditioner, but in this embodiment, it is different from the air conditioner 100 and can implement or set a test environment under various conditions simulated in an actual situation.

또한, 상기 공기조화기(100)는 상기 데이터 수집부(200), 상기 단말(300), 상기 리모컨(400) 등과 통신할 수 있다. 정밀 시험 챔버(500)에서 학습한 후에 설치현장에서 추가로 학습이 필요한 경우, 정밀 시험 챔버(500)는 설치공간의 환경(예를 들어, 가정집, 사무실, 공장 등)으로 대체될 수 있으며, 데이터 수집부(200)는 최소로 하여 공기조화기(100) 제품에 포함될 수 있다. 또한, 단말(300)의 경우에는 인공지능을 구현하기 위한 어떠한 컴퓨팅 시스템도 될 수도 있으며, 공기조화기(100) 제품에 내장되어 있는 계산용량이 작은 칩셋이 될 수도 있다. 리모컨(400)의 경우에는 유선, 무선통신 리모컨을 망라하며 혹은 제품에 내장되어 바로 각각의 컴포넌트 구성요소를 제어할 수 있는 있는 시그널 전송장치로 대체될 수도 있다In addition, the air conditioner 100 may communicate with the data collection unit 200, the terminal 300, the remote control 400, and the like. If additional learning is required at the installation site after learning in the precision test chamber 500, the precision test chamber 500 may be replaced with an environment (eg, a home, office, factory, etc.) of the installation space, and data The collection unit 200 may be included in the product of the air conditioner 100 to a minimum. In addition, in the case of the terminal 300, it may be any computing system for implementing artificial intelligence, or it may be a chipset with a small computational capacity built into the product of the air conditioner 100. In the case of the remote control 400, a wired or wireless communication remote control may be included, or it may be replaced with a signal transmission device built into the product and capable of directly controlling each component component.

또한, 상기 공기조화기(100)의 정밀 시험 챔버(500)는 상술한 바와 같이 표준 냉방, 난방 조건 등을 포함하여 실제 상황으로 모사된 온도 조건, 습도 조건, 재실자 조건 등의 시험 환경(또는 외부 환경)을 구현(또는 설정)한다. 여기서, 정밀 시험 챔버(500)는 시뮬레이션 조건 이외에 의도되지 않으며 교란이 없는 시험 환경(또는 테스트 환경)을 구현하기 위해서, 실외의 경우 가변 온도, 가변 습도 등의 조건을 포함하고, 실내의 경우 가변 현열 부하, 가변 잠열 부하 등의 조건을 포함한다. 이때, 상기 공기조화기(100)는 냉방 및 난방 기능을 수행하는 장치뿐만 아니라, 공기청정기 기능을 포함하는 공기청정기 겸용 공기조화기 등일 수 있다.In addition, the precision test chamber 500 of the air conditioner 100, as described above, including standard cooling, heating conditions, etc., test conditions such as temperature conditions, humidity conditions, occupancy conditions, etc. Environment). Here, the precision test chamber 500 includes conditions such as variable temperature and variable humidity in the case of outdoor, and variable sensible heat in the room in order to implement a test environment (or test environment) without intention and disturbance other than simulation conditions. Load, variable latent heat load, and other conditions. In this case, the air conditioner 100 may be an air purifier combined air conditioner including an air purifier function, as well as an apparatus for performing cooling and heating functions.

또한, 상기 공기조화기(100)는 상기 리모컨(400)을 통한 사용자 입력에 따른 목표 온도, 목표 습도, 목표 공기청정도 등을 수신한다.In addition, the air conditioner 100 receives a target temperature, target humidity, and target air cleanliness according to a user input through the remote controller 400.

또한, 상기 공기조화기(100)는 상기 정밀 시험 챔버(500) 또는 설치 환경에서 상기 수신된 목표 온도, 목표 습도, 목표 공기청정도 등을 근거로 구동(또는 동작)한다.In addition, the air conditioner 100 is driven (or operated) based on the received target temperature, target humidity, target air cleanliness, etc. in the precision test chamber 500 or an installation environment.

즉, 상기 공기조화기(100)는 상기 시험 환경 상태에서 상기 수신된 목표 온도, 목표 습도 등에 따라 상기 공기조화기(100) 주변의 온도 습도, 공기 청정도 중 적어도 하나를 조절하기 위해서 구동한다.That is, the air conditioner 100 is driven to adjust at least one of the temperature humidity and air cleanliness around the air conditioner 100 according to the received target temperature and target humidity in the test environment state.

또한, 상기 공기조화기(100)는 상기 단말(300)로부터 제공되는 상기 학습 결과(또는 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등을 포함하는 학습 결과)를 해당 공기조화기(100)에 구비된 칩셋 등 메모리에 저장할 수 있다.In addition, the air conditioner 100 is applied to the learning result (or learning results including weights, bias values, etc. at each node for a plurality of test environments) provided from the terminal 300. ) Can be stored in a memory such as a chipset.

또한, 상기 공기조화기(100)는 작업자에 의해 실제 환경(또는 실제 장소)에 설치(또는 배치)된다.In addition, the air conditioner 100 is installed (or placed) in a real environment (or a real place) by an operator.

또한, 상기 공기조화기(100)는 실제 배치된 환경에서 사용자 입력에 따른 다른 목표 온도 및/또는 다른 목표 습도, 다른 공기청정도 중 적어도 하나로 동작(또는 구동)을 수행한다.In addition, the air conditioner 100 operates (or drives) at least one of a different target temperature and / or different target humidity and different air cleanliness according to user input in an actual deployed environment.

또한, 상기 공기조화기(100)는 실제 배치된 환경에서의 동작에 따라 수집되는 다양한 데이터들과, 신경망을 구성하는 개별 노드에서의 가중치와 편향값(또는 상기 학습 결과) 등을 이용해서 상기 심층 강화 학습 알고리즘에 따라 추가적인 현장 학습 기능을 수행하여, 손실 함수를 최소화하는 실제 환경에 최적화된 개별 노드에서의 가중치, 편향값 등을 산출(또는 관리)하여 해당 공기조화기(100)의 운영을 최적화한다.In addition, the air conditioner 100 uses the various data collected according to the operation in the actual deployed environment, and the weight and bias value (or the learning result) of individual nodes constituting the neural network, to deepen the air. Optimize the operation of the air conditioner 100 by calculating (or managing) weights, bias values, etc. at individual nodes optimized for the real environment that minimizes the loss function by performing additional field learning functions according to the reinforcement learning algorithm do.

이와 같이, 상기 공기조화기(100)가 실제 사용 장소에 배치되기 전에, 다양한 시험 환경에 따라 학습된 학습 결과를 저장한 상태에서, 해당 사용 장소에 배치된 후 상기 저장된 학습 결과를 이용해서 공기조화기(100)의 동작을 제어함에 따라, 전체 공기조화기(100) 의 운영 효율을 향상시키고, 사용자의 편의성을 증대시킬 수 있다.As described above, before the air conditioner 100 is placed in an actual use place, in a state in which learning results learned according to various test environments are stored, after being placed in the corresponding use place and using the stored learning result, air conditioning is performed. By controlling the operation of the group 100, it is possible to improve the operation efficiency of the entire air conditioner 100 and increase the user's convenience.

상기 데이터 수집부(또는 센서부/측정부)(200)는 상기 공기조화기(100)의 정밀 시험 챔버(500)에 포함되어 정밀하게 측정하거나 상기 공기조화기(100)에 인접하게 구성(또는 설치/배치) 혹은 포함되어 필수적인 요소만 측정할 수 있다.The data collection unit (or sensor unit / measurement unit) 200 is included in the precision test chamber 500 of the air conditioner 100 to be precisely measured or configured adjacent to the air conditioner 100 (or Installation / deployment) or included to measure only essential elements.

또한, 상기 데이터 수집부(200)는 상기 공기조화기(100)의 정밀 시험 챔버(500), 상기 단말(300), 상기 리모컨(400) 등과 통신한다In addition, the data collection unit 200 communicates with the precision test chamber 500 of the air conditioner 100, the terminal 300, the remote control 400, and the like.

또한, 상기 데이터 수집부(200)는 상기 공기조화기(100)와 관련한 데이터 및 정밀 시험 챔버(500)의 환경 및/또는 실제 현장에서의 환경 데이터, 재실자 데이터를 수집(또는 획득/측정)한다.In addition, the data collection unit 200 collects (or acquires / measures) data related to the air conditioner 100, environmental data of the precision test chamber 500, and / or environmental data at a real site. .

즉, 상기 데이터 수집부(200)는 상기 공기조화기(100)의 소비 전력, 상기 공기조화기(100)의 구동에 따른 상기 공기조화기(100)의 토출구(미도시) 주변(또는 부근)에서의 온도(또는 토출 온도), 상기 공기조화기(100)의 토출구 부근에서의 습도(또는 토출 습도), 상기 공기조화기(100)가 위치한 실내의 실내 온도 및 실내 습도, 상기 공기조화기(100)가 위치한 실외의 실외 온도 및 실외 습도 등을 포함하는 상기 공기조화기(100)와 관련한 데이터를 수집한다. 이때, 상기 데이터 수집부(200)는 상기 공기조화기(100)의 토출구 주변이 아닌 상기 공기조화기(100)로부터 일정 거리 지점에서의 온도, 습도, 공기청정도 등의 환경 데이터를 수집할 수도 있다.That is, the data collection unit 200 is around (or near) a discharge port (not shown) of the air conditioner 100 according to power consumption of the air conditioner 100 and driving of the air conditioner 100. The temperature (or discharge temperature) at, the humidity (or discharge humidity) in the vicinity of the outlet of the air conditioner 100, the indoor temperature and room humidity in the room where the air conditioner 100 is located, and the air conditioner ( 100) to collect the data related to the air conditioner 100, including the outdoor temperature and outdoor humidity of the outdoor location. At this time, the data collection unit 200 may collect environmental data such as temperature, humidity, and air cleanliness at a certain distance from the air conditioner 100 rather than around the outlet of the air conditioner 100. have.

또한, 상기 데이터 수집부(200)는 상기 수집된 상기 공기조화기(100)와 관련한 데이터를 상기 단말(300)에 제공한다.In addition, the data collection unit 200 provides data related to the collected air conditioner 100 to the terminal 300.

상기 단말(300)은 스마트 폰(Smart Phone), 휴대 단말기(Portable Terminal), 이동 단말기(Mobile Terminal), 폴더블 단말기(Foldable Terminal), 개인 정보 단말기(Personal Digital Assistant: PDA), PMP(Portable Multimedia Player) 단말기, 텔레매틱스(Telematics) 단말기, 내비게이션(Navigation) 단말기, 개인용 컴퓨터(Personal Computer), 노트북 컴퓨터, 슬레이트 PC(Slate PC), 태블릿 PC(Tablet PC), 울트라북(ultrabook), 웨어러블 디바이스(Wearable Device, 예를 들어, 워치형 단말기(Smartwatch), 글래스형 단말기(Smart Glass), HMD(Head Mounted Display) 등 포함), 와이브로(Wibro) 단말기, IPTV(Internet Protocol Television) 단말기, 스마트 TV, 디지털방송용 단말기, AVN(Audio Video Navigation) 단말기, A/V(Audio/Video) 시스템, 플렉시블 단말기(Flexible Terminal), 디지털 사이니지 장치 등과 같은 다양한 단말기에 적용될 수 있다.The terminal 300 includes a smart phone, a portable terminal, a mobile terminal, a foldable terminal, a personal digital assistant (PDA), and a portable multimedia (PMP). Player) terminal, Telematics terminal, Navigation terminal, Personal computer, Notebook computer, Slate PC, Tablet PC, Ultrabook, Wearable device Device, for example, a watch-type terminal (Smartwatch), a glass-type terminal (Smart Glass), a head mounted display (HMD), etc., a Wibro terminal, an IPTV (Internet Protocol Television) terminal, a smart TV, and digital broadcasting It can be applied to various terminals such as a terminal, an audio video navigation (AVN) terminal, an audio / video (A / V) system, a flexible terminal, and a digital signage device.

또한, 상기 단말(300)은 상기 장치들 이외에도, GPU 서버 시스템, 클라우드 서버, 상기 공기조화기(100)에 직접 설치되는 컴퓨팅 시스템 또는 마이크로 칩일 수 있다.Further, the terminal 300 may be a GPU server system, a cloud server, a computing system or a microchip installed directly on the air conditioner 100 in addition to the devices.

또한, 상기 단말(300)은 다른 단말들과의 통신 기능을 수행하기 위한 통신부(미도시), 다양한 정보 및 프로그램(또는 애플리케이션)을 저장하기 위한 저장부(미도시), 다양한 정보 및 프로그램 실행 결과를 표시하기 위한 표시부(미도시), 상기 다양한 정보 및 프로그램 실행 결과에 대응하는 음성 정보를 출력하기 위한 음성 출력부(미도시), 각 단말의 다양한 구성 요소 및 기능을 제어하기 위한 제어부(미도시) 등을 포함할 수 있다.In addition, the terminal 300 is a communication unit (not shown) for performing communication functions with other terminals, a storage unit (not shown) for storing various information and programs (or applications), various information and program execution results A display unit (not shown) for displaying, a voice output unit (not shown) for outputting voice information corresponding to the various information and program execution results, a control unit (not shown) for controlling various components and functions of each terminal ) And the like.

또한, 상기 단말(300)은 상기 공기조화기(100)의 정밀 시험 챔버(500), 상기 데이터 수집부(200), 상기 리모컨(400) 등과 통신한다In addition, the terminal 300 communicates with the precision test chamber 500 of the air conditioner 100, the data collection unit 200, the remote controller 400, and the like.

또한, 상기 단말(300)은 상기 데이터 수집부(200)로부터 수집된 데이터(또는 상기 공기조화기(100)와 관련한 데이터)를 근거로 심층 강화 학습 신경 모델에 적용할 입력값을 생성한다.In addition, the terminal 300 generates an input value to be applied to a deep reinforcement learning neural model based on data collected from the data collection unit 200 (or data related to the air conditioner 100).

즉, 상기 단말(300)은 상기 데이터 수집부(200)로부터 수집된 데이터 중에서 실내 온도(예를 들어 Tin)와 상기 수신된 목표 온도 간의 차이값, 상기 수집된 데이터 중에서 실내 온도의 미리 설정된 시간 동안의 순간 변화 기울기(예를 들어 dTin/dt), 재실자 및 사물의 온도, 상기 수집된 데이터 중에서 상기 공기조화기(100)의 토출구 부근에서의 토출 온도(예를 들어 Tdis), 상기 수집된 데이터 중에서 실내 습도(예를 들어 Hin)와 상기 수신된 목표 습도 간의 차이값, 상기 수집된 데이터 중에서 실내 습도의 미리 설정된 시간 동안의 순간 변화 기울기(예를 들어 dHin/dt), 상기 수집된 데이터 중에서 상기 공기조화기(100)의 토출구 부근에서의 토출 습도(예를 들어 Hdis), 재실자 및 사물의 땀 등 수분량, 실내 공기청정도와, 수신된 목표 공기청정도 간의 차이값, 수집된 데이터 중에서 실내 공기청정도의 미리 설정된 시간 동안의 순간 변화 기울기, 수집된 데이터 중에서 공기조화기(100)의 토출구 부근에서의 공기청정도, 재실자 및 사물 주변의 공기청정도 등을 상기 심층 강화 학습 신경 모델에 적용할 입력값으로 생성한다.That is, the terminal 300 is the difference value between the room temperature (for example, Tin) and the received target temperature among the data collected from the data collection unit 200, for a preset time of the room temperature among the collected data Instantaneous change slope of (for example, dTin / dt), temperature of occupants and objects, among the collected data, the discharge temperature (for example, Tdis) near the outlet of the air conditioner 100, among the collected data The difference between the indoor humidity (for example, Hin) and the received target humidity, the instantaneous change slope (for example, dHin / dt) during the preset time of indoor humidity among the collected data, and the air from the collected data The discharge humidity (for example, Hdis) in the vicinity of the outlet of the air conditioner 100, the amount of moisture, such as sweat of occupants and objects, the difference between the indoor air cleanliness and the received target air cleanliness, and the indoor air cleanliness from the collected data. The input value to be applied to the deep reinforcement learning neural model, such as the instantaneous change slope for a preset time, the air cleanliness level near the discharge port of the air conditioner 100, and the air cleanliness level of the occupants and objects around the collected data. To create.

또한, 상기 단말(300)은 상기 생성된 입력값을 근거로 상기 심층 강화 학습 신경 모델을 이용해서 해당 시험 환경에 대한 학습 기능을 수행한다.In addition, the terminal 300 performs a learning function for a corresponding test environment using the deep reinforcement learning neural model based on the generated input value.

즉, 상기 단말(300)은 상기 심층 강화 학습 신경 모델에 사용할 신경망을 구성(또는 정의)한다. 여기서, 상기 신경망은 상기 입력값을 입력 레이어(input layer)로 구성하고, 온도를 조절하기 위한 둘 이상의 설정 온도 및/또는 습도를 조절하기 위한 둘 이상의 설정 습도 및/또는 공기청정도를 조절하기 위한 둘 이상의 설정 공기청정도 등을 출력 레이어(output layer)로 구성하고, 학습에 따른 가중치(weight)와 편향값(bias)을 가진 복수의 노드를 포함하는 복수의 히든 레이어(hidden layer)로 구성한다.That is, the terminal 300 configures (or defines) a neural network to be used for the deep reinforcement learning neural model. Here, the neural network configures the input value as an input layer, and two or more set humidity and / or air cleanliness for adjusting temperature and / or two or more set humidity and / or air cleanliness for controlling temperature. The above-described set air cleanliness, etc., is configured as an output layer, and is composed of a plurality of hidden layers including a plurality of nodes having a weight and a bias value according to learning.

또한, 상기 단말(300)은 상기 심층 강화 학습 알고리즘을 통해 해당 시험 환경에 대한 학습 기능을 수행한다.In addition, the terminal 300 performs a learning function for a corresponding test environment through the deep reinforcement learning algorithm.

즉, 상기 단말(300)은 초기값 등이 설정된 Q 학습 알고리즘(또는 SARSA 알고리즘)을 통해 상기 수신된 입력값을 근거로 해당 시험 환경에서의 학습 기능을 수행한다.That is, the terminal 300 performs a learning function in a corresponding test environment based on the received input value through a Q learning algorithm (or SARSA algorithm) in which initial values are set.

이때, 해당 심층 강화 학습의 구체적인 수행 단계는 다음과 같다.At this time, the specific steps of the in-depth reinforcement learning are as follows.

상기 단말(300)은 상기 목표치를 근거로 해당 심층 강화 학습 알고리즘에 사용할 온도 및/또는 습도(또는 온도 범위 및/또는 습도 범위)의 를 초기화 범위 설정, 보상 범위 설정, 시뮬레이션 타임 등의 변수 초기화 및/또는 선언을 할 수 있다. 여기서, 상기 목표치는 사용자 입력에 따른 목표 온도, 목표 습도, 목표 공기청정도 등을 포함한다. 이때, 상기 단말(300)은 초기화 범위 값이 보상 온도, 보상 습도(또는 보상 온도 범위 및/또는 보상 습도 범위) 및/또는 보상 공기청정도 내에 포함되도록 설정할 수 있다.The terminal 300 initializes variables of temperature and / or humidity (or temperature range and / or humidity range) to be used in the corresponding deep reinforcement learning algorithm based on the target value, initializes a range, sets a compensation range, and initializes variables such as simulation time and / Or can make a declaration. Here, the target value includes a target temperature, target humidity, and target air cleanliness according to user input. At this time, the terminal 300 may set the initialization range value to be included in the compensation temperature, the compensation humidity (or the compensation temperature range and / or the compensation humidity range) and / or the compensation air cleanliness.

또한, 상기 단말(300)은 해당 심층 강화 학습 알고리즘에서 에피소드 시작 점수 및 에피소드 이전 점수, 새로운 에피소드 점수를 초기화(또는 설정)한다(S1515 단계).In addition, the terminal 300 initializes (or sets) the episode start score, the episode previous score, and the new episode score in the corresponding deep reinforcement learning algorithm (step S1515).

또한, 상기 단말(300)은 현재 상태(S_t)(또는 입력값) 및 가치함수값(Q)을 근거로 해당 심층 강화 학습 알고리즘 내의 인공망 상에서 현재의 액션값(At)을 출력한다(이하 S1520). 여기서, 상기 현재 상태(S_t)(또는 입력값)는 실내 온도(예를 들어 Tin)와 상기 목표 온도 간의 차이값, 실내 온도의 미리 설정된 시간 동안의 순간 변화 기울기(예를 들어 dTin/dt), 공기조화기(100)의 토출구 부근에서의 토출 온도(예를 들어 Tdis), 실내 습도(예를 들어 Hin)와 상기 수신된 목표 습도 간의 차이값, 실내 습도의 미리 설정된 시간 동안의 순간 변화 기울기(예를 들어 dHin/dt), 상기 공기조화기(100)의 토출구 부근에서의 토출 습도(예를 들어 Hdis), 재실자 및 사물의 땀 등 수분량, 실내 공기청정도와 상기 목표 공기청정도 간의 차이값, 상기 실내 공기청정도의 미리 설정된 시간 동안의 순간 변화 기울기, 상기 수집된 데이터 중에서 상기 공기조화기(100)의 토출구 부근에서의 공기청정도, 재실자 및 사물 주변의 공기청정도 등을 포함할 수 있다.In addition, the terminal 300 outputs the current action value Att on the artificial network in the corresponding deep reinforcement learning algorithm based on the current state S _t (or input value) and the value function value Q (hereinafter, S1520). Here, the current state (S _t ) (or input value) is the difference between the room temperature (for example, Tin) and the target temperature, and the gradient of instantaneous change during the preset time of the room temperature (for example, dTin / dt) , Discharge temperature near the discharge port of the air conditioner 100 (for example, Tdis), the difference between the indoor humidity (for example, Hin) and the received target humidity, the gradient of instantaneous change over a predetermined time of the indoor humidity (E.g., dHin / dt), discharge humidity (e.g., Hdis) near the outlet of the air conditioner 100, moisture content such as sweat of occupants and objects, difference between indoor air cleanliness and the target air cleanliness , The instantaneous change slope for a predetermined time of the indoor air cleanliness, air cleanliness near the outlet of the air conditioner 100 among the collected data, air cleanliness around the occupants and objects, etc. .

즉, 상기 단말(300)은 상기 심층 강화 학습 알고리즘에 구성된 신경망에 포함된 출력 레이어에 포함되는 복수의 출력값 중에서 상기 현재 상태(S_t)(또는 입력값) 및 가치함수값(Q)을 근거로 어느 하나의 출력값(A_t)을 선택하고, 상기 선택된 출력값을 근거로 상기 리모컨(400)을 통해 상기 공기조화기(100)의 동작을 제어한다.That is, the terminal 300 is based on the current state S _t (or an input value) and a value function value Q among a plurality of output values included in an output layer included in a neural network configured in the deep reinforcement learning algorithm. One of the output values A _t is selected, and the operation of the air conditioner 100 is controlled through the remote controller 400 based on the selected output value.

또한, 상기 단말(300)은 미리 설정된 시간(예를 들어 120초)이 지난 후, 데이터 수집부(200)로부터 다음 현재 상태(또는 다음 입력값)(S_t+ ₁)를 수신(또는 획득)한다. 여기서, 상기 다음 현재 상태(S_t+1)는 앞선 현재 상태(S_t)와 상기 선택된 출력값(A_t)을 포함한다.In addition, the terminal 300 receives (or acquires) the next current state (or the next input value) (S _{t +} ₁ ) from the data collection unit 200 after a preset time (for example, 120 seconds). . Here, the next current state (S _{t + 1} ) includes the previous current state (S _t ) and the selected output value (A _t ).

또한, 상기 단말(300)은 상기 현재 상태(S_t), 상기 출력값(A_t), 상기 다음 현재 상태(S_t+1)를 근거로 보상값(Rt)을 산출한다.In addition, the terminal 300 calculates a compensation value Rt based on the current state S _t , the output value A _t , and the next current state S _{t + 1} .

이때, 다음 상태 시점(t+1 시점)에서 상기 현재 상태(S_t)가 상기 보상 범위 설정 온도 및/또는 습도 및/또는 공기청정도를 만족하는 경우, 상기 단말(300)은 해당 상태에서 보상값(Rt)를 상점(예를 들어 '1'점 추가)하여 부과하고, 상기 보상범위 설정을 벗어난 경우에는 보상값(Rt)을 벌점(예를 들어 '-100점')을 부과할 수 있다.At this time, if the current state (S _t ) at the next state time point (t + 1 time point) satisfies the compensation range setting temperature and / or humidity and / or air cleanliness, the terminal 300 compensates for the state. (Rt) is charged by a store (for example, '1' point added), and if it is out of the above compensation range, a penalty value (for example, '-100 points') may be imposed.

또한, 상기 단말(300)은 상기 보상을 실시한 후, 상기 현재 상태(S_t), 상기 출력값(A_t), 상기 보상값(Rt) 및 상기 다음 현재 상태(S_t+1)를 포함하는 튜플(D)를 저장한다.In addition, after the terminal 300 performs the compensation, a tuple including the current state (S _t ), the output value (A _t ), the compensation value (Rt), and the next current state (S _{t + 1} ) Save (D).

또한, 상기 단말(300)은 상기 저장되는 복수의 튜플(D) 중에서 일부를 선택하는 랜덤 샘플링을 수행한다.In addition, the terminal 300 performs random sampling to select a part of the plurality of stored tuples (D).

즉, 상기 단말(300)은 상기 저장되는 복수의 튜플(D) 중에서 일부를 이용해서 미니-배치(mini-batch)를 통해 랜덤 샘플링을 수행한다.That is, the terminal 300 performs random sampling through mini-batch using a part of the plurality of stored tuples D.

또한, 상기 미니-배치(mini-batch) 데이터에 근거하여 타겟 출력값(Y_j)과 상기 가치함수값(Q) 간의 차이 제곱인 손실함수(L_j)를 산출한다. 이때, 예를 들어 타겟 출력값은

와 같은 수식에 따라 산출될 수 있으며 세부 수식은 학습 모델에 따라서 변경될 수 있다. 여기서, 상기

는 과거의 가치를 얼마나 반영할 것인지를 포함하는 감가율(discount factor)를 나타내고, 상기 a'는 다음의 액션출력을 나타낸다. 또한, θ는 안정된 계산을 위한 타겟 네트워크 파라미터(target network parameter)를 나타낸다.In addition, the loss function L _j , which is the square of the difference between the target output value Y _j and the value function value Q, is calculated based on the mini-batch data. At this time, for example, the target output value

It can be calculated according to the following formula, and the detailed formula can be changed according to the learning model. Where, above

Denotes a discount factor including how much the past value will be reflected, and a 'denotes the following action output. In addition, θ represents a target network parameter for stable calculation.

또한, 상기 단말(300)은 상기 손실 함수가 최소가 되도록 신경망의 각 노드의 편향 및 가중치를 업데이트 할 수 있다.In addition, the terminal 300 may update the bias and weight of each node of the neural network so that the loss function is minimal.

또한, 보상값(R_t)이 상점(+)의 보상인 경우 기존 점수(Score)에 누적(+)하여 점수(Score)를 업데이트 하고, 상기 S1520 단계로 이동하여 과정을 반복한다. 반대로 보상값(R_t)이 벌점(-)인 보상인 경우에는 기존 점수(Score)에 (-)로 누적하여 점수(Score)를 업데이트 하고, 보상범위 밖의 현재상태(S_t+1)를 학습초기로 원위치 시키기 위해 강제로 해당 방향으로 리모컨 컨트롤할 수 있다. 이때, 시뮬레이션 타임 종료가 아니라면 상기 S1515 단계로 이동하여 다음 에피소드를 반복한다.In addition, when the reward value R _t is the reward of the store (+), the score (Score) is updated by accumulating (+) the existing score (Score), and the process is repeated by moving to step S1520. Conversely, if the reward value (R _t ) is a penalty (-), the score (Score) is updated by accumulating (-) on the existing score (Score) and learning the current state (S _{t + 1} ) outside the compensation range. To return to the initial position, you can forcibly control the remote in that direction. At this time, if the simulation time is not over, the process goes to step S1515 and the next episode is repeated.

또한, 미리 설정된 시간(시뮬레이션 타임) 동안 수행된 학습 단계가 정상적으로 종료된 후, 상기 단말(300)은 학습 수행에 따른 정보를 포함하는 신경망에 구성된 복수의 개별 노드에서의 가중치, 편향값 등을 저장한다.In addition, after the learning step performed for a predetermined time (simulation time) is normally ended, the terminal 300 stores weights, bias values, and the like at a plurality of individual nodes configured in a neural network including information according to the learning performance. do.

즉, 미리 설정된 시간 동안 수행된 학습 단계가 정상적으로 종료된 후, 상기 단말(300)은 학습 수행에 따른 개별 노드별로 가중치, 편향값 등을 포함하는 학습 결과를 저장한다.That is, after the learning step performed for a predetermined time is normally ended, the terminal 300 stores learning results including weights, bias values, and the like for each node according to the learning performance.

또한, 상기 단말(300)은 상기 구현된 시험 환경 이외의 다른 조건의 시험 환경을 통해서도 추가로 심층 강화 학습을 수행하여, 해당 시험 환경별로 각 노드에서의 가중치, 편향값 등을 포함하는 복수의 시험 환경별 학습 결과를 저장(또는 관리)할 수 있다.In addition, the terminal 300 additionally performs deep reinforcement learning through a test environment with conditions other than the implemented test environment, so that a plurality of tests including weights, bias values, etc. at each node for each test environment Environment-specific learning results can be stored (or managed).

또한, 상기 단말(300)은 해당 단말(300)에 저장된 학습 결과(또는 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등을 포함하는 학습 결과)를 상기 공기조화기(100)에 포함된 메모리(미도시) 및/또는 별도의 추가 단말에 저장할 수 있다.In addition, the terminal 300 includes learning results (or learning results including weights, bias values, etc. at each node for each test environment) stored in the corresponding terminal 300 in the air conditioner 100. It may be stored in a memory (not shown) and / or a separate additional terminal.

즉, 상기 단말(300)은 해당 단말(300)에 저장된 학습 결과(또는 복수의 시험 환경 별로 각 노드에서의 가중치, 편향값 등을 포함하는 학습 결과)를 상기 공기조화기(100) 및/또는 별도의 추가 단말에 제공(또는 전달)할 수 있다.That is, the terminal 300 is configured to determine the learning results (or learning results including weights, bias values, etc. at each node for each of the plurality of test environments) stored in the corresponding terminal 300, and / or It can be provided (or delivered) to a separate additional terminal.

도 2는 본 발명의 실시예에 따른 3개의 노드로 구성된 1개의 입력 레이어와, 각각 24개의 노드로 구성된 2개의 히든 레이어와, 2개의 노드로 구성된 1개의 출력 레이어를 포함하는 신경망(Artificial Neural Network model)을 나타낸다.2 is a neural network (Artificial Neural Network) including one input layer composed of three nodes, two hidden layers composed of 24 nodes, and one output layer composed of two nodes according to an embodiment of the present invention. model).

여기서, 상기 입력 레이어는 [(Tin(t)-26, dTin/dt, Tdis(t)]로 구성되며, 상기 Tin(t)는 실내 온도, 상기 dTin/dt는 실내 온도의 미리 설정된 시간 동안의 순간 변화 기울기, 상기 Tdis(t)는 상기 공기조화기(100)의 토출구 부근에서의 토출 온도를 각각 나타낸다. 여기서, 상기 '26'은 실내룸의 목표온도로서 사용자의 환경 및 선호도에 따라 실내룸의 목표온도를 27℃, 25℃ 등으로 대체될 수 있으며, 고정값이 아닌 시간에 따라 변화는 목표온도 설정값이 될 수도 있다. 상기 예시된 입력 레이어 노드의 개수, 히든 레이이어의 개수, 각 히든 레이어의 노드의 개수, 출력 레이어 노드의 개수는 입력데이타, 출력데이타, 학습문제의 복잡성에 따라 증가하거나 감소할 수 있다.Here, the input layer is composed of [(Tin (t) -26, dTin / dt, Tdis (t)], where Tin (t) is room temperature, and dTin / dt is for a preset time of room temperature. The instantaneous change slope, Tdis (t), respectively represents the discharge temperature in the vicinity of the discharge port of the air conditioner 100. Here, the '26' is a target temperature of the indoor room, depending on the user's environment and preferences. The target temperature of can be replaced by 27 ° C, 25 ° C, etc., and the change over time may be the target temperature setting value rather than a fixed value.The number of input layer nodes illustrated above, the number of hidden layers, and each The number of nodes in the hidden layer and the number of output layer nodes may increase or decrease depending on the complexity of the input data, the output data, and the learning problem.

또한, 상기 출력 레이어는 각각 저온 및 고온 제품 온도(예를 들어 22℃ 및 30℃)의 선택된 설정 온도로 원격 제어 에어컨으로 액션 출력을 제공한다. 이때, 상기 입력, 히든 및 출력 레이어의 ANN은 텐서 플로우(TensorFlow)에 의해 생성될 수 있다.In addition, the output layer provides action output to a remote control air conditioner at selected set temperatures of low and high temperature product temperatures (eg, 22 ° C and 30 ° C), respectively. At this time, the ANN of the input, hidden, and output layers may be generated by TensorFlow.

도 3은 본 발명의 실시예에 따른 정밀 시험 챔버 환경과 대응되는 공기조화기의 실사용 환경의 예를 나타낸 도이고, 도 4는 본 발명의 정밀 시험 챔버에서 구현되는 시뮬레이션 환경의 예를 나타낸 도이다.3 is a view showing an example of a practical use environment of an air conditioner corresponding to a precision test chamber environment according to an embodiment of the present invention, and FIG. 4 is a view showing an example of a simulation environment implemented in a precision test chamber of the present invention. to be.

공기조화기(100)의 전력 소비량은 실내 측면과 실외 측면 환경 조건에 따라 동시에 달라질 수 있다.The power consumption of the air conditioner 100 may be simultaneously changed according to indoor and outdoor side environmental conditions.

이때, 실외측 온도가 상승하여 실내 냉각 부하(indoor cooling load)가 높으면, 에어컨의 순간 전력 입력이 증가한다. 따라서, ASHRAE와 ISO와 같은 국제 시험 표준은 정격 시험(rating test)에 대한 실내 및 실외 환경(예를 들어 온도 또는 습도)의 정확한 고정을 요구한다.At this time, if the outdoor temperature increases and the indoor cooling load is high, the instantaneous power input of the air conditioner increases. Thus, international test standards such as ASHRAE and ISO require accurate fixation of indoor and outdoor environments (eg temperature or humidity) for rating tests.

예를 들어, 여름 동안의 표준 시험 조건(T1)은 각각 실외측 및 실내측에서 (DBT/WBT) = 35℃/24℃ 및 27℃/19℃이다.For example, the standard test conditions for summer (T1) are (DBT / WBT) = 35 ° C / 24 ° C and 27 ° C / 19 ° C on the outdoor and indoor sides, respectively.

본 발명의 실시예에서는 TRNSYS 시뮬레이션을 위해 한국과 중동 지역의 전형적인 여름 상황을 가정하며, TRNSYS 시뮬레이션 프로그램은 실내 건물 부하(q_build(t)) 및 실외 온도 변화(T_out(t))를 얻기 위해 사용한다.In the embodiment of the present invention, a typical summer situation in Korea and the Middle East is assumed for TRNSYS simulation, and the TRNSYS simulation program is used to obtain indoor building load (q _build (t)) and outdoor temperature change (T _out (t)). use.

이때, 여름의 하루 동안의 평균 온도 및 건물 부하 변화는 TRNSYS 기후 데이터와 분석을 사용하여 얻을 수 있다. TRNSYS 시뮬레이션 조건의 예는 하기의 [표 1]에 나타낼 수 있다.At this time, changes in average temperature and building load during the summer day can be obtained using TRNSYS climate data and analysis. Examples of TRNSYS simulation conditions can be shown in [Table 1] below.

또한, 본 발명의 실시예에 따른 테스트 에어컨의 사양은 하기의 [표 2]와 같다.In addition, the specifications of the test air conditioner according to the embodiment of the present invention are as shown in [Table 2] below.

또한, 본 발명의 실시예에 따른 심층 강화 학습(deep reinforcement learning) 알고리즘을 위한 설정은 다음과 같다.In addition, a setting for a deep reinforcement learning algorithm according to an embodiment of the present invention is as follows.

즉, - 상태: [(Tin(t)-26, dTin/dt, Tdis(t)]That is, the state: [(Tin (t) -26, dTin / dt, Tdis (t)]

- 액션: [22℃, 30℃]-Action: [22 ℃, 30 ℃]

- 보상: +1 pt, when Tin in the range of (26±0.5)℃-Reward: +1 pt, when Tin in the range of (26 ± 0.5) ℃

- 벌점: -100 pt, when Tin out of (26±0.5)℃-Penalty: -100 pt, when Tin out of (26 ± 0.5) ℃

- 각 에피소드(episode) 끝: when get penalty-End of each episode: when get penalty

- 시뮬레이션 타임 동안의 환경:-Environment during simulation time:

1일차: constant Tout=28℃, q_build=3,000WDay 1: constant Tout = 28 ℃, q _build = 3,000W

2일차: constant Tout=28℃, q_build=3,000W(1일차에서 연속 학습)Day 2: constant Tout = 28 ℃, q _build = 3,000W (continuous learning on Day 1)

3일차: 도 4에 도시된 한국의 여름 기후Day 3: Korea's summer climate shown in Figure 4

4일차: 도 4에 도시된 중동의 여름 기후Day 4: Summer climate in the Middle East shown in Figure 4

- 리셋: Tset is reset within (26±0.2)℃ when each episode ends.-Reset: Tset is reset within (26 ± 0.2) ℃ when each episode ends.

여기서, 에어컨의 권장 온도 설정은 나라마다 다르며, 한국의 경우 실내 온도가 26℃로 권장된다. 이때, 목표온도가 26℃ 고정된 값이 아닌 재실환경의 요구조건이나 소비자의 성향에 따라 시간에 따라 변화할 수 있다.Here, the recommended temperature setting of the air conditioner varies from country to country, and in Korea, the room temperature is 26 ° C. At this time, the target temperature may change over time according to the requirements of the living environment or the preference of the consumer rather than the fixed value of 26 ° C.

따라서, RL(reinforcement learning) 기계학습의 효과를 비교하기 위해서 기계 학습 알고리즘이 없는 원격 제어기를 26℃로 설정한 후, 하루 중 실외 온도(Tout(t)와 건물 부하(q_build(t)의 한국 기후환경에서 실내 온도(Tin(t)) 변화 패턴을 확인한다.Therefore, in order to compare the effects of reinforcement learning (RL) machine learning, after setting the remote controller without machine learning algorithm to 26 ℃, the outdoor temperature (Tout (t) and building load (q _build (t) of Korea) Identify the pattern of change in room temperature (Tin (t)) in a climate environment.

도 5에 도시된 바와 같이, 기계 학습 알고리즘이 없는 원격 제어기의 경우, 실내 온도를 미세하게 조절하기 위해 공기조화기(100)의 제품으로서 인버터 압축기형 에어컨을 도입했지만, 1일(24시간) 동안, Tin의 최대값은 27.08℃이었고 최소값은 25.22℃으로, 정밀하게 제어되지 않음을 확인할 수 있다. 이때, 상기 리모컨 설정 온도는 26℃로 고정되었지만, 상온 Tin은 약 ±1℃의 편차로 진동한다.As shown in FIG. 5, in the case of a remote controller without a machine learning algorithm, an inverter compressor type air conditioner was introduced as a product of the air conditioner 100 to finely adjust the indoor temperature, but for 1 day (24 hours) , It was confirmed that the maximum value of Tin was 27.08 ° C and the minimum value was 25.22 ° C, which was not precisely controlled. At this time, the remote control set temperature was fixed at 26 ° C, but the room temperature Tin vibrates with a deviation of about ± 1 ° C.

도 6에 도시된 바와 같이, 본 발명의 실시예에 따른 심층 강화 학습이 적용된 첫날의 시험 환경은 28℃의 실외 온도(Tout)와 3000W의 건물 부하중(q_build)를 유지하여 교란을 줄였다. 이때, 시험한 에어컨이 상기 도 5과 같이 전력 입력의 작은 조정 특성(미세 톱니 모양 증가 또는 감소)을 갖는 인버터 형식이지만, 상기 도 6에 도시된 바와 같이, 실내 온도가 Tset high(30℃) 및 low(22℃)의 작용에 의해서만 제어되었기 때문에 온 및 오프 스위칭 특성을 나타낸다.As shown in FIG. 6, the test environment of the first day to which deep reinforcement learning was applied according to an embodiment of the present invention reduced disturbance by maintaining an outdoor temperature (Tout) of 28 ° C. and a building load (q _build ) of 3000 W. In this case, the tested air conditioner is an inverter type having a small adjustment characteristic (increase or decrease in fine tooth shape) of the power input as shown in FIG. 5, but as shown in FIG. 6, the room temperature is Tset high (30 ° C.) and Since it is controlled only by the action of low (22 ℃), it shows on and off switching characteristics.

또한, 상기 도 6에 도시된 바와 같이, Tin은 16시간 후에 안정됨을 확인할 수 있다. 이때, 16시간 전에는 상기 공기조화기(100)의 토출구 부근에서의 토출 온도(또는 에어컨 토출 온도/Tdis)와 전력 입력(Pin)의 그래프 패턴이 불규칙하며, 때로 전원 온(on) 지속 시간의 큰 시간 범위(예를 들어 9시간 ~ 10시간, 13시간)를 확인할 수 있다. 또한, 16시간 이후, 전력 입력(Pin)과 에어컨 토출 온도(Tdis)가 규칙적인 그래프 패턴을 보인다.In addition, as shown in Figure 6, it can be confirmed that Tin is stable after 16 hours. At this time, before 16 hours, the graph pattern of the discharge temperature (or air conditioner discharge temperature / Tdis) and the power input (Pin) in the vicinity of the discharge port of the air conditioner 100 is irregular, and sometimes the power on duration is large. You can check the time range (eg 9 hours to 10 hours, 13 hours). In addition, after 16 hours, the power input (Pin) and the air conditioning discharge temperature (Tdis) show a regular graph pattern.

도 7은 본 발명의 실시예에 따른 실내 온도(Tin)의 확대된 상태를 나타낸 도이다.7 is a view showing an enlarged state of the room temperature (Tin) according to an embodiment of the present invention.

실내 온도(Tin)의 붉은 영역은 보상 +1점을 받은 영역이다. 만약 상기 실내 온도(Tin)가 (26±0.5)℃ 범위를 벗어나는 경우, 훈련의 각 에피소드는 종료된다. 그리고 훈련의 재시작을 위해서, 프로그램 알고리즘은 온도를 강제로 낮추거나 최댓값으로 올려 (26±0.2)℃ 내의 실내 온도(Tin)로 리셋한다.The red area of the room temperature (Tin) is the area that received the reward +1 point. If the room temperature (Tin) is outside the range of (26 ± 0.5) ° C, each episode of training ends. And to restart the training, the program algorithm resets the temperature to room temperature (Tin) within (26 ± 0.2) ℃ by forcibly lowering the temperature or raising it to its maximum value.

도 8은 본 발명의 실시예에 따른 각 에피소드에서 획득된 점수를 나타낸다.8 shows scores obtained in each episode according to an embodiment of the present invention.

또한, 상기 도 8은 17시간까지의 점수 정보만 있고, 18회 점수는 환경이 갑자기 가혹한 중동 기후 상태로 변하는 4일차의 시작 시각에 근접한 지점에서 하나의 에피소드가 끝나고 1,518점의 고득점을 기록하여 도 8에는 도시되어 있지 않다.In addition, FIG. 8 has score information of up to 17 hours, and the score of 18 times is recorded by recording a high score of 1,518 after one episode ends at the point close to the start time of Day 4, when the environment suddenly changes into a harsh Middle East climate. Not shown in 8.

심층 강화 학습 2일째에는 상수 Tout(28℃)과 q_build(3,000W)의 환경을 1일째와 동일하게 반복했다.On the second day of deep reinforcement learning, the environment of constant Tout (28 ° C) and q _build (3,000W) was repeated in the same manner as on the first day.

도 9는 본 발명의 실시예에 따른 심층 강화 학습이 적용된 둘째 날의 시험 결과를 나타내며, 매우 잘 훈련되고 안정화된 상황을 보여주는 심층 강화 학습 적용 결과를 나타낸 도이다. 상기 도 9에 도시된 바와 같이, 에피소드 18회는 심층 강화 학습의 이틀째 동안 계속되었다. 9 is a diagram showing the test results of the second day to which deep reinforcement learning is applied according to an embodiment of the present invention, and is a view showing the results of applying deep reinforcement learning showing a very well trained and stabilized situation. As shown in FIG. 9 above, episode 18 continued for the second day of deep reinforcement learning.

도 10은 본 발명의 실시예에 따른 심층 강화 학습이 적용된 셋째 날의 시험 결과를 나타낸 도이다.10 is a diagram showing test results of the third day to which deep reinforcement learning according to an embodiment of the present invention is applied.

심층 강화 학습의 3일째, 실외 온도(Tout(t))의 변화와 건물 부하 (q_build(t))의 변화가 교란 인자(disturbance)로 적용되었다. 본 발명의 실시예에서는 이러한 종류의 실외 온도와 건물 부하의 변화를 LSTM(Long-Short Term Memory) 학습 모델에서 일반적으로 설정될 수 있는 주기적 상태의 입력 변수로 사용하지 않았다. 따라서, Tout(t) 및 q_build(t) 변수는 알 수 없는 장애로 작용한다. 도 10에 도시된 바와 같이, Tout(t)과 q_build(t)는 오전 6시 ~ 7시경에 최소값, 오후 15시 ~ 16시경에 최대값을 가진다.On the third day of deep reinforcement learning, changes in outdoor temperature (Tout (t)) and changes in building load (q _build (t)) were applied as disturbance factors. In the embodiment of the present invention, this type of change in outdoor temperature and building load is not used as an input variable of a periodic state that can be generally set in a Long-Short Term Memory (LSTM) learning model. Therefore, the Tout (t) and q _build (t) variables act as unknown obstacles. As shown in FIG. 10, Tout (t) and q _build (t) have a minimum value between 6 am and 7 am and a maximum value between 15 pm and 16 pm.

상기와 같이 교란요소에도 불구하고 도 10에 도시된 바와 같이, 실내 온도(Tin)는 변화없이 잘 수렴된다. 이는 도 8, 도 9의 일정한(constant) 환경(Tout, q_build)에서 학습된 결과가 외부 환경의 교란요소에도 불구하고 심층 강화 학습에 의한 신경망이 새로운 환경에 잘 적응되었음을 나타낸다.As shown in FIG. 10 despite the disturbance factors as described above, the room temperature Tin converges well without change. This indicates that the result of learning in the constant environment (Tout, q _build ) of FIGS. 8 and 9 is that the neural network by deep reinforcement learning is well adapted to the new environment despite disturbing factors of the external environment.

3일 동안의 심층 강화 학습 동안 Tin 데이터의 품질은 도 11 및 도 12와 같이 비교되었다. 훈련이 1일차의 16시간 이후 안정되었기 때문에, Tin의 평균값과 표준 변동이 16시간 이후에 안정화됨을 보여준다. 하지만, 평균값은 64시간에 하나의 높은 피크값을 나타낸다. 이 피크값은 심층 강화 학습의 3일 동안의 환경 변화(T_out and q_build) 때문이다. 하기의 [표 3]은 날짜별 온도 조절 특성(예를 들어 온도 제어 정확도)을 나타낸다.The quality of the Tin data was compared as shown in FIGS. 11 and 12 during 3 days of deep reinforcement learning. Since the training was stable after 16 hours of Day 1, it shows that Tin's mean and standard fluctuations stabilized after 16 hours. However, the average value represents one high peak value at 64 hours. This peak value is due to the environmental change (T _out and q _build ) for 3 days of deep reinforcement learning. [Table 3] below shows the temperature control characteristics by date (for example, temperature control accuracy).

다음의 [표 4]는 온도 최솟값 및 최댓값을 나타낸다.The following [Table 4] shows the minimum and maximum temperature values.

상기 [표 3] 및 [표 4]에 나타낸 바와 같이, Tin 온도의 평균값과 표준 편차는 심층 강화 학습의 2일차 및 3일차에서 좋은 결과를 나타낸다.As shown in [Table 3] and [Table 4], the average value and standard deviation of the Tin temperature show good results on the 2nd and 3rd days of deep reinforcement learning.

마찬가지로, 목표 온도(26℃)로부터의 최소 및 최대 편차는 심층 강화 학습의 2일차 및 3일차에서 최소 에러를 나타낸다. 이는 본 발명의 실시예에 따른 심층 강화 학습의 적용이 정상 사용 또는 단순한 서모스탯 제어(thermostat control) 케이스보다 제품의 온도 제어 능력을 능가할 수 있음을 나타낸다.Likewise, the minimum and maximum deviations from the target temperature (26 ° C.) indicate minimum errors on days 2 and 3 of deep reinforcement learning. This indicates that the application of in-depth reinforcement learning according to an embodiment of the present invention may surpass the temperature control ability of the product over normal use or a simple thermostat control case.

상기 리모컨(또는 신호 제어부)(400)은 상기 공기조화기(100), 상기 데이터 수집부(200), 상기 단말(300) 등과 통신한다The remote control (or signal control unit) 400 communicates with the air conditioner 100, the data collection unit 200, the terminal 300, and the like.

또한, 상기 리모컨(400)은 상기 공기조화기(100)의 각 구성품을 제어하는 각각의 제어부(예를 들어 압축기 인버터 제어부, 실내기팬 RPM 제어부, 실외기팬 RPM 제어부, 전자팽창변 제어부, 블레이드 제어부 등 포함)일 수도 있다.In addition, the remote control 400 includes a respective control unit for controlling each component of the air conditioner 100 (for example, a compressor inverter control unit, indoor fan RPM control unit, outdoor fan RPM control unit, electronic expansion valve control unit, blade control unit, etc.) ).

또한, 상기 리모컨(400)은 사용자 입력에 따른 목표 온도, 목표 습도, 목표 공기청정도 등을 상기 공기조화기(100)에 전송한다.In addition, the remote control 400 transmits the target temperature, target humidity, and target air cleanliness according to the user input to the air conditioner 100.

또한, 상기 리모컨(400)은 상기 단말(300)로부터 전송되는 출력값(또는 해당 출력값을 포함하는 제어 신호)을 상기 공기조화기(100)에 전송한다.In addition, the remote control 400 transmits an output value (or a control signal including the corresponding output value) transmitted from the terminal 300 to the air conditioner 100.

본 발명의 실시예에서는, 상기 리모컨(400)을 통해 상기 출력값(또는 해당 출력값을 포함하는 제어 신호)을 상기 공기조화기(100)에 전송하는 것을 설명하고 있으나, 이에 한정되는 것은 아니며, 상기 단말(300)에서 상기 공기조화기(100)로 직접 상기 출력값(또는 해당 출력값을 포함하는 제어 신호)을 전송할 수도 있다.In an embodiment of the present invention, although it is described that the output value (or a control signal including the corresponding output value) is transmitted to the air conditioner 100 through the remote controller 400, it is not limited thereto, and the terminal The output value (or a control signal including the corresponding output value) may be transmitted directly from 300 to the air conditioner 100.

이와 같이, 상기 공기조화기 정밀 학습 시스템(10)은 시뮬레이션 조건 이외의 교란이 없는 정적인 시험 환경에서 학습을 수행하고, 목표로 하는 온/습도 환경으로 어느 정도 안정화된 학습 패턴을 보여주면, 실제 현장의 환경에 가까운 외기 및 실내 부하 등으로 테스트 환경을 조성하여 그 환경에서 연이은 학습을 실시한다.As described above, the air conditioner precision learning system 10 performs learning in a static test environment without disturbances other than simulation conditions, and shows a learning pattern stabilized to a certain temperature / humidity environment. A test environment is created with outdoor and indoor loads close to the environment of the site, and subsequent learning is conducted in the environment.

또한, 상기 공기조화기 정밀 제어 시스템(10)은 학습된 신경망의 가중치와 편향값을 저장한 후, 실제 환경에 해당 제품(또는 상기 공기조화기(100))을 설치하며, 제품을 바로 운전하거나 또는, 현장의 환경에 적응하는 학습 기간을 추가로 제공할 수 있다.In addition, the air conditioner precision control system 10 stores the weighted and deflected values of the learned neural network, and then installs the corresponding product (or the air conditioner 100) in a real environment, and operates the product directly or Alternatively, a learning period adapted to the on-site environment may be additionally provided.

본 발명의 실시예에서는 상기 데이터 수집부(200) 및 상기 단말(300)이 별도로 구성되는 특정 기능을 수행하는 것을 설명하고 있으나, 이에 한정되는 것은 아니며, 상기 데이터 수집부(200)의 기능 및 상기 단말(300)의 기능은 상기 공기조화기(100) 또는 상기 리모컨(400)에서 수행되도록 구성할 수도 있다.In the embodiment of the present invention, although the data collection unit 200 and the terminal 300 are described to perform a specific function separately configured, the present invention is not limited thereto, and the functions and the functions of the data collection unit 200 are described. The function of the terminal 300 may be configured to be performed by the air conditioner 100 or the remote control 400.

이와 같이, 시뮬레이션 조건 이외의 교란이 없는 시험 환경에서 다양한 조건하에 반복 학습을 수행하여 목표로 하는 목표 범위 내에서 안정화된 학습 패턴을 보여주면 해당 안정화된 학습 패턴에 따른 최종 학습 결과를 저장하고, 실제 현장에 배치된 공기조화기에서 앞서 저장된 최종 학습 결과를 근거로 실내 및/또는 실내 환경에 따른 적응형 학습을 수행하여 공기조화기의 동작을 제어할 수 있다.As described above, if the repetitive learning is performed under various conditions in a test environment without disturbances other than the simulation conditions, and the stabilized learning pattern is displayed within the target range, the final learning result according to the stabilized learning pattern is stored and actually The air conditioner disposed in the field may control the operation of the air conditioner by performing adaptive learning according to the indoor and / or indoor environment based on the last learning result stored previously.

이하에서는, 본 발명에 따른 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 방법을 도 13 내지 도 15를 참조하여 상세히 설명한다.Hereinafter, the air conditioner precision learning method to which the deep reinforcement learning algorithm according to the present invention is applied will be described in detail with reference to FIGS. 13 to 15.

도 13 내지 도 14는 본 발명의 실시예에 따른 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 방법을 나타낸 흐름도이다.13 to 14 are flowcharts illustrating an air conditioner precision learning method to which an in-depth reinforcement learning algorithm according to an embodiment of the present invention is applied.

먼저, 공기조화기(100)는 냉난방 시험 조건 등을 포함하는 온도 조건, 습도 조건 등의 시험 환경(또는 외부 환경)을 구현(또는 설정)한다. 여기서, 상기 냉난방 시험 조건은 시뮬레인션 이외에 교란이 없는 시험 환경(또는 테스트 환경)을 구현하기 위해서, 실외의 경우 가변 온도, 가변 습도 등의 조건을 포함하고, 실내의 경우 가변 현열 부하, 가변 잠열 부하 등의 조건을 포함한다. 이때, 상기 공기조화기(100)는 냉방 및 난방 기능을 수행하는 장치뿐만 아니라, 공기청정기 기능을 포함하는 공기청정기 겸용 공기조화기 등일 수 있다.First, the air conditioner 100 implements (or sets) a test environment (or external environment) such as a temperature condition and a humidity condition including air conditioning test conditions. Here, in order to implement a test environment (or test environment) without disturbance other than simulation, the air conditioning test conditions include conditions such as variable temperature and variable humidity in the outdoors, and variable sensible heat load and variable latent heat in the room. Conditions such as load. In this case, the air conditioner 100 may be an air purifier combined air conditioner including an air purifier function, as well as an apparatus for performing cooling and heating functions.

일 예로, 제 1 공기조화기(100)는 심층 강화 학습 수행을 위해서 온도 조건, 습도 조건 등을 포함하는 제 1 시험 환경을 구현한다(S1310).For example, the first air conditioner 100 implements a first test environment including temperature conditions and humidity conditions to perform deep reinforcement learning (S1310).

이후, 상기 공기조화기(100)는 리모컨(400)을 통한 사용자 입력에 따른 목표 온도, 목표 습도 등을 수신한다.Thereafter, the air conditioner 100 receives a target temperature and a target humidity according to a user input through the remote control 400.

일 예로, 상기 제 1 공기조화기는 상기 제 1 공기조화기와 연동하는 제 1 리모컨(400)을 통해 사용자 입력에 따른 제 1 목표 온도(예를 들어 26℃) 및 제 1 목표 습도(예를 들어 45%)를 수신한다(S1320).For example, the first air conditioner may have a first target temperature (eg, 26 ° C.) and a first target humidity (eg, 45) according to a user input through a first remote controller 400 interworking with the first air conditioner. %) Is received (S1320).

이후, 상기 공기조화기(100)는 상기 시험 환경에서 상기 수신된 목표 온도, 목표 습도 등을 근거로 구동(또는 동작)한다.Thereafter, the air conditioner 100 is driven (or operated) based on the received target temperature, target humidity, and the like in the test environment.

즉, 상기 공기조화기(100)는 상기 시험 환경 상태에서 상기 수신된 목표 온도, 목표 습도 등에 따라 상기 공기조화기(100) 주변의 온도 및/또는 습도를 조절하기 위해서 구동한다.That is, the air conditioner 100 is driven to adjust the temperature and / or humidity around the air conditioner 100 according to the received target temperature, target humidity, and the like in the test environment state.

일 예로, 상기 제 1 공기조화기는 상기 제 1 시험 환경에서 상기 수신된 제 1 목표 온도(예를 들어 26℃) 및 제 1 목표 습도(예를 들어 45%)에 따라 상기 제 1 공기조화기의 냉방 기능을 구동한다(S1330).For example, the first air conditioner of the first air conditioner according to the received first target temperature (eg 26 ° C.) and first target humidity (eg 45%) in the first test environment. The cooling function is driven (S1330).

이후, 데이터 수집부(200)는 상기 공기조화기(100)와 관련한 데이터를 수집(또는 획득/측정)한다.Thereafter, the data collection unit 200 collects (or acquires / measures) data related to the air conditioner 100.

일 예로, 제 1 데이터 수집부(200)는 상기 제 1 공기조화기가 동작함에 따라, 상기 제 1 공기조화기의 실시간 소비 전력을 산출(또는 수집)하고, 상기 제 1 공기조화기의 토출구 부근에서의 토출 온도 및 토출 습도, 상기 제 1 공기조화기가 위치한 실내의 실내 온도 및 실내 습도 등을 각각 수집한다(S1340).For example, as the first air conditioner operates, the first data collection unit 200 calculates (or collects) real-time power consumption of the first air conditioner, and in the vicinity of the outlet of the first air conditioner. Discharge temperature and discharge humidity of the room, the indoor temperature and room humidity, etc. of the room where the first air conditioner is located are collected (S1340).

이후, 단말(300)은 상기 데이터 수집부(200)로부터 수집된 데이터(또는 상기 공기조화기(100)와 관련한 데이터)를 근거로 심층 강화 학습 신경 모델에 적용할 입력값을 생성한다.Thereafter, the terminal 300 generates an input value to be applied to the deep reinforcement learning neural model based on the data collected from the data collection unit 200 (or data related to the air conditioner 100).

일 예로, 제 1 단말(300)은 상기 수집된 데이터 중에서 실내 온도와 상기 수신된 제 1 목표 온도(예를 들어 26℃) 간의 제 1 차이값, 상기 수집된 데이터 중에서 실내 온도의 미리 설정된 시간 동안의 제 1 순간 변화 기울기, 상기 수집된 데이터 중에서 상기 제 1 공기조화기의 토출구 부근에서의 제 1 토출 온도, 상기 수집된 데이터 중에서 실내 습도와 상기 수신된 제 1 목표 습도(예를 들어 45%) 간의 제 2 차이값, 상기 수집된 데이터 중에서 실내 습도의 미리 설정된 시간 동안의 제 2 순간 변화 기울기, 상기 수집된 데이터 중에서 상기 제 1 공기조화기의 토출구 부근에서의 제 1 토출 습도 등을 상기 심층 강화 학습 신경 모델에 적용할 제 1 입력값으로 생성한다(S1350).For example, the first terminal 300 is a first difference value between the indoor temperature and the received first target temperature (for example, 26 ° C) among the collected data, and during the preset time of the indoor temperature among the collected data The first instantaneous change slope of, the first discharge temperature in the vicinity of the outlet of the first air conditioner among the collected data, the indoor humidity among the collected data and the received first target humidity (for example, 45%) The second difference value between, the second instantaneous change slope during a preset time of indoor humidity among the collected data, the first discharge humidity in the vicinity of the outlet of the first air conditioner among the collected data, etc. It is generated as a first input value to be applied to the learning neural model (S1350).

이후, 상기 단말(300)은 상기 생성된 입력값을 근거로 상기 심층 강화 학습 신경 모델을 이용해서 해당 시험 환경에 대한 학습 기능을 수행한다.Thereafter, the terminal 300 performs a learning function for a corresponding test environment using the deep reinforcement learning neural model based on the generated input value.

즉, 상기 단말(300)은 상기 심층 강화 학습 신경 모델에 사용할 신경망을 구성(또는 정의)한다. 여기서, 상기 신경망은 상기 입력값을 입력 레이어로 구성하고, 온도를 조절하기 위한 하나 이상의 설정 온도, 습도를 조절하기 위한 하나 이상의 설정 습도 등을 출력 레이어로 구성하고, 학습에 따른 가중치(weight)와 편향값(bias)으로 구성된 복수의 노드를 포함하는 복수의 히든 레이어로 구성한다.That is, the terminal 300 configures (or defines) a neural network to be used for the deep reinforcement learning neural model. Here, the neural network configures the input value as an input layer, configures at least one set temperature for adjusting the temperature, and at least one set humidity for adjusting the humidity, etc., as an output layer. It is composed of a plurality of hidden layers including a plurality of nodes composed of bias values.

즉, 상기 단말(300)은 초기값 등이 설정된 Q 학습 알고리즘(또는 SARSA 알고리즘)을 통해 상기 수신된 입력값을 근거로 해당 시험 환경에서의 학습 기능을 수행한다. 이때, 해당 심층 강화 학습의 구체적인 수행 단계는 이후의 도 15에 나타내었다.That is, the terminal 300 performs a learning function in a corresponding test environment based on the received input value through a Q learning algorithm (or SARSA algorithm) in which initial values are set. At this time, the specific step of performing the in-depth reinforcement learning is shown in FIG. 15 later.

또한, 미리 설정된 시간 동안 수행된 학습 단계가 정상적으로 종료된 후, 상기 단말(300)은 학습 수행에 따른 정보를 포함하는 신경망에 구성된 복수의 개별 노드에서의 가중치, 편향값 등을 저장한다.In addition, after the learning step performed for a predetermined time is normally ended, the terminal 300 stores weights, bias values, and the like at a plurality of individual nodes configured in a neural network including information according to the learning performance.

일 예로, 상기 제 1 단말은 상기 제 1 입력값으로 입력 레이어를 구성하고, 온도를 조정하기 위한 22℃의 제 1 출력값, 30℃의 제 2 출력값, 습도를 조정하기 위한 40%의 제 3 출력값 및 50%의 제 4 출력값으로 출력 레이어를 구성하고, 하나의 레이어가 24개의 노드로 구성되며 2단 레이어를 포함하는 히든 레이어를 구성하여, 최종적으로 심층 강화 학습을 수행하기 위한 신경망을 구성한다.For example, the first terminal configures an input layer with the first input value, a first output value of 22 ° C for adjusting temperature, a second output value of 30 ° C, and a third output value of 40% for adjusting humidity. And a fourth output value of 50%, an output layer, one layer composed of 24 nodes, and a hidden layer including a second layer, to finally construct a neural network for performing deep reinforcement learning.

또한, 상기 제 1 단말은 상기 구성된 신경망을 통해 심층 강화 학습을 수행한다.In addition, the first terminal performs deep reinforcement learning through the configured neural network.

또한, 상기 제 1 단말은 미리 설정된 시간 동안 수행된 학습 단계가 정상적으로 종료된 후, 학습 수행에 따른 정보를 포함하는 상기 신경망에 포함된 개별 노드별로 가중치, 편향값 등을 저장한다.In addition, after the learning step performed for a preset time is normally ended, the first terminal stores weights, bias values, and the like for each node included in the neural network including information according to the learning performance.

또한, 상기 제 1 단말은 상기 제 1 시험 환경 이외에도 복수의 다른 시험 환경에서 앞선 S1310 단계 내지 S1360 단계를 반복 수행하여, 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등을 저장한다(S1360).In addition, the first terminal repeatedly performs steps S1310 to S1360 in a plurality of different test environments in addition to the first test environment, and stores weights, bias values, and the like at each node for each of the plurality of test environments (S1360). .

이후, 상기 단말(300)은 해당 단말(300)에 저장된 학습 결과(또는 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등을 포함하는 학습 결과)를 상기 공기조화기(100)에 포함된 메모리(미도시)에 저장한다.Thereafter, the terminal 300 includes learning results stored in the terminal 300 (or learning results including weights, bias values, etc. at each node for each test environment) in the air conditioner 100. It is stored in memory (not shown).

즉, 상기 단말(300)은 해당 단말(300)에 저장된 학습 결과(또는 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등을 포함하는 학습 결과)를 상기 공기조화기(100)에 제공(또는 전달)한다.That is, the terminal 300 provides learning results (or learning results including weights, bias values, and the like at each node for each test environment) stored in the terminal 300 to the air conditioner 100 ( Or forward).

또한, 상기 공기조화기(100)는 상기 단말(300)로부터 제공되는 상기 학습 결과(또는 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등을 포함하는 학습 결과)를 해당 공기조화기(100)에 구비된 메모리에 저장한다.In addition, the air conditioner 100 is applied to the learning result (or learning results including weights, bias values, etc. at each node for a plurality of test environments) provided from the terminal 300. ).

일 예로, 상기 제 1 단말은 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등을 포함하는 학습 결과를 상기 제 1 공기조화기에 전달한다.For example, the first terminal transmits a learning result including weights, bias values, and the like at each node to each of the plurality of test environments to the first air conditioner.

또한, 상기 제 1 공기조화기는 상기 제 1 단말로부터 전달되는 학습 결과를 해당 공기조화기에 구비된 제 1 메모리에 저장한다(S1370).In addition, the first air conditioner stores the learning result transmitted from the first terminal in the first memory provided in the air conditioner (S1370).

이후, 상기 공기조화기(100)는 작업자에 의해 실제 환경(또는 실제 장소)에 설치(또는 배치)된다.Thereafter, the air conditioner 100 is installed (or placed) in a real environment (or a real place) by an operator.

일 예로, 상기 제 1 공기조화기는 홍길동 가정 내의 거실에 배치된다(S1380).For example, the first air conditioner is disposed in a living room in Hong Gil-dong's home (S1380).

이후, 상기 공기조화기(100)는 실제 배치된 환경에서 사용자 입력에 따른 다른 목표 온도 및/또는 다른 목표 습도로 동작(또는 구동)을 수행한다.Thereafter, the air conditioner 100 performs an operation (or driving) at a different target temperature and / or different target humidity according to a user input in an actual deployed environment.

또한, 상기 공기조화기(100)는 실제 배치된 환경에서의 동작에 따라 수집되는 다양한 데이터들과, 신경망을 구성하는 개별 노드에서의 가중치와 편향값(또는 상기 학습 결과) 등을 이용해서 상기 심층 강화 학습 알고리즘에 따라 학습 기능을 수행하여, 손실 함수를 최소화하는 실제 환경에 최적화된 개별 노드에서의 가중치, 편향값 등을 산출(또는 관리)하여 해당 공기조화기(100)의 운영을 최적화한다.In addition, the air conditioner 100 uses the various data collected according to the operation in the actual deployed environment, and the weight and bias value (or the learning result) of individual nodes constituting the neural network, to deepen the air. By performing a learning function according to the reinforcement learning algorithm, the operation of the air conditioner 100 is optimized by calculating (or managing) weights, bias values, and the like at individual nodes optimized for a real environment that minimizes a loss function.

일 예로, 상기 제 1 공기조화기는 상기 홍길동 가정 내의 거실에 배치된 후, 해당 홍길동의 입력에 따른 다른 목표 온도(예를 들어 24℃), 상기 학습 결과(예를 들어 복수의 시험 환경별로 각 노드에서의 가중치, 편향값 등 포함) 등을 이용해서 상기 심층 강화 학습 알고리즘에 따라 학습 기능을 수행하여, 손실 함수를 최소화하도록 상기 제 1 공기조화기의 동작을 제어한다(S1390).For example, after the first air conditioner is disposed in the living room in the Hong Gil-dong home, different target temperatures (for example, 24 ° C.) according to the input of the Hong Gil-dong, the learning result (for example, each node for a plurality of test environments) Using weights, deflection values, and the like in), a learning function is performed according to the deep reinforcement learning algorithm to control the operation of the first air conditioner to minimize a loss function (S1390).

도 15는 본 발명의 실시예에 따른 심층 강화 학습 알고리즘을 적용한 학습 방법을 나타낸 흐름도이다.15 is a flowchart illustrating a learning method to which a deep reinforcement learning algorithm is applied according to an embodiment of the present invention.

먼저, 단말(300)은 목표치를 근거로 해당 심층 강화 학습 알고리즘에 사용할 온도 및/또는 습도의 보상범위(또는 온도 범위 및/또는 습도 범위)를 설정한다. 또한, 설정된 보상범위에서 벗어나는 경우 에피소드 리셋을 위한 온도 및/또는 습도의 초기화 범위를 설정한다. 여기서, 상기 목표치는 사용자 입력에 따른 목표 온도, 목표 습도 등을 포함한다. 이때, 상기 단말(300)은 온도 및/또는 습도 초기화 범위는 보상 온도 및/또는 보상 습도(또는 보상 온도 범위 및/또는 보상 습도 범위)의 보상범위 내에 포함되도록 설정할 수 있다.First, the terminal 300 sets a compensation range (or temperature range and / or humidity range) of temperature and / or humidity to be used in the corresponding deep reinforcement learning algorithm based on the target value. In addition, if it is out of the set compensation range, an initialization range of temperature and / or humidity for episode reset is set. Here, the target value includes a target temperature and a target humidity according to user input. In this case, the terminal 300 may set the temperature and / or humidity initialization range to be included within the compensation range of the compensation temperature and / or the compensation humidity (or the compensation temperature range and / or the compensation humidity range).

또한, 상기 단말(300)은 해당 심층 강화 학습 알고리즘에 사용할 점수(Score)를 초기화(또는 설정)한다.In addition, the terminal 300 initializes (or sets) a score to be used for the corresponding deep reinforcement learning algorithm.

일 예로, 제 1 단말(300)은 리모컨(400)을 통한 사용자 입력에 따른 제 1 목표 온도(예를 들어 26℃), 제 1 목표 습도(예를 들어 45%)를 근거로 심층 강화 학습 알고리즘에 사용할 제 1 온도 범위(예를 들어 26℃±0.2℃), 제 1 습도 범위(예를 들어 45%±2%)를 초기화한다. 이때, 상기 제 1 단말은 미리 설정된 보상 온도 범위(예를 들어, 제 2 온도 범위 26℃±0.5℃) 이내에 상기 제 1 온도 범위가 존재하도록 설정하고, 미리 설정된 보상 습도 범위(예를 들어, 제2 습도 범위 45%±5%) 이내에 상기 제 1 습도 범위(예를 들어 45%±2%)가 존재하도록 설정한다(S1510).For example, the first terminal 300 is based on a first target temperature (for example, 26 ° C) and a first target humidity (for example, 45%) according to a user input through the remote control 400, an in-depth reinforcement learning algorithm. Initialize the first temperature range (eg 26 ° C ± 0.2 ° C) to be used, and the first humidity range (eg 45% ± 2%). In this case, the first terminal sets the first temperature range to exist within a preset compensation temperature range (eg, the second temperature range 26 ° C ± 0.5 ° C), and a preset compensation humidity range (for example, 2 Set the first humidity range (for example, 45% ± 2%) within the humidity range 45% ± 5% (S1510).

또한, 상기 제 1 단말은 상기 심층 강화 학습 알고리즘에 사용할 점수(Score)를 '0'으로 초기화한다(S1515).In addition, the first terminal initializes the score (Score) to be used in the deep reinforcement learning algorithm to '0' (S1515).

이후, 상기 단말(300)은 현재 상태(또는 입력값)(S_t) 및 Q 값을 근거로 해당 심층 강화 학습 알고리즘 내의 인공망 상에서 현재의 액션값(At)을 출력한다. 여기서, 상기 현재 상태(또는 입력값)는 실내 온도(예를 들어 Tin)와 상기 목표 온도 간의 차이값, 실내 온도의 미리 설정된 시간 동안의 순간 변화 기울기(예를 들어 dTin/dt), 공기조화기(100)의 토출구 부근에서의 토출 온도(예를 들어 Tdis), 실내 습도(예를 들어 Hin)와 상기 수신된 목표 습도 간의 차이값, 실내 습도의 미리 설정된 시간 동안의 순간 변화 기울기(예를 들어 dHin/dt), 상기 공기조화기(100)의 토출구 부근에서의 토출 습도(예를 들어 Hdis) 등을 포함한다.Then, the terminal 300 outputs the current action value Att on the artificial network in the corresponding deep reinforcement learning algorithm based on the current state (or input value) S _t and Q value. Here, the current state (or input value) is the difference between the room temperature (for example, Tin) and the target temperature, the instantaneous change slope (for example, dTin / dt) for a preset time of the room temperature, air conditioner The discharge temperature (for example, Tdis) in the vicinity of the discharge port of (100), the difference between the indoor humidity (for example Hin) and the received target humidity, and the gradient of instantaneous change during the preset time of the indoor humidity (for example, dHin / dt), discharge humidity (for example, Hdis) in the vicinity of the discharge port of the air conditioner 100, and the like.

즉, 상기 단말(300)은 상기 심층 강화 학습 알고리즘에 구성된 신경망에 포함된 출력 레이어에 포함되는 복수의 출력값 중에서 상기 현재 상태(또는 입력값) 및 Q 값을 근거로 어느 하나의 출력값(A_t)을 선택하고, 상기 선택된 출력값을 근거로 상기 리모컨(400)을 통해 상기 공기조화기(100)의 동작을 제어한다.That is, the terminal 300 is any one output value (A _t ) based on the current state (or input value) and Q value among a plurality of output values included in the output layer included in the neural network configured in the deep reinforcement learning algorithm. Select and control the operation of the air conditioner 100 through the remote control 400 based on the selected output value.

일 예로, 상기 제 1 단말은 상기 심층 강화 학습 알고리즘에 구성된 신경망에 포함된 출력 레이어에 포함되는 복수의 출력값(예를 들어 22℃의 제 1 출력값, 30℃의 제 2 출력값, 40%의 제 3 출력값 및 50%의 제 4 출력값) 중에서 상기 실내 온도와 상기 수신된 제 1 목표 온도(예를 들어 26℃) 간의 제 1 차이값, 실내 온도의 미리 설정된 시간 동안의 제 1 순간 변화 기울기, 제 1 공기조화기(100)의 토출구 부근에서의 제 1 토출 온도, 상기 실내 습도와 상기 수신된 제 1 목표 습도(예를 들어 45%) 간의 제 2 차이값, 실내 습도의 미리 설정된 시간 동안의 제 2 순간 변화 기울기, 상기 제 1 공기조화기의 토출구 부근에서의 제 1 토출 습도 등을 포함하는 제 1 입력값 및 Q 값을 근거로 상기 제 1 출력값(예를 들어 22℃)을 선택한다.For example, the first terminal may include a plurality of output values (for example, a first output value of 22 ° C, a second output value of 30 ° C, and a third of 40%) included in an output layer included in a neural network configured in the deep reinforcement learning algorithm. The first difference value between the indoor temperature and the received first target temperature (for example, 26 ° C) among the output value and the fourth output value of 50%), the first instantaneous change slope during a preset time of the room temperature, the first The first discharge temperature in the vicinity of the discharge port of the air conditioner 100, the second difference value between the indoor humidity and the received first target humidity (for example, 45%), the second during the preset time of the indoor humidity The first output value (for example, 22 ° C.) is selected based on a first input value and a Q value including an instantaneous change slope, a first discharge humidity in the vicinity of the outlet of the first air conditioner, and the like.

또한, 상기 제 1 단말은 상기 선택된 상기 제 1 출력값(예를 들어 22℃)을 근거로 상기 리모컨(400)을 통해 상기 제 1 공기조화기의 동작을 제어한다(S1520).In addition, the first terminal controls the operation of the first air conditioner through the remote control 400 based on the selected first output value (for example, 22 ° C) (S1520).

이후, 상기 단말(300)은 미리 설정된 시간(예를 들어 120초)이 지난 후, 데이터 수집부(200)로부터 다음 현재 상태(또는 다음 입력값)(S_t+ ₁)를 수신(또는 획득)한다. 여기서, 상기 다음 현재 상태(S_t+1)는 앞선 현재 상태(S_t)와 상기 선택된 출력값(A_t)을 포함한다.Thereafter, the terminal 300 receives (or acquires) the next current state (or the next input value) (S _{t +} ₁ ) from the data collection unit 200 after a preset time (for example, 120 seconds). . Here, the next current state (S _{t + 1} ) includes the previous current state (S _t ) and the selected output value (A _t ).

일 예로, 상기 제 1 단말은 미리 설정된 시간인 120초가 지난 후, 상기 데이터 수집부(200)로부터 제공되는 다음 현재 상태(S_t+1)를 수신한다(S1530).For example, the first terminal receives a next current state (S _{t + 1} ) provided from the data collection unit 200 after a preset time of 120 seconds (S1530).

이후, 상기 단말(300)은 상기 현재 상태(S_t), 상기 출력값(A_t), 상기 다음 현재 상태(S_t+1)를 근거로 보상값(Rt)를 산출한다.Thereafter, the terminal 300 calculates a compensation value Rt based on the current state S _t , the output value A _t , and the next current state S _{t + 1} .

이때, 다음 상태 시점(t+1 시점)에서 상기 현재 상태(S_t)가 상기 초기화된 온도 및/또는 습도(또는 온도 범위 및/또는 습도 범위)를 만족하는 경우, 상기 단말(300)은 해당 상태에서 보상값(Rt)을 상점(예를 들어 '+1'점 추가)한다.In this case, when the current state S _t satisfies the initialized temperature and / or humidity (or temperature range and / or humidity range) at the next state time point (t + 1 time point), the terminal 300 is applicable Store the reward value Rt in the state (for example, add '+1' point).

또한, 다음 상태 시점(t+1 시점)에서 상기 현재 상태(S_t)가 상기 초기화된 온도 및/또는 습도(또는 온도 범위 및/또는 습도 범위)를 만족하지 않은 경우, 상기 단말(300)은 해당 상태에서 보상값(Rt)으로 벌점(예를 들어 '-100'점 추가)을 부여하고, 진행 중인 에피소드 단계를 종료하고, 새로운 에피소드를 진행한다. 또한, 상기 단말(300)은 새로운 에피소드 진행에 따라, 앞선 상태들에 대한 초기화 기능을 수행한다.In addition, when the current state S _t does not satisfy the initialized temperature and / or humidity (or temperature range and / or humidity range) at the next state time point (t + 1 time point), the terminal 300 In this state, penalty points (for example, '-100' points are added) are given as a reward value Rt, the ongoing episode stage is terminated, and a new episode is performed. In addition, the terminal 300 performs an initialization function for the preceding states according to the progress of a new episode.

일 예로, 상기 제 1 단말은 상기 현재 상태(S_t), 상기 출력값(A_t), 상기 다음 현재 상태(S_t+1)를 근거로 상기 현재 상태(S_t)(또는 실내 온도)가 상기 제 2 온도 범위(예를 들어 26℃±0.5℃) 내에 존재할 때, 상기 제 1 단말은 해당 상태에서 보상값(Rt)을 상점(예를 들어 '1'점 추가)한다(S1540).For example, the first terminal is the current state (S _t), the output value (A _t), the following wherein the basis of the current state (S _{t + 1)} current state (S _t) (or the temperature indoor) that the When within the second temperature range (for example, 26 ° C ± 0.5 ° C), the first terminal stores the compensation value Rt in the corresponding state (for example, adds a '1' point) (S1540).

이후, 상기 단말(300)은 상기 보상값(Rt)을 보상한 후, 상기 현재 상태(S_t), 상기 출력값(A_t), 상기 보상값(Rt) 및 상기 다음 현재 상태(S_t+1)를 포함하는 튜플(D)를 저장한다.Thereafter, after the terminal 300 compensates for the compensation value Rt, the current state S _t , the output value A _t , the compensation value Rt and the next current state S _{t + 1} ) Is stored.

일 예로, 상기 제 1 단말은 상기 현재 상태(S_t), 상기 출력값(A_t), 상기 보상값(Rt) 및 상기 다음 현재 상태(S_t+1)를 포함하는 상기 튜플(D)를 저장한다(S1550).For example, the first terminal stores the tuple (D) including the current state (S _t ), the output value (A _t ), the compensation value (Rt), and the next current state (S _{t + 1} ). (S1550).

이후, 상기 단말(300)은 상기 저장되는 복수의 튜플(D) 중에서 일부를 선택하고, 상기 선택된 일부를 이용해서 랜덤 샘플링을 수행한다.Thereafter, the terminal 300 selects a portion among the plurality of stored tuples D, and performs random sampling using the selected portion.

일 예로, 상기 제 1 단말은 상기 저장되는 복수의 튜플(D) 중에서 일부를 근거로 미니-배치를 통해 랜덤 샘플링을 수행한다(S1560).For example, the first terminal performs random sampling through mini-batch based on a part of the plurality of stored tuples (S1560).

또한, 상기 단말(300)은 미니배치(mini-batch)에 근거하여 해당 에피소드와 관련한 타겟 출력값(Y_j)(예를 들어,

)으로 해당 타겟 출력값과 가치함수값(Q) 간의 차이의 제곱인 손실함수를 산출한다(S1570). 여기서, 상기

는 과거의 내용을 얼마나 반영할 것인지를 포함하는 감가율(discount factor)을 나타내고, 상기 a는 학습율(learning rate)을 나타낸다. q는 안정된 계산을 위한 타겟 네트워크 파라미터(target network parameter)를 나타낸다.In addition, the terminal 300 based on the mini-batch (mini-batch) target output value associated with the episode (Y _j ) (for example,

) To calculate the loss function that is the square of the difference between the target output value and the value function value (Q) (S1570). Where, above

Denotes a discount factor including how much the past content will be reflected, and a represents a learning rate. q represents a target network parameter for stable calculation.

일 예로, 상기 판단 결과, 진행 중인 하나의 에피소드가 유지 중일 때, 상기 제 1 단말은 해당 에피소드와 관련한 최종 출력값(Y_j)으로 상기 손실 함수를 설정한다.For example, as a result of the determination, when one episode in progress is maintained, the first terminal sets the loss function as the final output value Y _j associated with the episode.

이후, 상기 단말(300)은 상기 손실 함수가 최소화되도록 신경망의 각 노드의 편향 및 가중치를 업데이트한다(S1580).Thereafter, the terminal 300 updates the bias and weight of each node of the neural network so that the loss function is minimized (S1580).

일 예로, 상기 제 1 단말은 상기 설정된 손실 함수가 최소화되도록 상기 단말(300)에 저장되어 있는 신경망의 각 노드의 편향 및 가중치를 업데이트 한다.For example, the first terminal updates the bias and weight of each node of the neural network stored in the terminal 300 so that the set loss function is minimized.

이후, 상기 단말(300)은 획득한 보상값(Rt)에 근거하여 보상이 상점(+)이면 점수(Score)를 누적하여 점수를 업데이트하고, S1520 단계로 이동한다(S1590, S1600). 이와 반대로, 보상이 벌점(-)이면 점수(Score)를 차감누적하고 상기 초기화 범위에 들어올때까지 강제로 해당 방향으로 리모컨 컨트롤 한다(S1590, S1600).Thereafter, if the reward is a store (+) based on the obtained reward value Rt, the terminal 300 accumulates the score and updates the score, and moves to step S1520 (S1590, S1600). Conversely, if the reward is a penalty (-), the score is deducted and the remote control is controlled in the corresponding direction until it enters the initialization range (S1590, S1600).

일 예로, 상기 제 1 단말은 획득한 보상값(R_t)에 근거하여 보상이 상점(+)이면 점수(Score)를 누적하여 점수를 업데이트하고 S1520 단계로 이동한다. 보상이 벌점(-)이면 점수(Score)를 차감 누적하고, 현재 측정 실내온도가 제 2 온도범위보다 크고(또는 크거나), 현재 측정 실내습도가 상기 제 2 습도범위보다 작다면 제 1 출력값(22℃)과(또는 제 1 출력값이나) 제 4 출력값을 상기 리모컨(400)을 통하여 출력한다.For example, if the reward is a store (+) based on the obtained reward value R _t , the first terminal accumulates the score and moves to step S1520. If the compensation is a penalty (-), the score is accumulated by subtracting, and if the current measured indoor temperature is greater than (or greater than) the second temperature range, and the current measured indoor humidity is less than the second humidity range, the first output value ( 22 ° C) and (or the first output value) and the fourth output value are output through the remote controller 400.

또한, 보상이 벌점(-)인 경우, 상기 단말(300)은 미리 설정된 시뮬레이션 타임이 종료된 상태인지 여부를 판단(또는 확인)하여 종료상태이면 강화학습 과정을 끝내고, 종료상태가 아니면 S1515 단계로 복귀한다. In addition, when the reward is a penalty (-), the terminal 300 determines whether or not the preset simulation time has ended (or confirms), and ends the reinforcement learning process in the end state, or goes to step S1515 if it is not the end state. To return

일 예로, 상기 제 1 단말은 미리 설정된 시뮬레이션 타임이 종료된 상태인지 여부를 판단(또는 확인)하여(S1610), 상기 판단 결과(또는 상기 확인 결과), 종료상태이면 강화학습 과정을 끝내고, 종료상태가 아니면 S1515 단계로 복귀한다(S1610). For example, the first terminal determines (or checks) whether or not the preset simulation time has ended (S1610), and if the determination result (or the result of the check), ends the reinforcement learning process, and ends Otherwise, the process returns to step S1515 (S1610).

..

본 발명의 실시예는 앞서 설명된 바와 같이, 시뮬레이션 조건 이외의 교란이 없는 심험 환경에서 다양한 조건하에 반복 학습을 수행하여 목표로 하는 목표 범위 내에서 안정화된 학습 패턴을 보여주면 해당 안정화된 학습 패턴에 따른 최종 학습 결과를 저장하고, 실제 현장에 배치된 공기조화기에서 앞서 저장된 최종 학습 결과를 근거로 실내 및/또는 실내 환경에 따른 적응형 학습을 수행하여 공기조화기의 동작을 제어하여, 사용상의 편의성을 향상시키고, 전체 시스템 운영 효율을 향상시키고, 외부 환경 변환에 유연하게 대응할 수 있다.As described above, according to an embodiment of the present invention, if it shows a stabilized learning pattern within a target range by performing iterative learning under various conditions in a trial environment without disturbance other than simulation conditions, the stabilized learning pattern Save the final learning result according to the actual air conditioner placed in the field and perform the adaptive learning according to the indoor and / or indoor environment based on the last learning result stored in advance to control the operation of the air conditioner, It can improve convenience, improve overall system operation efficiency, and flexibly respond to external environment conversion.

전술된 내용은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above-described contents may be modified and modified without departing from the essential characteristics of the present invention by those skilled in the art to which the present invention pertains. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the claims below, and all technical spirits within the equivalent range should be interpreted as being included in the scope of the present invention.

10: 심층 강화 학습 알고리즘을 적용한 공기조화기 정밀 학습 시스템
100: 냉난방기
200: 데이터 수집부
300: 단말
400: 리모컨
500: 정밀 시험 챔버10: Air conditioner precision learning system using deep reinforcement learning algorithm
100: air conditioner
200: data collection unit
300: terminal
400: remote control
500: precision test chamber

Claims

A remote controller transmitting at least one of a target temperature and a target humidity according to a user input;
An air conditioner for implementing a test environment, receiving at least one of a target temperature and a target humidity transmitted from the remote controller, and driving based on the received target temperature and a target humidity in the test environment;
A precision test chamber in which simulated outdoor and indoor environmental conditions are set in advance to perform in-depth reinforcement learning of the air conditioner;
A data collection unit for collecting data related to the air conditioner; And
Based on the collected data related to the air conditioner, an input value to be applied to a deep reinforcement learning neural model is generated, and based on the generated input value, a test environment of the precision test chamber using the deep reinforcement learning neural model After performing the learning function, and after the learning step performed during the preset simulation period is normally ended, the weights and bias values at a plurality of individual nodes configured in the neural network including the information according to the learning performance are stored, and each test environment. And a terminal providing a learning result including a weight and a bias value at each node to the air conditioner,
The air conditioner,
It is installed in a specific place according to the user's use, and operates in accordance with a different target temperature, different target humidity and different target air cleanliness according to user input in a specific environment according to a specific place where the air conditioner is installed, and operates in the specific environment Based on the learning result including the data collected according to the data and the weights and bias values at each node for each stored test environment, the deep reinforcement learning algorithm is applied according to the deep reinforcement learning algorithm to additionally perform the learning function according to the actual installed field conditions. Air conditioner precision learning system.

According to claim 1,
The air conditioner,
Learning about constant heating and cooling temperature and humidity conditions or variable temperature and humidity conditions during the cooling period,
The constant heating and cooling temperature and humidity conditions,
Set to implement a test environment free of disturbances other than simulation conditions,
The variable temperature and humidity conditions,
In order to simulate the variable environment in the field before the installation of the air conditioner in advance, including the conditions of variable temperature and variable humidity in the outdoor case, the conditions of variable sensible heat load and variable latent heat load in the room, An air conditioner precision learning system to which an in-depth reinforcement learning algorithm is applied.

According to claim 1,
The data related to the air conditioner,
Power consumption, discharge temperature around the outlet of the air conditioner, discharge humidity near the outlet of the air conditioner, air cleanliness at the outlet of the air conditioner, temperature and humidity in the room where the air conditioner is located, Air applied with an in-depth reinforcement learning algorithm comprising at least one of outdoor temperature and outdoor humidity, indoor air cleanliness, indoor sensible heat load and indoor latent heat load, reaction data of occupants and occupants, weather forecast, and weather database. Harmonic Precision Learning System.

According to claim 1,
The input value is,
The difference value between the indoor temperature and the received target temperature among the collected data, the instantaneous change gradient during a preset time of the indoor temperature among the collected data, and the discharge temperature in the vicinity of the outlet of the air conditioner among the collected data , A difference value between the indoor humidity and the received target humidity among the collected data, an instantaneous change slope during a preset time of the indoor humidity among the collected data, and a discharge near the discharge port of the air conditioner among the collected data Humidity, the difference between the indoor air cleanliness and the target air cleanliness, the slope of the instantaneous change of the indoor air cleanliness for a preset time, the air cleanliness near the outlet of the air conditioner, during the past period of the occupants and occupants An air conditioner precision learning system to which an in-depth reinforcement learning algorithm is applied, comprising at least one of status information including location information and temperature, and status information including current location information and temperature of occupants and occupants.

According to claim 1,
The terminal,
An air conditioner precision learning system to which a deep reinforcement learning algorithm is applied, wherein a neural network to be used for the deep reinforcement learning neural model is constructed and a learning function for the test environment is performed through the deep reinforcement learning algorithm.

Implementing a test environment by a precision test chamber;
Receiving, by the precision test chamber, at least one of target temperature, target humidity, and target air cleanliness according to user input through a remote control;
Driving by the air conditioner based on the received target temperature, target humidity, and target air cleanliness in a test environment of the precision test chamber;
Collecting data related to the air conditioner and test environment data of the precision test chamber by a data collection unit;
Generating an input value to be applied to a deep reinforcement learning neural model based on the collected test environment data and data related to the air conditioner by the terminal;
Performing, by the terminal, a learning function in the test environment using the deep reinforcement learning neural model based on the generated input value;
Storing, by the terminal, weights and bias values at a plurality of individual nodes configured in a neural network including information according to the learning performance after the learning steps performed for a predetermined time are normally ended;
Storing, by the air conditioner, a learning result including weights and bias values at each node for each test environment provided from the terminal;
The air conditioner is installed in a specific place according to the user's use;
Operating according to different target temperatures, different target humidity and different target air cleanliness according to user input in a specific environment according to a specific place where the air conditioner is installed; And
The air conditioner is a field condition in which a learning function is actually installed according to an in-depth reinforcement learning algorithm based on learning results including data collected according to operation in the specific environment and weights and bias values at each node for each stored test environment. A precision learning method of an air conditioner to which an in-depth reinforcement learning algorithm is applied, which includes additionally performing steps according to the application.

delete

The method of claim 6,
The step of performing the learning function in the test environment,
Setting, by the terminal, a temperature initialization range, a humidity initialization range, an air freshness initialization range, a temperature compensation range, a humidity compensation range, an air freshness compensation range and a simulation period to be used for the deep reinforcement learning algorithm;
Initializing, by the terminal, the episode start score and the episode previous score of the deep reinforcement learning algorithm;
Set a temperature compensation range, a humidity compensation range, and an air freshness compensation range to determine the compensation (R _t ) of a store or a penalty, independently of the temperature initialization range, the humidity initialization range, and the air freshness initialization range during the simulation period. step;
By the terminal, any one output value (A _t ) based on the current state (S _t ) and the value function value (Q) among the plurality of output values included in the output layer included in the neural network configured in the deep reinforcement learning algorithm Selecting and controlling the operation of the air conditioner through the remote control based on the selected output value;
Receiving, by the terminal, a next current state from the data collection unit after a preset time has elapsed from the time of controlling the remote control;
Calculating, by the terminal, a compensation value (Rt) based on the current state (S _t ), the output value (A _t ), and the next current state (S _{t + 1} );
Storing, by the terminal, a tuple (D) including the current state (S _t ), the output value (A _t ), the compensation value (Rt), and the next current state (S _{t + 1} );
Generating, by the terminal, a mini-batch for learning by random sampling a part of the plurality of stored tuples D;
Calculating, by the terminal, a target output value Y _j based on mini-batch data for one episode in progress within a preset simulation time;
Calculating a loss function value (L _j ) that is a square of a difference between the target output value (Y _j ) and the value function value (Q) based on the mini-batch;
Updating the bias and weight of each node of the neural network so that the loss function value L _j is minimized;
Determining, by the terminal in one of the episodes, whether the compensation is a positive value or a negative value by satisfying the preset temperature compensation range, humidity compensation range, and air cleanliness compensation range;
Updating the score by accumulating the reward value (R _t ) in the existing score by the terminal;
When the reward value R _t is a store (+), the corresponding reward value is accumulated in the score of the current episode, and the current state (S _t ) among the plurality of output values (A _t ) included in the output layer included in the neural network And selecting one output value (A _t ) based on the value function value (Q) and controlling the operation of the air conditioner through the remote control based on the selected output value (A _t );
If the compensation value (R _t ) is a penalty (-), the corresponding compensation value is accumulated in the score of the current episode, and the remote controller is specified until it enters the temperature initialization range and humidity initialization range and air cleanness initialization range. Outputting the selected output value (A _t ) for a direction;
Recording, by the terminal, the episode start score and the episode previous score if the simulation time end condition is not satisfied, and initializing a new episode score;
And a step of terminating the deep reinforcement learning algorithm when the time end condition of the simulation is satisfied by the terminal, the air conditioner precision learning method using the deep reinforcement learning algorithm.