KR102257082B1

KR102257082B1 - Apparatus and method for generating decision agent

Info

Publication number: KR102257082B1
Application number: KR1020200143448A
Authority: KR
Inventors: 팜 투옌 르; 노철균; 민예린; 이동수; 이성령; 정석규
Original assignee: 주식회사 애자일소다
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-05-28
Also published as: WO2022092415A1

Abstract

Disclosed are a device and method for generating a decision-making agent. When a creation of optimization and automation models related to a corporate decision-making is requested, the present invention can create and provide a model thereof. The device for generating the decision-making agent comprises: a training agent part; and a deploy agent part.

Description

Device and method for generating decision agents {APPARATUS AND METHOD FOR GENERATING DECISION AGENT}

본 발명은 의사결정 에이전트 생성 장치 및 방법에 관한 발명으로서, 더욱 상세하게는 기업의 의사결정에 관련된 최적화 및 자동화 모델의 생성이 요청되면, 이에 대한 모델을 생성하는 의사결정 에이전트 생성 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for generating a decision agent, and more particularly, to an apparatus and method for generating a decision agent for generating a model when an optimization and automation model related to a company's decision-making is requested. will be.

인공 지능(artificial intelligence, AI) 시스템은 인간 수준의 지능을 구현하는 컴퓨터 시스템이다. 종래의 규칙-기반 스마트 시스템과 달리, AI는 학습하고 판단하며 스마트해지는 시스템이다. Artificial intelligence (AI) systems are computer systems that implement human-level intelligence. Unlike conventional rule-based smart systems, AI is a system that learns, judges, and becomes smart.

인공 지능을 사용하여 인식률 및 사용자 기호에 대한 이해가 보다 정확하게 이루어짐에 따라, 기존의 규칙-기반 스마트 시스템들은 점차적으로 심층-기반 인공 지능 시스템들로 대체되고 있다.As the recognition rate and user preferences are more accurately understood using artificial intelligence, existing rule-based smart systems are gradually being replaced by deep-based artificial intelligence systems.

인공 지능 기술은 머신 러닝(딥 러닝) 및 머신 러닝을 사용하는 요소 기술(element technology)로 구성된다.Artificial intelligence technology consists of machine learning (deep learning) and element technology using machine learning.

머신 러닝(machine learning, ML)은 입력 데이터의 특징을 스스로 분류/학습하는 알고리즘 기술이다. Machine learning (ML) is an algorithmic technology that classifies/learns features of input data by itself.

요소 기술은 딥 러닝과 같은 머신 러닝 알고리즘들을 사용하여 인식 및 판단과 같은 인간의 뇌 기능들을 시뮬레이션하는 기술로서, 언어 이해, 시각적 이해, 추론/예측, 지식 표현 및 동작 제어와 같은 기술 분야로 구성된다.Elemental technology is a technology that simulates human brain functions such as recognition and judgment using machine learning algorithms such as deep learning, and consists of technical fields such as language understanding, visual understanding, reasoning/prediction, knowledge expression, and motion control. .

한편, 머신 러닝의 일종인 강화학습은 에이전트를 이용하여 최적의 목표를 달성하는 프로그램으로서, 리워드(보상)를 이용하여 강화학습을 수행함으로써 최적화를 달성한다.On the other hand, reinforcement learning, a kind of machine learning, is a program that achieves an optimal goal using an agent, and it achieves optimization by performing reinforcement learning using rewards (compensation).

그러나, 강화학습이란 개념 자체가 어렵고, 사례들도 부족해서 에이전트를 통한 기업의 의사결정 과정에 충분히 이용되지 못하는 문제점이 있다.However, there is a problem that the concept of reinforcement learning is difficult and cases are insufficient, so that it cannot be sufficiently used in the decision-making process of a company through an agent.

또한, 종래의 강화학습을 위한 에이전트는 전문적인 지식을 가진 사용자가 시뮬레이터 등을 이용하여 복잡한 코딩 작업의 수행을 통해 학습이 진행되도록 구성되어 강화학습에 전문적인 지식이 없는 사용자는 사용하기 어려운 문제점이 있다.In addition, the conventional agent for reinforcement learning is configured so that a user with specialized knowledge can perform a complex coding task using a simulator, etc., so it is difficult to use a user without specialized knowledge in reinforcement learning. have.

또한, 종래의 강화학습을 위한 에이전트는 최적화 또는 자동화 모델을 생성하기 위한 작업을 수행한 다음 이후 과정에 대한 진행 정보를 확인할 수 없는 문제점이 있다.In addition, there is a problem in that an agent for reinforcement learning in the related art cannot check progress information on a subsequent process after performing a task for generating an optimization or automation model.

한국 공개특허공보 공개번호 제10-2019-0087635호(발명의 명칭: 자동화된 의사 결정을 위한 방법 및 장치)Korean Patent Application Publication No. 10-2019-0087635 (title of the invention: method and apparatus for automated decision making)

이러한 문제점을 해결하기 위하여, 본 발명은 기업의 의사결정에 관련된 최적화 및 자동화 모델의 생성이 요청되면, 이와 대한 모델을 생성하는 의사결정 에이전트 생성 장치 및 방법을 제공하는 것을 목적으로 한다.In order to solve this problem, an object of the present invention is to provide an apparatus and method for generating a decision agent that generates a model for optimization and automation model generation related to a company's decision-making request.

상기한 목적을 달성하기 위하여 본 발명의 일 실시 예는 의사결정 에이전트 생성 장치로서, 비즈니스 도메인에 대한 입력 데이터를 기반으로 임의의 강화학습 대상 모델을 생성하여 트레이닝 시키되, 상기 강화학습 대상 모델의 학습에 사용자 설정 데이터를 반영하여 강화학습 모델을 트레이닝시키는 트레이닝 에이전트부; 및 상기 트레이닝 에이전트부에서 생성된 강화학습 모델을 배포(Deploy)하는 디플로이 에이전트부;를 포함한다.In order to achieve the above object, an embodiment of the present invention is an apparatus for generating a decision agent, which generates and trains an arbitrary reinforcement learning target model based on input data for a business domain. A training agent unit for training the reinforcement learning model by reflecting user setting data; And a deployment agent unit for deploying the reinforcement learning model generated by the training agent unit.

또한, 상기 실시 예에 따른 의사결정 에이전트 생성 장치는 사용자로부터 드래그 앤 드롭(Drag & Drop) 또는 코딩으로 입력되는 강화학습의 환경 요소, 강화학습의 학습 속도 및 성능을 증가시키기 위한 최적화 데이터 및 강화학습 알고리즘 선택 데이터 중 적어도 하나로 이루어진 설정 데이터를 상기 트레이닝 에이전트부로 출력하는 사용자 인터페이스부;를 더 포함하는 것을 특징으로 한다.In addition, the apparatus for generating a decision agent according to the above embodiment includes an environment element of reinforcement learning input by drag & drop or coding from a user, optimization data and reinforcement learning to increase the learning speed and performance of reinforcement learning. And a user interface unit that outputs setting data consisting of at least one of algorithm selection data to the training agent unit.

또한, 상기 실시 예에 따른 사용자 인터페이스부는 트레이닝 에이전트부로 디플로이 에이전트부를 통해 배포된 강화학습 모델의 학습 정보 검색을 요청하고, 검색된 강화학습 모델의 평가 데이터 및 모니터링 데이터를 시각적으로 변환하여 표시하는 것을 특징으로 한다.In addition, the user interface unit according to the embodiment requests the training agent unit to search for learning information of the reinforcement learning model distributed through the deployment agent unit, and visually converts and displays evaluation data and monitoring data of the searched reinforcement learning model. It is done.

또한, 상기 실시 예에 따른 트레이닝 에이전트부는 입력되는 비즈니스 도메인의 데이터를 저장하고, 상기 저장된 데이터를 기반으로 강화학습 모델의 생성과 강화학습을 수행하기 위한 환경 정보로 출력하는 데이터 저장부; 상기 저장된 데이터를 기반으로 강화학습 모델을 생성하는 빌트인 모델부; 상기 강화학습의 수행을 위한 환경 정보에 사용자로부터 입력되는 보상 함수가 반영되도록 설정하는 커스터마이즈 함수부; 상기 환경 정보와 보상 함수에 기반한 강화학습 모델을 이용하여 에이전트를 생성하고, 생성된 에이전트를 학습하는 트레이닝부; 및 상기 에이전트를 통해 학습된 강화학습 모델을 저장하는 학습된 에이전트 저장부를 포함하되, 상기 트레이닝부는 복수의 에이전트를 생성하여 강화학습 모델을 병렬적으로 학습하고, 각 에이전트의 학습마다 리워드의 성능치를 시각화하여 표시하는 것을 특징으로 한다.Further, the training agent unit according to the embodiment may include a data storage unit that stores input business domain data, and outputs environment information for generating a reinforcement learning model and performing reinforcement learning based on the stored data; A built-in model unit that generates a reinforcement learning model based on the stored data; A customization function unit configured to reflect a reward function input from a user in environmental information for performing the reinforcement learning; A training unit that generates an agent using a reinforcement learning model based on the environment information and a reward function, and learns the generated agent; And a learned agent storage unit storing the reinforcement learning model learned through the agent, wherein the training unit generates a plurality of agents to learn the reinforcement learning model in parallel, and visualizes the performance value of the reward for each agent's learning. It is characterized in that it is displayed.

또한, 상기 실시 예에 따른 트레이닝 에이전트부는 강화학습의 학습 속도 및 성능을 증가시키기 위한 최적화 정보와, 강화학습의 적용을 용이하게 수행할 수 있도록 자동 보상 정보를 설정하는 캐털리스트(Catalyst) 학습부; 및 상기 비즈니스 도메인에 따라 에이전트를 학습하기 위한 강화학습 알고리즘을 선택하는 RL 알고리즘부;를 더 포함하는 것을 특징으로 한다.In addition, the training agent unit according to the embodiment may include a catalyst learning unit configured to set optimization information for increasing the learning speed and performance of reinforcement learning and automatic compensation information to facilitate application of reinforcement learning; And an RL algorithm unit for selecting a reinforcement learning algorithm for learning an agent according to the business domain.

또한, 상기 실시 예에 따른 트레이닝 에이전트부는 사전에 학습된 모델을 이용하여 에이전트를 신속하게 학습시킬 수 있는 모델을 제공하는 학습된 모델부; 의사결정에 대한 설명을 필요로 하는 도메인을 위한 모델을 제공하는 설명 가능한 AI모델부; 및 결측치가 있는 데이터를 기존 데이터 분포를 이용하여 결측치가 대체된 데이터를 생성하는 모델을 제공하는 생성적 AI 모델부;를 더 포함하는 것을 특징으로 한다.In addition, the training agent unit according to the embodiment may include a trained model unit that provides a model capable of quickly learning an agent using a model learned in advance; An explainable AI model unit that provides a model for a domain requiring explanation of decision making; And a generative AI model unit that provides a model for generating data in which the missing values are replaced by using the existing data distribution.

또한, 본 발명의 일 실시 예는 a) 트레이닝 에이전트부가 임의의 비즈니스 도메인에 대한 데이터를 기반으로 강화학습 적용을 위한 환경 설정과 학습 진행을 위해 상기 비즈니스 도메인에 대한 탐색적 자료 분석(Exploratory Data Analysis, EDA)을 수행하는 단계; b) 사용자 인터페이스부를 통해 사용자 설정이 입력됨에 따라, 트레이닝 에이전트부가 작업 공간을 설계하는 단계; c) 상기 트레이닝 에이전트부가 사용자 인터페이스부를 통해 입력되는 강화학습 에이전트의 생성을 위한 환경 정보와 보상 함수를 정의하고, 상기 환경 정보와 보상 함수에 기반한 강화학습 모델을 이용하여 생성한 에이전트를 학습하는 단계; 및 d) 상기 디플로이 에이전트부가 상기 트레이닝 에이전트부에서 생성된 강화학습 모델을 배포(Deploy)하는 단계;를 포함하되, 상기 c) 단계에서 트레이닝 에이전트부는 복수의 에이전트를 생성하여 강화학습 모델을 병렬적으로 트레이닝하는 것을 특징으로 한다.In addition, an embodiment of the present invention is a) an exploratory data analysis for the business domain to set an environment for the application of reinforcement learning by the training agent unit based on data on an arbitrary business domain and to proceed with learning (Exploratory Data Analysis, Performing EDA); b) designing a work space by a training agent as a user setting is input through a user interface unit; c) defining, by the training agent unit, environment information and a compensation function for generating a reinforcement learning agent input through a user interface unit, and learning an agent generated by using a reinforcement learning model based on the environment information and the compensation function; And d) deploying, by the deployment agent unit, the reinforcement learning model generated by the training agent unit; wherein, in step c), the training agent unit generates a plurality of agents to parallelize the reinforcement learning model. It is characterized by training with.

또한, 상기 실시 예에 따른 d) 단계는 트레이닝 에이전트부가 사용자 인터페이스부를 통해 에이전트의 형상 관리 정보, 강화학습 모델 및 학습 파라미터 정보를 분석하여 출력하는 단계;를 더 포함하는 것을 특징으로 한다.In addition, the step d) according to the embodiment further comprises a step of analyzing and outputting the agent configuration management information, the reinforcement learning model, and learning parameter information through the user interface unit by the training agent.

또한, 상기 실시 예에 따른 b)단계는 설계된 작업 공간에서 이전 실험 결과들을 검색하고, 검색 결과에 따른 실험들의 정렬, 조회 및 분석을 수행하는 것을 특징으로 한다.In addition, step b) according to the above embodiment is characterized in that the previous experiment results are searched in the designed workspace, and experiments are sorted, searched, and analyzed according to the search results.

또한, 상기 실시 예에 따른 c)단계는 사용자 인터페이스부를 통해 설정되는 환경 정보 및 보상 함수의 적어도 일부는 비즈니스 도메인 데이터 셋의 메타 데이터를 기반으로 자동 설정되는 것을 특징으로 한다.In addition, step c) according to the above embodiment is characterized in that at least some of the environmental information and the reward function set through the user interface unit are automatically set based on the metadata of the business domain data set.

또한, 상기 실시 예에 따른 c)단계는 사용자 인터페이스부를 통해 각 에이전트의 학습마다 리워드의 성능치를 시각화하여 표시하는 것을 특징으로 한다.In addition, step c) according to the above embodiment is characterized in that the performance value of the reward is visualized and displayed for each learning of each agent through the user interface unit.

본 발명은 기업의 의사결정에 관련된 최적화 및 자동화 모델을 생성하여 제공할 수 있는 장점이 있다.The present invention has the advantage of generating and providing an optimization and automation model related to a company's decision-making.

또한, 본 발명은 강화학습에 지식이 없는 사용자가 머신러닝의 핵심요소를 쉽게 설정 및 적용하여 학습할 수 있는 장점이 있다.In addition, the present invention has the advantage that a user without knowledge of reinforcement learning can learn by easily setting and applying core elements of machine learning.

또한, 본 발명은 강화학습 에이전트를 통해 사용자의 도메인 지식과 일반적인 머신 러닝 지식만으로도 쉽게 강화학습 에이전트를 생성할 수 있는 장점이 있다.In addition, the present invention has the advantage of being able to easily create a reinforcement learning agent with only a user's domain knowledge and general machine learning knowledge through the reinforcement learning agent.

또한, 본 발명은 최소한의 노력으로 의사결정 문제에 다양한 강화학습 설계를 구축하여 높은 수준의 의사결정 에이전트를 생성할 수 있는 장점이 있다.In addition, the present invention has the advantage of being able to create a high-level decision agent by constructing various reinforcement learning designs for decision-making problems with minimal effort.

도1은 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 장치를 개략적으로 나타낸 블록도.
도2는 도1의 실시 예에 따른 의사결정 에이전트 생성 장치의 트레이닝 에이전트부 구성을 나타낸 블록도.
도3은 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 방법을 나타낸 흐름도.
도4는 도3의 실시 예에 따른 의사결정 에이전트 생성 방법의 환경 설정 과정을 설명하기 위한 예시도.
도5는 도3의 실시 예에 따른 의사결정 에이전트 생성 방법의 학습 결과를 설명하기 위한 예시도.
도6은 도3의 실시 예에 따른 의사결정 에이전트 생성 방법의 완료한 실험에 대한 학습 결과 탐색 화면을 설명하기 위한 다른 예시도.
도7은 도3의 실시 예에 따른 의사결정 에이전트 생성 방법의 구성요소 설계 과정을 설명하기 위한 예시도.
도8은 도3의 실시 예에 따른 의사결정 에이전트 생성 방법의 구성요소 설계 과정을 설명하기 위한 다른 예시도.1 is a block diagram schematically illustrating an apparatus for generating a decision agent according to an embodiment of the present invention.
FIG. 2 is a block diagram showing the configuration of a training agent unit of the apparatus for generating a decision agent according to the embodiment of FIG. 1;
3 is a flowchart illustrating a method of generating a decision agent according to an embodiment of the present invention.
4 is an exemplary diagram for explaining an environment setting process of a method for generating a decision agent according to the embodiment of FIG. 3.
5 is an exemplary view for explaining a learning result of the method for generating a decision agent according to the embodiment of FIG. 3.
FIG. 6 is another exemplary view for explaining a screen for searching a learning result for a completed experiment of a method for generating a decision agent according to the embodiment of FIG. 3.
7 is an exemplary diagram for explaining a component design process of the method for generating a decision agent according to the embodiment of FIG. 3.
8 is another exemplary view for explaining a component design process of the method for generating a decision agent according to the embodiment of FIG. 3.

이하에서는 본 발명의 바람직한 실시 예 및 첨부하는 도면을 참조하여 본 발명을 상세히 설명하되, 도면의 동일한 참조부호는 동일한 구성요소를 지칭함을 전제하여 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to a preferred embodiment of the present invention and the accompanying drawings, but it will be described on the premise that the same reference numerals refer to the same elements.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 발명의 기술적 요지와 직접적 관련이 없는 구성에 대해서는 본 발명의 기술적 요지를 흩뜨리지 않는 범위 내에서 생략하였음에 유의하여야 할 것이다. Prior to describing specific details for the implementation of the present invention, it should be noted that configurations that are not directly related to the technical gist of the present invention have been omitted within the scope not disturbing the technical gist of the present invention.

또한, 본 명세서 및 청구범위에 사용된 용어 또는 단어는 발명자가 자신의 발명을 최선의 방법으로 설명하기 위해 적절한 용어의 개념을 정의할 수 있다는 원칙에 입각하여 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.In addition, terms or words used in the present specification and claims are meanings and concepts consistent with the technical idea of the invention based on the principle that the inventor can define the concept of an appropriate term to describe his invention in the best way. Should be interpreted as.

본 명세서에서 어떤 부분이 어떤 구성요소를 "포함"한다는 표현은 다른 구성요소를 배제하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In the present specification, the expression that a certain part "includes" a certain component does not exclude other components, but means that other components may be further included.

또한, "‥부", "‥기", "‥모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는 그 둘의 결합으로 구분될 수 있다.In addition, terms such as "... unit", "... group", and "... module" mean units that process at least one function or operation, which can be classified into hardware, software, or a combination of the two.

또한, "적어도 하나의" 라는 용어는 단수 및 복수를 포함하는 용어로 정의되고, 적어도 하나의 라는 용어가 존재하지 않더라도 각 구성요소가 단수 또는 복수로 존재할 수 있고, 단수 또는 복수를 의미할 수 있음은 자명하다 할 것이다. In addition, the term “at least one” is defined as a term including the singular and plural, and even if the term “at least one” does not exist, each component may exist in the singular or plural, and may mean the singular or plural. Will say self-explanatory.

또한, 각 구성요소가 단수 또는 복수로 구비되는 것은, 실시 예에 따라 변경가능하다 할 것이다.In addition, it will be said that each component is provided in a singular or plural number and may be changed according to embodiments.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 장치 및 방법의 바람직한 실시예를 상세하게 설명한다.Hereinafter, a preferred embodiment of an apparatus and method for generating a decision agent according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도1은 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 장치를 개략적으로 나타낸 블록도이고, 도2는 도1의 실시 예에 따른 의사결정 에이전트 생성 장치의 트레이닝 에이전트부 구성을 나타낸 블록도이다.1 is a block diagram schematically illustrating an apparatus for generating a decision agent according to an embodiment of the present invention, and FIG. 2 is a block diagram showing a configuration of a training agent unit of the apparatus for generating a decision agent according to the embodiment of FIG. 1.

도1 및 도2를 참조하면, 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 장치는 기업의 의사결정에 관련된 최적화 및 자동화 모델을 생성하여 제공할 수 있도록 트레이닝 에이전트부(100)와, 디플로이 에이전트부(200)와, 사용자 인터페이스부(300)를 포함하여 구성될 수 있다.1 and 2, the apparatus for generating a decision agent according to an embodiment of the present invention includes a training agent unit 100 and a deployment agent to generate and provide an optimization and automation model related to a company's decision making. It may be configured to include an agent unit 200 and a user interface unit 300.

트레이닝 에이전트부(100)는 비즈니스 도메인에 대한 입력 데이터를 기반으로 임의의 강화학습 대상 모델을 생성하여 트레이닝 시킨다.The training agent unit 100 generates and trains an arbitrary reinforcement learning target model based on input data for the business domain.

비즈니스 도메인은 스테이트(State), 리워드(Reward)를 포함할 수 있고, 에이전트가 응답할 입력과 에이전트에 제공되는 지식일 수 있는데, 예를 들어 자동차 제조 고정 자동화인 경우, 제조 공정의 프로세스, 재료 등 모델링 할 때 필수적으로 알아야 할 비즈니스 정보이다.The business domain may include a state and a reward, and may be inputs to be responded to by the agent and knowledge provided to the agent. For example, in the case of fixed automation of automobile manufacturing, the process of the manufacturing process, materials, etc. This is the business information you need to know when modeling.

또한, 트레이닝 에이전트부(100)는 강화학습 대상 모델의 학습에 사용자 인터페이스부(300)를 통해 입력되는 사용자 설정 데이터를 반영하여 강화학습 모델을 학습한다.In addition, the training agent unit 100 learns the reinforcement learning model by reflecting user setting data input through the user interface unit 300 in learning the reinforcement learning target model.

또한, 트레이닝 에이전트부(100)는 학습이 완료된 강화학습 모델을 디플로이 에이전트부(200)로 출력하고, 데이터 저장부(110)와, 빌트인 모델부(111)와, 커스터마이즈 함수부(112)와, 캐털리스트 학습부(113)와, RL 알고리즘부(114)와, 학습된 모델부(115)와, 설명 가능한 AI 모델부(116)와, 생성적 AI 모델부(117)와, 트레이닝부(118)와, 학습된 에이전트 저장부(119)를 포함하여 구성될 수 있다.In addition, the training agent unit 100 outputs the reinforcement learning model in which the training has been completed to the deployment agent unit 200, and the data storage unit 110, the built-in model unit 111, the customization function unit 112, and , Catalyst learning unit 113, RL algorithm unit 114, learned model unit 115, explainable AI model unit 116, generative AI model unit 117, and training unit ( 118) and a learned agent storage unit 119 may be included.

데이터 저장부(110)는 입력되는 비즈니스 도메인의 데이터(예를 들어, 상태, 리워드, 액션, 실시간 관측치, 배치성 관측치 등)를 저장하고, 저장된 데이터를 기반으로 트레이닝 에이전트부(100)에서 강화학습 모델의 생성과 강화학습을 수행하기 위한 환경 정보로 출력한다.The data storage unit 110 stores input business domain data (eg, status, rewards, actions, real-time observations, placement observations, etc.), and reinforcement learning in the training agent unit 100 based on the stored data. Generates the model and outputs it as environmental information for performing reinforcement learning.

빌트인 모델부(111)는 데이터 저장부(110)에 저장된 데이터를 기반으로 강화학습 모델을 생성한다.The built-in model unit 111 generates a reinforcement learning model based on data stored in the data storage unit 110.

또한, 빌트인 모델부(111)는 여러 가지 강화학습 모델을 생성할 때 저장된 데이터 뿐만 아니라, 캐털리스트(Catalyst) 계층과 같은 특정 계층을 연계 사용하여 강화학습 수행을 위한 환경 준비와, 강화학습 모델을 빌트인(Built-in) 할 수도 있다.In addition, the built-in model unit 111 prepares an environment for performing reinforcement learning and reinforcement learning models by linking not only data stored when generating various reinforcement learning models, but also a specific layer such as a catalyst layer. It can also be built-in.

커스터마이즈 함수부(112)는 강화학습의 수행을 위한 환경 정보에 사용자로부터 입력되는 보상 함수(Reward function)가 반영되도록 설정한다.The customization function unit 112 is set to reflect a reward function input from a user in environmental information for performing reinforcement learning.

즉, 커스터마이즈 함수부(112)는 종래와 다른 사용자가 원하는 보상 함수, 목적 등이 반영될 수 있도록 사용자 맞춤형(Customized) 리워드, 위저드(Wizard) 방식의 리워드, 정답을 활용하는 방식의 사용자가 간단한 학습 및 강화학습 베이스라인 및 확인 용도로 사용할 수 있는 오토 리워드 중 어느 하나가 설정될 수 있도록 구성할 수도 있다.That is, the customization function unit 112 is a user-customized reward, a wizard-based reward, and a simple learning method that utilizes the correct answer so that the reward function, purpose, etc. desired by the user different from the prior art can be reflected. And reinforcement learning baselines and auto rewards that can be used for verification purposes may be configured to be set.

캐털리스트(Catalyst) 학습부(113)는 강화학습의 학습 속도 및 성능을 증가시키기 위한 최적화 정보와, 강화학습의 적용을 용이하게 수행할 수 있도록 자동 보상 정보를 설정한다. The catalyst learning unit 113 sets optimization information for increasing the learning speed and performance of reinforcement learning and automatic compensation information so that application of reinforcement learning can be easily performed.

즉, 캐털리스트 학습부(113)는 에이전트가 시뮬레이션된 모델(Simulated models)에 대한 빠른 이해와, 좋은 스테이트 구성과, 최적의 아키텍처 구성 및 자동 보상 함수 체계를 사용자 인터페이스(300)를 이용하여 설정할 수 있도록 한다.That is, the catalytic learning unit 113 can set a quick understanding of the simulated models by the agent, a good state configuration, an optimal architecture configuration, and an automatic compensation function system using the user interface 300. To be there.

또한, 캐털리스트 학습부(113)는 정형 데이터, 이미지 데이터 및 텍스트 데이터 등의 스테이트의 형태(Type)에 따라 전처리를 수행하고, 알고리즘을 통해 자동으로 주어진 스테이트에 대한 차원의 과적합 등을 회피할 수 있도록 한다.In addition, the catalytic learning unit 113 performs pre-processing according to the type of state such as structured data, image data, and text data, and automatically avoids dimensional overfitting for a given state through an algorithm. To be able to.

즉, 자동으로 스테이트를 구성하거나, 결측치 대체, 연속형 변수, 범주형 변수, 차원 축소, 변수 선택, 이상치 제거, 이미지 노이즈 제거, 데이터 증대, 사이즈 조정, 토큰나이저, 필터링, 클렌징 등의 모듈을 통해 다양한 전처리를 수행할 수 있다.In other words, through modules such as automatic state configuration, missing value substitution, continuous variable, categorical variable, dimension reduction, variable selection, outlier removal, image noise removal, data augmentation, resizing, tokenizer, filtering, cleansing, etc. Various pretreatments can be performed.

또한, 캐털리스트 학습부(113)는 강화학습, 진화(Evolutionary), 베이지안 최적화(Bayesian Optimization), 경사기반 최적화(Gradient-based Optimization) 등을 통해 최적의 신경망 아키텍처를 검색할 수 있다.In addition, the catalyst learning unit 113 may search for an optimal neural network architecture through reinforcement learning, evolutionary, Bayesian optimization, gradient-based optimization, and the like.

또한, 캐털리스트 학습부(113)는 에이전트의 성능에 많은 영향을 주는 하이퍼 파라미터를 그리드 서치(Grid-Search) 베이지안 최적화(Bayesian Optimization), 경사기반 최적화(Gradient-based Optimization), 인구 기반 최적화(Population based Optimization)를 이용하여 검색하고, 검색 결과를 기초로 최적의 하이퍼 파라미터 조합을 제공할 수 있다.In addition, the catalytic learning unit 113 determines the hyper-parameters that have a great influence on the agent's performance, such as Grid-Search, Bayesian Optimization, Gradient-based Optimization, and Population. based Optimization) may be used to search, and an optimal hyperparameter combination may be provided based on the search result.

또한, 캐털리스트 학습부(113)는 강화학습에 요구되는 리워드가 미리 설정된 보상 패턴에 따라 자동으로 설정되도록 할 수 있다.In addition, the catalytic learning unit 113 may automatically set a reward required for reinforcement learning according to a preset reward pattern.

RL 알고리즘부(Reinforcement Learning Algorithm, 114)는 비즈니스 도메인에 따라 에이전트를 학습하기 위한 강화학습 알고리즘을 선택한다.The RL algorithm unit (Reinforcement Learning Algorithm, 114) selects a reinforcement learning algorithm for learning an agent according to the business domain.

즉, RL 알고리즘부(114)는 모델 프리 강화학습, 모델 기반 알고리즘, 계층적 RL 알고리즘, 다중 에이전트 알고리즘 등을 선택할 수 있다. That is, the RL algorithm unit 114 may select model-free reinforcement learning, model-based algorithms, hierarchical RL algorithms, multi-agent algorithms, and the like.

여기서, 모델 프리 강화학습은 DQN(Deep Q Networks), DDQN(Double Deep Q Networks), Dueling DDQN(Dueling Double Deep Q Networks)등으로 구성된 가치 기반 알고리즘과, A2C(Advantage Actor Critic), TRPO(Trust Region Policy Optimization), PPO(Proximal Policy Optimization), DDPG(Deep Deterministic Policy Gradient), SAC(Soft Actor Critic) 등의 AC 기반 알고리즘 및 DPS(Direct Policy Search) 등의 정책 기반 알고리즘으로 구성될 수 있다.Here, model-free reinforcement learning is a value-based algorithm composed of DQN (Deep Q Networks), DDQN (Double Deep Q Networks), Dueling DDQN (Dueling Double Deep Q Networks), etc., and A2C (Advantage Actor Critic), TRPO (Trust Region). It may be composed of AC-based algorithms such as Policy Optimization), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Soft Actor Critic (SAC), and policy-based algorithms such as Direct Policy Search (DPS).

또한, 모델 기반 강화학습은 모델 프리 강화학습과는 다르게 환경에 대한 정보가 있는 상태에서 모델이 학습하는 알고리즘을 제공하고, 모델 기반 알고리즘의 전이 모델(transition model)을 이용하여 에이전트를 학습하는 알고리즘으로서, DynA, PILCO(Probabilistic Inference for Learning Control), MCTS(Monte-Carlo Tree Search), World Models 등으로 구성도리 수 있고, In addition, model-based reinforcement learning, unlike model-free reinforcement learning, provides an algorithm that the model learns in the presence of information about the environment, and it is an algorithm that learns the agent using the transition model of the model-based algorithm. , DynA, PILCO (Probabilistic Inference for Learning Control), MCTS (Monte-Carlo Tree Search), World Models, etc.

또한, 모델 기반 강화학습은 실제 데이터와 시뮬레이션 환경에서 나온 데이터 모두를 정책 업데이트시 사용하며, 실제 데이터를 이용해 전이 모델을 학습시키거나 LQR(Linear Quadratic Regulator)와 같은 수리적 모델을 사용할 수도 있다.In addition, model-based reinforcement learning uses both real data and data from a simulation environment when updating policies, and it is possible to train a transfer model using real data or use a mathematical model such as a linear quadratic regulator (LQR).

계층적 RL 알고리즘은 비즈니스 도메인이 너무 복잡하여 단일 에이전트로는 문제 해결이 어려운 경우, 에이전트를 여러 계층으로 나누어 배치하고, 각 계층의 에이전트들이 각자의 강화학습 알고리즘으로 학습하며, 마스터 에이전트의 학습을 도울 수 있는 구조의 알고리즘을 제공한다.The hierarchical RL algorithm is a business domain that is too complex to solve a problem with a single agent, divides the agent into multiple layers, and the agents of each layer learn with their own reinforcement learning algorithm, and help the master agent learn. It provides an algorithm with a structure that can be used.

다중 에이전트 알고리즘은 하나의 환경에 복수의 에이전트가 존재하면, 에이전트들 간에 경쟁 또는 협업을 통해 학습하도록 알고리즘을 제공한다.The multi-agent algorithm provides an algorithm to learn through competition or collaboration among agents when a plurality of agents exist in one environment.

또한, RL 알고리즘부(114)는 지도학습과 같이 에이전트를 학습시키거나 레이블이 있는 데이터 셋으로 보상 함수를 역으로 찾아내 레이블이 없는 데이터 셋의 학습에 사용하는 알고리즘, LSTM(Long Short Term Memory), MAML(Model-Agnostic Meta Learning), MQL(Meta Q Learning) 등의 메타 RL 알고리즘, 환경과 실시간 상호작용이 어려운 비즈니스 도메인에서 오프라인 데이터를 이용하여 학습하는 배치 RL 알고리즘, A2GAN을 이용한 알고리즘을 제공할 수도 있다.In addition, the RL algorithm unit 114 is an algorithm used for learning an unlabeled data set by learning an agent like supervised learning or by inversely finding a reward function with a labeled data set, Long Short Term Memory (LSTM). , MAML (Model-Agnostic Meta Learning), MQL (Meta Q Learning), and other meta-RL algorithms, batch RL algorithms that learn using offline data in business domains where real-time interaction with the environment is difficult, and algorithms using A2GAN are provided. May be.

학습된 모델부(115)는 학습된 에이전트(500)로부터 사전에 학습된 모델을 이용하여 에이전트를 신속하게 학습시킬 수 있는 모델을 제공하는 구성으로서, 기존에 학습된 모델을 재사용할 수 있도록 한다.The trained model unit 115 is a component that provides a model capable of rapidly learning an agent by using a model learned in advance from the learned agent 500, and allows reuse of an existing trained model.

설명 가능한 AI모델부(Explainable AI Models, 116)는 강화학습 모델이 운영되었을 때, 이러한 모델들이 잘 돌아가는지 설명하거나, 의사결정에 대한 설명을 필요로 하는 도메인을 위한 모델을 제공한다.Explainable AI Models (116) explain whether these models work well when reinforcement learning models are operated, or provide models for domains that require explanations for decision making.

즉, 설명 가능한 AI 모델부(116)는 강화학습 포함 신경망 알고리즘은 학습 결과에 대한 설명력이 부족하기 때문에 의사결정에 대한 설명을 필요로 하는 도메인을 위한 모델을 제공한다.In other words, the explainable AI model unit 116 provides a model for a domain that requires explanation of decision making because the neural network algorithm including reinforcement learning lacks explanatory power for the learning result.

생성적 AI 모델부(Generative Models, 117)는 결측치가 있는 데이터를 기존 데이터 분포를 이용하여 결측치가 대체된 데이터를 생성하는 모델을 제공한다.The generative AI model unit (Generative Models, 117) provides a model that generates data with missing values by using the existing data distribution for data with missing values.

즉, 결측치가 있는 데이터를 기존 데이터 분포를 이용하여 결측치가 대체된 데이터를 생성하는 모델을 제공하고, 데이터 부족 문제를 해결하기 위해 데이터를 증강 시킬 수 있으며, 정답이 없는 데이터에 대하여 레이블링을 통해 정답이 있는 모델로 제공할 수도 있다.In other words, it provides a model that generates data with missing values by using the existing data distribution, and the data can be augmented to solve the data shortage problem. It can also be provided as a model with this.

트레이닝부(118)는 환경 정보와 보상 함수에 기반한 강화학습 모델을 이용하여 에이전트를 생성하고, 생성된 에이전트를 학습한다.The training unit 118 generates an agent using a reinforcement learning model based on environmental information and a reward function, and learns the generated agent.

또한, 트레이닝부(118)는 복수의 에이전트를 생성하여 강화학습 모델을 병렬적으로 학습한다.In addition, the training unit 118 generates a plurality of agents to learn the reinforcement learning model in parallel.

즉, 하나의 강화학습 모델을 생성하는데 많은 시간이 요구되므로, 여러 개의 강화학습 모델을 생성한 다음, 병렬적으로 강화학습 모델을 학습한다.That is, since it takes a lot of time to generate one reinforcement learning model, several reinforcement learning models are generated and then the reinforcement learning model is trained in parallel.

또한, 트레이닝부(118)는 학습된 강화학습 모델은 학습된 에이전트 저장부(219)에 저장하고, 추가적인 학습이 필요한 경우 학습된 에이전트 저장부(119)에 저장된 강화학습 모델이 재학습되도록 한다.In addition, the training unit 118 stores the learned reinforcement learning model in the learned agent storage unit 219, and when additional learning is required, the reinforcement learning model stored in the learned agent storage unit 119 is retrained.

또한, 트레이닝부(118)는 각 에이전트의 학습마다 리워드의 성능치를 시각화하여 사용자 인터페이스부(300)를 통해 표시함으로써, 사용자가 더욱 쉽게 강화학습 모델의 학습과정을 확인할 수 있도록 한다.In addition, the training unit 118 visualizes the performance value of the reward for each learning of each agent and displays it through the user interface unit 300, so that the user can more easily check the learning process of the reinforcement learning model.

학습된 에이전트 저장부(119) 트레이닝부(118)에서 에이전트를 통해 학습된 강화학습 모델을 저장한다.The learned agent storage unit 119 stores the reinforcement learning model learned through the agent in the training unit 118.

디플로이 에이전트부(200)는 트레이닝 에이전트부(100)에서 생성된 강화학습 모델을 비즈니스 도메인부(400)로 배포(Deploy)한다.The deployment agent unit 200 deploys the reinforcement learning model generated by the training agent unit 100 to the business domain unit 400.

즉, 디플로이 에이전트부(100)는 API(Application Programming Interface) 서비스를 통해 접속한 비즈니스 도메인부(400)로부터 임의의 비즈니스 도메인, 예를 들어 금융, 에너지, 물류, 물질, 마케팅, 로보틱스, 시스템 제어, 변동 가격 등에 대한 강화학습 모델의 요청에 대하여 최적화 및 자동화된 강화학습 모델이 배포될 수 있도록 한다.That is, the deployment agent unit 100 controls an arbitrary business domain, such as finance, energy, logistics, material, marketing, robotics, and system from the business domain unit 400 accessed through an API (Application Programming Interface) service. In response to requests for reinforcement learning models, such as variable prices, optimized and automated reinforcement learning models can be distributed.

사용자 인터페이스부(300)는 사용자로부터 드래그 앤 드롭(Drag & Drop) 또는 코딩으로 입력되는 강화학습의 환경 요소, 강화학습의 학습 속도 및 성능을 증가시키기 위한 최적화 데이터 및 강화학습 알고리즘 선택 데이터 중 적어도 하나로 이루어진 설정 데이터를 상기 트레이닝 에이전트부(100)로 출력한다.The user interface unit 300 includes at least one of an environmental element of reinforcement learning input by drag & drop or coding from a user, optimization data for increasing the learning speed and performance of reinforcement learning, and data for selecting a reinforcement learning algorithm. The configured setting data is output to the training agent unit 100.

즉, 사용자 인터페이스부(300)는 사용자가 필요할 때 마다 선택적으로 환경 정보, 리워드 등을 구성해서 새로운 강화학습 모델을 생성할 수 있도록 한다.That is, the user interface unit 300 allows the user to selectively configure environment information and rewards whenever necessary to generate a new reinforcement learning model.

또한, 사용자 인터페이스부(300)는 자기가 수행하고자 하는 것에서 정보를 샘플링(Sampling)해서 사용할 수 있고, 시뮬레이팅(Simulating)할 경우 설정 값 등을 이용하여 사용할 수 있도록 함으로써, 시뮬레이터라는 환경을 구성하지 않고, 데이터 자체를 환경으로 두고, 그 데이터에서 샘플링을 통해 가져와 설정할 수 있도록 한다.In addition, the user interface unit 300 can sample and use information from what it wants to perform, and when simulating, it is possible to use it by using a set value, etc., so that an environment called a simulator is not configured. Instead, the data itself is set as an environment, and the data can be retrieved and set through sampling.

이러한 사용자 인터페이스부(300)는 종래의 시뮬레이터를 이용한 코딩에서 벗어나, 사용자 인터페이스부(300)를 이용한 클릭을 통해 작업을 수행함으로써, 사용자가 작업 수행과 결과를 직관적으로 인식할 수 있다.The user interface unit 300 is free from coding using a conventional simulator, and performs a task through a click using the user interface unit 300, so that the user can intuitively recognize the task performance and the result.

또한, 사용자 인터페이스부(300)는 사용로부터 정의되어 사용되는 사용자 맞춤형(Customized) 리워드, 데이터에 존재하는 변수나 각 회사의 KPI(Key Performance Indicator)를 가중치 조절 방식으로 사용하는 위저드(Wizard) 리워드, 사용자가 간단한 학습 및 강화학습 베이스 라인 확인 용도로 사용할 수 있는 오토 리워드(Auto reward) 중 어느 하나가 설정될 수 있도록 트레이닝 에이전트부(100)로 요청할 수 있다.In addition, the user interface unit 300 is a customized reward that is defined and used from use, a wizard reward that uses a variable existing in data or a key performance indicator (KPI) of each company as a weight control method, The user may request the training agent unit 100 to set any one of auto rewards that can be used for simple learning and reinforcement learning baseline verification purposes.

또한, 사용자 인터페이스부(300)는 트레이닝 에이전트부(100)로 디플로이 에이전트부(200)를 통해 배포된 강화학습 모델의 학습 정보 검색을 요청하고, 검색된 강화학습 모델의 결과 해석을 포함한 평가 데이터 및 모니터링 데이터를 시각적으로 변환하여 표시되도록 한다.In addition, the user interface unit 300 requests the training agent unit 100 to search for learning information of the reinforcement learning model distributed through the deployment agent unit 200, and evaluates data including analysis of the results of the retrieved reinforcement learning model, and The monitoring data is visually converted and displayed.

다음은 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 방법을 설명한다.Next, a method of generating a decision agent according to an embodiment of the present invention will be described.

도3은 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 방법을 나타낸 흐름도이다.3 is a flowchart illustrating a method of generating a decision agent according to an embodiment of the present invention.

도1 내지 도3을 참조하면, 본 발명의 일 실시 예에 따른 의사결정 에이전트 생성 방법은 트레이닝 에이전트부(100)가 임의의 비즈니스 도메인에 대한 데이터를 업로드 받아 저장하고, 강화학습 적용을 위한 환경 설정과 학습 진행을 위해 상기 비즈니스 도메인에 대한 탐색적 자료 분석(Exploratory Data Analysis, EDA)을 수행(S100)한다.1 to 3, in a method for generating a decision agent according to an embodiment of the present invention, the training agent unit 100 uploads and stores data for an arbitrary business domain, and sets an environment for applying reinforcement learning. To proceed with the learning process, exploratory data analysis (EDA) for the business domain is performed (S100).

즉, S100 단계에서는 데이터를 드래그 앤 드롭 방식으로 데이터 저장부(110)에 저장하거나, 각 변수별로 요약된 통계량 및 그래프를 제공하거나, 잘못된 메타 데이터 명세서가 있는 경우 수정하거나, 변수별 시각화 등을 수행한다.That is, in step S100, data is stored in the data storage unit 110 in a drag-and-drop method, statistics and graphs summarized for each variable are provided, or if there is an incorrect metadata specification, correction or visualization for each variable is performed. do.

S100 단계를 수행한 다음, 사용자 인터페이스부(300)를 통해 사용자 설정이 입력됨에 따라, 트레이닝 에이전트부(100)가 작업 공간을 설계(S200)한다.After performing step S100, as user settings are input through the user interface unit 300, the training agent unit 100 designs a work space (S200).

도4에 나타낸 바와 같이, 환경 설정 화면(600)에서, 작업 공간 입력부(610)를 통해 자신의 작업 공간을 설정하고, 자료 검색부(620)를 통해 현재 실험에서 사용할 데이터 셋을 설정할 수 있다.As shown in FIG. 4, on the environment setting screen 600, the user may set his own work space through the work space input unit 610, and set the data set to be used in the current experiment through the data search unit 620.

또한, 현재 실험에 대한 결과물은 Trial No(630)와, Reward(640)를 통해 성능치 등의 정보를 정렬하여 표시할 수도 있다.In addition, the results of the current experiment may be displayed by sorting information such as performance values through Trial No (630) and Reward (640).

또한, S200 단계는 사용자가 실험 생성, 분석을 진행할 수도 있다.In addition, in step S200, the user may create and analyze an experiment.

도5에 나타낸 바와 같이, 학습 결과 화면(700)과 도6과 같이 완료한 실험에 대한 학습 결과 탐색 화면(710)을 통해 실험에 대한 분석을 진행 수도 있다.As shown in FIG. 5, analysis of the experiment may be performed through the learning result screen 700 and the learning result search screen 710 for the completed experiment as shown in FIG. 6.

계속해서, 트레이닝 에이전트부(100)는 사용자 인터페이스부(300)를 통해 입력되는 강화학습 에이전트의 생성을 위한 환경 정보와 보상 함수를 정의하고, 환경 정보와 보상 함수에 기반한 강화학습 모델을 이용하여 생성한 에이전트를 학습(S300)한다.Subsequently, the training agent unit 100 defines environment information and a compensation function for generating the reinforcement learning agent input through the user interface unit 300, and is generated using a reinforcement learning model based on the environment information and the compensation function. One agent is learned (S300).

또한, S300 단계에서 사용자 인터페이스부(300)를 통해 설정되는 환경 정보 및 보상 함수의 적어도 일부는 비즈니스 도메인 데이터 셋의 메타 데이터를 기반으로 자동 설정될 수 있다.In addition, at least a part of the environment information and the compensation function set through the user interface unit 300 in step S300 may be automatically set based on the metadata of the business domain data set.

즉, 사용자가 설정할 부분은 데이터 셋의 메타 데이터를 기반으로 자동으로 채우고, 현재 실험에서 해당하는 데이터의 상태나 통계치를 확인할 수 있다.That is, the part to be set by the user is automatically filled based on the metadata of the data set, and the status or statistical value of the corresponding data in the current experiment can be checked.

또한, S300 단계는 사용자가 강화학습 에이전트를 생성하기 위한 구성요소 예를 들어, 스테이트, 액션, 리워드 등의 환경 정보 등을 사용자 인터페이스부(300)를 통해 클릭 바이 클릭(Click by click)을 통해 정의할 수 있다.In addition, in step S300, elements for creating a reinforcement learning agent by the user, for example, environmental information such as state, action, and reward, are defined through the user interface unit 300 through click by click. can do.

또한, 리워드는 종래와 다른 사용자가 원하는 보상 함수, 목적 등이 반영될 수 있도록 도7과 같이, 사용자 정의 보상 화면(800)과 같은 사용자 맞춤형(Customized) 리워드 설정, 도8과 같이 위저드 기반 보상 화면(810)과 같은 위저드(Wizard) 방식의 리워드 설정, 정답을 활용하는 방식으로 리워드를 설정할 수 있도록 구성할 수도 있다.In addition, the reward is a customized reward setting, such as the user-defined reward screen 800, as shown in FIG. 7, so that the reward function, purpose, etc. desired by the user different from the prior art can be reflected, and a wizard-based reward screen as shown in FIG. It can also be configured to set the reward in a way that uses the correct answer and a wizard-based reward setting such as (810).

즉, 사용자가 보상 함수 전체를 코딩하지 않고, 특정 보상 함수만 정의하면, 내부적으로 결합되어 동작되도록 하여 전문적인 지식이 없는 사용자도 사용자 인터페이스부(300)를 이용한 클릭을 통해 쉽게 리워드를 정의할 수 있다.That is, if the user does not code the entire reward function and only defines a specific reward function, it is internally combined and operated, so that even a user without specialized knowledge can easily define a reward through a click using the user interface unit 300. have.

또한, 상기 S300 단계는 트레이닝 에이전트부(100)가 복수의 에이전트를 생성하여 강화학습 모델을 병렬적으로 학습한다.In addition, in step S300, the training agent unit 100 generates a plurality of agents to learn the reinforcement learning model in parallel.

즉, 트레이닝 에이전트부(100)는 병렬 처리 기능을 이용하여 다양한 강화학습 모델과 에이전트를 생성하여 동시에 학습 시키고, 각각의 에이전트가 학습할 때마다 리워드의 성능치를 사용자 인터페이스부(300)를 통해 시각화하여 표시되도록 함으로써, 사용자가 직관적으로 확인 및 분석할 수 있다.That is, the training agent unit 100 generates various reinforcement learning models and agents by using a parallel processing function to learn at the same time, and visualizes the performance value of the reward through the user interface unit 300 whenever each agent learns. By making it displayed, the user can intuitively check and analyze.

S300 단계를 수행한 다음, 디플로이 에이전트부(200)는 트레이닝 에이전트부(100)에서 생성된 강화학습 모델을 비즈니스 도메인부(400)로 배포(Deploy)되도록 (S400)한다.After performing step S300, the deployment agent unit 200 distributes the reinforcement learning model generated by the training agent unit 100 to the business domain unit 400 (S400).

또한, 트레이닝 에이전트부(100)는 사용자 인터페이스부(300)를 통해 에이전트의 형상 관리 정보를 제공하여 특정 에이전트의 정보들을 재사용할 수 있도록 한다. In addition, the training agent unit 100 provides configuration management information of the agent through the user interface unit 300 so that information of a specific agent can be reused.

또한, S400 단계는 해당 실험에서 사용자 인터페이스부(300)를 이용한 인터렉티브한 시각화 기능을 통해 강화학습 모델 및 학습 파라미터 정보를 분석하여 출력함으로써, 이후 사용자가 자신의 도메인에 맞게 적용할 수 있도록 한다.In addition, step S400 analyzes and outputs the reinforcement learning model and learning parameter information through an interactive visualization function using the user interface unit 300 in the corresponding experiment, so that the user can later apply it according to his or her domain.

따라서, 기업의 의사결정에 관련된 최적화 및 자동화 모델을 생성하여 제공할 수 있고, 강화학습에 지식이 없는 사용자가 머신러닝의 핵심요소를 쉽게 설정 및 적용하여 학습할 수 있다.Therefore, optimization and automation models related to corporate decision-making can be created and provided, and users without knowledge of reinforcement learning can easily set and apply key elements of machine learning to learn.

또한, 강화학습 에이전트를 통해 사용자의 도메인 지식과 일반적인 머신 러닝 지식만으로도 쉽게 강화학습 에이전트를 생성할 수 있고, 최소한의 노력으로 의사결정 문제에 다양한 강화학습 설계를 구축하여 높은 수준의 의사결정 에이전트를 생성할 수 있게 된다.In addition, reinforcement learning agents can be easily created with only the user's domain knowledge and general machine learning knowledge through the reinforcement learning agent, and a high-level decision agent is created by constructing various reinforcement learning designs for decision-making problems with minimal effort. You can do it.

상기와 같이, 본 발명의 바람직한 실시 예를 참조하여 설명하였지만 해당 기술 분야의 숙련된 당업자라면 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.As described above, although it has been described with reference to a preferred embodiment of the present invention, those skilled in the art will variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the following claims. You will understand that you can do it.

또한, 본 발명의 특허청구범위에 기재된 도면번호는 설명의 명료성과 편의를 위해 기재한 것일 뿐 이에 한정되는 것은 아니며, 실시예를 설명하는 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다.In addition, the reference numbers in the claims of the present invention are provided for clarity and convenience of description, and are not limited thereto. In the process of describing the embodiments, the thickness of the lines shown in the drawings, the size of components, etc. May be exaggerated for clarity and convenience of description.

또한, 상술된 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있으므로, 이러한 용어들에 대한 해석은 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, the above-described terms are terms defined in consideration of functions in the present invention and may vary according to the intentions or customs of users and operators, so interpretation of these terms should be made based on the contents throughout the present specification. .

또한, 명시적으로 도시되거나 설명되지 아니하였다 하여도 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기재사항으로부터 본 발명에 의한 기술적 사상을 포함하는 다양한 형태의 변형을 할 수 있음은 자명하며, 이는 여전히 본 발명의 권리범위에 속한다. In addition, even if not explicitly shown or described, a person having ordinary knowledge in the technical field to which the present invention pertains can make various modifications including the technical idea according to the present invention from the description of the present invention. It is obvious, and this still belongs to the scope of the present invention.

또한, 첨부하는 도면을 참조하여 설명된 상기의 실시예들은 본 발명을 설명하기 위한 목적으로 기술된 것이며 본 발명의 권리범위는 이러한 실시예에 국한되지 아니한다.In addition, the above embodiments described with reference to the accompanying drawings are described for the purpose of describing the present invention, and the scope of the present invention is not limited to these embodiments.

10 : 에이전트 생성 장치 100 : 트레이닝 에이전트부
110 : 데이터 저장부 111 : 빌트인 모델부
112 : 커스터마이즈 함수부 113 : 캐털리스트 학습부
114 : RL 알고리즘부 115 : 학습된 모델부
116 : 설명 가능한 AI 모델부 117 : 생성적 AI 모델부
118 : 트레이닝부 119 : 학습된 에이전트 저장부
200 : 디플로이 에이전트부 300 : 사용자 인터페이스부
400 : 비즈니스 도메인부 500 : 학습된 에이전트
600 : 환경 설정 화면 610 : 작업 공간 입력부
620 : 자료 검색부 700 : 학습 결과 화면
710 : 학습 결과 탐색 화면 800 : 사용자 정의 보상 화면
810 : 위저드 기반 보상 화면10: agent generating device 100: training agent unit
110: data storage unit 111: built-in model unit
112: customization function unit 113: catalyst learning unit
114: RL algorithm unit 115: learned model unit
116: Explainable AI model unit 117: Generative AI model unit
118: training unit 119: learned agent storage unit
200: deployment agent unit 300: user interface unit
400: business domain unit 500: learned agent
600: environment setting screen 610: workspace input unit
620: Data search unit 700: Learning result screen
710: Learning result search screen 800: User-defined reward screen
810: Wizard-based reward screen

Claims

A training agent unit 100 for generating and training a reinforcement learning target model based on input data for the business domain, and training the reinforcement learning model by reflecting user setting data in learning of the reinforcement learning target model; And
Including; a deployment agent unit 200 for deploying (Deploy) the reinforcement learning model generated by the training agent unit 100; and
The training agent unit 100 may include a data storage unit 110 for storing inputted business domain data and outputting environment information for generating a reinforcement learning model and performing reinforcement learning based on the stored data;
A built-in model unit 111 that generates a reinforcement learning model based on the stored data;
A customization function unit 112 configured to reflect a reward function input from a user in environmental information for performing the reinforcement learning;
A catalyst learning unit 113 that sets optimization information for increasing the learning speed and performance of the reinforcement learning and automatic compensation information so that application of the reinforcement learning can be easily performed; And
An RL algorithm unit 114 for selecting a reinforcement learning algorithm for learning an agent according to the business domain;
A training unit 118 that generates an agent using the reinforcement learning model based on the environmental information and the reward function, and learns the generated agent; And
Including; a learned agent storage unit 119 for storing the reinforcement learning model learned through the agent,
The training unit 118 generates a plurality of agents, trains the reinforcement learning model in parallel, and visualizes and displays a performance value of a reward for each learning of each agent.

The method of claim 1,
Setting data consisting of at least one of reinforcement learning environment elements, optimization data for increasing the learning speed and performance of reinforcement learning, and reinforcement learning algorithm selection data input by drag & drop or coding from the user The apparatus for generating a decision agent, further comprising: a user interface unit 300 outputting to the unit 100.

The method of claim 2,
The user interface unit 300 requests the training agent unit 100 to search for learning information of the reinforcement learning model distributed through the deployment agent unit 200, and visually retrieves the evaluation data and monitoring data of the retrieved reinforcement learning model. An apparatus for generating a decision agent, characterized in that to convert and display.

delete

The method of claim 1,
The training agent unit 100 includes a trained model unit 115 that provides a model capable of quickly learning an agent using a model learned in advance;
An explainable AI model unit 116 that provides a model for a domain requiring explanation of decision making; And
And a generative AI model unit 117 that provides a model for generating data in which the missing values are replaced by using the existing data distribution.

A method for generating a decision agent by a decision agent generating apparatus including a training agent unit 100, a deployment agent unit 200, and a user interface unit 300,
a) The training agent unit 100 performs an exploratory data analysis (EDA) on the business domain to set the environment for applying reinforcement learning and to proceed with learning based on the data on an arbitrary business domain. step;
b) as the user setting is input through the user interface unit 300, the training agent unit 100 designing a work space;
c) The training agent unit 100 defines environment information and a compensation function for generating a reinforcement learning agent input through the user interface unit 300, and uses a reinforcement learning model based on the environment information and the compensation function. Learning the created agent; And
d) deploying, by the deployment agent unit 200, the reinforcement learning model generated by the training agent unit 100; including,
In step c), the training agent unit 100 generates a plurality of agents to train the reinforcement learning model in parallel.

The method of claim 7,
The step d) further comprises the step of analyzing and outputting, by the training agent unit 100, the agent configuration management information, the reinforcement learning model, and the learning parameter information through the user interface unit 300. How to create a decision agent.

The method according to claim 7 or 8,
Step b) is to search for previous experiment results in the designed workspace,
A method of creating a decision agent, characterized in that it sorts, searches, and analyzes experiments according to search results.

The method according to claim 7 or 8,
In step c), at least some of the environmental information and the reward function set through the user interface unit 300 are automatically set based on the metadata of the business domain data set.

The method according to claim 7 or 8,
The step c) is a method for creating a decision agent, characterized in that the user interface unit 300 visualizes and displays the performance value of the reward for each learning of each agent.