KR20220041519A

KR20220041519A - Automatic generation method and system of artificial intelligence algorithm

Info

Publication number: KR20220041519A
Application number: KR1020200124842A
Authority: KR
Inventors: 장광선; 이정일; 임정선; 신지강; 최승환
Original assignee: 한국전력공사
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2022-04-01

Abstract

According to the present invention, an automatic generation method of an artificial intelligence algorithm may include the steps of: determining a basic candidate pipeline according to required artificial intelligence conditions; adding pipelines having higher performance than the basic candidate pipelines as first additional candidate pipelines based on database logs; determining a second additional candidate pipeline by selecting an optimal combination of unit steps constituting a pipeline template of each candidate pipeline; and determining best performance among the candidate pipelines as the final artificial intelligence pipeline. The present invention provides the method for automatically generating the artificial intelligence algorithm suitable for an electric power field.

Description

AUTOMATIC GENERATION METHOD AND SYSTEM OF ARTIFICIAL INTELLIGENCE ALGORITHM

본 발명은 용도에 적합한 인공지능 알고리즘을 자동으로 생성하는 방법에 관한 것으로, 파이프라인 형태의 인공지능 알고리즘을 데이터베이스 기반으로 최적의 것을 선택하는 인공지능 알고리즘 자동 생성 방법 및 시스템에 관한 것이다.The present invention relates to a method for automatically generating an artificial intelligence algorithm suitable for a purpose, and to a method and system for automatically generating an artificial intelligence algorithm that selects an optimal one based on a database based on a pipeline-type artificial intelligence algorithm.

머신 러닝(machine learning; 기계 학습)이란 인공 지능(Artificial Intelligence: AI)의 한 종류로서, 데이터를 기반으로 컴퓨터가 스스로 학습한 내용을 바탕으로 회귀, 분류, 군집화 등의 예측 작업을 수행하는 것을 말한다. 머신 러닝 인공지능이 우수한 성능을 보임에 따라 다양한 산업에서 인공지능 기반으로 우수한 예측 모델을 개발하고 적용하고 있다. Machine learning (machine learning) is a type of artificial intelligence (AI), and it refers to performing prediction tasks such as regression, classification, and clustering based on what a computer learns by itself based on data. . As machine learning artificial intelligence shows excellent performance, various industries are developing and applying excellent predictive models based on artificial intelligence.

전력 산업에서도 그간 축적된 빅데이터를 기반으로 여러 분야에 AI를 적용하고 있으며 AI 기반 적용 알고리즘이 우수한 성능을 보이고 있다. 전력 수급 안정성을 유지하기 위한 전력 최대 수요 예측에 ARIMA와 같은 통계적 기법부터 LSTM 기반의 딥러닝을 사용하여 실제 업무에 활용하고 있다. AI를 활용하여 SMP, 전력 최대 공급량을 예측할 뿐만 아니라 설비 센서 데이터 기반으로 장비의 수명을 예측하는 health index를 만들어 적용하고 있다. The power industry is also applying AI to various fields based on accumulated big data, and AI-based application algorithms are showing excellent performance. From statistical techniques such as ARIMA to LSTM-based deep learning to predict the peak power demand to maintain power supply and demand stability, it is being used in actual work. Using AI, not only predicts SMP and maximum power supply, but also creates and applies a health index that predicts the lifespan of equipment based on facility sensor data.

단순히 테이블 데이터 기반 모델 개발뿐만 아니라, 텍스트 데이터 기반의 NLP(Natural Language Processing) 분야와 이미지 및 영상을 다루는 컴퓨터 비전(Computer Visio) 분야에도 인공지능이 많이 적용되고 있다. Artificial intelligence is being widely applied not only to simply developing table data-based models, but also to text data-based NLP (Natural Language Processing) fields and computer vision (Computer Visio) fields that deal with images and images.

또한, 발전 분야 QA(Question & Answer) 시스템, 법률 챗봇 등이 사내 여러 텍스트 데이터를 기반으로 연구되고 있다. 컴퓨터 비전 분야에서는 YOLO 기반으로 애자를 검출 모델, Mask R-CNN 기반의 공사장 안전모 인식 모델 등이 연구되고 있다.In addition, QA (Question & Answer) systems and legal chatbots in the power generation field are being studied based on various text data within the company. In the field of computer vision, an insulator detection model based on YOLO and a safety helmet recognition model based on Mask R-CNN are being studied.

그런데, 그간 수많은 데이터를 축적하고 분석 기반을 준비한 것과 반하여, 실제 데이터를 분석하고 모델을 개발할 인공지능 전문가는 부족하다. 전력분야의 경우, 분야 특성상 전력분야 도메인 지식이 있어야 데이터 기반 모델 개발이 가능하다. 단순히 인공지능 지식만 있는 분석가는 실제 필드에 적용할 수 있는 우수한 알고리즘을 만들기 어렵다. 반대로 전력 분야 도메인 전문가의 경우 단순히 기계학습 모델을 사용하는 것을 넘어, 인공지능 전문가와 같이 인공지능에 대한 깊은 이해를 바탕으로 한 모델 개발이 어렵다. However, in contrast to the fact that a lot of data has been accumulated and the basis for analysis has been prepared, there is a shortage of artificial intelligence experts who can analyze actual data and develop models. In the case of the electric power field, it is possible to develop a data-based model only with knowledge of the electric power field domain due to the nature of the field. It is difficult for analysts with only knowledge of artificial intelligence to create excellent algorithms that can be applied in the real field. Conversely, in the case of power domain experts, it is difficult to develop a model based on a deep understanding of AI like an AI expert, beyond simply using a machine learning model.

이와 같이 전력 분야의 인공지능 전문가는 부족한데 비하여, 인공지능을 적용할 분야는 많아 인공지능 전문가의 수요/공급 불균형이 발생하고 있다. 또한 데이터 분석 및 모델 개발 과정은 최적 특징(Features)을 찾는 특징 공학(Feature Engineering) 단계, 반복적인 여러 알고리즘 적용 및 최적 알고리즘 선택 단계 등 많은 시간이 소요되어 한정적인 인원으로 전력산업에 모두 인공지능을 적용하기는 힘든 상황이다. 이에 따라 전문 인력 부족 문제에도 필드의 많은 부분에 인공지능의 우수성을 적용하기 위한 자동화된 인공지능 기술이 필요하다. As described above, while there is a shortage of AI experts in the electric power field, there are many fields where AI can be applied, resulting in a supply/demand imbalance of AI experts. In addition, the data analysis and model development process takes a lot of time, such as the feature engineering step to find the optimal features, iterative application of multiple algorithms, and the optimal algorithm selection step. It is difficult to apply. Accordingly, automated artificial intelligence technology is needed to apply the superiority of artificial intelligence to many parts of the field even in the problem of a shortage of professional manpower.

인공지능 자동화 기술은 데이터와 문제 정의만 하면, 특징 공학(Feature Engineering), 모델 검색(Model Selection), 하이퍼 파라미터 튜닝(Hyper-parameter Tuning)의 일련의 과정을 자동으로 수행하여 최적의 모델을 제공하는 기술이다. 인공지능 자동화 기술은 전문 인력이 부족한 전력 산업의 문제를 해결하여, 빠른 시간에 최적의 알고리즘을 많은 분야에 적용할 수 있다. 최근 들어, Microsoft AutoML, Data Robot, H2O 등 인공지능 자동화 서비스를 공개하였으며, 이를 활용하여 자동 생성한 모델이 여러 분야에서 우수한 성과를 보이고 있다. 그러나, 아직, 전력 분야 및 전력 산업의 데이터의 특수성 때문에 우수한 성능을 가진 인공지능 자동화 모델이 존재하지 않았다.AI automation technology provides an optimal model by automatically performing a series of processes of feature engineering, model selection, and hyper-parameter tuning only by defining data and problems. it is technology Artificial intelligence automation technology solves the problem of the power industry, which lacks professional manpower, and can apply the optimal algorithm to many fields in a short time. Recently, artificial intelligence automation services such as Microsoft AutoML, Data Robot, and H2O have been released, and models created automatically using them are showing excellent performance in various fields. However, due to the specificity of the data in the electric power field and the electric power industry, there has not yet been an AI automation model with excellent performance.

대한민국 등록공보 10-2010468호Republic of Korea Registration No. 10-2010468

본 발명은 전력 분야에 적합한 인공지능 알고리즘 자동 생성 방법을 제공하고자 한다.An object of the present invention is to provide a method for automatically generating an artificial intelligence algorithm suitable for the electric power field.

본 발명의 일 측면에 따른 인공지능 알고리즘 자동 생성 방법은, 요구되는 인공지능의 조건에 따라 기본 후보 파이프라인을 결정하는 단계; 데이터베이스 로그 기반에서 각 상기 기본 후보 파이프라인들 보다 성능이 높은 파이프라인들을 1차 추가 후보 파이프라인으로 추가하는 단계; 상기 각 후보 파이프라인의 파이프라인 템플릿을 구성하는 각 단위 스텝들의 최적 조합을 선정하여, 2차 추가 후보 파이프라인을 결정하는 단계; 및 상기 후보 파이프라인들 중 최고의 성능을 보이는 것을 최종 인공지능 파이프라인으로 확정하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a method for automatically generating an artificial intelligence algorithm, the method comprising: determining a basic candidate pipeline according to a required artificial intelligence condition; adding pipelines with higher performance than each of the basic candidate pipelines as primary additional candidate pipelines based on a database log; determining a second additional candidate pipeline by selecting an optimal combination of each unit step constituting the pipeline template of each candidate pipeline; and determining, among the candidate pipelines, the one showing the best performance as the final artificial intelligence pipeline.

여기서, 상기 2차 추가 후보 파이프라인을 결정하는 단계 이전에, 최적 판정에 있어서 각 단위 스텝들의 최적 선택 순서를 결정하는 단계를 더 포함할 수 있다.Here, before the step of determining the second addition candidate pipeline, the method may further include determining an optimal selection order of each unit step in the optimal determination.

여기서, 상기 2차 추가 후보 파이프라인을 결정하는 단계는, 상기 각 후보 파이프라인 템플릿을 구성하는 일련의 스텝들 중, 최적 선택 순서상 첫번째 스텝에 대하여 최적 작업을 선택하는 과정; 및 이전에 선택된 최적 작업을 그대로 반영한 상태로 최적 선택 순서상 다음 스텝에 대하여 최적 작업을 선택하는 것을 최종 스텝까지 수행하는 과정을 포함할 수 있다.Here, the determining of the second additional candidate pipeline may include: selecting an optimal task for a first step in an optimal selection order from among a series of steps constituting each of the candidate pipeline templates; and selecting the optimal task for the next step in the optimal selection sequence in a state in which the previously selected optimal task is reflected as it is until the final step.

여기서, 상기 2차 후보 파이프라인을 결정하는 단계에서는, 상기 후보 파이프라인 템플릿들에 추가하는 단계에서 추가된 후보 파이프라인 템블릿들에 대해서만 각 단위 스텝들의 최적 조합을 선정할 수 있다.Here, in the step of determining the secondary candidate pipeline, an optimal combination of each unit step may be selected only for the candidate pipeline templates added in the step of adding to the candidate pipeline templates.

여기서, 상기 기본 후보 파이프라인을 결정하는 단계는, 인공지능이 처리하는 데이터 특성과 문제 정의로부터 제1 개수의 파이프라인들을 구분하는 단계; 및 무작위 샘플링한 평가용 데이터 세트를 이용하여 상기 제1 개수의 파이프라인들의 성능을 측정한 결과로 제2 개수의 파이프라인들을 상기 기본 후보 파이프라인으로 선택하는 단계를 포함할 수 있다.Here, the determining of the basic candidate pipelines may include: classifying a first number of pipelines from data characteristics and problem definitions processed by artificial intelligence; and selecting a second number of pipelines as the default candidate pipelines as a result of measuring the performance of the first number of pipelines using a randomly sampled evaluation data set.

여기서, 상기 기본 후보 파이프라인을 결정하는 단계는, 인공지능 알고리즘의 후보 파이프라인 템플릿들을 결정하는 단계; 및 결정된 상기 각 후보 파이프라인 템플릿을 구성하는 각 단위 스텝들의 최적 조합을 선정하여 각 기본 후보 파이프라인을 생성하는 단계를 포함할 수 있다.Here, the determining of the basic candidate pipeline may include: determining candidate pipeline templates of an artificial intelligence algorithm; and generating each basic candidate pipeline by selecting an optimal combination of each of the unit steps constituting each of the determined candidate pipeline templates.

여기서, 상기 2차 후보 파이프라인을 결정하는 단계에서는, 상기 기본 후보 파이프라인을 결정하는 단계에서 적용한 각 단위 스텝들의 최적 선택 순서 및 최적 선택을 위한 "평가용 데이터 세트" 중 적어도 하나 이상을 다른 것으로 적용할 수 있다.Here, in the step of determining the secondary candidate pipeline, at least one or more of the optimal selection order of each unit step applied in the step of determining the basic candidate pipeline and the “data set for evaluation” for optimal selection is set to another can be applied

본 발명의 다른 측면에 따른 인공지능 알고리즘 자동 생성 시스템은, 파이프라인들 및 각 파이프라인들을 구성하는 스텝들에 대한 정보와, 상기 각 파이프라인을 평가하기 위한 평가 데이터들과, 기 수행된 평가 및 최적화에 대한 로그가 저장된 인공지능 DB; 평가 데이터를 선정하며, 파이프라인 템플릿을 구성하는 각 스텝들에 대하여 최적 작업을 선정하여 파이프라인을 생성하는 특징 공학 모듈; 및 다수개의 후보 파이프라인 템플릿들을 관리하고, 다수개의 후보 파이프라인들을 관리하며, 최종 인공지능 파이프라인을 확정하는 모델 최적화 모듈을 포함할 수 있다.The system for automatically generating an artificial intelligence algorithm according to another aspect of the present invention includes information on pipelines and steps constituting each pipeline, evaluation data for evaluating each pipeline, pre-performed evaluation and Artificial intelligence DB where logs for optimization are stored; a feature engineering module that selects evaluation data and creates a pipeline by selecting an optimal task for each step constituting the pipeline template; and a model optimization module for managing a plurality of candidate pipeline templates, managing a plurality of candidate pipelines, and determining a final artificial intelligence pipeline.

여기서, 상기 모델 최적화 모듈은, 상기 후보 파이프라인 템플릿들을 결정하고, 데이터베이스 로그 기반에서 기본 후보 파이프라인들 보다 성능이 높은 파이프라인들을 1차 추가 후보 파이프라인들로 추가하고, 추가된 상기 1차 추가 후보 파이프라인들이 속하는 템플릿들을 상기 후보 파이프라인 템플릿들에 추가하며, 상기 기본 후보 파이프라인들, 상기 1차 추가 후보 파이프라인들 및 2차 추가 후보 파이프라인들 중 최고의 성능을 보이는 것을 최종 인공지능 파이프라인으로 확정할 수 있다.Here, the model optimization module determines the candidate pipeline templates, adds pipelines with higher performance than basic candidate pipelines based on a database log as primary addition candidate pipelines, and adds the added primary pipelines. Templates to which candidate pipelines belong are added to the candidate pipeline templates, and the final AI pipe showing the best performance among the basic candidate pipelines, the first additional candidate pipelines, and the second additional candidate pipelines line can be confirmed.

여기서, 상기 특징 공학 모듈은, 상기 각 후보 파이프라인 템플릿을 구성하는 각 단위 스텝들의 최적 조합을 선정하여 상기 기본 후보 파이프라인들 및 상기 2차 추가 후보 파이프라인들을 결정할 수 있다.Here, the feature engineering module may determine the basic candidate pipelines and the secondary additional candidate pipelines by selecting an optimal combination of each unit step constituting the respective candidate pipeline templates.

여기서, 상기 특징 공학 모듈은, 상기 파이프라인 생성을 위한 최적 판정에 있어서 각 단위 스텝들의 최적 선택 순서를 결정할 수 있다.Here, the feature engineering module may determine an optimal selection order of each unit step in the optimal determination for generating the pipeline.

여기서, 상기 특징 공학 모듈은, 상기 각 후보 파이프라인 템플릿을 구성하는 일련의 스텝들 중, 최적 선택 순서상 첫번째 스텝에 대하여 최적 작업을 선택하고, 이전에 선택된 최적 작업을 그대로 반영한 상태로 최적 선택 순서상 다음 스텝에 대하여 최적 작업을 선택하는 것을 최종 스텝까지 수행할 수 있다.Here, the feature engineering module selects an optimal task for a first step in an optimal selection sequence from among a series of steps constituting each candidate pipeline template, and reflects the previously selected optimal task as it is in an optimal selection sequence Selecting an optimal operation for the next step can be performed up to the final step.

상술한 구성의 본 발명의 사상에 따른 인공지능 알고리즘 자동 생성 방법을 실시하면, 전력 분야에서 해당 목적에 적합한 인공지능 알고리즘을 생성할 수 있는 이점이 있다.When the method for automatically generating an artificial intelligence algorithm according to the spirit of the present invention having the above configuration is implemented, there is an advantage in that an artificial intelligence algorithm suitable for the purpose in the power field can be generated.

구체적으로 살펴보면, 본 발명의 인공지능 알고리즘 자동 생성 방법은, 실제 필드에서 개발 및 적용 중인 회귀분석, 분류, 텍스트 분류 문제에 적용이 적합하였고, 회귀분석으로 전력 최대수요 예측, 분류로 비트코인 채굴장 위약 탐지 예측, 텍스트 분류로 법률전문가 서비스 챗봇 데이터 기반 질문 의도 분류로 각각 분야별 하나씩 적용한 경우, 기존 모델 대비 전력 최대수요 예측은 3.42% 향상되었고, 비트 코인 채굴 위약 탐지 예측은 경우는 3.23% 향상된 성능을 보였으며, 챗봇 질문 의도 분류의 경우는 정확도가 95.23%로 룰 기반(Rule-based) 챗봇 엔진을 인공지능 기반으로 바꿀 수 있는 가능성을 확인할 수 있었다.Specifically, the automatic generation method of the artificial intelligence algorithm of the present invention was suitable for application to regression analysis, classification, and text classification problems that are being developed and applied in the actual field, and the maximum demand for electricity prediction and classification by regression analysis. When one for each field was applied as a legal expert service chatbot data-based question intent classification with detection prediction and text classification, the prediction of the peak power demand improved by 3.42% compared to the existing model, and the prediction of the Bitcoin mining breach detection prediction showed a 3.23% improvement. In the case of chatbot question intention classification, the accuracy was 95.23%, confirming the possibility of changing the rule-based chatbot engine to an AI-based one.

도 1은 본 발명의 일 실시예에 따른 인공지능 알고리즘 자동 생성 방법을 도시한 흐름도.
도 2는 도 1의 인공지능 알고리즘 자동 생성 방법에 따라 진행되는 후보 파이프라인 리스트에 대한 추가 과정을 도시한 절차 흐름도.
도 3은 Beam search 알고리즘의 실행 구조를 도시한 개념도.
도 4는 본 발명의 인공지능 알고리즘 자동 생성 방법을 수행할 수 있는 인공지능 알고리즘 생성 시스템을 도시한 블록도.
도 5는 도 4의 인공지능 알고리즘 생성 시스템의 클라우드 가상 서버로의 구현예를 도시한 블록도.1 is a flowchart illustrating a method for automatically generating an artificial intelligence algorithm according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating an addition process to a candidate pipeline list performed according to the automatic generation method of an artificial intelligence algorithm of FIG. 1;
3 is a conceptual diagram illustrating an execution structure of a beam search algorithm;
4 is a block diagram illustrating an artificial intelligence algorithm generating system capable of performing the automatic artificial intelligence algorithm generating method of the present invention.
5 is a block diagram showing an implementation example of the artificial intelligence algorithm generating system of FIG. 4 to a cloud virtual server.

본 발명을 설명함에 있어서 제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되지 않을 수 있다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. In describing the present invention, terms such as first, second, etc. may be used to describe various components, but the components may not be limited by the terms. The terms are only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 연결되어 있다거나 접속되어 있다고 언급되는 경우는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해될 수 있다.When a component is referred to as being connected or connected to another component, it may be directly connected or connected to the other component, but it can be understood that other components may exist in between. .

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. The terminology used herein is used only to describe specific embodiments, and is not intended to limit the present invention. The singular expression may include the plural expression unless the context clearly dictates otherwise.

본 명세서에서, 포함하다 또는 구비하다 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것으로서, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해될 수 있다. In this specification, the terms include or include are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, and includes one or more other features or numbers, It may be understood that the existence or addition of steps, operations, components, parts, or combinations thereof is not precluded in advance.

또한, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.In addition, shapes and sizes of elements in the drawings may be exaggerated for clearer description.

본 발명의 사상에 따라 제안하는 내용을 기술하기에 앞서, 관련된 기본 개념들에 대하여 설명하겠다.Prior to describing the contents proposed according to the spirit of the present invention, related basic concepts will be described.

먼저, 특징 공학(Feature Engineering) 자동화에 대하여 설명한다.First, feature engineering automation will be described.

데이터 분석 성과의 70% 이상이 특징 공학(Feature Engineering)이라고 할 정도로, 새로운 특징(Feature)을 만들고 우수한 특징(Features)만 선택하는 작업은 예측 모델의 성능에 지대한 영향을 미친다. 결측치 처리, 이상치 제거, 원핫 인코딩, 로그 변형 등 여러 방법을 통하여 데이터 특성에 맞는 특징(Features)을 만들어 사용한다.To the extent that more than 70% of data analysis performance is called feature engineering, the task of creating new features and selecting only excellent features greatly affects the performance of the predictive model. Features suitable for data characteristics are created and used through various methods such as missing value processing, outlier removal, one-hot encoding, and log transformation.

무조건 많은 특징(Features)을 사용하는 것은 관계성이 작은 변수가 모델의 성능을 저하시킬 수 있다. 또한 인공지능 모델을 학습시키는데 더 많은 시간이 소요된다. 따라서 특징 선택(Feature Selection) 작업을 통하여 최적의 특징(Features)을 추출하는 것이 매우 중요하다. 특징 추출(Feature Selection) 하는 방법으로는 기계학습 모델을 활용한 특징 중요도 기반으로 중요 특징을 선택(Feature Selection) 하는 방법, chi square 인덱스 등 특징 추출 인덱스를 활용하여 중요한 특징을 선택하는 방법, 상관계수 (Correlation)를 활용한 방법, 재귀 변수 제거법 (Recursive Feature Elimination) 등이 있다.Unconditionally using a large number of features may cause a variable with a small relationship to degrade the model's performance. It also takes more time to train the AI model. Therefore, it is very important to extract the optimal features through the feature selection operation. As a method of feature selection, a method of selecting important features based on feature importance using a machine learning model, a method of selecting important features using a feature extraction index such as a chi square index, a correlation coefficient There are methods using (correlation) and recursive feature elimination.

데이터 분석가가 많은 시간을 투자하는 탐색적 데이터 분석(Explorary Data Analysis) 기반 특징 공학(Feature Engineering)을 자동화하는 시도로서 다양한 연구가 수행되고 있다. 각기 다른 방법론을 사용하지만, 근본적으로는 사전에 정의된 데이터 변형 함수를 통해 자동으로 변형된 특징(Features)를 생성하고, 특징 선택(Feature Selection) 기법을 통하여 우수한 특징만 추출한다는 측면에서 유사한 점을 보인다. Various studies are being conducted as an attempt to automate feature engineering based on exploratory data analysis, in which data analysts invest a lot of time. Although different methodologies are used, they are fundamentally similar in that they automatically generate transformed features through a predefined data transformation function and extract only excellent features through a feature selection technique. see.

다음, 자동 알고리즘 탐색 및 하이퍼 파라미터 튜닝에 대하여 설명한다.Next, automatic algorithm search and hyperparameter tuning will be described.

자동 알고리즘 탐색은 유전 알고리즘, 베이지안 최적화 등 휴리스틱 방법을 활용하여 적용 가능한 알고리즘 리스트 중에 최적의 알고리즘을 선택한다. Automatic algorithm search selects an optimal algorithm from a list of applicable algorithms by using heuristic methods such as genetic algorithms and Bayesian optimization.

하이퍼 파라미터 튜닝의 경우 그리드 탐색(Grid Search), 베이지안 탐색(Bayesian Search), 무작위 탐색(Random Search) 등 다양한 방식을 통하여 선택된 모델의 하이퍼 파라미터를 최적화한다. 최적의 하이퍼파라미터를 적용함으로써, 모델의 성능을 향상시킬 수 있다.In the case of hyperparameter tuning, the hyperparameters of the selected model are optimized through various methods such as grid search, Bayesian search, and random search. By applying the optimal hyperparameter, the performance of the model can be improved.

최근에는 입력 데이터의 최적 성능을 내는 딥러닝의 아키텍처를 찾는 NAS (Neural Architecture Search)가 각광을 받고 있다. 기존 기계학습 모델 탐색보다 더 많은 연산 자원이 필요하기 때문에, 작업 연산을 줄이기 위한 분산처리 및 효율적 탐색 기법 연구가 많이 나오고 있다. Recently, NAS (Neural Architecture Search), which finds an architecture of deep learning that provides optimal performance of input data, is in the spotlight. Since more computational resources are required than the existing machine learning model search, there are many studies on distributed processing and efficient search techniques to reduce task computation.

이하, 도면을 참조하여 본 발명의 실시예들에 대하여 기술하겠다.Hereinafter, embodiments of the present invention will be described with reference to the drawings.

본 발명에서 제시하는 알고리즘 자동 생성 방법에 따라, 무작위 샘플링한 샘플 데이터를 가지고 제안하는 인공지능 파이프라인의 성능을 평가하고, 이를 기반으로 우수한 성능을 보일 것 같은 파이프라인 리스트를 뽑아내고, 그 리스트에 속한 각 파이프라인의 템플릿을 활용하여 variations 파이프라인을 추출하고 이들을 대상으로 최적의 인공지능 파이프라인을 찾는다.According to the automatic algorithm generation method proposed in the present invention, the performance of the proposed artificial intelligence pipeline is evaluated with randomly sampled sample data, and based on this, a pipeline list that is likely to show excellent performance is extracted, and the list It extracts variations pipelines using the templates of each pipeline it belongs to, and finds the optimal AI pipeline for them.

도 1은 본 발명의 일 실시예에 따른 인공지능 알고리즘 자동 생성 방법을 도시한 흐름도이다.1 is a flowchart illustrating a method for automatically generating an artificial intelligence algorithm according to an embodiment of the present invention.

도시한 인공지능 알고리즘 자동 생성 방법은, 요구되는 인공지능의 조건에 따라 기본 후보 파이프라인을 결정하는 단계(S100); 데이터베이스 로그 기반에서 각 상기 기본 후보 파이프라인들 보다 성능이 높은 파이프라인들을 1차 추가 후보 파이프라인으로 추가하는 단계(S200); 상기 각 후보 파이프라인의 파이프라인 템플릿을 구성하는 각 단위 스텝들의 최적 조합을 선정하여, 2차 추가 후보 파이프라인을 결정하는 단계(S300); 및 상기 후보 파이프라인들 중 최고의 성능을 보이는 것을 최종 인공지능 파이프라인으로 확정하는 단계(S400)를 포함할 수 있다.The illustrated method for automatically generating an artificial intelligence algorithm includes the steps of determining a basic candidate pipeline according to a required artificial intelligence condition (S100); adding pipelines with higher performance than each of the basic candidate pipelines as a primary additional candidate pipeline based on a database log (S200); determining a second additional candidate pipeline by selecting an optimal combination of each unit step constituting the pipeline template of each candidate pipeline (S300); and determining ( S400 ) the one showing the best performance among the candidate pipelines as the final artificial intelligence pipeline.

구현에 따라, 상기 2차 추가 후보 파이프라인을 결정하는 단계(S300) 이전에, 최적 판정에 있어서 각 단위 스텝들의 최적 선택 순서를 결정하는 단계(S280)를 더 포함할 수 있다. 구현에 따라, 상기 각 단위 스텝들의 최적 선택 순서를 결정하는 단계(S280)는, 상기 S200 단계 또는 S100 단계 이전에 수행될 수도 있다.According to implementation, before the step (S300) of determining the secondary addition candidate pipeline, the step (S280) of determining an optimal selection order of each unit step in the optimal determination may be further included. Depending on the implementation, the step of determining the optimal selection order of each of the unit steps ( S280 ) may be performed before the step S200 or the step S100 .

상기 2차 추가 후보 파이프라인을 결정하는 단계(S300)에서는, 상기 기본 후보 파이프라인 및 상기 1차 추가 후보 파이프라인의 각 파이프라인 템플릿에 대한 최적 조합을 상기 S280 단계에서 결정된 최적 선택 순서에 따라 탐색하고, 최적으로 탐색된 조합을 2차 추가 후보 파이프라인으로 추가할 수 있다.In the step of determining the secondary addition candidate pipeline ( S300 ), an optimal combination for each pipeline template of the basic candidate pipeline and the primary additional candidate pipeline is searched according to the optimal selection order determined in the step S280 . Then, the optimally searched combination may be added as a secondary addition candidate pipeline.

보다 구체적으로, 상기 2차 추가 후보 파이프라인을 결정하는 단계(S300)는,More specifically, the step (S300) of determining the second additional candidate pipeline includes:

상기 각 후보 파이프라인 템플릿을 구성하는 일련의 스텝들 중, 상기 최적 선택 순서상 첫번째 스텝에 대하여 최적 작업을 선택하는 과정; 및 이전에 선택된 최적 작업을 그대로 반영한 상태로 상기 최적 선택 순서상 다음 스텝에 대하여 최적 작업을 선택하는 것을 최종 스텝까지 수행하는 과정을 포함하는 것으로 표현할 수 있다.selecting an optimal task for a first step in the optimal selection sequence from among a series of steps constituting each of the candidate pipeline templates; and selecting the optimal task for the next step in the optimal selection sequence in a state in which the previously selected optimal task is reflected as it is, until the final step.

상기 기본 후보 파이프라인을 결정하는 단계(S100)에서, 상기 기본 후보 파이프라인은, 기존에 수행된 인공지능 알고리즘 자동 생성 과정이나 평가 과정의 결과들을 가진 파이프라인들에서 추출되거나, 파이프라인을 생성하는 파이프라인 템플릿에 대하여 먼저 후보들을 선정하고 선정된 파이프라인 템플릿에 최적 조합을 부여하여 기본 후보 파이프라인들을 결정할 수 있다. 여기서, 파이프라인 템플릿들은 단위 스텝들의 나열로 이루어지며, 파이프라인은 나열된 각 단위 스텝에 대하여 선택된 특정 작업들의 나열로 이루어진다. In the step (S100) of determining the default candidate pipeline, the default candidate pipeline is extracted from pipelines having results of the previously performed artificial intelligence algorithm automatic generation process or evaluation process, or generating a pipeline Candidates are first selected with respect to the pipeline template, and an optimal combination is given to the selected pipeline template to determine basic candidate pipelines. Here, the pipeline templates consist of a list of unit steps, and the pipeline consists of a list of specific tasks selected for each listed unit step.

전자의 경우, 상기 기본 후보 파이프라인을 결정하는 단계(S100)는, 인공지능이 처리하는 데이터 특성과 문제 정의로부터 제1 개수의 파이프라인들을 구분하는 단계; 및 무작위 샘플링한 평가용 데이터 세트를 이용하여 상기 제1 개수의 파이프라인들의 성능을 측정한 결과로 제2 개수의 파이프라인들을 상기 기본 후보 파이프라인으로 선택하는 단계를 포함할 수 있다.In the former case, the step (S100) of determining the basic candidate pipelines may include: classifying a first number of pipelines from data characteristics and problem definitions processed by artificial intelligence; and selecting a second number of pipelines as the default candidate pipelines as a result of measuring the performance of the first number of pipelines using a randomly sampled evaluation data set.

후자의 경우, 상기 기본 후보 파이프라인을 결정하는 단계(S100)는, 인공지능 알고리즘의 후보 파이프라인 템플릿들을 결정하는 단계; 및 결정된 상기 각 후보 파이프라인 템플릿을 구성하는 각 단위 스텝들의 최적 조합을 선정하여 각 기본 후보 파이프라인을 생성하는 단계를 포함할 수 있다.In the latter case, the step of determining the default candidate pipeline ( S100 ) may include: determining candidate pipeline templates of an artificial intelligence algorithm; and generating each basic candidate pipeline by selecting an optimal combination of each of the unit steps constituting each of the determined candidate pipeline templates.

후자의 경우, 상기 2차 후보 파이프라인을 결정하는 단계에서는, 상기 기본 후보 파이프라인을 결정하는 단계에서 적용한 각 단위 스텝들의 최적 선택 순서 및 최적 선택을 위한 "평가용 데이터 세트" 중 적어도 하나 이상을 다른 것으로 적용할 수 있다. 또는, 상기 2차 후보 파이프라인을 결정하는 단계에서는, 상기 후보 파이프라인 템플릿들에 추가하는 단계에서 추가된 후보 파이프라인 템블릿들에 대해서만 각 단위 스텝들의 최적 조합을 선정할 수 있다.In the latter case, in the step of determining the secondary candidate pipeline, at least one or more of an optimal selection order of each unit step applied in the step of determining the basic candidate pipeline and an “evaluation data set” for optimal selection Others can be applied. Alternatively, in the step of determining the secondary candidate pipeline, an optimal combination of each unit step may be selected only for the candidate pipeline templates added in the step of adding to the candidate pipeline templates.

하기 표 1은 전력 분야에서 유용한 파이프라인 템플릿들을 구성할 수 있는 스텝(타입)들 및 각 스텝에 할당될 수 있는 작업들을 나타낸 것이다. 즉, 특징 생성 맞춤 모듈(Custom Module of Feature Generation)을 나타낸다.Table 1 below shows steps (types) that can constitute useful pipeline templates in the power field and tasks that can be assigned to each step. That is, it represents a Custom Module of Feature Generation.

도 2는 도 1의 인공지능 알고리즘 자동 생성 방법에 따라 진행되는 후보 파이프라인 리스트에 대한 추가 과정을 도시한 것이다.FIG. 2 is a diagram illustrating an addition process to a candidate pipeline list performed according to the method of automatically generating an artificial intelligence algorithm of FIG. 1 .

이하, 도 2의 과정을 일단 전자의 경우로 구체화하여 설명하겠다.Hereinafter, the process of FIG. 2 will be concretely described in the former case.

상기 S100 단계를 구성하는, 상기 제1 개수의 파이프라인들을 구분하는 단계에서는, 예컨대, 데이터 특성(데이터 원천 시스템, 각 columns 데이터 특성, Row 수, column 수, columns의 타입, 분포, 왜도, 분포 등)과 문제 정의(분류, 회귀분석, 텍스 분류, 이미지 분류 등)를 조건을 조합하여 N개의 인공지능 파이프라인 리스트를 제안할 수 있다.In the step of classifying the first number of pipelines constituting the step S100, for example, data characteristics (data source system, data characteristics of each column, number of rows, number of columns, type of columns, distribution, skewness, distribution) etc.) and problem definition (classification, regression analysis, text classification, image classification, etc.) can be combined to propose a list of N AI pipelines.

상기 S100 단계를 구성하는, 상기 제2 개수의 파이프라인들을 상기 기본 후보 파이프라인으로 선택하는 단계에서는, 예컨대, 소요되는 시간을 줄이기 위하여 데이터셋을 무작위 샘플링(Random Sampling)을 수행할 수 있다. 샘플 데이터를 활용하여 본 시스템이 1차 제안한 N개의 인공지능 파이프라인의 성능을 beam search 탐색 기반으로 측정(평가)하여 파이프라인 k개를 선택할 수 있다.In the step of selecting the second number of pipelines as the default candidate pipelines constituting the step S100, for example, random sampling may be performed on the dataset in order to reduce the required time. By using sample data, the performance of the first N AI pipelines proposed by this system can be measured (evaluated) based on beam search and k pipelines can be selected.

다음, 상기 S200 단계에서는, 선택한 상위 k개의 파이프라인들과 유사한 파이프라인들과 데이터베이스 로그 상에서 더 우수한 성능을 보였던 파이프라인 총 j개를 1차 추가 파이프라인들로서 리스트에 추가한다.(과거 사례에서 우수한 성능 보였던 파이프라인)Next, in step S200, pipelines similar to the selected top k pipelines and a total of j pipelines that showed better performance in the database log are added to the list as the first additional pipelines. pipeline that showed performance)

여기까지 과정으로, 모두 (k + j) 개의 후보 파이프라인들이 선정되었다.In the process up to this point, all (k + j) candidate pipelines were selected.

다음, 상기 2차 추가 후보 파이프라인을 결정하는 단계(S300)에서는, 인공지능 템플릿(파이프라인) 데이터베이스 기반으로 제안된 (k + j) 개의 템플릿(파이프라인)에 각각 적용된 작업(Step, 객체,7page표의 ‘작업’)과 동일 목적의 다른 방법의 작업(Step, 객체,7page표의 ‘작업’) 또한 적용할 수 있다. 평가모듈 기반으로 성능이 좋은 작업(Step, 객체,7page표의 ‘작업’)으로 템플릿(파이프라인)의 구성을 변경할 수 있다. 즉, 데이터베이스 기반 (k+j)개의 후보 파이프라인들을 기반으로, 각 파이프라인 템플릿에 대하여 탐색한 후, 최적으로 판정된 a개의 파이프라인을 후보 파이프라인 리스트에 추가한다. 이에 따라, 모두 (k + j + a) 개의 후보 파이프라인들이 선정되었다.Next, in the step (S300) of determining the second additional candidate pipeline, the tasks (Step, object, 7-page table 'work') and other methods of the same purpose (Step, object, 7-page table 'task') can also be applied. Based on the evaluation module, it is possible to change the configuration of the template (pipeline) to a task with good performance (Step, object, ‘task’ in the 7page table). That is, after searching for each pipeline template based on (k+j) candidate pipelines based on the database, a pipelines determined as optimal are added to the candidate pipeline list. Accordingly, all (k + j + a) candidate pipelines were selected.

상기 S400 단계에서는, (k + j + a) 개의 파이프라인 리스트 중에 최고의 성능을 보이는 파이프라인을 최종 인공지능 파이프라인으로 선택한다.In step S400, a pipeline showing the best performance among (k + j + a) pipelines is selected as the final AI pipeline.

다음, 상기 S100 단계 및 S300 단계에서 사용될 수 있는 최적값을 찾기 위한 방법으로서 Beam search 알고리즘에 대하여 설명한다.Next, a beam search algorithm as a method for finding an optimal value that can be used in steps S100 and S300 will be described.

도 3은 Beam search 알고리즘의 실행 구조를 도시한다.3 shows the execution structure of the beam search algorithm.

도시한 경우, 최적 선택 순서는, step1, step2, step3 순이며, 각 스텝별 최적 작업 선택에 있어서, 각 스텝이 선택가능한 작업들 및 아무 작업도 선택하지 않음 중에서 작업 선택이 결정될 수 있다.In the illustrated case, the optimal selection order is step1, step2, and step3, and in the optimal job selection for each step, the job selection may be determined from among the jobs selectable by each step and no job selected.

도시한 바와 같이, 일련의 스텝들(step1, step2, step3,...) 중, 상기 최적 선택 순서상 첫번째 스텝(step1)에 대하여 최적 작업을 선택하는 과정(Level 1)과, 이전에 선택된 최적 작업을 그대로 반영한 상태로 상기 최적 선택 순서상 다음 스텝에 대하여 최적 작업을 선택하는 것((Level 2, 3,...)을 최종 스텝(Level N)까지 수행함을 알 수 있다.As shown, among a series of steps (step1, step2, step3, ...), a process (Level 1) of selecting an optimal operation for the first step (step1) in the optimal selection sequence, and the previously selected optimal operation It can be seen that selecting the optimal task for the next step in the optimal selection sequence ((Level 2, 3, ...) is performed up to the final step (Level N) in a state in which the task is reflected as it is.

상술한 Beam search 알고리즘을 추가한 인공지능 알고리즘 자동 생성 방법은, 상기 기본 후보 파이프라인을 결정하는 단계(S100) 이전에, 최적 판정에 있어서 각 단위 스텝들의 최적 선택 순서를 결정하는 단계를 더 포함한다. The method for automatically generating an artificial intelligence algorithm to which the above-described beam search algorithm is added further includes, before the step of determining the basic candidate pipeline (S100), determining the optimal selection order of each unit step in the optimal determination. .

또한, 상기 기본 후보 파이프라인을 결정하는 단계는, 상기 각 후보 파이프라인 템플릿을 구성하는 일련의 스텝들 중, 상기 최적 선택 순서상 첫번째 스텝에 대하여 최적 작업을 선택하는 과정; 및 이전에 선택된 최적 작업을 그대로 반영한 상태로 상기 최적 선택 순서상 다음 스텝에 대하여 최적 작업을 선택하는 것을 최종 스텝까지 수행하는 과정을 포함한다.The determining of the basic candidate pipeline may include: selecting an optimal task for a first step in the optimal selection order from among a series of steps constituting each of the candidate pipeline templates; and selecting the optimal task for the next step in the optimal selection sequence until the final step in a state in which the previously selected optimal task is reflected as it is.

도 4는 본 발명의 인공지능 알고리즘 자동 생성 방법을 수행할 수 있는 인공지능 알고리즘 생성 시스템을 도시한 블록도이다.4 is a block diagram illustrating an artificial intelligence algorithm generating system capable of performing the automatic artificial intelligence algorithm generating method of the present invention.

도시한 인공지능 알고리즘 생성 시스템은, 파이프라인들 및 각 파이프라인들을 구성하는 스텝들에 대한 정보와, 상기 각 파이프라인을 평가하기 위한 평가 데이터들과, 기 수행된 평가 및 최적화에 대한 로그(또는 특징공학 수행을 위한 정보)저장된 인공지능 파이프라인 DB(100); 평가 데이터를 선정하며(구체적으로 특징 생성 모듈(220)), 파이프라인 템플릿을 구성하는 각 스텝들에 대하여 최적 작업을 선정(구체적으로 특징 선택 모듈(240))하여 파이프라인을 생성하는 특징 공학 모듈(200); 및 다수개의 후보 파이프라인 템플릿들을 관리하고, 다수개의 후보 파이프라인들을 관리하며, 최종 인공지능 파이프라인을 확정하는 모델 최적화 모듈(300)을 포함할 수 있다.The illustrated artificial intelligence algorithm generating system includes information on pipelines and steps constituting each pipeline, evaluation data for evaluating each pipeline, and a log (or information for performing feature engineering) stored artificial intelligence pipeline DB (100); A feature engineering module that selects evaluation data (specifically, the feature generation module 220), and selects an optimal task for each step constituting the pipeline template (specifically, the feature selection module 240) to generate a pipeline (200); and a model optimization module 300 for managing a plurality of candidate pipeline templates, managing a plurality of candidate pipelines, and determining a final artificial intelligence pipeline.

도 1의 흐름도에 따른 인공지능 알고리즘 생성 방법을 수행함에 있어서, 상기 모델 최적화 모듈(300)은, 상기 후보 파이프라인 템플릿들을 결정하고, 데이터베이스 로그 기반에서 기본 후보 파이프라인들 보다 성능이 높은 파이프라인들을 1차 추가 후보 파이프라인들로 추가하고, 추가된 상기 1차 추가 후보 파이프라인들이 속하는 템플릿들을 상기 후보 파이프라인 템플릿들에 추가하며, 상기 기본 후보 파이프라인들, 상기 1차 추가 후보 파이프라인들 및 2차 추가 후보 파이프라인들 중 최고의 성능을 보이는 것을 최종 인공지능 파이프라인으로 확정할 수 있다.In performing the artificial intelligence algorithm generating method according to the flowchart of FIG. 1 , the model optimization module 300 determines the candidate pipeline templates, and selects pipelines with higher performance than basic candidate pipelines based on a database log. adding as primary additional candidate pipelines, adding templates to which the added primary additional candidate pipelines belong to the candidate pipeline templates, the basic candidate pipelines, the primary additional candidate pipelines, and Among the second additional candidate pipelines, the one with the best performance can be confirmed as the final AI pipeline.

마찬가지로, 도 1의 흐름도에 따른 인공지능 알고리즘 생성 방법을 수행함에 있어서, 상기 특징 공학 모듈(200)은, 상기 각 후보 파이프라인 템플릿을 구성하는 각 단위 스텝들의 최적 조합을 선정하여 상기 기본 후보 파이프라인들 및/또는 상기 2차 추가 후보 파이프라인들을 결정할 수 있다. Similarly, in performing the artificial intelligence algorithm generating method according to the flowchart of FIG. 1 , the feature engineering module 200 selects an optimal combination of each unit step constituting each of the candidate pipeline templates to select the basic candidate pipeline and/or the secondary additional candidate pipelines.

추가적으로, 상기 특징 공학 모듈(200)은, 상기 파이프라인 생성을 위한 최적 판정에 있어서 각 단위 스텝들의 최적 선택 순서를 결정할 수 있다.Additionally, the feature engineering module 200 may determine an optimal selection order of each unit step in the optimal determination for generating the pipeline.

보다 구체적으로, 상기 특징 공학 모듈(200)은, 상기 각 후보 파이프라인 템플릿을 구성하는 일련의 스텝들 중, 상기 최적 선택 순서상 첫번째 스텝에 대하여 최적 작업을 선택하고, 이전에 선택된 최적 작업을 그대로 반영한 상태로 상기 최적 선택 순서상 다음 스텝에 대하여 최적 작업을 선택하는 것을 최종 스텝까지 수행할 수 있다.More specifically, the feature engineering module 200 selects an optimal task for the first step in the optimal selection sequence from among a series of steps constituting each of the candidate pipeline templates, and retains the previously selected optimal task as it is. In the reflected state, selecting an optimal task for the next step in the optimal selection sequence may be performed up to the final step.

도시한 인공지능 알고리즘 생성 시스템은 특징 공학(Feature Engineering), 특징 선택(Feature Selection), 모델 선택(Model Selection), 하이퍼파라미터 튜닝(Hyper-parameter Tuning) 일련의 과정을 모두 포함하는 전력 산업에 특화된 인공지능 알고리즘 자동 생성 시스템이다. The artificial intelligence algorithm generation system shown is an artificial intelligence specialized for the power industry that includes a series of features engineering, feature selection, model selection, and hyper-parameter tuning. It is an intelligent algorithm automatic generation system.

실제 인공지능 모델이 적용된 3개 전력 산업 분야에 본 기술을 활용하여 생성한 모델을 적용하여 기존 모델 대비 우수성을 입증하였다. 최대전력수요 예측과 비트코인 채굴장 위약 탐지는 기존 모델 대비 각각 3.42%, 3.23% 향상 되었다. 또한 본 기술이 자동 생성한 법률 전문가 서비스 챗봇의 질문 의도 분류 모델은 95.23% 정확도를 달성하여, 기존 상용 소프트웨어 대체 가능성을 보였다. 이뿐만 아니라 오픈스택 (Openstack), 쿠버테니스 (Kubernetes) 등 인프라와 결합하여 실제 산업에 적용가능한 안정성 있고 확장 가능한 시스템을 구축하였다.The model generated using this technology was applied to the three electric power industry fields to which the actual artificial intelligence model was applied, proving its superiority compared to the existing model. Peak power demand prediction and Bitcoin mining site breach detection were improved by 3.42% and 3.23%, respectively, compared to the previous model. In addition, the question intention classification model of the legal expert service chatbot automatically generated by this technology achieved 95.23% accuracy, showing the possibility of replacing existing commercial software. In addition, by combining with infrastructure such as Openstack and Kubernetes, a stable and scalable system that can be applied to real industries was built.

먼저, 인공지능 파이프라인 DB(100)에 대하여 상세히 설명한다.First, the artificial intelligence pipeline DB 100 will be described in detail.

인공지능 파이프라인 데이터베이스(100)는 기존 개발된 여러 전력 시스템에서 사용된 인공지능 알고리즘을 분석 및 역공학(Reverse Engineering)하여 유사한 케이스에 적용 가능하도록 인공지능 파이프라인 기반 데이터를 축적한다. 또한 Kaggle 등에서 전력분야 우수 케이스를 분석하여 인공지능 파이프라인에 추가할 수 있다. The artificial intelligence pipeline database 100 accumulates artificial intelligence pipeline-based data so that it can be applied to similar cases by analyzing and reverse engineering the artificial intelligence algorithms used in several previously developed power systems. In addition, it can be added to the AI pipeline by analyzing the best cases in the power sector such as Kaggle.

또한, 기존에 우수한 성능을 보이는 데이터 전처리 및 인공지능 알고리즘을 전력 데이터 기반으로 학습 및 테스트 한 결과를 저장한다. 단순히 해당 알고리즘의 최대 성능의 환경값만 저장하는 것이 아니라, 최대성능을 탐색하는 과정에서 나온 탐색 과정의 성과도 로그로 저장하고, 이 또한 최적 인공지능 파이프라인 리스트를 생성하는데 기반 데이터로 활용한다. 즉, 기존에 개발한 알고리즘, 데이터 분석 우수사례, 일반 인공지능을 모두 전력 데이터 기반으로 테스트하고 그 테스트 결과 값을 데이터화하여 데이터베이스에 저장할 수 있다. In addition, it stores the results of learning and testing data pre-processing and artificial intelligence algorithms that show excellent performance in the past based on power data. It does not simply store the environment value of the maximum performance of the algorithm, but also stores the performance of the search process from the process of searching for the maximum performance as a log, which is also used as base data to generate the optimal AI pipeline list. In other words, the previously developed algorithms, best practices for data analysis, and general artificial intelligence can all be tested based on power data, and the test results can be converted into data and stored in the database.

이 로그 데이터를 기반으로 인공지능 파이프라인 데이터베이스(100)는 데이터의 특성 및 풀고자하는 문제의 정의에 따라 적합한 알고리즘을 제시할 수 있다Based on this log data, the artificial intelligence pipeline database 100 may present an appropriate algorithm according to the characteristics of the data and the definition of the problem to be solved.

인공지능 파이프라인 데이터베이스(100)는 실제 적용되는 데이터와 실제 해결할 문제의 정의에 따라서 가장 최적의 인공지능 파이프라인을 제공한다. 최적 인공지능 파이프라인은 데이터 전처리를 포함한 특징 공학 요소부터 최적 알고리즘까지 일련의 프로세스로 구성되어 있다. 예컨대, 데이터베이스에 축적된 데이터 및 로그를 바탕으로 일정 수의 인공지능 파이프라인을 제안하는데 이용될 수 있다. The artificial intelligence pipeline database 100 provides the most optimal artificial intelligence pipeline according to the definition of the data to be actually applied and the problem to be solved. The optimal AI pipeline consists of a series of processes from feature engineering elements including data preprocessing to optimal algorithms. For example, it can be used to propose a certain number of AI pipelines based on data and logs accumulated in a database.

다음, 자동 특징 공학(Feature Engineering) 모듈(200)에 대하여 상세히 설명한다.Next, the automatic feature engineering module 200 will be described in detail.

특징 공학 (Feature Engineering) 모듈은 특징 생성(Feature Generation) 모듈(220), 특징 선택(Feature Selection) 모듈(240), 그리고 평가 모듈(Scoring Module)(260)로 구성된다. The feature engineering module includes a feature generation module 220 , a feature selection module 240 , and a scoring module 260 .

특징 생성(Feature Generation), 특징 선택(Feature Selection) 모듈, 두 단계는 기능 특성상으로 구분되어 있을 뿐, 실행 과정에서는 최적 데이터셋을 만들기 위하여 구분없이 실행될 수 있다. 이 때 해당 모듈은 데이터의 각 칼럼 속성을 자동으로 추론하여 각 타입에 적합한 특징 공학(Feature Engineering) 작업을 수행한다. 평가 모듈을 통하여 각 작업이 적용된 데이터셋의 교차 검증 점수를 산출하고, 교차 검증 점수를 기반으로 최적 데이터셋 탐색을 진행할 수 있다.The two stages, the Feature Generation and Feature Selection module, are only separated by functional characteristics, and can be executed without distinction in the execution process to create an optimal data set. At this time, the module automatically infers the properties of each column of data and performs feature engineering work suitable for each type. Through the evaluation module, the cross-validation score of the dataset to which each task is applied can be calculated, and the optimal dataset can be searched based on the cross-validation score.

상기 특징 생성(Feature Generation) 모듈(220)은, 맞춤 모듈(Custom Module)(222)과 비선형 특징 생성 모듈(Non-linear Feature Generation Module)(224)로 구성될 수 있다.The feature generation module 220 may include a custom module 222 and a non-linear feature generation module 224 .

상기 맞춤 모듈(Custom Module)(222)은 기본적인 기존 인공지능 자동화 서비스에서 제공하지 않는 특징 공학 기술(Feature Engineering Techniques)을 제공한다. 기존 서비스들은 결측치가 있으면 실행이 안 된다. 또한 사전 정의된 특징 생성 기법을 통하여 특징을 생성만 되며, 생성 대상도 숫자형 데이터로 적용 대상이 한정적이다. 이와 달리, 본 시스템은 텍스트 데이터의 경우, 텍스트를 자동 임베딩(Embedding)하고, 연속 변수에는 범주화를 시행하고, 군집화(Clustering) 방법 등 여러 방법으로 임베딩 변수를 생성하는 등 다양한 변수 생성 기능을 제공할 수 있다.The custom module (Custom Module) 222 provides feature engineering techniques (Feature Engineering Techniques) that are not provided by the basic existing artificial intelligence automation service. Existing services cannot be executed if there are missing values. In addition, only a feature is generated through a predefined feature generating technique, and the target of generation is also limited to numeric data. On the other hand, in the case of text data, this system provides various variable creation functions, such as automatic embedding of text, categorization of continuous variables, and generation of embedding variables in various methods such as clustering method. can

상기 맞춤 모듈(222)은 상기 표 1과 같이 현재 10개의 종류로 총 30여개 이상의 데이터 전처리 작업을 적용하여, 전력 분야에서 최고 성능을 낼 수 있는 데이터 세트 후보를 만들 수 있다. 이외에도 기존 인공지능 자동화 서비스와는 달리 사용자의 개입을 최소화하고 확실한 예외처리를 통하여 다양한 케이스에도 에러 없이 작동하도록 데이터셋을 정제하는 기능을 추가로 수행할 수 있다.As shown in Table 1 above, the custom module 222 may apply a total of 30 or more data pre-processing tasks in 10 types to create a data set candidate capable of achieving the best performance in the power field. In addition, unlike the existing AI automation service, it is possible to additionally perform the function of refining the dataset so that it works without errors in various cases through minimal user intervention and clear exception handling.

상기 비선형 특징 생성 모듈(Non-linear Feature Generation Module)(224)은 log(x), √x, 1/x, x^2, x^3, |x|, exp(x), sin(x), cos (x)와 같이 비선형 특징 (Non-linear Feature)을 생성하는 기능을 제공하며, 생성된 특징간 산술 계산으로 2차적으로 새로운 특징을 또 생성할 수 있다. 이 모듈은 기하급수적으로 여러 형태의 비선형 특징을 자동 생성하고, 비선형 데이터 속성을 가진 숨겨진 통찰력으로 더 많은 기능을 찾을 수 있도록 지원할 수 있다.The non-linear feature generation module 224 is log(x), √x, 1/x, x^2, x^3, |x|, exp(x), sin(x) , cos (x) provides a function to generate non-linear features, and new features can be secondarily generated by arithmetic calculations between the generated features. This module can automatically generate exponentially different types of nonlinear features, and support to discover more features with hidden insights with nonlinear data properties.

상기 특징 선택(Feature Selection) 모듈(240)은, 맞춤 모듈(Custom Module)(242), 보루타 모듈(Boruta Module)(244), 선형 모델 기반 모듈(246)로 구성될 수 있다. 특징 선택 작업(Feature Selection Step)은 특징 생성 작업(Feature Generation Step) 중간 중간에 맞춤 모듈(Custom Module), 보루타 모듈(Boruta Module), 선형 모델 기반 모듈 순으로 적용될 수 있다. 특징 생성 작업 (Feature Generation Step) 사이에 특징 선택 작업(Feature Selection Step)을 적용하므로써, 메모리 부족 문제를 해결할 수 있고, 새롭게 생성된 특징 중 성능 향상에 기여가 없는 특징을 사전에 제거함으로써 성능을 더 향상시킬 수 있다.The feature selection module 240 may include a custom module 242 , a Boruta module 244 , and a linear model-based module 246 . The Feature Selection Step may be applied in the order of a Custom Module, a Boruta Module, and a Linear Model-Based Module in the middle of the Feature Generation Step. By applying the Feature Selection Step between the Feature Generation Steps, the memory shortage problem can be solved, and the performance can be improved by removing in advance the features that do not contribute to the performance improvement among the newly created features. can be improved

상기 맞춤 특징 선택 모듈(Custom Module for Feature Selection)(242)은 기존 개발된 모델 및 시스템 분석을 통하여 축적된 도메인 전문 지식 데이터를 기반으로 특징 선택 정책 (Feature Selection Rule)을 적용한다. 동일한 원천 데이터 혹은 유사 시스템의 데이터를 사용한 경우, 기개발된 시스템 데이터와 유사한 전처리 정책을 적용한다. 본 시스템의 로그 데이터베이스에 더 많은 로그가 축적될수록 맞춤 특징 선택 모듈(242)의 정책도 많아질 것이다. 이는 최적 특징 공학(Feature Engineering) 작업 탐색 소요시간 감소와 모델 성능 향상을 가능하게 할 것이라 예상된다.The Custom Module for Feature Selection 242 applies a Feature Selection Rule based on domain expertise data accumulated through analysis of an existing developed model and system. If the same source data or data from a similar system is used, a preprocessing policy similar to that of the previously developed system data is applied. As more logs are accumulated in the log database of the present system, the policies of the custom feature selection module 242 will also increase. This is expected to reduce the time required to search for an optimal feature engineering task and improve model performance.

상기 보루타 모듈(Boruta Module)(244)은 비선형 특징 생성 모듈(Non-linear Feature Generations Module)(224) 수행 전에 실행되어 노이즈 특징(Noise Features)을 제거하는 역할과 여유 메모리 확보 역할을 한다. 보루타 모듈(244)은 정규 분포기반으로 무작위로 추출된 노이즈 특징과 원본 특징을 함께 랜덤 포레스트 모델을 학습시킨다. 노이즈 특징의 모델 계수의 절댓값보다 큰 값을 가지고 있는 원본 특징(Original Feature)만 유지한다. 즉, 노이즈 특징보다 결과에 더 영향을 많이 미치는 원본 특징만 추출하여 추후에 사용하는 방식이다.The Boruta module 244 is executed before performing the Non-linear Feature Generations Module 224 to remove noise features and to secure spare memory. The Boruta module 244 trains the random forest model with the noise feature and the original feature randomly extracted based on the normal distribution. Only original features that have a value greater than the absolute value of the model coefficient of the noise feature are maintained. In other words, it is a method of extracting only the original features that have more influence on the results than the noise features and using them later.

상기 선형 모델 기반 모듈(246)은 L1 정규화 선형 모델을 사용하여 각 특징의 계수(Coefficient)의 절댓값 크기를 기반으로 유의미한 특징만 추출한다. 각 특징의 계수의 절댓값이 클수록 예측에 큰 영향을 끼치기 때문에 큰 계수를 가지는 특징(Features)을 일정 범위로 선택한다. L1 정규화 선형 모델을 특징 선택 모듈로 사용하는 이유는 딱히 도움이 되지 않는 특징의 계수를 0으로 학습하도록 유도하여, 계수가 0인 특징은 그 값이 얼마여도 모델 예측 결과에 주지 않기 때문이다. The linear model-based module 246 extracts only significant features based on the magnitude of the absolute value of the coefficient of each feature using the L1 normalized linear model. The larger the absolute value of the coefficient of each feature, the greater the influence on the prediction, so features having a large coefficient are selected within a certain range. The reason for using the L1 regularized linear model as a feature selection module is that it induces learning of the coefficient of a feature that is not particularly helpful as 0, so that a feature with a coefficient of 0 is not given to the model prediction result no matter how much its value is.

다음, 모델 선택 및 최적화 모듈(Model selection & optimization module)(300)에 대하여 상세히 설명하겠다.Next, the model selection and optimization module (Model selection & optimization module) 300 will be described in detail.

모델 최적화 모듈(300)은, 최적 전처리된 데이터셋에 대하여 최적의 알고리즘을 찾고 학습하는 과정을 담당하며, 해당 모듈은 맞춤 기계학습 모듈(320), 맞춤 딥러닝 모듈(340), 앙상블 모듈(360)로 구성될 수 있다.The model optimization module 300 is responsible for the process of finding and learning the optimal algorithm for the optimal preprocessed dataset, and the modules include the custom machine learning module 320 , the custom deep learning module 340 , and the ensemble module 360 . ) can be composed of

상기 맞춤 기계학습 모듈(320)은 xgboost, catboost, light GBM 등 최신 앙상블 모델뿐만 아니라 송변전, 배전, 발전 등 분야에서 기존에 개발된 우수한 알고리즘도 최적 알고리즘 탐색 공간에 추가할 수 있다. 최적의 기계학습 알고리즘의 유전 알고리즘 기반으로 찾고 학습하는 것을 지원한다. The custom machine learning module 320 can add not only the latest ensemble models such as xgboost, catboost, and light GBM, but also excellent algorithms previously developed in the fields of transmission, distribution, and power generation to the optimal algorithm search space. It supports finding and learning based on the genetic algorithm of the optimal machine learning algorithm.

상기 맞춤 딥러닝 모듈(340)은 도메인 지식 데이터베이스 기반으로 특정 분야 혹은 특정 시스템과 관련 있는 경우, 기존에 개발된 딥러닝 모델을 변형하여 최적의 딥러닝 모델을 찾고 학습한다.The customized deep learning module 340 finds and learns an optimal deep learning model by modifying an existing deep learning model when it is related to a specific field or a specific system based on a domain knowledge database.

상기 앙상블 모듈(360)은 맞춤 기계학습 모듈과 맞춤 딥러닝 모듈에서 생성된 모델을 기반으로 여러 종류의 보팅(Voting), 스택킹(Stacking), 배깅(Bagging) 앙상블 알고리즘을 적용하여 성능을 측정한다. 그 결과 중 최고의 성능을 보이는 모델을 선택하여 최적 템플릿(파이프라인)을 제공한다. The ensemble module 360 measures the performance by applying various kinds of voting, stacking, and bagging ensemble algorithms based on the models generated by the custom machine learning module and the custom deep learning module. . Among the results, the optimal template (pipeline) is provided by selecting the model with the best performance.

구현에 따라, 본 발명의 사상에 따른 인공지능 알고리즘 자동 생성 시스템은, 클라우드 기반 인프라 시스템을 형성할 수 있다. 도 5는 도 4의 인공지능 알고리즘 생성 시스템의 클라우드 가상 서버로의 구현예를 도시한 블록도이다.According to implementation, the artificial intelligence algorithm automatic generation system according to the spirit of the present invention may form a cloud-based infrastructure system. 5 is a block diagram illustrating an implementation example of the artificial intelligence algorithm generating system of FIG. 4 as a cloud virtual server.

시스템 자원 사용의 효율성 및 확장성을 지원하기 위하여 오픈스택 train 버전을 기반으로 가상서버를 생성하여 시스템의 인프라를 구성할 수 있다. 이 경우 사용자가 데이터를 업로드하고 수행된 결과를 확인할 수 있는 가상서버 하나로 Web과 Was를 구성한다. 상기 메시지 큐는 사용자가 요청한 작업들을 스케줄링하고 여러 가상서버에 작업들을 할당하는 역할을 담당한다. 작업의 특성에 따라 GPU 기반의 가상서버를 생성하여 인공지능 알고리즘 자동 생성 작업을 수행한다. In order to support the efficiency and scalability of system resource use, the system infrastructure can be configured by creating a virtual server based on the OpenStack train version. In this case, Web and Was are configured as one virtual server where users can upload data and check the results. The message queue is responsible for scheduling tasks requested by users and allocating tasks to various virtual servers. According to the characteristics of the task, a GPU-based virtual server is created to automatically create an artificial intelligence algorithm.

데이터 속성과 작업 요청 속성과 결합하여 각 단계별 작업의 로그를 데이터베이스에 저장한다. 예를 들어, 전력 수요 예측 작업의 경우 데이터의 Row의 수, 특징의 특성, 기초 통계량, 적용된 작업과 적용 결과를 저장한다. 이는 추후에 인공지능 알고리즘 자동 생성 과정의 탐색 공간을 줄이고 전문 지식의 특성을 반영한 향상된 성능의 모델을 만들기 위함이다. Combined with data attribute and work request attribute, the log of each step operation is stored in the database. For example, in the case of a power demand forecasting task, the number of rows of data, characteristics of features, basic statistics, applied tasks and application results are stored. This is to reduce the search space of the artificial intelligence algorithm automatic creation process in the future and to create a model with improved performance reflecting the characteristics of expert knowledge.

본 발명이 속하는 기술 분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있으므로, 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Those skilled in the art to which the present invention pertains should understand that the present invention can be embodied in other specific forms without changing the technical spirit or essential characteristics thereof, so the embodiments described above are illustrative in all respects and not restrictive. only do The scope of the present invention is indicated by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. .

100 : 인공지능 파이프라인 DB
200 : 특징 공학 모듈
220 : 특징 생성 모듈
240 : 특징 선택 모듈
300 : 모델 최적화 모듈100: AI pipeline DB
200: feature engineering module
220: feature generation module
240: feature selection module
300: model optimization module

Claims

determining a basic candidate pipeline according to the required artificial intelligence condition;
adding pipelines with higher performance than each of the basic candidate pipelines as primary additional candidate pipelines based on a database log;
determining a second additional candidate pipeline by selecting an optimal combination of each unit step constituting the pipeline template of each candidate pipeline; and
Determining the best performance among the candidate pipelines as the final AI pipeline
A method for automatically generating artificial intelligence algorithms, including

According to claim 1,
Prior to the step of determining the second additional candidate pipeline,
Determining an optimal selection order of each unit step in optimal determination
Artificial intelligence algorithm automatic generation method further comprising.

3. The method of claim 2,
The step of determining the second additional candidate pipeline includes:
selecting an optimal task for a first step in an optimal selection sequence from among a series of steps constituting each of the candidate pipeline templates; and
The process of selecting the optimal task for the next step in the optimal selection sequence up to the final step while reflecting the previously selected optimal task as it is
A method for automatically generating artificial intelligence algorithms, including

According to claim 1,
In the step of determining the secondary candidate pipeline,
An artificial intelligence algorithm automatic generation method for selecting an optimal combination of each unit step only for the candidate pipeline templates added in the step of adding to the candidate pipeline templates.

According to claim 1,
The step of determining the default candidate pipeline includes:
distinguishing the first number of pipelines from data characteristics and problem definitions processed by the artificial intelligence; and
selecting a second number of pipelines as the default candidate pipelines as a result of measuring the performance of the first number of pipelines using a randomly sampled evaluation data set
A method for automatically generating artificial intelligence algorithms, including

According to claim 1,
The step of determining the default candidate pipeline includes:
determining candidate pipeline templates of an artificial intelligence algorithm; and
generating each basic candidate pipeline by selecting an optimal combination of each unit step constituting each of the determined candidate pipeline templates;
A method for automatically generating artificial intelligence algorithms, including

7. The method of claim 6,
In the step of determining the secondary candidate pipeline,
A method for automatically generating an artificial intelligence algorithm in which at least one of an optimal selection order of each unit step applied in the step of determining the basic candidate pipeline and at least one of an “evaluation data set” for optimal selection is applied to another.

an artificial intelligence DB in which information on pipelines and steps constituting each pipeline, evaluation data for evaluating each pipeline, and logs for previously performed evaluation and optimization are stored;
a feature engineering module that selects evaluation data and creates a pipeline by selecting an optimal task for each step constituting the pipeline template; and
A model optimization module that manages multiple candidate pipeline templates, manages multiple candidate pipelines, and determines the final AI pipeline
Artificial intelligence algorithm automatic generation system that includes.

9. The method of claim 8,
The model optimization module,
determining the candidate pipeline templates;
Based on the database log, pipelines with higher performance than the default candidate pipelines are added as primary additional candidate pipelines,
adding templates to which the added first additional candidate pipelines belong to the candidate pipeline templates;
An artificial intelligence algorithm automatic generation system for determining, as a final artificial intelligence pipeline, the one showing the best performance among the basic candidate pipelines, the first additional candidate pipelines, and the second additional candidate pipelines.

10. The method of claim 9,
The feature engineering module is
An artificial intelligence algorithm automatic generation system for determining the basic candidate pipelines and the secondary additional candidate pipelines by selecting an optimal combination of each unit step constituting the respective candidate pipeline templates.

9. The method of claim 8,
The feature engineering module is
An artificial intelligence algorithm automatic generation system that determines the optimal selection order of each unit step in the optimal decision for generating the pipeline.

9. The method of claim 8,
The feature engineering module is
Selecting an optimal task for a first step in an optimal selection sequence among a series of steps constituting each of the candidate pipeline templates;
An artificial intelligence algorithm automatic creation system that selects the optimal task for the next step in the optimal selection sequence while reflecting the previously selected optimal task as it is until the final step.