KR20210157302A

KR20210157302A - Method and Apparatus for Automatic Predictive Modeling Based on Workflow

Info

Publication number: KR20210157302A
Application number: KR1020210023785A
Authority: KR
Inventors: 성현중; 이중필; 이원석
Original assignee: (주)브릭
Priority date: 2020-06-19
Filing date: 2021-02-23
Publication date: 2021-12-28
Also published as: KR102605481B1; KR20210157303A; KR102605482B1

Abstract

Disclosed are an automatic prediction modeling method based on a workflow and a device for thereof. The automatic prediction modeling method according to an embodiment of the present invention may comprise: an input step of obtaining the initial input data for the source data and the prediction work information; a data analysis step of generating at least one node based on the initial input data, and constructing a prediction workflow based on at least one generated node to convert thereof into the workflow output data; and a workflow output step of outputting the converted workflow output data so that an automatic prediction modeling is performed. Therefore, the present invention is capable of having an effect of being able to process automatic predictions.

Description

Method and Apparatus for Automatic Predictive Modeling Based on Workflow

본 발명은 적어도 하나의 노드로 구성된 워크플로우 기반의 자동 예측 모델링 방법 및 그를 위한 장치에 관한 것이다.The present invention relates to a workflow-based automatic predictive modeling method comprising at least one node and an apparatus therefor.

이 부분에 기술된 내용은 단순히 본 발명의 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information on the embodiments of the present invention and does not constitute the prior art.

머신러닝을 활용한 데이터 분석을 위해서는 전처리, 특징추출, 분류, 예측, 후처리, 시각화 등 다양한 단계를 수행해야 한다. 각 과정에서 필요한 동작 및 알고리즘의 노드는 서로 다른 분석가 또는 다양한 프로그래밍 언어를 기반으로 구현될 수 있다. For data analysis using machine learning, various steps such as pre-processing, feature extraction, classification, prediction, post-processing, and visualization must be performed. Nodes of operations and algorithms required in each process can be implemented based on different analysts or various programming languages.

데이터 분석을 위한 노드는 분석가 또는 프로그래밍 언어 각각에 따라 문법이나 구현 방법, 구동환경 등이 상이하며, 데이터 분석을 수행하는 분석가 각각은 개인이 선호하는 프로그래밍 언어를 선택한 후 학습하여 주로 활용하게 된다. Node for data analysis has different grammar, implementation method, and operating environment according to each analyst or programming language, and each analyst who performs data analysis selects and learns a programming language that an individual prefers and mainly uses it.

복잡한 시스템 분석이 필요한 환경에서 다수의 분석가가 협업을 수행해야 하는 과정에서는 다수의 데이터 분석가가 서로 다른 프로그래밍 언어 및 서로 다른 노드 환경을 사용하는 데 발생하는 협업의 어려움이 발생한다. 또한, 다수의 분석가가 협업을 수행하는 과정에서는 단계별로 구성된 다수의 알고리즘이 서로 연결되어 통합된 워크플로우를 구성하는데 발생하는 복잡도가 증가하게 되는 문제점이 발생한다.In the process where multiple analysts must collaborate in an environment that requires complex system analysis, collaboration difficulties arise when multiple data analysts use different programming languages and different node environments. In addition, in the process of performing collaboration of a plurality of analysts, a problem arises in that a number of algorithms configured in stages are connected to each other to increase the complexity of composing an integrated workflow.

본 발명은 사용자에 의해 입력된 소스 데이터 및 예측 작업정보에 대한 초기 입력 데이터를 토대로 데이터 패턴을 분석한 후 예측에 필요한 각 단계에 최적화된 알고리즘 노드를 생성하여 자동 예측 모델링을 위한 워크플로우를 생성하여 제공하는 워크플로우 기반의 자동 예측 모델링 방법 및 그를 위한 장치를 제공하는 데 주된 목적이 있다.The present invention generates a workflow for automatic predictive modeling by creating an algorithm node optimized for each step required for prediction after analyzing a data pattern based on the source data input by the user and the initial input data for the prediction work information. A main object of the present invention is to provide a workflow-based automatic predictive modeling method and an apparatus for the same.

본 발명의 일 측면에 의하면, 상기 목적을 달성하기 위한 자동 예측 모델링 방법은, 소스 데이터 및 예측 작업정보에 대한 초기 입력 데이터를 획득하는 입력 단계; 상기 초기 입력 데이터를 기반으로 적어도 하나의 노드를 생성하고, 생성된 적어도 하나의 노드를 기반으로 예측 워크플로우를 구성하여 워크플로우 출력 데이터로 변환하는 데이터 분석 단계; 및 변환된 상기 워크플로우 출력 데이터를 출력하여 자동 예측 모델링이 수행되도록 하는 워크플로우 출력 단계를 포함할 수 있다. According to one aspect of the present invention, an automatic predictive modeling method for achieving the above object includes an input step of acquiring initial input data for source data and predictive work information; a data analysis step of generating at least one node based on the initial input data, constructing a prediction workflow based on the generated at least one node, and converting it into workflow output data; and outputting the converted workflow output data to perform automatic predictive modeling.

또한, 본 발명의 다른 측면에 의하면, 상기 목적을 달성하기 위한 자동 예측 모델링 장치는, 소스 데이터 및 예측 작업정보에 대한 초기 입력 데이터를 획득하는 입력부; 적어도 하나의 노드를 생성하고, 생성된 적어도 하나의 노드를 기반으로 예측 워크플로우를 구성하여 워크플로우 출력 데이터로 변환하는 데이터 분석부; 및 변환된 상기 워크플로우 출력 데이터를 출력하여 자동 예측 모델링이 수행되도록 하는 워크플로우 출력부를 포함할 수 있다. In addition, according to another aspect of the present invention, an automatic predictive modeling apparatus for achieving the above object includes: an input unit for obtaining source data and initial input data for predictive work information; a data analysis unit that generates at least one node, configures a prediction workflow based on the generated at least one node, and converts it into workflow output data; and a workflow output unit for outputting the converted workflow output data to perform automatic predictive modeling.

이상에서 설명한 바와 같이, 본 발명은 소정의 작업 및 데이터에 대한 자동 예측을 수행하기 위한 최적의 워크플로우를 사용자에게 제공할 수 있는 효과가 있다. As described above, the present invention is effective in providing users with an optimal workflow for performing automatic prediction on predetermined tasks and data.

또한, 본 발명은 다수의 분석가 또는 이종 프로그램 언어로 구현된 노드를 통합하여 자동 예측을 처리할 수 있는 효과가 있다.In addition, the present invention has an effect that can process automatic prediction by integrating a plurality of analysts or nodes implemented in heterogeneous programming languages.

도 1은 본 발명의 실시예에 따른 자동 예측 모델링 장치를 개략적으로 나타낸 블록 구성도이다.
도 2는 본 발명의 실시예에 따른 데이터 분석부의 구성을 개략적으로 나타낸 블록 구성도이다.
도 3은 본 발명의 실시예에 따른 자동 예측 모델링 방법을 설명하기 위한 순서도이다.
도 4는 본 발명의 실시예에 따른 예측 노드 구성부의 구성을 개략적으로 나타낸 블록 구성도이다.
도 5는 본 발명의 실시예에 따른 예측 노드 구성 동작을 설명하기 위한 순서도이다.
도 6은 본 발명의 실시예에 따른 자동 예측 모델링을 위한 워크플로우를 나타낸 예시도이다.
도 7은 본 발명의 실시예에 따른 자동 예측 모델링 장치의 구성을 나타낸 도면이다. 1 is a block diagram schematically illustrating an automatic predictive modeling apparatus according to an embodiment of the present invention.
2 is a block diagram schematically illustrating the configuration of a data analysis unit according to an embodiment of the present invention.
3 is a flowchart illustrating an automatic predictive modeling method according to an embodiment of the present invention.
4 is a block diagram schematically illustrating the configuration of a prediction node configuration unit according to an embodiment of the present invention.
5 is a flowchart illustrating an operation of configuring a prediction node according to an embodiment of the present invention.
6 is an exemplary diagram illustrating a workflow for automatic predictive modeling according to an embodiment of the present invention.
7 is a diagram showing the configuration of an automatic predictive modeling apparatus according to an embodiment of the present invention.

이하, 본 발명의 바람직한 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 바람직한 실시예를 설명할 것이나, 본 발명의 기술적 사상은 이에 한정하거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있음은 물론이다. 이하에서는 도면들을 참조하여 본 발명에서 제안하는 워크플로우 기반의 자동 예측 모델링 방법 및 그를 위한 장치에 대해 자세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, preferred embodiments of the present invention will be described below, but the technical spirit of the present invention is not limited thereto or may be variously implemented by those skilled in the art without being limited thereto. Hereinafter, a workflow-based automatic predictive modeling method and an apparatus therefor proposed by the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 실시예에 따른 자동 예측 모델링 장치를 개략적으로 나타낸 블록 구성도이다. 1 is a block diagram schematically illustrating an automatic predictive modeling apparatus according to an embodiment of the present invention.

본 실시예에 따른 자동 예측 모델링 장치(100)는 입력부(110), 데이터 분석부(120) 및 워크플로우 출력부(130)을 포함한다. 도 1의 자동 예측 모델링 장치(100)는 일 실시예에 따른 것으로서, 도 1에 도시된 모든 블록이 필수 구성요소는 아니며, 다른 실시예에서 자동 예측 모델링 장치(100)에 포함된 일부 블록이 추가, 변경 또는 삭제될 수 있다. 한편, 자동 예측 모델링 장치(100)는 컴퓨팅 디바이스로 구현될 수 있고, 자동 예측 모델링 장치(100)에 포함된 각 구성요소들은 각각 별도의 소프트웨어 프로그램으로 구현되거나, 소프트웨어가 결합된 별도의 하드웨어 장치로 구현될 수 있다.The automatic predictive modeling apparatus 100 according to the present embodiment includes an input unit 110 , a data analysis unit 120 , and a workflow output unit 130 . The automatic predictive modeling apparatus 100 of FIG. 1 is according to an embodiment, and not all blocks shown in FIG. 1 are essential components, and in another embodiment, some blocks included in the automatic predictive modeling apparatus 100 are added. , may be changed or deleted. On the other hand, the automatic predictive modeling apparatus 100 may be implemented as a computing device, and each component included in the automatic predictive modeling apparatus 100 is implemented as a separate software program, or as a separate hardware device combined with software. can be implemented.

자동 예측 모델링 장치(100)는 사용자의 조작 또는 입력에 의해 선택된 예측 작업정보와 소스 데이터를 토대로 데이터 패턴을 분석한 후 예측에 필요한 각 단계에 최적화된 알고리즘을 선정하여 노드를 생성하고, 생성된 노드를 기반으로 자동 예측 모델링을 위한 워크플로우를 구성한다. 여기서, 워크플로우의 노드는 예측에 필요한 다양한 기능(데이터 전처리, 모델 학습 등)을 수행할 수 있도록 여러 알고리즘으로 구현되어 있고 입출력값에 대한 포맷이 같아 여러 노드를 서로 연결하여 원하는 분석 흐름을 수행할 수 있다. 각 노드는 다양한 프로그래밍 언어(R, Python, Java, C, GO, Julia 등)로 구현될 수 있으며, 자동 예측 모델링 장치(100)는 노드 각각의 속도와 기능을 고려하여 최적의 워크플로우를 구성할 수 있다. 자동 예측 모델링 장치(100)는 구성된 최종 워크플로우를 자동 예측 분석을 위하여 사용자에게 제공하여 실행될 수 있도록 시각화한다. The automatic prediction modeling apparatus 100 analyzes a data pattern based on the prediction job information and source data selected by the user's manipulation or input, selects an algorithm optimized for each step required for prediction, generates a node, and the generated node Constructs a workflow for automatic predictive modeling based on Here, the nodes of the workflow are implemented with multiple algorithms to perform various functions required for prediction (data preprocessing, model learning, etc.) can Each node can be implemented in various programming languages (R, Python, Java, C, GO, Julia, etc.), and the automatic predictive modeling device 100 configures an optimal workflow considering the speed and function of each node. can The automatic predictive modeling apparatus 100 provides the configured final workflow to the user for automatic predictive analysis and visualizes it to be executed.

이하, 자동 예측 모델링 장치(100)에 포함된 구성요소 각각에 대해 설명하도록 한다. Hereinafter, each component included in the automatic predictive modeling apparatus 100 will be described.

입력부(110)는 소스 데이터 및 예측 작업정보를 포함하는 초기 입력 데이터를 획득한다. 여기서, 예측 작업정보와 소스 데이터는 사용자의 조작 또는 입력에 의해 선택된 후 업로드를 통해 획득될 수 있으나 반드시 이에 한정되는 것은 아니며, 외부 장치로부터 불러와 획득된 정보 및 데이터일 수도 있다. The input unit 110 obtains initial input data including source data and prediction work information. Here, the prediction job information and the source data may be obtained through upload after being selected by a user's manipulation or input, but is not limited thereto, and may be information and data obtained by fetching from an external device.

입력부(110)는 연속형 숫자, 문자형 변수, 이미지 데이터 등 중 적어도 하나에 대한 소스 데이터와 사용자의 조작 또는 입력에 의해 결정된 예측 변수 및 예측 종류에 대한 예측 작업정보를 포함하는 초기 입력 데이터를 획득할 수 있다. The input unit 110 obtains initial input data including source data for at least one of continuous numbers, character variables, image data, etc. can

입력부(110)는 작업 선택부(112) 및 입력 데이터 획득부(114)를 포함할 수 있다. The input unit 110 may include a task selection unit 112 and an input data acquisition unit 114 .

입력 데이터 획득부(114)는 연속형 숫자, 문자형 변수, 이미지 데이터 등 중 적어도 하나에 대한 소스 데이터를 획득한다. 여기서, 소스 데이터는 사용자가 알고리즘을 통해 예측하고자 하는 데이터를 의미한다. The input data acquisition unit 114 acquires source data for at least one of a continuous number, a character variable, and image data. Here, the source data means data that a user wants to predict through an algorithm.

소스 데이터는 소정의 비트(예: 64 비트)의 연속형 숫자 또는 문자형 변수로 구성된 데이터, jpeg 등과 같은 이미지 데이터 등일 수 있으며, 데이터베이스의 규칙에 맞게 들어간 테이블 형태의 정형 데이터 또는 이미지 파일들의 형태일 수 있다. Source data may be data composed of continuous numeric or character variables of predetermined bits (eg 64 bits), image data such as jpeg, etc. have.

예를 들어, 입력 데이터 획득부(114)에서 획득된 소스 데이터는 [표 1]과 같은 테이블 형태의 숫자, 문자형 데이터를 포함할 수 있다. For example, the source data acquired by the input data acquisition unit 114 may include numeric and character data in the form of tables as shown in [Table 1].

또한, 입력 데이터 획득부(114)에서 획득된 소스 데이터는 파일1: image_1.jpeg(1080x720), 파일2: image_2.jpeg(1080x720), 파일3: image_3.jpeg(1080x720) 등과 같이 파일 형태의 이미지 데이터를 포함할 수도 있다. In addition, the source data acquired by the input data acquisition unit 114 is an image in the form of a file such as File 1: image_1.jpeg (1080x720), File 2: image_2.jpeg (1080x720), File 3: image_3.jpeg (1080x720), etc. It may also contain data.

작업 선택부(112)는 사용자의 조작 또는 입력에 의해 결정된 예측 변수 및 예측 종류에 대한 예측 작업정보를 획득한다. The job selection unit 112 acquires prediction job information about a prediction variable and a prediction type determined by a user's manipulation or input.

작업 선택부(112)는 소스 데이터에 대한 예측 변수 및 예측 종류를 획득할 수 있다. The job selection unit 112 may obtain a predictor variable and a prediction type for the source data.

테이블 형태의 소스 데이터에 사용되는 작업 선택부(112)의 예측 변수는 명목 자료, 순서 자료, 구간 자료, 비율 자료, 시계열 자료 등일 수 있다. 여기서, 명목 자료는 성별, 음식 메뉴, 우편 번호와 같이 범주를 의미하는 값을 의미하고, 순서 자료는 순위, 학력, 학점과 같이 크고 작다의 의미만 있는 값을 의미한다. 또한, 구간 자료는 온도, 시간, 무게와 같이 양적으로 의미가 있지만 절대 0점이 존재하지 않는 값을 의미하고, 비율 자료는 나이, 불량개수와 같이 양적으로 의미가 있으며 절대 0점이 존재하는 값을 의미하며, 시계열 자료는 일자 별 판매량과 같이 시간 순서에 따른 수치 값을 가진 데이터를 의미한다. The predictor variable of the job selection unit 112 used for the source data in the form of a table may be nominal data, ordinal data, interval data, ratio data, time series data, and the like. Here, the nominal data means values indicating categories such as gender, food menu, and zip code, and the order data means values with only large and small meanings such as ranking, academic background, and grades. In addition, interval data are quantitatively meaningful, such as temperature, time, and weight, but have no absolute zero point. Ratio data are quantitatively meaningful, such as age and number of defectives, and are values in which absolute zero points exist. In addition, time series data means data with numerical values according to time sequence, such as sales by date.

작업 선택부(112)의 예측 종류는 분류 예측(Classification Prediction), 회귀 예측(Regression Prediction), 시계열 예측(Forecast), 군집 예측(Clustering Prediction) 등일 수 있다. The prediction type of the task selection unit 112 may be a classification prediction, a regression prediction, a time series prediction, a clustering prediction, and the like.

여기서, 분류 예측은 명목 자료 형태의 변수를 예측하며, 설명변수와 반응변수 간의 관계를 파악하여 새로운 설명변수가 주어졌을 때 반응변수가 가질 수 있는 값을 파악하는 것을 의미한다. 예를 들어, 분류 예측은 여러 이미지를 학습하여 해당 이미지가 사자인지 호랑이인지를 예측하는 것일 수 있다. Here, classification prediction means predicting a variable in the form of a nominal data, grasping the relationship between the explanatory variable and the response variable, and identifying the possible value of the response variable when a new explanatory variable is given. For example, classification prediction may be learning multiple images to predict whether the image is a lion or a tiger.

또한, 회귀 예측은 명목 자료가 아닌 형태의 변수를 예측하며, 설명변수와 반응변수 간의 관계를 파악하여 새로운 설명변수가 주어졌을 때 반응변수가 가질 수 있는 값을 파악하는 것을 의미한다. 예를 들어, 회귀 예측은 자동차 메이커, 최대속도, 무게, 크기 등의 설명변수를 통해 자동차 연비를 예측하는 것일 수 있다. In addition, regression prediction predicts a variable in a form other than nominal data, and it means identifying the possible value of a response variable when a new explanatory variable is given by understanding the relationship between the explanatory variable and the response variable. For example, the regression prediction may be to predict the fuel efficiency of a vehicle through explanatory variables such as a car maker, maximum speed, weight, and size.

또한, 시계열 예측은 시계열 예측이란 과거 데이터의 패턴을 파악하여 아직 발생하지 않은 미래 데이터가 가질 수 있는 패턴을 유추하는 것을 말하며, 시계열 자료를 활용한다. 예를 들어, 시계열 예측은 과거 강우량 데이터를 통해 내일의 강우량을 예측하는 것일 수 있다. In addition, time series prediction refers to inferring patterns that have not yet occurred in future data by identifying patterns in past data, and time series data is utilized. For example, the time series prediction may be to predict tomorrow's rainfall based on past rainfall data.

또한, 군집 예측은 변수의 형태와 무관하며 각 관측치 간의 유사성을 바탕으로 군집을 형성하는 것을 의미한다. 예를 들어, 군집 예측은 성격과 관련된 설문조사를 통해 사람의 성격 유형 군집을 형성하고 특정 사람의 군집을 예측하는 것일 수 있다. In addition, cluster prediction is independent of the shape of a variable and means forming a cluster based on the similarity between observations. For example, the cluster prediction may be to form a personality type cluster of people through a questionnaire related to personality and predict the cluster of a specific person.

데이터 분석부(120)는 초기 입력 데이터를 기반으로 자동 예측 모델링을 위한 예측 워크플로우를 구성하는 동작을 수행한다. 구체적으로, 데이터 분석부(120)는 초기 입력 데이터를 분석하여 적어도 하나의 노드를 생성하고, 생성된 적어도 하나의 노드를 기반으로 예측 워크플로우를 구성한다. 또한, 데이터 분석부(120)는 예측 워크플로우를 워크플로우 출력 데이터로 변환한다. The data analysis unit 120 configures a prediction workflow for automatic prediction modeling based on initial input data. Specifically, the data analysis unit 120 generates at least one node by analyzing the initial input data, and configures a prediction workflow based on the generated at least one node. In addition, the data analysis unit 120 converts the prediction workflow into workflow output data.

데이터 분석부(120)의 구성 및 구체적인 동작은 도 2 및 도 4에서 자세히 설명하도록 한다. The configuration and specific operation of the data analysis unit 120 will be described in detail with reference to FIGS. 2 and 4 .

워크플로우 출력부(130)는 예측 워크플로우가 변환된 워크플로우 출력 데이터를 출력하여 자동 예측 모델링이 수행되도록 한다.The workflow output unit 130 outputs the workflow output data in which the predictive workflow is converted so that automatic predictive modeling is performed.

워크플로우 출력부(130)는 워크플로우 출력 데이터를 구비된 디스플레이를 이용하여 출력할 수 있으나 반드시 이에 한정되는 것은 아니며, 별도의 외부 장치 또는 사용자 단말기에서 자동 예측 모델링이 수행되도록 워크플로우 출력 데이터를 전송할 수도 있다.The workflow output unit 130 may output the workflow output data using a display equipped with it, but is not limited thereto, and transmits the workflow output data to perform automatic predictive modeling in a separate external device or user terminal. may be

워크플로우 출력부(130)는 워크플로우 출력 데이터를 출력하여 사용자에게 워크플로우의 노드 구성을 보여주고 워크플로우의 수정 또는 워크플로우 기반 자동 예측 모델링의 실행이 수행되도록 한다. The workflow output unit 130 outputs the workflow output data to show the node configuration of the workflow to the user, and to modify the workflow or execute the workflow-based automatic predictive modeling.

도 2는 본 발명의 실시예에 따른 데이터 분석부의 구성을 개략적으로 나타낸 블록 구성도이다. 2 is a block diagram schematically illustrating the configuration of a data analysis unit according to an embodiment of the present invention.

본 실시예에 따른 데이터 분석부(120)는 예측 노드 구성부(210) 및 워크플로우 시각화 처리부(220)를 포함할 수 있다. 도 2의 데이터 분석부(120)는 일 실시예에 따른 것으로서, 도 2에 도시된 모든 블록이 필수 구성요소는 아니며, 다른 실시예에서 데이터 분석부(120)에 포함된 일부 블록이 추가, 변경 또는 삭제될 수 있다. 데이터 분석부(120)에 포함된 각 구성요소들은 각각 별도의 소프트웨어 프로그램으로 구현되거나, 소프트웨어가 결합된 별도의 하드웨어 장치로 구현될 수 있다.The data analysis unit 120 according to the present embodiment may include a prediction node configuration unit 210 and a workflow visualization processing unit 220 . The data analysis unit 120 of FIG. 2 is according to an embodiment, and not all blocks shown in FIG. 2 are essential components, and in another embodiment, some blocks included in the data analysis unit 120 are added or changed. Or it can be deleted. Each component included in the data analysis unit 120 may be implemented as a separate software program, or as a separate hardware device combined with software.

예측 노드 구성부(210)는 자동 예측을 위한 각 단계에 대한 적어도 하나의 노드를 생성하고, 적어도 하나의 노드를 기반으로 자동 예측을 위한 예측 워크플로우를 구성한다.The prediction node configuration unit 210 generates at least one node for each step for automatic prediction, and configures a prediction workflow for automatic prediction based on the at least one node.

예측 노드 구성부(210)는 소스 데이터 및 예측 작업정보를 포함하는 초기 입력 데이터를 토대로 자동 예측에 필요한 노드들을 생성하고, 생성된 노드들 중 전체 또는 일부를 선택한 후 연결하여 예측 워크플로우를 구성한다. The prediction node configuration unit 210 generates nodes necessary for automatic prediction based on initial input data including source data and prediction work information, selects all or some of the generated nodes, and connects them to configure a prediction workflow. .

예측 노드 구성부(210)는 생성 가능한 모든 노드들의 연결 가능성을 테스트하는 것은 비효율적이므로 최적화 알고리즘을 통해 노드를 선정하여 예측 워크플로우를 구성할 수 있다. 즉, 예측 노드 구성부(210)는 검증 방법 구성, 전처리 수행, 모델 학습 처리, 검증 처리 등의 4 단계의 과정을 통해 예측 워크플로우를 구성할 수 있다. 예측 노드 구성부(210)에 포함된 각 단계는 도 4에서 자세히 설명하도록 한다. The prediction node configuration unit 210 may configure a prediction workflow by selecting a node through an optimization algorithm because it is inefficient to test the connectivity of all possible nodes. That is, the prediction node configuration unit 210 may configure the prediction workflow through four steps of configuring a verification method, performing pre-processing, model learning processing, and verification processing. Each step included in the prediction node configuration unit 210 will be described in detail with reference to FIG. 4 .

워크플로우 시각화 처리부(220)는 사용자에게 시각화 형태로 제공하기 위하여 예측 워크플로우를 워크플로우 출력 데이터로 변환하는 동작을 수행한다. The workflow visualization processing unit 220 converts the predictive workflow into workflow output data in order to provide it to the user in the form of a visualization.

워크플로우 시각화 처리부(220)는 최종적으로 선택된 최종 노드들을 포함하는 예측 워크플로우를 워크플로우 출력 데이터로 변환하는 것으로 기재하고 있으나 반드시 이에 한정되는 것은 아니며, 각 단계별로 사용자가 변경 가능하도록 최종 노드와 연관된 적어도 하나의 후보 노드들을 포함하는 형태로 예측 워크플로우를 워크플로우 출력 데이터로 변환할 수도 있다. The workflow visualization processing unit 220 is described as converting the prediction workflow including the finally selected final nodes into the workflow output data, but is not necessarily limited thereto, and is associated with the final node so that the user can change it at each step. A prediction workflow may be converted into workflow output data in a form including at least one candidate node.

한편, 본 실시예에 따른 데이터 분석부(120)는 최적화 언어 선택부(미도시)를 추가로 포함할 수 있다. Meanwhile, the data analysis unit 120 according to the present embodiment may further include an optimization language selection unit (not shown).

최적화 언어 선택부(미도시)는 예측 노드 구성부(210)에서 출력된 예측 워크플로우에 포함된 소정의 노드와 동일한 기능을 수행하지만 다른 언어(프로그래밍 언어)로 쓰여진 노드가 존재하는 경우 해당 기능을 수행하는 최적화 언어로 구성된 최적 노드를 선택하여 최종 예측 워크플로우를 구성하는 동작을 수행한다. The optimization language selection unit (not shown) performs the same function as a predetermined node included in the prediction workflow output from the prediction node configuration unit 210, but when a node written in a different language (programming language) exists, the function is The operation of composing the final prediction workflow is performed by selecting the optimal node composed of the performing optimization language.

최적화 언어 선택부(미도시)는 예측 워크플로우를 구성할 때 동일한 기능을 수행하지만 다른 언어로 쓰여진 노드가 다수 존재하는 경우, 이전 노드와 다음 노드의 연결 상태를 고려하여 워크플로우의 처리 속도가 최적화될 수 있도록 하는 최적 노드를 선택한다. 여기서, 최적화 언어로 구성된 최적 노드의 선택 기준은 통계적 검증을 통해 결정될 수 있다. The optimization language selector (not shown) performs the same function when composing the prediction workflow, but when there are many nodes written in different languages, the processing speed of the workflow is optimized by considering the connection state of the previous node and the next node Select the optimal node that makes it possible. Here, the selection criterion of the optimal node composed of the optimization language may be determined through statistical verification.

예측 노드 구성부(210)은 노드에 사용된 언어와 무관하게 예측 성능을 기준으로 구성되어 있다. 하지만 예측 성능은 데이터셋과 모델 초기값에 따라 변동되는 특성이 있다. 따라서, 최적화 언어 선택부(미도시)는 예측 노드 구성부(210)에서 다수의 예측 워크플로우가 구성되어 추천되고 다수의 예측 워크플로우 각각에서 언어에 따른 예측 성능에 차이가 존재하는 경우, 최적화 언어를 선정하여 최종 예측 워크플로우를 선택할 수 있다. The prediction node configuration unit 210 is configured based on prediction performance regardless of the language used in the node. However, the prediction performance has a characteristic that varies depending on the dataset and the initial model value. Therefore, the optimization language selection unit (not shown) is the optimization language when a plurality of prediction workflows are configured and recommended by the prediction node configuration unit 210 and there is a difference in prediction performance according to languages in each of the plurality of prediction workflows. can be selected to select the final prediction workflow.

한편, 최적화 언어 선택부(미도시)는 예측 노드 구성부(210)에서 다수의 예측 워크플로우가 구성되어 추천되고 다수의 예측 워크플로우 각각에서 예측 성능에 큰 차이가 없는 경우, 메모리 사용량, 시간 소요 등을 추가로 고려하여 시스템 부하를 줄이는 최종 예측 워크플로우 한 개를 선택할 수도 있다. On the other hand, the optimization language selection unit (not shown) is recommended when a plurality of prediction workflows are configured in the prediction node configuration unit 210 and there is no significant difference in prediction performance in each of the plurality of prediction workflows. You can also choose one final prediction workflow that reduces the system load by taking additional factors into consideration.

예를 들어, 소정의 노드(단계)는 파생변수를 생성하는 동일한 방법을 R 언어 또는 Python 언어로 생성되어 있을 수 있다. 분류 알고리즘인 랜덤 포레스트도 R 언어와 Python 언어 각각에 대한 버전이 있고 예측 성능과 동작 속도에 차이가 있다. 따라서 예측 노드 구성부(210)에서 모델 학습을 위한 알고리즘의 선택은 알고리즘 종류 뿐만 아니라 알고리즘이 작성된 언어도 선택 사항이 될 수 있다. For example, a predetermined node (step) may be created in the R language or Python language using the same method of generating a derived variable. Random Forest, a classification algorithm, also has versions for the R language and Python language, and there is a difference in prediction performance and operation speed. Accordingly, the selection of an algorithm for model learning in the prediction node configuration unit 210 may include not only the type of algorithm but also the language in which the algorithm is written.

일반적으로는 동일한 알고리즘 동작은 서로 다른 언어여도 유사하기 때문에, 성능에 차이가 없는 경우 최적화 언어 선택부(미도시)는 메모리 사용량이 적거나 속도가 빠른 언어를 선택할 수도 있다. In general, since the operation of the same algorithm is similar even in different languages, if there is no difference in performance, the optimization language selection unit (not shown) may select a language with a low memory usage or a high speed.

최적화 언어 선택부(미도시)는 예측 노드 구성부(210)에서 구성된 예측 워크플로우에서 예측 성능 차이 여부를 판단하는 기준으로 통계적 검증을 활용 한다. The optimization language selection unit (not shown) utilizes statistical verification as a criterion for determining whether there is a difference in prediction performance in the prediction workflow configured by the prediction node configuration unit 210 .

예를 들어, 예측 노드 구성부(210)의 검증 방법 구성부(410)에 의해 한 개의 예측 워크플로우에는 k 개의 예측 성능이 측정되어 있을 경우, 각 워크플로우 별로 예측 성능 지표를 갖고 있기 때문에 ANOVA 검증을 통해 평균 예측 성능의 차이를 검증할 수 있다. 각 워크플로우에 사용된 노드는 서로 중복되는 사항이 있을 수 있을 수 있으며, 유전 알고리즘 원리에 따라 이전 단계에 좋은 성능을 내는데 기여한 노드가 있으면 계속 남아있고 다른 단계의 노드가 변경되기 때문에 중복되는 경우가 많이 존재한다. For example, when k prediction performances are measured in one prediction workflow by the verification method configuration unit 410 of the prediction node configuration unit 210, ANOVA verification is performed because each workflow has a prediction performance index. can verify the difference in average prediction performance. The nodes used in each workflow may overlap each other, and according to the genetic algorithm principle, if there is a node that contributed to good performance in the previous stage, it will remain and overlap because the node in another stage is changed. There are many.

최적화 언어 선택부(미도시)는 이렇게 통계적으로 예측 성능에 차이가 없는 예측 성능 상위 워크플로우의 메모리 사용량과 수행 소요 시간을 파악한다. The optimization language selection unit (not shown) determines the memory usage and execution time of the high-level workflow with no statistically significant difference in prediction performance.

예측 워크플로우의 메모리 사용량과 수행 소요 시간은 최적화 언어 선택부(미도시)에 저장되어 있을 수 있으며, 예측 워크플로우 당 각각 1 개의 값을 가진다. The memory usage and execution time of the prediction workflow may be stored in an optimization language selection unit (not shown), and each prediction workflow has one value.

메모리와 시간에 대한 우선 순위는 원칙적으로 사용자 판단에 따르나 몇 가지 조건을 최적화 언어 선택부(미도시)에 도입할 수도 있다. In principle, priorities for memory and time are determined by the user, but some conditions may be introduced into the optimization language selection unit (not shown).

자동 예측 모델링 장치(100)의 안정성이 최선이므로 현재 시스템의 메모리 부하가 일정 수준 이하라면 최적화 언어 선택부(미도시)는 메모리 사용량이 적은 예측 워크플로우를 최우선으로 선택한다. 한편, 자동 예측 모델링 장치(100)의 메모리 부하가 일정 수준 미만인 상황이라면 최적화 언어 선택부(미도시)는 수행 소요 시간을 최우선으로 선택한다Since the stability of the automatic predictive modeling apparatus 100 is the best, if the memory load of the current system is below a certain level, the optimization language selection unit (not shown) selects a predictive workflow with a low memory usage as the top priority. On the other hand, if the memory load of the automatic predictive modeling apparatus 100 is less than a certain level, the optimization language selection unit (not shown) selects the execution time as the top priority.

그 외에 조건이 불분명한 경우, 최적화 언어 선택부(미도시)는 최적화 언어에 대한 모든 예측 워크플로우의 메모리와 수행 소요 시간을 테스트하고, 테스트 결과에 따라 확률 분포로 예측 워크플로우를 구성하고, 선택하고자 하는 예측 워크플로우의 메모리와 수행 소요 시간의 확률 밀도값을 계산한다. In addition, if the condition is unclear, the optimization language selection unit (not shown) tests the memory and execution time of all prediction workflows for the optimization language, configures the prediction workflow with a probability distribution according to the test results, and selects Calculate the probability density value of the memory and execution time of the desired prediction workflow.

최적화 언어 선택부(미도시)는 메모리와 수행 소요 시간의 확률 밀도값 중 상위 10 %를 넘는 경우가 있다면 해당 예측 워크플로우를 제외하고, 시간 기준으로 최선의 예측 워크플로우를 선택한다. 예를 들어, 메모리 사용량이 메모리 확률 분포의 상위 5 %(메모리를 매우 많이 씀)에 위치하고, 수행 소요 시간이 시간 확률 분포의 50 %(시간 소요가 평균적임)에 위치해 있을 수 있다. The optimization language selection unit (not shown) selects the best prediction workflow based on time, excluding the corresponding prediction workflow, if it exceeds the top 10% among the probability density values of memory and execution time. For example, memory usage may be located in the top 5% of the memory probability distribution (memory is very intensive), and execution time may be located in 50% of the time probability distribution (time consumption is average).

최적화 언어 선택부(미도시)는 자동 예측 모델링 장치(100)에 여유가 있더라도 메모리 사용량이 극단적이므로 수행 소요 시간을 기준으로 예측 워크플로우를 선택한다. The optimization language selection unit (not shown) selects the prediction workflow based on the required execution time because the memory usage is extreme even if the automatic prediction modeling apparatus 100 has room to spare.

최적화 언어 선택부(미도시)는 서로 연결된 노드가 이종 언어로 구현된 경우 해당 노드 간의 연동 방안을 제공한다. The optimization language selection unit (not shown) provides a method for interworking between the nodes connected to each other when the nodes are implemented in a heterogeneous language.

예를 들어, R 언어와 Python 언어와 같이 서로 다른 언어로 구성된 노드들은 데이터 구조가 달라 다음 노드로 데이터를 넘길 수가 없다. For example, nodes composed of different languages such as R and Python have different data structures, so data cannot be passed to the next node.

최적화 언어 선택부(미도시)는 이종 언어로 구현된 노드 간의 연동을 위하여 메타정보를 활용하여 R 언어와 Python 언어가 모두 이해할 수 있는 공통 포맷을 사용하여 데이터를 변환한다. 최적화 언어 선택부(미도시)는 각 노드의 입력과 출력을 모두 JSON 포맷으로 통일시키고, 노드 작성시 JSON 데이터를 변환하는 과정을 R 스크립트 및 Python 스크립트에 포함시켜 작성한다.The optimization language selection unit (not shown) converts data using a common format that both R and Python languages can understand by utilizing meta information for interworking between nodes implemented in heterogeneous languages. The optimization language selection unit (not shown) unifies the input and output of each node in JSON format, and includes the process of converting JSON data in R script and Python script when creating a node.

각 워크플로우는 자동 예측 모델링 장치(100)의 독립된 환경 Docker에 존재할 수 있다. Docker의 첫 번째 노드에 입력 값이 넘어가고 마지막 노드에 출력 값이 나오는 시간을 통해 수행 속도를 파악하고, 분산처리 시스템에서는 각 독립된 환경에 필요한 리소스가 자동으로 분배되기 때문에 메모리 사용량을 파악할 수 있다.Each workflow may exist in an independent environment Docker of the automatic predictive modeling apparatus 100 . The execution speed is determined by the time when the input value is passed to the first node of Docker and the output value is output to the last node, and in the distributed processing system, the resources required for each independent environment are automatically distributed, so you can understand the memory usage.

도 3은 본 발명의 실시예에 따른 자동 예측 모델링 방법을 설명하기 위한 순서도이다.3 is a flowchart illustrating an automatic predictive modeling method according to an embodiment of the present invention.

자동 예측 모델링 장치(100)는 소스 데이터 및 예측 작업정보를 포함하는 초기 입력 데이터를 획득한다(S310, S320). 여기서, 자동 예측 모델링 장치(100)는 연속형 숫자, 문자형 변수, 이미지 데이터 등 중 적어도 하나에 대한 소스 데이터와 사용자의 조작 또는 입력에 의해 결정된 예측 변수 및 예측 종류에 대한 예측 작업정보를 포함하는 초기 입력 데이터를 획득할 수 있다.The automatic predictive modeling apparatus 100 acquires initial input data including source data and predictive work information (S310, S320). Here, the automatic predictive modeling apparatus 100 includes source data for at least one of continuous numbers, character variables, image data, etc. Input data can be obtained.

자동 예측 모델링 장치(100)는 초기 입력 데이터를 분석하여 적어도 하나의 노드를 생성하고(S330), 생성된 적어도 하나의 노드를 기반으로 예측 워크플로우를 구성한다(S340).The automatic predictive modeling apparatus 100 analyzes initial input data to generate at least one node (S330), and configures a predictive workflow based on the at least one generated node (S340).

자동 예측 모델링 장치(100)는 소스 데이터 및 예측 작업정보를 포함하는 초기 입력 데이터를 토대로 자동 예측에 필요한 노드들을 생성하고, 생성된 노드들 중 전체 또는 일부를 선택한 후 연결하여 예측 워크플로우를 구성한다. 자동 예측 모델링 장치(100)는 구성된 예측 워크플로우를 워크플로우 출력 데이터로 변환한다.The automatic prediction modeling apparatus 100 generates nodes necessary for automatic prediction based on initial input data including source data and prediction work information, selects all or some of the generated nodes, and connects them to configure a prediction workflow. . The automatic predictive modeling apparatus 100 converts the configured predictive workflow into workflow output data.

자동 예측 모델링 장치(100)는 예측 워크플로우가 변환된 워크플로우 출력 데이터를 출력하여 자동 예측 모델링이 수행되도록 한다(S350). 자동 예측 모델링 장치(100)는 워크플로우 출력 데이터를 구비된 디스플레이를 이용하여 출력할 수 있으나 반드시 이에 한정되는 것은 아니며, 별도의 외부 장치 또는 사용자 단말기에서 자동 예측 모델링이 수행되도록 워크플로우 출력 데이터를 전송할 수도 있다.The automatic predictive modeling apparatus 100 outputs the workflow output data in which the predictive workflow is converted to perform automatic predictive modeling ( S350 ). The automatic predictive modeling apparatus 100 may output the workflow output data using a display equipped with it, but is not limited thereto, and transmits the workflow output data so that the automatic predictive modeling is performed in a separate external device or user terminal. may be

도 3에서는 각 단계를 순차적으로 실행하는 것으로 기재하고 있으나, 반드시 이에 한정되는 것은 아니다. 다시 말해, 도 3에 기재된 단계를 변경하여 실행하거나 하나 이상의 단계를 병렬적으로 실행하는 것으로 적용 가능할 것이므로, 도 3은 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each step is sequentially executed in FIG. 3 , it is not necessarily limited thereto. In other words, since it may be applicable to changing and executing the steps described in FIG. 3 or executing one or more steps in parallel, FIG. 3 is not limited to a chronological order.

도 3에 기재된 본 실시예에 따른 자동 예측 모델링 방법은 애플리케이션(또는 프로그램)으로 구현되고 단말장치(또는 컴퓨터)로 읽을 수 있는 기록매체에 기록될 수 있다. 본 실시예에 따른 자동 예측 모델링 방법을 구현하기 위한 애플리케이션(또는 프로그램)이 기록되고 단말장치(또는 컴퓨터)가 읽을 수 있는 기록매체는 컴퓨팅 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치 또는 매체를 포함한다.The automatic prediction modeling method according to the present embodiment described in FIG. 3 may be implemented as an application (or program) and recorded in a terminal device (or computer) readable recording medium. The recording medium in which the application (or program) for implementing the automatic prediction modeling method according to the present embodiment is recorded and the terminal device (or computer) can read is any type of recording device in which data that can be read by the computing system is stored. or media.

도 4는 본 발명의 실시예에 따른 예측 노드 구성부의 구성을 개략적으로 나타낸 블록 구성도이다. 4 is a block diagram schematically illustrating the configuration of a prediction node configuration unit according to an embodiment of the present invention.

본 실시예에 따른 예측 노드 구성부(210)는 검증 방법 구성부(410), 전처리 수행부(420), 모델 학습 처리부(430) 및 검증 처리부(440)를 포함한다. 도 4의 예측 노드 구성부(210)는 일 실시예에 따른 것으로서, 도 4에 도시된 모든 블록이 필수 구성요소는 아니며, 다른 실시예에서 예측 노드 구성부(210)에 포함된 일부 블록이 추가, 변경 또는 삭제될 수 있다. 예측 노드 구성부(210)에 포함된 각 구성요소들은 각각 별도의 소프트웨어 프로그램으로 구현되거나, 소프트웨어가 결합된 별도의 하드웨어 장치로 구현될 수 있다.The prediction node configuration unit 210 according to the present embodiment includes a verification method configuration unit 410 , a preprocessing unit 420 , a model learning processing unit 430 , and a verification processing unit 440 . The prediction node configuration unit 210 of FIG. 4 is according to an embodiment, and not all blocks shown in FIG. 4 are essential components, and in another embodiment, some blocks included in the prediction node configuration unit 210 are added. , may be changed or deleted. Each of the components included in the prediction node configuration unit 210 may be implemented as a separate software program, or may be implemented as a separate hardware device combined with software.

검증 방법 구성부(410)는 초기 입력 데이터를 학습용 데이터 및 검증용 데이터로 분류하는 동작을 수행한다. The verification method configuration unit 410 classifies the initial input data into learning data and verification data.

검증 방법 구성부(410)는 예측 모델의 예측 성능을 검증하기 위하여 예측 모델을 생성할 때 사용되는 학습용 데이터와 검증에 사용되는 검증용 데이터를 서로 분리하는 과정을 수행한다. In order to verify the prediction performance of the prediction model, the verification method configuration unit 410 separates the training data used for generating the prediction model and the verification data used for verification from each other.

검증 방법 구성부(410)에서 획득된 초기 입력 데이터를 학습, 검증 데이터로 나누는 방법은 다양하며, 데이터 크기에 따라 분류 방식을 선택할 수 있다. 예를 들어, 검증 방법 구성부(410)는 몬테카를로 검증, k-fold cross-validation, leave-p-out cross-validation 등 중 하나의 방식을 선택하여 초기 입력 데이터를 분류할 수 있다. There are various methods for dividing the initial input data obtained by the verification method configuration unit 410 into learning and verification data, and a classification method may be selected according to the data size. For example, the verification method configuration unit 410 may classify the initial input data by selecting one of Monte Carlo verification, k-fold cross-validation, leave-p-out cross-validation, and the like.

몬테카를로 검증 방식은 전체 데이터에서 매회 랜덤하게 샘플링한 데이터를 검증 데이터로 사용하며, 속도가 빨라 큰 데이터에 적합하다. 또한, k-fold cross-validation 방식은 데이터를 소정의 조건에 따라 k 개의 데이터셋으로 나누고 k-1 개 데이터셋을 학습용 데이터, 나머지 1개를 검증용 데이터로 사용하고, k 회 반복하여 평균 예측 성능을 산출하며, 속도가 느려 작은 데이터에 적합하다.The Monte Carlo verification method uses data randomly sampled every time from the entire data as verification data, and is suitable for large data because of its high speed. In addition, the k-fold cross-validation method divides data into k datasets according to predetermined conditions, uses k-1 datasets as training data, and the remaining one as validation data, repeats k times to predict the average. It yields performance and is suitable for small data due to its slow speed.

또한, leave-p-out cross-validation 방식은 p 개의 검증용 데이터를 선택하고 나머지를 학습용으로 사용하고, k-fold와 동일한 방식이지만 더 엄격한 기준으로 성능을 측정할 수 있으며, 속도가 매우 느려 기 설정된 기준 미만의 작은 데이터에 적합하다.In addition, the leave-p-out cross-validation method selects p pieces of data for validation and uses the remainder for training. It is suitable for small data below the established standard.

전처리 수행부(420)는 학습용 데이터 및 검증용 데이터 각각을 예측 모델에 사용할 수 있도록 변환하는 동작을 수행한다. The preprocessing unit 420 performs an operation of transforming each of the training data and the verification data to be used in a predictive model.

전처리 수행부(420)는 사용 가능한 데이터 형태와 종류를 정의하였지만, 바로 예측 모델에 사용할 수는 없다. 예측 모델은 일반적으로 각 변수 간의 관계를 학습하는 것이기 때문에 여기에 적합한 포맷으로 변환하기 위해 전처리가 수행되어야 한다. Although the pre-processing unit 420 defines usable data types and types, it cannot be directly used in a predictive model. Since a predictive model is usually to learn the relationship between each variable, preprocessing must be performed to convert it into a format suitable for it.

전처리 수행부(420)는 학습용 데이터 및 검증용 데이터 각각에 대해 변수별 데이터 타입 확인 및 변환, 결측값(missing value) 처리, 이상치 판단 및 제거 처리, 파생변수의 생성 처리 등의 전체 또는 일부 과정을 포함하는 전처리를 수행한다. The preprocessing unit 420 performs all or part of the process of checking and converting data types for each variable, processing of missing values, determining and removing outliers, and processing of generating derived variables for each of the training data and the verification data. pre-processing including

변수별 데이터 타입 확인 및 변환 과정은 사용자가 선택한 예측 목적에 부합하는 데이터인지 확인하는 과정을 의미한다. 예를 들어, 전처리 수행부(420)는 분류 예측을 선택하였는데 반응 변수가 비율 자료형이면 명목 자료형으로 변환하는 과정을 수행한다. The process of checking and converting data types for each variable refers to a process of confirming whether data meets the prediction purpose selected by the user. For example, if the preprocessing unit 420 selects classification prediction and the response variable is a ratio data type, it performs a process of converting it into a nominal data type.

결측값 처리 과정은 결측값을 학습하지 못하거나 예측에 영향을 미치는 경우가 있으므로 결측값을 제거하거나 다른 값으로 대체하는 과정을 의미한다. 예를 들어, 전처리 수행부(420)는 자동차 연비를 예측할 때, 무게 변수에 NA가 포함된 관측치가 있는 경우 이를 무게 변수의 평균값으로 대체하는 과정을 수행한다.The missing value processing process refers to the process of removing missing values or replacing them with other values because missing values may not be learned or may affect prediction. For example, when predicting vehicle fuel efficiency, the preprocessing unit 420 performs a process of replacing an observed value including NA in a weight variable with an average value of the weight variable.

이상치 판단 및 제거 처리 과정은 각 변수가 가질 수 있는 범위를 크게 벗어났거나 다른 값들과 차이가 많은 경우, 이를 제거하거나 다른 값으로 대체하는 과정을 의미한다. 예를 들어, 전처리 수행부(420)는자동차 무게 변수의 값들에 0.1 ~ 2 ton과 1,000 ton이 섞여 있는 경우 1,000 ton을 관측치에서 제거하는 과정을 수행한다. The process of determining and removing outliers refers to a process of removing or replacing each variable with another value when it greatly exceeds the range of each variable or has a large difference from other values. For example, the preprocessing unit 420 performs a process of removing 1,000 ton from the observation value when 0.1 to 2 ton and 1,000 ton are mixed in the values of the vehicle weight variable.

파생변수의 생성 처리 과정은 예측 성능을 향상시키기 위해 설명변수를 변환하여 새로운 설명변수를 생성하는 과정을 의미한다. 예를 들어, 전처리 수행부(420)는 자동차 연비와 무게 사이에 지수로 증가하는 관계가 있는 경우 무게 변수를 제곱한 무게 파생변수를 생성하여 예측 성능이 향상되도록 하는 과정을 수행한다. The process of generating a derived variable refers to the process of generating a new explanatory variable by transforming the explanatory variable in order to improve prediction performance. For example, when there is an exponentially increasing relationship between vehicle fuel efficiency and weight, the preprocessing unit 420 generates a weight derived variable obtained by squaring the weight variable to improve prediction performance.

모델 학습 처리부(430)는 학습용 데이터를 입력으로 적어도 하나의 알고리즘 중 특정 알고리즘을 선정하고, 적어도 하나의 알고리즘을 기반으로 학습 데이터의 패턴을 분석하여 후보 알고리즘을 선정하고, 후보 알고리즘에 검증용 데이터를 대입하여 선정된 최종 알고리즘을 기반으로 예측 모델을 생성한다. The model learning processing unit 430 selects a specific algorithm among at least one algorithm by inputting the training data as an input, analyzes a pattern of the training data based on the at least one algorithm to select a candidate algorithm, and applies the verification data to the candidate algorithm. A predictive model is created based on the final algorithm selected by substituting it.

모델 학습 처리부(430)는 다양한 통계 모델, 기계학습 모델, 딥러닝 모델, 사용자 정의 모델 등을 사용하여 예측 모델을 생성할 수 있으며, 각각의 모델에는 다양한 알고리즘이 포함될 수 있다. The model learning processing unit 430 may generate a predictive model using various statistical models, machine learning models, deep learning models, user-defined models, and the like, and each model may include various algorithms.

모델 학습 처리부(430)에서 생성되는 예측 모델은 학습 데이터를 입력 받는 i 단계, 알고리즘을 통해 학습 데이터의 패턴을 분석하여 예측 성능을 최대화하는 ii 단계 및 ii 단계에서 생성된 모델에 검증용 데이터를 대입하여 성능을 확인하는 iii 단계와 같은 프로세스에 의해 처리된다. The prediction model generated by the model learning processing unit 430 receives the training data in step i, and applies the data for verification to the model generated in step ii and step ii of maximizing prediction performance by analyzing the pattern of the training data through an algorithm. It is processed by the same process as step iii to check the performance.

예를 들어, 모델 학습 처리부(430)는 분류 예측 모델을 처리하는 경우 기능에 따른 예측 모델의 구성 및 동작, 대표 알고리즘은 다음과 같다. For example, when the model learning processing unit 430 processes the classification prediction model, the configuration and operation of the prediction model according to the function, and the representative algorithm are as follows.

i 단계) 모델 학습 처리부(430)는 사용자가 지정한 설명변수와 반응변수를 알고리즘의 입력값에 넣고, 알고리즘 학습에 필요한 세부 파라미터를 시스템을 통해 조정한다. 파라미터 값에 따라 예측 성능이 차이가 나며 이를 어떻게 탐색하고 조정하는지에 대한 설정을 수행한다.Step i) The model learning processing unit 430 puts the explanatory variables and response variables specified by the user into the input values of the algorithm, and adjusts detailed parameters necessary for algorithm learning through the system. The prediction performance is different depending on the parameter value, and the setting is performed on how to explore and adjust it.

ii-a 단계) 모델 학습 처리부(430)는 분류 예측에 사용할 수 있는 다수의 알고리즘을 준비한다. 여기서, 분류 예측을 위한 분류 알고리즘은 분류 랜덤 포레스트, 로지스틱 선형 회귀 모델, 분류 뉴럴 네트워크, 서포트 벡터 머신 등일 수 있다. Step ii-a) The model learning processing unit 430 prepares a plurality of algorithms that can be used for classification prediction. Here, the classification algorithm for classification prediction may be a classification random forest, a logistic linear regression model, a classification neural network, a support vector machine, and the like.

ii-b 단계) 모델 학습 처리부(430)는 알고리즘에 사용할 파라미터를 탐색한다. 여기서, 알고리즘에 사용할 파라미터는 알고리즘의 복잡도, 속도, 정확도 등에 영향을 미치는 옵션이다. Step ii-b) The model learning processing unit 430 searches for parameters to be used in the algorithm. Here, the parameters to be used in the algorithm are options that affect the complexity, speed, accuracy, etc. of the algorithm.

파라미터의 탐색은 알고리즘이 해당 데이터에서 가질 수 있는 파라미터의 모든 조건 중에 일부를 랜덤하게 테스트하고 초기 테스트 결과 성능이 기 설정된 기준 미만인 조건은 제거하고 기 설정된 기준 이상인 조건을 남긴 후 다른 파라미터를 조정해 나가는 동작을 말한다. 여기서, 파라미터의 탐색을 위한 알고리즘은 유전 알고리즘인 것이 바람직하나 반드시 이에 한정되는 것은 아니다.In the parameter search, the algorithm randomly tests some of all conditions of the parameters that can have in the data, removes the condition in which the performance of the initial test result is less than the preset standard, leaves the condition above the preset standard, and then adjusts other parameters. say action. Here, the algorithm for the parameter search is preferably a genetic algorithm, but is not necessarily limited thereto.

iii 단계) 모델 학습 처리부(430)는 검증 데이터 테스트 결과 가장 우수한 알고리즘과 파라미터를 선택하여 저장한다.Step iii) The model learning processing unit 430 selects and stores the best algorithm and parameter as a result of the verification data test.

모델 학습 처리부(430)는 회귀 예측 모델, 시계열 예측 모델, 군집 예측 모델 등에 대해 모델 학습을 수행하는 경우 전술한 분류 예측 모델 과정과 큰 차이가 없고 알고리즘에 차이가 있을 수 있다. 예를 들어, 회귀 예측 모델은 회귀 랜덤 포레스트, 선형 회귀 모델, 주성분 회귀 모델, 부분최소제곱 회귀 모델 등의 알고리즘이 포함될 수 있고, 시계열 예측 모델은 ARIMA 모델, GARCH 모델 등의 알고리즘이 포함될 수 있다. 또한, 군집 예측 모델은 K-means, 최근접 이웃선택 모델 등의 알고리즘이 포함될 수 있다.When the model learning processing unit 430 performs model learning on a regression prediction model, a time series prediction model, a cluster prediction model, etc., there is no significant difference from the above-described classification prediction model process, and there may be a difference in the algorithm. For example, the regression prediction model may include algorithms such as a regression random forest, a linear regression model, a principal component regression model, and a partial least squares regression model, and the time series prediction model may include an algorithm such as an ARIMA model and a GARCH model. Also, the cluster prediction model may include algorithms such as K-means and nearest neighbor selection model.

검증 처리부(440)는 기 설정된 검증 지표를 사용하여 예측 모델에 대한 검증을 수행하여 예측 워크플로우를 구성한다. The verification processing unit 440 configures a prediction workflow by performing verification on the prediction model using a preset verification index.

검증 처리부(440)는 학습된 예측 모델을 어떤 지표를 통해 평가할지 선택 하는 동작을 수행한다. 여기서, 검증을 위한 지표는 다양하게 존재할 수 있다. The verification processing unit 440 performs an operation of selecting through which index to evaluate the learned predictive model. Here, various indicators for verification may exist.

검증 처리부(440)는 모델 학습 처리부(430)와 별도의 동작인 것으로 기재하고 있으나 반드시 이에 한정되는 것은 아니며, 모델 학습 처리부(430)에 포함되어 예측 모델을 검증할 수도 있다. Although the verification processing unit 440 is described as a separate operation from the model learning processing unit 430 , it is not necessarily limited thereto, and it may be included in the model learning processing unit 430 to verify the predictive model.

검증 처리부(440)는 생성한 예측 모델이 검증용 데이터에서도 예측 성능을 잘 발휘하는 지를 검증하는 동작을 수행한다. 검증 처리부(440)는 절대적인 성능 평가가 아닌 모델간 상대적 평가를 수행한다. 검증 처리부(440)는 예측 모델이 예측한 값과 실제값을 비교하는 지표를 사용한다. The verification processing unit 440 performs an operation of verifying whether the generated prediction model exhibits good prediction performance even in the verification data. The verification processing unit 440 performs relative evaluation between models rather than absolute performance evaluation. The verification processing unit 440 uses an index comparing the value predicted by the predictive model and the actual value.

예를 들어, 분류 예측 모델에 대한 검증 지표는 F1 score, Area under curve 등일 수 있으며, 회귀 예측 모델과 시계열 예측 모델에 대한 검증 지표는 root mean square error, r squared 등일 수 있다. 또한, 군집 예측 모델에 대한 검증 지표는 Dunn index, Jaccard index 등일 수 있다. For example, the verification index for the classification prediction model may be an F1 score, an area under curve, etc., and the verification index for the regression prediction model and the time series prediction model may be a root mean square error, r squared, and the like. Also, the verification index for the cluster prediction model may be a Dunn index, a Jaccard index, or the like.

검증 처리부(440)는 예측 모델 각각에 대한 하나의 지표를 적용하는 것이 바람직하나 반드시 이에 한정되는 것은 아니다. 예를 들어, 검증 처리부(440)는 예측 모델 각각에 대해 열거한 지표 중 첫 번째를 평가 기준으로 1차 검증을 수행하고, 1차 검증 결과 성능이 동일한 경우 열거한 지표 중 두 번째 지표를 평가 기준으로 2차 검증을 수행할 수 있다. Preferably, the verification processing unit 440 applies one index to each predictive model, but is not limited thereto. For example, the verification processing unit 440 performs primary verification as an evaluation criterion for the first among the indices listed for each predictive model, and uses the second indicator among the listed indicators as the evaluation criterion when the primary verification result performance is the same can perform secondary verification.

본 실시예에 따른 예측 노드 구성부(210)는 복수의 단계로 진행되는 예측 플로우를 생성할 수 있다. 예를 들어, 예측 플로우가 4 단계를 진행되는 경우 각 단계에 수십 ~ 수백가지의 선택지가 있을 수 있다. 이러한 경우 각 단계에 n 개의 선택지가 있다면, 구성 가능한 유형은 최대 n⁴로 시간 복잡도와 공간 복잡도가 엄청나게 커지는 문제점이 존재한다. The prediction node configuration unit 210 according to the present embodiment may generate a prediction flow that proceeds in a plurality of steps. For example, when the prediction flow proceeds in four steps, there may be dozens to hundreds of options in each step. In this case, if there are n options in each step, the number of configurable types is up to n ⁴ , and there is a problem in that the time complexity and the space complexity increase enormously.

본 실시예에 따른 예측 노드 구성부(210)는 이러한 문제를 해결하기 위해 유전 알고리즘을 사용하여 예측 워크플로우를 구성할 수 있다. 여기서, 유전 알고리즘이란 최적화 기법 중 하나로 무한에 가까운 선택지를 모두 확인하는 것이 아니라 각 단계에서 최선의 값을 선택하고 최선의 값을 갖도록 한 특성을 기억하도록 하여 점차 최적의 해를 갖도록 진화하는 알고리즘이다. The prediction node configuration unit 210 according to the present embodiment may configure a prediction workflow using a genetic algorithm to solve this problem. Here, the genetic algorithm is one of the optimization techniques, and it is an algorithm that gradually evolves to have an optimal solution by selecting the best value at each step and remembering the characteristic to have the best value, rather than checking all near-infinite options.

예측 노드 구성부(210)는 자동 구성을 위해 처음에는 랜덤으로 몇 가지 노드를 선택하여 워크플로우를 구성하고 예측 성능이 뛰어난 워크플로우에서 사용된 노드를 선택지에 남긴다. 예측 노드 구성부(210)는 위 과정을 유전 알고리즘을 통해 계속 진행하며, 예측 성능이 뛰어난 워크플로우의 노드를 서로 교차하여 새로운 워크플로우를 구성하며 그 과정에 랜덤으로 노드를 추가하기도 한다. 이 때 예측 노드 구성부(210)는 나쁜 예측 성능을 보이는 노드는 배제하며, 예측 성능이 더 이상 개선되지 않을 때까지 계속 반복 수행하여 궁극적으로 예측 성능에 좋은 영향을 미친 노드만 선정하여 최종 예측 워크플로우를 구성한다. The prediction node configuration unit 210 configures a workflow by initially selecting several nodes at random for automatic configuration, and leaves the nodes used in the workflow with excellent prediction performance in the selection. The prediction node configuration unit 210 continues the above process through a genetic algorithm, crosses nodes of a workflow with excellent predictive performance, constructs a new workflow, and randomly adds nodes to the process. At this time, the prediction node configuration unit 210 excludes nodes showing bad prediction performance, and continues iteratively until the prediction performance is no longer improved, ultimately selecting only the nodes that have a good influence on the prediction performance, and performing the final prediction work compose the flow.

이하, 4 단계로 구성된 예측 플로우에서 최적 노드를 선정하는 예측 노드 구성부(210)의 동작을 설명하도록 한다. Hereinafter, the operation of the prediction node configuration unit 210 for selecting an optimal node in the prediction flow composed of 4 steps will be described.

1 내지 4 단계 중 초기 단계(1 단계)와 최종 단계(4 단계)는 검증 방법 구성부(410), 전처리 수행부(420), 모델 학습 처리부(430) 및 검증 처리부(440)의 동작에 의해 선정하며, 최적의 예측 플로우를 선정하기 위하여 예측 노드 구성부(210)는 2 단계와 3 단계의 연결 조합을 검증하는 동작을 추가로 수행한다. The initial step (step 1) and the final step (step 4) among steps 1 to 4 are performed by the operation of the verification method configuration unit 410 , the preprocessing unit 420 , the model learning processing unit 430 , and the verification processing unit 440 . In order to select an optimal prediction flow, the prediction node configuration unit 210 additionally performs an operation of verifying the connection combination of steps 2 and 3.

예를 들어, 2 단계의 파생변수 생성 단계에서 10 가지 방법이 있고 3 단계에서 사용할 수 있는 알고리즘이 10 개가 있다면 총 100개의 조합이 생기지만 이를 다 테스트하기에는 리소스 소요가 너무 크다. 이에, 본 실시예에 따른 예측 노드 구성부(210)는 유전 알고리즘을 기반으로 최적의 조합을 찾는 과정을 수행한다. For example, if there are 10 methods in the step of generating a derived variable in step 2 and 10 algorithms that can be used in step 3, a total of 100 combinations are created, but it requires too much resources to test them all. Accordingly, the prediction node configuration unit 210 according to the present embodiment performs a process of finding an optimal combination based on a genetic algorithm.

예측 노드 구성부(210)는 2 단계와 3 단계에서 각각 2 내지 3 개의 방법을 랜덤하게 선택하고, 선택된 방법들에 대한 조합으로 테스트를 수행한다. 예를 들어, 예측 노드 구성부(210)는 2 단계의 파생변수를 생성하는 다양한 방법 중 3 개를 선택하고 3 단계에서 사용할 수 있는 알고리즘을 3 개 선택하여 총 9번에 대해서만 테스트를 수행한다. The prediction node configuration unit 210 randomly selects 2 to 3 methods in steps 2 and 3, respectively, and performs a test using a combination of the selected methods. For example, the prediction node configuration unit 210 selects three of the various methods for generating the derived variable of step 2 and selects three algorithms that can be used in step 3 to test only 9 times in total.

이후, 예측 노드 구성부(210)는 2 단계 및 3 단계 각각의 모델의 성능을 4 단계를 통해 측정한다. 이 때 예측 노드 구성부(210)는 성능이 우수한 모델을 2 내지 3개를 선택하며, 이 때 사용한 파생변수 생성법(2 단계에서 선택한 방법)과 알고리즘(3 단계에서 선택한 방법)을 저장한다.Thereafter, the prediction node configuration unit 210 measures the performance of each model in steps 2 and 3 through 4 steps. At this time, the prediction node configuration unit 210 selects two to three models with excellent performance, and stores the derived variable generation method (the method selected in step 2) and the algorithm (the method selected in step 3) used at this time.

예를 들어, 예측 노드 구성부(210)는 무한에 가까운 조합 중 성능 상위 4 개를 선정한다.For example, the prediction node configuration unit 210 selects the top four performance among combinations close to infinity.

예를 들어, 예측 노드 구성부(210)는 워크플로우1: 노드(2.1) -> 노드(3.2) = 성능 90%, 워크플로우2: 노드(2.2) -> 노드(3.6) = 성능 88%, 워크플로우3: 노드(2.1) -> 노드(3.1) = 성능 85%, 워크플로우4: 노드(2.4) -> 노드(3.7) = 성능 80% 와 같이 측정된 결과 중 2 단계에서는 1, 2, 4 번 방법, 3 단계에서는 2, 6, 1, 7 번 방법을 성능 상위 결과로 선정하여 저장할 수 있다. For example, the prediction node configuration unit 210 is configured for workflow 1: node (2.1) -> node (3.2) = performance 90%, workflow 2: node (2.2) -> node (3.6) = performance 88%, Workflow 3: Node(2.1) -> Node(3.1) = performance 85%, Workflow 4: Node(2.4) -> Node(3.7) = 80% performance. In method 4 and step 3, methods 2, 6, 1, and 7 can be selected and stored as high-performance results.

예측 노드 구성부(210)는 선정된 결과를 기반으로 파생변수 방법을 고정시키고 3 단계 알고리즘은 사용해보지 않았던 조합이 되도록 선택하거나 선택된 알고리즘을 고정시키고 2 단계 변수 방법을 사용해보지 않았던 조합이 되도록 선택한다. The prediction node configuration unit 210 fixes the derived variable method based on the selected result and selects a combination that has not been used in the three-step algorithm, or fixes the selected algorithm and selects a combination that has not been used in the two-step variable method .

예를 들어, 예측 노드 구성부(210)는 워크플로우1: 노드(2.1) -> 노드(3.11), 워크플로우2: 노드(2.2) -> 노드(3.3), 워크플로우3: 노드(2.1) -> 노드(3.4) 등과 같이, 2 단계 노드의 1, 2, 4번은 고정시키고 3 단계 노드는 사용해 보지 않은 이전에 없던 조합이 되도록 구성하거나, 3 단계 노드의 2, 6, 1, 7 번은 고정시키고 2 단계 노드는 사용해 보지 않은 이전에 없던 조합이 되도록 구성하여 테스트를 수행할 수 있다. For example, the prediction node configuration unit 210 may include workflow 1: node (2.1) -> node (3.11), workflow 2: node (2.2) -> node (3.3), and workflow 3: node (2.1). -> As with node (3.4), 1, 2, and 4 of the 2nd stage node are fixed and the 3rd stage node is configured to be a combination that has not been used before, or 2, 6, 1, 7 of the 3rd stage node is fixed The test can be performed by configuring the second stage node to be a combination that has not been used before.

예측 노드 구성부(210)는 4 단계에서 예측 모델의 성능이 더 이상 개선되지 않을 때까지 위 과정을 반복하여 수행할 수 있다. 즉, 예측 노드 구성부(210)는 각 단계의 노드에서 성능이 좋은 방법과 알고리즘은 남겨놓고 더 좋은 방법이 있는지 사용해보지 않은 방법을 적용시켜가며 최적 노드를 탐색할 수 있다. The prediction node configuration unit 210 may repeat the above process until the performance of the prediction model is no longer improved in step 4 . That is, the prediction node configuration unit 210 may search for an optimal node by applying a method that has not been used to see if there is a better method while leaving the method and algorithm with good performance in the node of each stage.

도 5는 본 발명의 실시예에 따른 예측 노드 구성 동작을 설명하기 위한 순서도이다. 5 is a flowchart illustrating an operation of configuring a prediction node according to an embodiment of the present invention.

도 5는 도 3의 단계 S330 및 단계 S340을 구체화한 단계를 나타낸다. FIG. 5 shows the concrete steps of steps S330 and S340 of FIG. 3 .

자동 예측 모델링 장치(100)는 초기 입력 데이터를 학습용 데이터 및 검증용 데이터로 분류하는 동작을 수행한다(S510). 자동 예측 모델링 장치(100)는 예측 모델의 예측 성능을 검증하기 위하여 예측 모델을 생성할 때 사용되는 학습용 데이터와 검증에 사용되는 검증용 데이터를 서로 분리하는 과정을 수행한다. The automatic predictive modeling apparatus 100 classifies the initial input data into training data and verification data ( S510 ). The automatic predictive modeling apparatus 100 performs a process of separating the learning data used for generating the predictive model and the verification data used for the verification from each other in order to verify the predictive performance of the predictive model.

자동 예측 모델링 장치(100)는 학습용 데이터 및 검증용 데이터 각각을 예측 모델에 사용할 수 있도록 변환하는 동작을 수행한다(S520). 자동 예측 모델링 장치(100)는 학습용 데이터 및 검증용 데이터 각각에 대해 변수별 데이터 타입 확인 및 변환, 결측값(missing value) 처리, 이상치 판단 및 제거 처리, 파생변수의 생성 처리 등의 전체 또는 일부 과정을 포함하는 전처리를 수행한다. The automatic predictive modeling apparatus 100 performs an operation of transforming each of the training data and the verification data to be used in the predictive model ( S520 ). Automatic predictive modeling apparatus 100 for each of the data for training and verification data, check and transform the data type for each variable, missing value (missing value) processing, outlier determination and removal processing, all or part of the process, such as generation processing of derived variables Pre-processing including

자동 예측 모델링 장치(100)는 학습용 데이터를 입력으로 적어도 하나의 알고리즘 중 특정 알고리즘을 선정하고, 적어도 하나의 알고리즘을 기반으로 학습 데이터의 패턴을 분석하여 후보 알고리즘을 선정하고, 후보 알고리즘에 검증용 데이터를 대입하여 선정된 최종 알고리즘을 기반으로 예측 모델을 생성한다(S530). The automatic predictive modeling apparatus 100 selects a specific algorithm among at least one algorithm by inputting the learning data as an input, analyzes a pattern of the learning data based on the at least one algorithm, selects a candidate algorithm, and selects the candidate algorithm for verification data A predictive model is generated based on the final algorithm selected by substituting (S530).

자동 예측 모델링 장치(100)는 기 설정된 검증 지표를 사용하여 예측 모델에 대한 검증을 수행하여 예측 워크플로우를 구성한다(S540).The automatic predictive modeling apparatus 100 configures a predictive workflow by verifying the predictive model using a preset verification index (S540).

도 5에서는 각 단계를 순차적으로 실행하는 것으로 기재하고 있으나, 반드시 이에 한정되는 것은 아니다. 다시 말해, 도 5에 기재된 단계를 변경하여 실행하거나 하나 이상의 단계를 병렬적으로 실행하는 것으로 적용 가능할 것이므로, 도 5는 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each step is sequentially executed in FIG. 5 , the present invention is not limited thereto. In other words, since it may be applicable to changing and executing the steps described in FIG. 5 or executing one or more steps in parallel, FIG. 5 is not limited to a chronological order.

도 5에 기재된 본 실시예에 따른 예측 노드 구성 방법은 애플리케이션(또는 프로그램)으로 구현되고 단말장치(또는 컴퓨터)로 읽을 수 있는 기록매체에 기록될 수 있다. 본 실시예에 따른 예측 노드 구성 방법을 구현하기 위한 애플리케이션(또는 프로그램)이 기록되고 단말장치(또는 컴퓨터)가 읽을 수 있는 기록매체는 컴퓨팅 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치 또는 매체를 포함한다.The prediction node configuration method according to the present embodiment described in FIG. 5 may be implemented as an application (or program) and recorded in a terminal device (or computer) readable recording medium. A recording medium in which an application (or program) for implementing the prediction node configuration method according to the present embodiment is recorded and a terminal device (or computer) readable recording medium is any type of recording device in which data that can be read by a computing system is stored or media.

도 6은 본 발명의 실시예에 따른 자동 예측 모델링을 위한 워크플로우를 나타낸 예시도이다.6 is an exemplary diagram illustrating a workflow for automatic predictive modeling according to an embodiment of the present invention.

도 6은 환자의 신체검사 결과를 통해 남은 수명을 예측하기 위한 수명 예측 워크플로우(600)를 나타낸다. 6 illustrates a lifespan prediction workflow 600 for predicting the remaining lifespan through a physical examination result of a patient.

수명 예측 플로우(600)은 5 개의 노드(610, 620, 630, 640, 650)으로 구성될 수 있다. 제1 노드(610)는 분석에 사용할 데이터를 DB에서 읽는 노드(Read CSV file)이고, 제2 노드(620)는 데이터를 학습용과 검증용으로 분할하는 노드(Partitioner)이다. 또한, 제3 노드(630)는 random forest 알고리즘으로 학습용 데이터를 학습하는 노드(Random Forest Regression Learner)이고, 제4 노드(640)는 학습된 random forest 알고리즘을 활용하여 검증용 데이터를 예측하는 노드이다. 또한, 제5 노드(650)는 검증용 데이터 예측 결과의 성능을 평가하는 노드(Regression Score)이다. The life prediction flow 600 may include five nodes 610 , 620 , 630 , 640 , and 650 . The first node 610 is a node that reads data to be used for analysis from the DB (Read CSV file), and the second node 620 is a node (Partitioner) that divides the data for learning and verification. In addition, the third node 630 is a node that learns data for training with a random forest algorithm (Random Forest Regression Learner), and the fourth node 640 is a node that predicts data for verification using the learned random forest algorithm. . In addition, the fifth node 650 is a node (Regression Score) for evaluating the performance of the prediction data for verification.

자동 예측 모델링 장치(100)는 수명 예측 워크플로우(600)를 구성하기 위하여 아래와 같은 과정을 수행할 수 있다. The automatic prediction modeling apparatus 100 may perform the following process to configure the lifetime prediction workflow 600 .

자동 예측 모델링 장치(100)는 분석 데이터 업로드 및 작업 선택 과정을 수행한다. 예를 들어, 사용자는 각 환자의 신체검사 결과를 통해 남은 수명을 예측하고자 하며, 자동 예측 모델링 장치(100)는 수명을 예측과 관련된 소스 데이터를 획득하고, 예측 작업으로 '회귀 예측'이 선택되고 수명에 관련된 변수를 타겟으로 하는 예측 작업정보를 획득한다. The automatic predictive modeling apparatus 100 performs a process of uploading analysis data and selecting a job. For example, the user wants to predict the remaining lifespan through the physical examination results of each patient, the automatic predictive modeling apparatus 100 obtains source data related to predicting the lifespan, and 'regression prediction' is selected as the prediction operation, Acquire predictive work information targeting variables related to lifespan.

이후, 자동 예측 모델링 장치(100)는 예측 노드 구성 및 추천 과정을 수행한다.Thereafter, the automatic prediction modeling apparatus 100 performs a prediction node configuration and recommendation process.

자동 예측 모델링 장치(100)는 데이터를 학습용과 검증용으로 분할하는 노드를 삽입하고, 결측값 처리 및 문자형 변수 포함 여부를 분석하여 전처리에 필요한 노드를 추가할 지 여부를 자동으로 분석한다. The automatic predictive modeling apparatus 100 automatically analyzes whether to add a node necessary for preprocessing by inserting a node that divides data for learning and for verification, and analyzing whether missing value processing and character variables are included.

자동 예측 모델링 장치(100)는 데이터가 업로드 되었을 때 DB 시스템에서는 결측값과 변수 타입을 분석할 수 있다. 이 정보를 활용하여 2 단계 노드(620)를 구성할 때 사전에 정의한 방식대로 선택 가능한 노드를 옵션에 올린다. 예를 들어 자동 예측 모델링 장치(100)에 데이터가 업로드 되었을 때 변수에서 결측값을 발견했다면, 2 단계의 세부 프로세스에 결측값 처리 노드가 추가 선택되도록 조치할 수 있다. 이후, 자동 예측 모델링 장치(100)는 회귀 예측에 가장 적합한 여러 알고리즘을 자동으로 선택한다. The automatic predictive modeling apparatus 100 may analyze missing values and variable types in the DB system when data is uploaded. When constructing the second-stage node 620 using this information, selectable nodes are added to the options in a predefined manner. For example, if a missing value is found in a variable when data is uploaded to the automatic predictive modeling apparatus 100, a missing value processing node may be additionally selected in the detailed process of step 2. Thereafter, the automatic predictive modeling apparatus 100 automatically selects several algorithms most suitable for regression prediction.

이후, 자동 예측 모델링 장치(100)는 최적화 언어를 선택하는 과정을 수행한다.Thereafter, the automatic predictive modeling apparatus 100 performs a process of selecting an optimization language.

자동 예측 모델링 장치(100)는 동일한 알고리즘을 서로 다른 언어로 구현한 노드 중에 예측 성능이 가장 우수하거나 속도가 가장 빠른 경우를 선택한다. 예를 들어 제3 노드(630)에 R 버전의 Random forest, Python버전의 Random forest가 있을 경우, 자동 예측 모델링 장치(100)는 R 버전과 Python 버전으로 각각 예측 워크플로우를 구성하고 예측 성능이 더 좋은 버전을 선택할 수 있다. The automatic predictive modeling apparatus 100 selects a case with the best prediction performance or the fastest speed among nodes implementing the same algorithm in different languages. For example, if the third node 630 has the R version of the random forest and the Python version of the random forest, the automatic prediction modeling device 100 configures the prediction workflow with the R version and the Python version, respectively, and the prediction performance is better. You can choose a good version.

이후, 자동 예측 모델링 장치(100)는 워크플로우 시각화 과정을 수행한다.Thereafter, the automatic predictive modeling apparatus 100 performs a workflow visualization process.

자동 예측 모델링 장치(100)는 최종 선택된 노드를 연결하여 하나의 예측 워크플로우를 구성하고, 구성된 예측 워크플로우를 워크플로우 출력 데이터로 변환하여 실행될 수 있도록 출력한다. The automatic prediction modeling apparatus 100 configures one prediction workflow by connecting the finally selected nodes, and converts the configured prediction workflow into workflow output data and outputs it to be executed.

도 7은 본 발명의 실시예에 따른 자동 예측 모델링 장치의 구성을 나타낸 도면이다. 7 is a diagram showing the configuration of an automatic predictive modeling apparatus according to an embodiment of the present invention.

도 7에 도시된 자동 예측 모델링 장치(700)는 컴퓨팅 기기로 구현될 수 있으며, 적어도 하나의 프로세서(710), 컴퓨터 판독 가능한 저장매체(720) 및 통신 버스(760)를 포함한다. The automatic predictive modeling apparatus 700 illustrated in FIG. 7 may be implemented as a computing device, and includes at least one processor 710 , a computer-readable storage medium 720 , and a communication bus 760 .

자동 예측 모델링 장치(700)의 입력부(110) 및 워크플로우 출력부(130)는 입출력 인터페이스(740) 또는 통신 인터페이스(750)에 대응할 수 있고, 데이터 분석부(120)는 프로세서(710)에 대응할 수 있다. The input unit 110 and the workflow output unit 130 of the automatic predictive modeling apparatus 700 may correspond to the input/output interface 740 or the communication interface 750 , and the data analysis unit 120 may correspond to the processor 710 . can

프로세서(710)는 자동 예측 모델링 장치(700)로 동작하도록 제어할 수 있다. 예컨대, 프로세서(710)는 컴퓨터 판독 가능한 저장매체(720)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 컴퓨터 실행 가능 명령어는 프로세서(710)에 의해 실행되는 경우 자동 예측 모델링 장치(700)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.The processor 710 may control the automatic predictive modeling apparatus 700 to operate. For example, the processor 710 may execute one or more programs stored in the computer-readable storage medium 720 . The one or more programs may include one or more computer-executable instructions, which when executed by the processor 710 configure the automated predictive modeling apparatus 700 to perform operations in accordance with the exemplary embodiment. can be

컴퓨터 판독 가능한 저장매체(720)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능한 저장매체(720)에 저장된 프로그램(730)은 프로세서(710)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독한 가능 저장매체(720)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 자동 예측 모델링 장치(700)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 720 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 730 stored in the computer-readable storage medium 720 includes a set of instructions executable by the processor 710 . In one embodiment, computer-readable storage medium 720 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, It may be flash memory devices, other types of storage media that can be accessed by the automatic predictive modeling apparatus 700 and store desired information, or a suitable combination thereof.

통신 버스(760)는 프로세서(710), 컴퓨터 판독 가능한 저장매체(720)를 포함하여 자동 예측 모델링 장치(700)의 다른 다양한 컴포넌트들을 상호 연결한다.The communication bus 760 interconnects various other components of the automatic predictive modeling apparatus 700 including the processor 710 and the computer-readable storage medium 720 .

자동 예측 모델링 장치(700)는 또한 하나 이상의 입출력 장치를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(740) 및 하나 이상의 통신 인터페이스(750)를 포함할 수 있다. 입출력 인터페이스(740) 및 통신 인터페이스(750)는 통신 버스(760)에 연결된다. 입출력 장치는 입출력 인터페이스(740)를 통해 자동 예측 모델링 장치(700)의 다른 컴포넌트들에 연결될 수 있다.The automatic predictive modeling apparatus 700 may also include one or more input/output interfaces 740 and one or more communication interfaces 750 that provide interfaces for one or more input/output devices. The input/output interface 740 and the communication interface 750 are coupled to the communication bus 760 . The input/output device may be connected to other components of the automatic predictive modeling device 700 through the input/output interface 740 .

이상의 설명은 본 발명의 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명의 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명의 실시예들은 본 발명의 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the embodiment of the present invention, and those of ordinary skill in the art to which the embodiment of the present invention pertains may modify various modifications and transformation will be possible. Accordingly, the embodiments of the present invention are not intended to limit the technical spirit of the embodiment of the present invention, but to explain, and the scope of the technical spirit of the embodiment of the present invention is not limited by these embodiments. The protection scope of the embodiment of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

100: 자동 예측 모델링 장치
110: 입력부 120: 데이터 분석부
130: 워크플로우 출력부
210: 예측 노드 구성부 220: 워크플로우 시각화 처리부100: automatic predictive modeling device
110: input unit 120: data analysis unit
130: workflow output unit
210: prediction node configuration unit 220: workflow visualization processing unit

Claims

A method for performing automatic predictive modeling based on a workflow in an automatic predictive modeling apparatus, the method comprising:
an input step of obtaining initial input data for the source data and the prediction work information;
a data analysis step of generating at least one node based on the initial input data, constructing a prediction workflow based on the generated at least one node, and converting it into workflow output data; and
A workflow output step of outputting the converted workflow output data so that automatic predictive modeling is performed
Automatic predictive modeling method comprising a.

According to claim 1,
The input step is
Acquiring the initial input data including the source data for at least one of continuous numeric, character-type variables, and image data, and the predictive work information for predictive variables and prediction types determined by a user's manipulation or input, characterized in that An automatic predictive modeling method.

According to claim 1,
The data analysis step is
a prediction node configuration step of generating the at least one node and configuring a prediction workflow for the automatic prediction based on the at least one node; and
A workflow visualization processing step of converting the prediction workflow into the workflow output data in order to provide it to the user in the form of a visualization
Automatic predictive modeling method comprising a.

4. The method of claim 3,
The prediction node configuration step includes:
a verification method configuration step of classifying the initial input data into training data and verification data;
performing a preprocessing step of converting each of the training data and the verification data to be used in a predictive model;
Selecting a specific algorithm among at least one algorithm by inputting the learning data as an input, analyzing a pattern of the learning data based on the at least one algorithm to select a candidate algorithm, and substituting the verification data into the candidate algorithm to select A model learning processing step of generating a predictive model based on the final algorithm; and
A verification processing step of configuring the prediction workflow by performing verification on the prediction model using a preset verification index
Automatic predictive modeling method comprising a.

5. The method of claim 4,
The pre-processing step is,
For each of the learning data and the verification data, a preprocessing including all or part of a process of checking and converting data types for each variable, processing of missing values, processing of outlier determination and removal, and generation of derived variables is performed. Automatic predictive modeling method, characterized in that.

5. The method of claim 4,
The model learning process step is,
i) receiving the training data as input, ii) maximizing the prediction performance by analyzing the pattern of the training data through an algorithm, and iii) substituting the verification data into the model generated in the maximizing step to check the performance Automatic predictive modeling method, characterized in that processed by a process comprising the step of.

4. The method of claim 3,
Further comprising an optimization language selection step of selecting an optimal node composed of an optimization language in the prediction workflow to select a final prediction workflow so that the final prediction workflow is converted into the workflow output data,
The optimization language selection step includes:
If there is a node that performs the same function as a predetermined node included in the prediction workflow but is written in a different language, an optimal node composed of an optimization language is selected in consideration of the connection state of the node for performing the function, and the final An automatic predictive modeling method, characterized in that the predictive workflow is selected.

In an apparatus for performing automatic predictive modeling based on a workflow,
an input unit for obtaining initial input data for source data and prediction work information;
a data analysis unit that generates at least one node, configures a prediction workflow based on the generated at least one node, and converts it into workflow output data; and
A workflow output unit that outputs the converted workflow output data so that automatic predictive modeling is performed
Automatic predictive modeling device comprising a.

9. The method of claim 8,
The data analysis unit,
a prediction node configuration unit that generates the at least one node and configures a prediction workflow for the automatic prediction based on the at least one node; and
A workflow visualization processing unit that converts the prediction workflow into the workflow output data in order to provide it to the user in the form of a visualization
Automatic predictive modeling device comprising a.

10. The method of claim 9,
The prediction node configuration unit,
a verification method configuration unit for classifying the initial input data into learning data and verification data;
a pre-processing unit converting each of the training data and the verification data to be used in a predictive model;
Selecting a specific algorithm among at least one algorithm by inputting the learning data as an input, analyzing a pattern of the learning data based on the at least one algorithm to select a candidate algorithm, and substituting the verification data into the candidate algorithm to select a model learning processing unit for generating a predictive model based on the final algorithm; and
A verification processing unit configured to configure the prediction workflow by performing verification on the prediction model using a preset verification index
Automatic predictive modeling device comprising a.

10. The method of claim 9,
Automatic prediction, characterized in that it further comprises an optimization language selection unit that selects an optimal node composed of an optimization language in the prediction workflow to select a final prediction workflow so that the final prediction workflow is converted into the workflow output data modeling device.

A computer program stored in a recording medium for executing the automatic predictive modeling method according to any one of claims 1 to 7 on the computer.