KR102469610B1

KR102469610B1 - Data preprocessing system

Info

Publication number: KR102469610B1
Application number: KR1020190179931A
Authority: KR
Inventors: 황윤선; 김성민
Original assignee: 주식회사 포스코아이씨티
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-11-21
Also published as: KR20210086175A

Abstract

분석가의 지식 및 경험 유무와 상관없이 데이터 전처리를 수행할 수 있는 본 발명의 일 측면에 따른 대량의 데이터 수집을 위한 미들웨어 시스템은, 데이터 수집 장치에 의해서 수집된 수집 데이터를 저장하는 데이터 저장부, 전처리 작업 공간인 프로젝트에 상기 데이터 저장부에 저장된 수집 데이터 중 전처리 대상 수집 데이터를 로드하는 데이터 준비부, 로드된 전처리 대상 수집 데이터를 정제하는 데이터 정제부, 및 데이터 정제부에 의하여 정제된 데이터로부터 파생변수를 생성하는 복수의 서브 파생변수 생성부들을 포함하고, 상기 복수의 서브 파생변수 생성부들 중 적어도 하나를 선택하고, 상기 선택된 적어도 하나의 서브 파생변수 생성부를 기초로 생성되는 복수의 파생변수들 중 적어도 하나를 추천하는 파생변수 생성부를 포함한다.A middleware system for collecting a large amount of data according to an aspect of the present invention capable of performing data pre-processing regardless of the analyst's knowledge and experience includes a data storage unit for storing collected data collected by a data collection device, and pre-processing A data preparation unit that loads the collected data for preprocessing among the collected data stored in the data storage unit in the project, which is a workspace, a data refinement unit that refines the loaded preprocessing target collected data, and a variable derived from the data refined by the data refiner. It includes a plurality of sub-derived variable generators for generating, selects at least one of the plurality of sub-derived variable generators, and at least one of the plurality of derived variables generated based on the selected at least one sub-derived variable generator. Includes a derived variable generation unit that recommends one.

Description

Data preprocessing system {DATA PREPROCESSING SYSTEM}

본 발명은 데이터 전처리 시스템에 관한 것으로서, 보다 구체적으로는 데이터 분석을 위한 데이터 전처리 시스템에 관한 것이다.The present invention relates to a data pre-processing system, and more particularly to a data pre-processing system for data analysis.

데이터 분석은 일반적으로, 데이터 수집, 데이터 탐색, 데이터 전처리, 모델링, 모델링 검증 및 모델 배포 순으로 이루어질 수 있다. 이때, 데이터 전처리 과정은 데이터 분석 과정 중에서 가장 많은 시간과 비용이 소모된다. Data analysis may generally be performed in the order of data collection, data exploration, data preprocessing, modeling, modeling verification, and model distribution. At this time, the data preprocessing process consumes the most time and cost among the data analysis processes.

종래의 데이터 전처리 시스템은 분석가의 지식과 경험을 기초로 직접 코딩한 프로그램을 입력하여야 한다. 이에, 분석가의 지식과 경험 유무에 따라 전처리 결과가 크게 달라질 수 있다. 또한, 분석가가 지식과 경험이 있다고 하더라도, 데이터 분석 목적에 맞게 전처리를 수행하는데 상당한 소요시간이 소모된다.Conventional data pre-processing systems require input of programs directly coded based on the analyst's knowledge and experience. Therefore, the preprocessing result may vary greatly depending on the analyst's knowledge and experience. In addition, even if the analyst has knowledge and experience, considerable time is consumed to perform preprocessing suitable for the purpose of data analysis.

본 발명은 상술한 문제점을 해결하기 위한 것으로서, 데이터 전처리 시간을 줄일 수 있는 데이터 전처리 시스템을 제공하는 것을 기술적 과제로 한다.The present invention is to solve the above problems, and a technical problem is to provide a data pre-processing system capable of reducing data pre-processing time.

또한, 본 발명은 분석가의 지식 및 경험 유무와 상관없이 데이터 전처리를 수행할 수 있는 데이터 전처리 시스템을 제공하는 것을 다른 기술적 과제로 한다.Another technical problem of the present invention is to provide a data pre-processing system capable of performing data pre-processing regardless of an analyst's knowledge and experience.

상기 목적을 달성하기 위한 본 발명의 일 측면에 따른 대량의 데이터 수집을 위한 미들웨어 시스템은, 데이터 수집 장치에 의해서 수집된 수집 데이터를 저장하는 데이터 저장부, 전처리 작업 공간인 프로젝트에 상기 데이터 저장부에 저장된 수집 데이터 중 전처리 대상 수집 데이터를 로드하는 데이터 준비부, 로드된 전처리 대상 수집 데이터를 정제하는 데이터 정제부, 및 데이터 정제부에 의하여 정제된 데이터로부터 파생변수를 생성하는 복수의 서브 파생변수 생성부들을 포함하고, 상기 복수의 서브 파생변수 생성부들 중 적어도 하나를 선택하고, 상기 선택된 적어도 하나의 서브 파생변수 생성부를 기초로 생성되는 복수의 파생변수들 중 적어도 하나를 추천하는 파생변수 생성부를 포함한다.In order to achieve the above object, a middleware system for collecting a large amount of data according to an aspect of the present invention includes a data storage unit for storing collected data collected by a data collection device, and a data storage unit in a project that is a preprocessing workspace. A data preparation unit that loads collected data for preprocessing among stored collected data, a data refiner that refines the loaded preprocessed collected data, and a plurality of sub-derived variable generators that generate derived variables from the data refined by the data refiner. and a derived variable generator that selects at least one of the plurality of sub-derived variable generators and recommends at least one of the plurality of derived variables generated based on the selected at least one sub-derived variable generator. .

본 발명에 따르면, 종속변수와의 상관관계가 높은 파생변수들을 추천하고, 더 나아가, 최적의 파생변수를 도출하기 위한 복수의 파생변수 생성 방법들의 순서까지 추천함으로써, 사용자가 최적의 파생변수를 생성하기 위하여 최적의 기준 변수를 찾고, 최적의 파생변수 생성 방법을 찾는데 소요되는 시간을 줄일 수 있다.According to the present invention, a user generates an optimal derived variable by recommending derived variables having a high correlation with the dependent variable and further recommending the order of a plurality of derived variable generation methods for deriving the optimal derived variable. In order to do this, it is possible to reduce the time required to find the optimal reference variable and find the optimal derived variable generation method.

또한, 본 발명에 따르면, 자동으로 전처리를 수행하고, 전처리 데이터의 품질을 진단하여 진단결과를 화면으로 출력함으로써, 사용자가 전처리 데이터의 품질을 쉽게 확인할 수 있도록 한다. 또한, 사용자는 전처리 데이터의 품질이 만족할만한 결과가 도출되지 않으면, 변수만을 변경하면서 최적의 전처리 데이터를 탐색할 수 있다. 이에 따라, 사용자는 분석 경험이 없이도 신뢰성이 높은 전처리 데이터를 획득할 수 있다.In addition, according to the present invention, the preprocessing is automatically performed, the quality of the preprocessed data is diagnosed, and the diagnosis result is displayed on the screen, so that the user can easily check the quality of the preprocessed data. In addition, if the quality of the preprocessing data does not produce a satisfactory result, the user may search for the optimal preprocessing data while changing only the variables. Accordingly, the user can acquire highly reliable preprocessing data without analysis experience.

도 1은 본 발명에 따른 데이터 전처리 시스템을 보여주는 도면이다.
도 2는 도 1의 프로젝트 생성부의 구성을 보여주는 블록도이다.
도 3은 도 2의 자동 품질진단부에 의하여 출력되는 데이터 품질진단 리포트 화면의 일 예를 보여주는 도면이다.
도 4a 내지 도 4b는 도 2의 선택적 품질진단부에 의하여 출력되는 선택적 품질진단 리포트 화면의 일 예를 보여주는 도면이다.
도 5a 내지 도 5f는 도 2의 자동 데이터 정제부에 의하여 데이터가 정제되는 과정을 보여주는 도면이다.
도 6은 도 2의 정규화 변환부에 의하여 출력되는 화면의 일 예를 보여주는 도면이다.
도 7은 도 2의 교호작용 변수 생성부에 의하여 생성되는 파생변수의 일 예를 보여주는 도면이다.
도 8은 도 2의 최대상관시차 변수 생성부에 의하여 생성되는 파생변수의 일 예를 보여주는 도면이다.
도 9는 도 1의 전처리 자동화부의 동작 방법을 보여주는 흐름도이다.1 is a diagram showing a data pre-processing system according to the present invention.
Figure 2 is a block diagram showing the configuration of the project generator of Figure 1;
FIG. 3 is a diagram showing an example of a data quality diagnosis report screen output by the automatic quality diagnosis unit of FIG. 2 .
4A to 4B are diagrams showing an example of a selective quality diagnosis report screen output by the selective quality diagnosis unit of FIG. 2 .
5A to 5F are diagrams illustrating a process of refining data by the automatic data refining unit of FIG. 2 .
6 is a diagram showing an example of a screen output by the normalization conversion unit of FIG. 2 .
FIG. 7 is a diagram showing an example of a derived variable generated by the interaction variable generating unit of FIG. 2 .
FIG. 8 is a diagram showing an example of a derived variable generated by the maximum correlation lag variable generator of FIG. 2 .
9 is a flowchart illustrating a method of operating a preprocessing automation unit of FIG. 1 .

이하, 첨부되는 도면을 참고하여 본 발명의 실시예들에 대해 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.The meaning of terms described in this specification should be understood as follows.

단수의 표현은 문맥상 명백하게 다르게 정의하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다.Singular expressions should be understood to include plural expressions unless the context clearly defines otherwise, and terms such as “first” and “second” are used to distinguish one component from another, The scope of rights should not be limited by these terms.

"포함하다" 또는 "가지다" 등의 용어는 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.It should be understood that terms such as "comprise" or "having" do not preclude the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

"적어도 하나"의 용어는 하나 이상의 관련 항목으로부터 제시 가능한 모든 조합을 포함하는 것으로 이해되어야 한다. 예를 들어, "제1 항목, 제2 항목 및 제 3항목 중에서 적어도 하나"의 의미는 제1 항목, 제2 항목 또는 제3 항목 각각 뿐만 아니라 제1 항목, 제2 항목 및 제3 항목 중에서 2개 이상으로부터 제시될 수 있는 모든 항목의 조합을 의미한다.The term “at least one” should be understood to include all possible combinations from one or more related items. For example, "at least one of the first item, the second item, and the third item" means not only the first item, the second item, or the third item, but also two of the first item, the second item, and the third item. It means a combination of all items that can be presented from one or more.

도 1은 본 발명에 따른 데이터 전처리 시스템을 보여주는 도면이다.1 is a diagram showing a data pre-processing system according to the present invention.

도 1에 도시된 바와 같이, 본 발명에 따른 데이터 전처리 시스템(1000)는 데이터 관리부(100) 및 프로젝트 관리부(200)를 포함한다. As shown in FIG. 1 , a data preprocessing system 1000 according to the present invention includes a data management unit 100 and a project management unit 200 .

데이터 관리부(100)는 전처리 대상 데이터를 저장 및 관리한다. 여기서, 전처리 대상 데이터는 데이터 수집 장치(미도시)에 의하여 수집된 데이터일 수 있다. 일 실시예에 있어서, 데이터 수집 장치는 다양한 공정의 진행 과정에서 발생되는 마이크로 데이터(Micro Data)를 수집할 수 있다. 여기서, 마이크로 데이터는 다양한 센서 등을 통해 수집된 데이터 그 자체로서 원시 데이터(Raw Data)를 의미한다. 이하에서는, 설명의 편의를 위해 마이크로 데이터를 수집 데이터로 표기하기로 한다.The data management unit 100 stores and manages pre-processing target data. Here, the preprocessing target data may be data collected by a data collection device (not shown). In one embodiment, the data collection device may collect micro data generated in the course of various processes. Here, micro data refers to raw data as data itself collected through various sensors and the like. Hereinafter, for convenience of description, micro data will be referred to as collected data.

데이터 수집 장치는 마이크로 데이터를 수집하기 위한 다양한 계측기, 센서, 액츄에이터 등을 포함한다. 데이터 수집 장치는 계측기, 센서, 액츄에이터 등에 의해 수집된 데이터를 통합하거나 제어하는 P/C, PLC(Programmable Logic Controller), DCS(Distributed Control System) 등을 더 포함할 수 있다.The data collection device includes various instruments, sensors, and actuators for collecting micro data. The data collection device may further include a P/C, a Programmable Logic Controller (PLC), a Distributed Control System (DCS), and the like that integrates or controls data collected by measuring instruments, sensors, actuators, and the like.

일 예로, 데이터 수집장치는 연속공정에서 발생되는 데이터를 수집할 수 있다. 연속공정이란 원재료를 이용하여 완제품을 생성하기 위한 복수개의 공정들이 연속적으로 수행되고, 각 공정의 산출물들이 서로 혼합되거나 특정 공정의 산출물의 상태가 변화하여 후속 공정으로 공급되는 방식의 공정을 의미한다. 철강공정이 이러한 연속공정의 대표적인 예에 해당한다. 철강공정은 제선공정, 제강공정, 연주공정, 및 압연공정 등과 같은 다양한 공정으로 구성될 수 있다. 데이터 수집 장치가 철강공정에 적용되는 경우, 제선공정, 제강공정, 연주공정, 및 압연공정 등과 같은 다양한 공정의 진행 과정에서 발생되는 마이크로 데이터(Micro Data)를 수집할 수 있다.For example, the data collection device may collect data generated in a continuous process. A continuous process refers to a process in which a plurality of processes for producing a finished product using raw materials are continuously performed, and the outputs of each process are mixed with each other or the state of the output of a specific process is changed and supplied to a subsequent process. Steel processing is a typical example of such a continuous process. The steel process may be composed of various processes such as an iron making process, a steel making process, a continuous casting process, and a rolling process. When the data collection device is applied to a steel process, micro data generated during various processes such as an iron making process, a steel making process, a continuous casting process, and a rolling process may be collected.

데이터 관리부(100)는 데이터 저장부(110), 및 데이터 정보 편집부(120)를 포함한다. 데이터 저장부(110)는 전처리 대상 데이터, 예컨대, 데이터 수집장치에 의해서 수집된 수집 데이터를 저장한다. The data management unit 100 includes a data storage unit 110 and a data information editing unit 120 . The data storage unit 110 stores preprocessing target data, for example, collection data collected by a data collection device.

데이터 정보 편집부(120)는 데이터 저장부(110)에 저장된 수집 데이터의 형식 및 권한 등의 정보를 편집한다. 데이터 정보 편집부(120)는 사용자 입력에 의하여 수집 데이터의 형식 및 권한 등의 정보를 편집할 수 있다.The data information editing unit 120 edits information such as the format and authority of collected data stored in the data storage unit 110 . The data information editing unit 120 may edit information such as the format and authority of collected data according to user input.

프로젝트 관리부(200)는 전처리 작업 공간인 프로젝트를 생성하고, 생성된 적어도 하나의 프로젝트를 관리한다. 이러한 프로젝트 관리부(200)는 프로젝트 생성부(210) 및 프로젝트 정보 편집부(220)를 포함한다.The project manager 200 creates a project that is a preprocessing workspace and manages at least one project. The project management unit 200 includes a project creation unit 210 and a project information editing unit 220 .

프로젝트 생성부(210)는 전처리 작업 공간인 프로젝트를 생성한다. 프로젝트 생성부(210)는 사용자의 요청에 의하여 새로운 프로젝트를 생성할 수 있다. 프로젝트 생성부(210)는 생성된 프로젝트에서 사용자가 데이터 전처리 작업을 용이하게 수행할 수 있도록 다양한 기능을 제공할 수 있다.The project generating unit 210 creates a project that is a preprocessing workspace. The project creation unit 210 may create a new project according to a user's request. The project generator 210 may provide various functions so that the user can easily perform data pre-processing in the created project.

이를 위하여, 프로젝트 생성부(210)는 데이터 준비부(211), 데이터 진단부(212), 데이터 병합부(213), 데이터 정제부(214), 파생변수 생성부(215) 및 전처리 자동화부(216) 중 적어도 하나를 포함할 수 있다. 프로젝트 생성부(210)의 구성들에 대해서는 도 2를 참조하여 구체적으로 설명하도록 한다.To this end, the project generation unit 210 includes a data preparation unit 211, a data diagnosis unit 212, a data merging unit 213, a data refinement unit 214, a derived variable generation unit 215, and a preprocessing automation unit ( 216) may include at least one of them. Components of the project generator 210 will be described in detail with reference to FIG. 2 .

프로젝트 정보 편집부(220)는 프로젝트 생성부(210)에 의해 생성된 프로젝트의 이름 및 권한 등의 정보를 편집한다. 프로젝트 정보 편집부(220)는 사용자 입력에 의하여 프로젝트의 이름 및 권한 등의 정보를 편집할 수 있다.The project information editing unit 220 edits information such as the name and authority of the project created by the project creation unit 210 . The project information editing unit 220 may edit information such as the name and authority of a project according to user input.

이하에서는 도 2를 참조하여 프로젝트 생성부(210)의 구성들을 구체적으로 설명하도록 한다.Hereinafter, configurations of the project generation unit 210 will be described in detail with reference to FIG. 2 .

도 2는 도 1의 프로젝트 생성부의 구성을 보여주는 블록도이다.Figure 2 is a block diagram showing the configuration of the project generator of Figure 1;

도 2를 참조하면, 프로젝트 생성부(210)는 데이터 준비부(211), 데이터 진단부(212), 데이터 병합부(213), 데이터 정제부(214), 파생변수 생성부(215) 및 전처리 자동화부(216)를 포함한다.Referring to FIG. 2, the project generation unit 210 includes a data preparation unit 211, a data diagnosis unit 212, a data merging unit 213, a data refinement unit 214, a derived variable generation unit 215, and a preprocessing unit. It includes an automation unit 216.

데이터 준비부(211)는 전처리하고자 하는 수집 데이터를 해당 프로젝트에 준비한다. 이를 위하여, 데이터 준비부(211)는 데이터 로드부(311) 및 데이터 변경부(312)를 포함할 수 있다. The data preparation unit 211 prepares collected data to be preprocessed for a corresponding project. To this end, the data preparation unit 211 may include a data loading unit 311 and a data changing unit 312 .

데이터 로드부(311)는 데이터 저장부(110)에 저장된 수집 데이터 중 전처리 대상 수집 데이터를 해당 프로젝트로 읽어온다. 데이터 로드부(311)는 사용자 요청에 의하여 데이터 저장부(110)에 저장된 수집 데이터 중 전처리 대상 수집 데이터를 해당 프로젝트로 읽어올 수 있다. 이때, 전처리 대상 수집 데이터는 사용자에 의하여 선택될 수 있다.The data loading unit 311 reads the collected data to be preprocessed among the collected data stored in the data storage unit 110 into a corresponding project. The data loading unit 311 may read the collected data to be preprocessed from among the collected data stored in the data storage unit 110 to the corresponding project at the request of the user. At this time, the collected data to be preprocessed may be selected by the user.

데이터 변경부(312)는 데이터 로드부(311)에 의하여 로드된 수집 데이터를 변경할 수 있다. 데이터 변경부(312)는 사용자 입력에 의하여 수집 데이터의 변수명 및 타입 등을 변경할 수 있다.The data change unit 312 may change collected data loaded by the data load unit 311 . The data changer 312 may change variable names and types of collected data according to user input.

다음, 데이터 진단부(212)는 데이터 로드부(311)에 의하여 로드된 전처리 대상 수집 데이터의 품질을 자동으로 진단하고, 진단 결과를 기초로 1차적인 전처리를 수행한다. Next, the data diagnosis unit 212 automatically diagnoses the quality of the pre-processing target collected data loaded by the data load unit 311 and performs primary pre-processing based on the diagnosis result.

이를 위하여, 데이터 진단부(212)는 자동 품질진단부(321), 선택적 품질진단부(322), 문자 비율 판단부(323), 공백 비율 판단부(324), 데이터 중복 확인부(325) 및 상관계수 리포팅부(326)를 포함할 수 있다.To this end, the data diagnosis unit 212 includes an automatic quality diagnosis unit 321, an optional quality diagnosis unit 322, a character ratio determination unit 323, a blank ratio determination unit 324, a data redundancy check unit 325, and A correlation coefficient reporting unit 326 may be included.

자동 품질진단부(321)는 데이터 로드부(311)에 의하여 로드된 전처리 대상 수집 데이터의 품질을 자동으로 진단한다. 그리고, 자동 품질진단부(321)는 도 3에 도시된 바와 같이 진단결과를 사용자가 확인할 수 있도록 복수의 항목들을 포함하는 데이터 품질진단 리포트 화면을 출력할 수 있다. The automatic quality diagnosis unit 321 automatically diagnoses the quality of the preprocessing target collected data loaded by the data loading unit 311 . And, as shown in FIG. 3 , the automatic quality diagnosis unit 321 may output a data quality diagnosis report screen including a plurality of items so that the user can check the diagnosis result.

복수의 항목들은 데이터 기본 정보 항목(301), 기초통계량 정보 항목(302), 타입별 변수 리스트 항목(303), 변수 내 문자 비율 항목(304), 변수 내 공백 비율 항목(305), 및 중복 행 및 열 항목(306) 중 적어도 둘 이상을 포함할 수 있다.The plurality of items include data basic information item 301, basic statistics information item 302, variable list item by type 303, character rate in variable item 304, blank rate item in variable item 305, and duplicate rows. and at least two of the column items 306.

데이터 기본 정보 항목(301)은 전처리 대상 수집 데이터의 행 및 컬럼의 수를 제공할 수 있다. 사용자는 데이터 기본 정보 항목(301)에서 제공된 정보를 통해 전처리 대상 수집 데이터의 크기를 알 수 있다.The basic data information item 301 may provide the number of rows and columns of collected data to be preprocessed. The user can know the size of collected data to be preprocessed through the information provided in the basic data information item 301 .

기초통계량 정보 항목(302)은 변수 별 결측치 수, 평균, 표준편차 최소값, 25%, 50%, 75% 위치의 값, 최대값 정보를 제공할 수 있다. 사용자는 기초통계량 정보 항목(302)에서 제공된 정보를 통해 전처리 대상 수집 데이터의 분포를 수치적으로 확인할 수 있다.The basic statistics information item 302 may provide information on the number of missing values for each variable, mean, minimum standard deviation, values at 25%, 50%, and 75% positions, and maximum value information. The user can numerically check the distribution of collected data to be preprocessed through the information provided in the basic statistics information item 302 .

타입별 변수 리스트 항목(303)은 수치 타입의 변수, 문자 타입의 변수, 및 단일 변수 각각의 리스트를 제공할 수 있다. 사용자는 타입별 변수 리스트 항목(303)의 수치 타입의 변수, 문자 타입의 변수를 통해 해당 변수가 사용자가 알고 있는 타입과 동일한지 확인할 수 있다. 또한, 사용자는 타입별 변수 리스트 항목(303)의 단일 변수를 확인하고, 단일 변수의 삭제 여부를 결정할 수 있다. 단일 변수는 하나의 값만이 존재하는 변수를 나타낼 수 있다. 이러한 단일 변수는 중요성이 현저히 낮고, 신뢰도가 떨어지므로, 활용가능성이 매우 낮다. 사용자는 자동 품질진단부(321)가 제공하는 데이터 품질진단 리포트 화면을 통해 어떤 변수가 단일 변수인지를 쉽게 확인할 수 있으며, 전처리 과정에서 단일 변수를 삭제할 수 있다.The variable list by type item 303 may provide lists of numerical type variables, character type variables, and single variables, respectively. The user can check whether the corresponding variable is the same as the user's known type through the numerical type variable and the character type variable of the variable list item 303 for each type. In addition, the user can check a single variable in the variable list item 303 for each type and determine whether to delete the single variable. A single variable can represent a variable that has only one value. This single variable has a significantly low importance and low reliability, so the applicability is very low. The user can easily check which variable is a single variable through the data quality diagnosis report screen provided by the automatic quality diagnosis unit 321, and can delete the single variable in the pre-processing process.

변수 내 문자 비율 항목(304)은 문자 타입으로 인식된 변수에서 실제 문자 값의 비율 정보를 제공할 수 있다. 사용자는 변수 내 문자 비율 항목(304)에서 제공된 정보를 통해 문자 타입으로 인식된 변수가 정확한 타입인지를 확인할 수 있다. The text ratio item 304 in a variable may provide ratio information of actual text values in a variable recognized as a text type. The user can check whether a variable recognized as a character type is an accurate type through information provided in the text ratio item 304 in a variable.

변수 내 공백 비율 항목(305)은 변수 내에 포함된 공백의 개수 및 비율에 관한 정보를 제공할 수 있다. 사용자는 변수 내 공백 비율 항목(305)에서 제공된 정보를 통해 수집 데이터를 하나하나 확인하지 않고도, 값 사이에 공백이 포함되어 있는지를 파악할 수 있다.The white space ratio item 305 in a variable may provide information about the number and ratio of white space included in a variable. The user can determine whether blanks are included between values without checking the collected data one by one through the information provided in the blank percentage in variable item 305 .

중복 행 및 열 항목(306)은 중복 행 리스트 및 중복 열 리스트를 제공할 수 있다. 사용자는 중복 행 및 열 항목(306)에서 제공된 정보를 통해 행 전체가 중복되어 있는지, 혹은 변수명은 다르나 실제로 동일한 값을 보유한 열인지를 판단할 수 있다.Duplicate row and column item 306 can provide a list of duplicate rows and a list of duplicate columns. The user can determine whether entire rows are duplicated through information provided in the overlapping row and column item 306 or whether the variable name is different but actually has the same value.

선택적 품질진단부(322)는 데이터 로드부(311)에 의하여 로드된 전처리 대상 수집 데이터의 품질을 선택적으로 진단한다. 선택적 품질진단부(322)는 자동 품질진단부(321)에서 제공하는 복수의 항목들 중 일부만을 포함하는 선택적 품질진단 화면을 출력할 수 있다. The optional quality diagnosis unit 322 selectively diagnoses the quality of collected data to be preprocessed, loaded by the data load unit 311 . The selective quality diagnosis unit 322 may output a selective quality diagnosis screen including only some of a plurality of items provided by the automatic quality diagnosis unit 321 .

일 실시예에 있어서, 선택적 품질진단부(322)는 자동 품질진단부(321)에서 제공하는 복수의 항목들 중 사용자가 선택한 항목만을 제공할 수 있다. 선택적 품질진단부(322)는 도 4a에 도시된 바와 같이 데이터 기본 정보, 기초통계량 정보, 타입별 변수 리스트, 변수 내 문자 비율, 변수 내 공백 비율, 및 중복 행 및 열과 같은 복수의 항목들 중 적어도 하나를 사용자가 선택할 수 있도록 진단 기능 선택 화면을 출력할 수 있다. 선택적 품질진단부(322)는 사용자가 복수의 항목들 중 적어도 하나를 선택하면, 도 4b에 도시된 바와 같이 사용자에 의하여 선택된 항목을 포함하는 선택적 품질진단 리포트 화면을 출력할 수 있다.In one embodiment, the selective quality diagnosis unit 322 may provide only items selected by the user from among a plurality of items provided by the automatic quality diagnosis unit 321 . As shown in FIG. 4A , the optional quality diagnosis unit 322 includes at least one of a plurality of items such as basic data information, basic statistics information, a list of variables by type, a letter ratio in a variable, a blank ratio in a variable, and duplicate rows and columns. A diagnosis function selection screen can be output so that the user can select one. When the user selects at least one of the plurality of items, the selective quality diagnosis unit 322 may output a selective quality diagnosis report screen including the item selected by the user as shown in FIG. 4B .

전처리 대상 수집 데이터의 크기가 큰 경우, 자동 품질진단부(321)은 복수의 항목들에 대한 정보를 처리하는데 많은 시간이 걸릴 수 있다. 선택적 품질진단부(322)는 복수의 항목들 중 사용자가 선택한 항목만을 처리하므로, 빠른 처리가 가능할 수 있다.When the size of collected data to be preprocessed is large, the automatic quality diagnosis unit 321 may take a lot of time to process information on a plurality of items. Since the selective quality diagnosis unit 322 processes only the items selected by the user among a plurality of items, fast processing may be possible.

문자 비율 판단부(323)는 변수 내 문자 비율 항목(304)에서 제공된 정보를 기초로 해당 변수에 대한 타입 변경 여부를 판단한다. 문자 비율 판단부(323)는 변수 내 문자 비율이 낮으면, 해당 변수 내에서 존재하는 문자 값을 삭제하고, 해당 변수의 타입을 문자 타입에서 수치 타입으로 변경할 수 있다.The character rate determination unit 323 determines whether or not to change the type of the variable based on the information provided in the character rate item 304 in the variable. If the character ratio in the variable is low, the character ratio determining unit 323 may delete a character value existing in the variable and change the type of the variable from a character type to a numeric type.

일 실시예에 있어서, 문자 비율 판단부(323)는 변수 내 문자 비율이 미리 설정된 임계값 보다 작으면, 해당 변수 내에서 존재하는 문자 값을 삭제하고, 해당 변수의 타입을 문자 타입에서 수치 타입으로 변경할 수 있다.In one embodiment, if the character ratio within the variable is smaller than a preset threshold value, the character ratio determination unit 323 deletes the text value existing within the variable, and converts the type of the variable from a character type to a numeric type. can be changed

다른 일 실시예에 있어서, 문자 비율 판단부(323)는 사용자 요청에 의해서, 해당 변수 내에서 존재하는 문자 값을 삭제하고, 해당 변수의 타입을 문자 타입에서 수치 타입으로 변경할 수 있다.In another embodiment, the character ratio determining unit 323 may delete a character value existing in a corresponding variable and change the type of the corresponding variable from a character type to a numeric type, upon a user's request.

공백 비율 판단부(324)는 변수 내 공백 비율 항목(305)에서 제공된 정보를 기초로 해당 변수 내에 존재하는 공백에 대한 삭제 여부를 판단한다. 공백 비율 판단부(324)는 공백이 포함되지 않는 변수에 포함된 공백을 삭제할 수 있다. The blank rate determining unit 324 determines whether or not to delete blanks existing in the variable based on the information provided in the blank rate item 305 in the variable. The blank ratio determination unit 324 may delete blanks included in variables that do not include blanks.

데이터 중복 확인부(325)는 중복 행 및 열 항목(306)에서 제공된 정보를 기초로 중복 행 또는 중복 열에 대한 삭제 여부를 판단한다. 데이터 중복 확인부(325)는 중복 열이 존재하면, 중복 열을 삭제할 수 있다. 데이터 중복 확인부(325)는 중복 행이 존재하면, 중복 행을 삭제할 수 있다. 일 실시예에 있어서, 데이터 중복 확인부(325)는 사용자 요청에 따라 중복 행을 삭제할 수 있다.The data redundancy check unit 325 determines whether to delete duplicate rows or duplicate columns based on the information provided in the duplicate row and column item 306 . If there is a duplicate column, the data redundancy checker 325 may delete the duplicate column. The data duplication checking unit 325 may delete the duplicated row if there is a duplicated row. In one embodiment, the data duplication checking unit 325 may delete duplicate rows according to a user's request.

상관계수 리포팅부(326)는 데이터 로드부(311)에 의하여 로드된 전처리 대상 수집 데이터에 포함된 독립 변수들과 종속 변수들 간에 상관계수를 제공한다. The correlation coefficient reporting unit 326 provides correlation coefficients between independent variables and dependent variables included in the collected data to be preprocessed loaded by the data loading unit 311 .

다음, 데이터 병합부(213)는 데이터 로드부(311)에 의하여 로드된 전처리 대상 수집 데이터들 중 둘 이상의 수집 데이터를 병합한다. 데이터 병합부(213)는 사용자에 의하여 선택된 둘 이상의 수집 데이터를 병합할 수 있다.Next, the data merging unit 213 merges two or more collected data among the collected data to be preprocessed loaded by the data loading unit 311 . The data merging unit 213 may merge two or more pieces of collected data selected by a user.

다음, 데이터 정제부(214)는 전처리 대상 수집 데이터를 정제한다. 데이터 정제부(214)는 전처리 대상 수집 데이터에서 데이터 노이즈를 제거하고, 모순되는 데이터를 변경할 수 있다. Next, the data refiner 214 refines the collected data to be preprocessed. The data refiner 214 may remove data noise from the collected data to be preprocessed and change contradictory data.

데이터 전처리 과정은 일반적으로 데이터 준비, 데이터 진단, 데이터 병합, 데이터 정제 및 파생변수 생성 순으로 이루어진다. 본 발명의 일 실시예에 따른 데이터 전처리 시스템(1000)은 사용자에 의하여 순서가 변경될 수 있다. 사용자는 데이터 준비, 데이터 진단, 데이터 병합, 데이터 정제 및 파생변수 생성의 순서를 임의로 결정할 수 있다. 일 실시예에 있어서, 데이터 전처리 시스템(1000)은 사용자가 임의로 결정한 데이터 준비, 데이터 진단, 데이터 정제, 데이터 병합 및 파생변수 생성 순서로 데이터 전처리가 진행될 수 있다. 따라서, 데이터 정제부(214)에서 정제하는 수집 데이터는 진행 순서에 따라 달라질 수 있다.The data preprocessing process generally consists of data preparation, data diagnosis, data merging, data cleaning, and creation of derived variables. The order of the data preprocessing system 1000 according to an embodiment of the present invention may be changed by a user. The user can arbitrarily decide the order of data preparation, data diagnosis, data merging, data cleaning, and derivative variable generation. In one embodiment, the data preprocessing system 1000 may perform data preprocessing in the order of data preparation, data diagnosis, data refinement, data merging, and derivative variable generation, which are arbitrarily determined by a user. Accordingly, collected data refined by the data refiner 214 may vary according to the order of processing.

일 실시예에 있어서, 데이터 정제부(214)는 데이터 로드부(311)에 의하여 로드된 전처리 대상 수집 데이터를 정제할 수 있다. 다른 일 실시예에 있어서, 데이터 정제부(214)는 데이터 진단부(212)를 통해 1차적으로 전처리된 수집 데이터를 정제할 수 있다. 또 다른 일 실시예에 있어서, 데이터 정제부(214)는 데이터 병합부(213)에 의해 병합된 수집 데이터를 정제할 수 있다.In one embodiment, the data refiner 214 may refine the preprocessing target collected data loaded by the data loader 311 . In another embodiment, the data refiner 214 may refine collected data primarily pre-processed through the data diagnosis unit 212 . In another embodiment, the data refiner 214 may refine collected data merged by the data merger 213 .

데이터 정제부(214)는 자동 데이터 정제부(341)를 포함할 수 있다. The data refiner 214 may include an automatic data refiner 341 .

자동 데이터 정제부(341)는 미리 설정된 처리순서에 따라 전처리 대상 수집 데이터를 정제한다. 자동 데이터 정제부(341)는 복수의 서브 데이터 정제부들 미리 설정된 처리 순서에 따라 실행하여 데이터 정제를 진행할 수 있다. 복수의 서브 데이터 정제부들은 변수 별 타입을 인식하는 제1 서브 데이터 정제부, 결측치를 처리하는 제2 서브 데이터 정제부, 단일변수를 제거하는 제3 서브 데이터 정제부, 이상치를 처리하는 제4 서브 데이터 정제부, 공백을 제거하는 제5 서브 데이터 정제부 및 중복된 행과 중복된 열을 제거하는 제6 서브 데이터 정제부 중 적어도 하나를 포함할 수 있다.The automatic data purification unit 341 refines the collected data to be preprocessed according to a preset processing sequence. The automatic data refiner 341 may perform data refinement by executing a plurality of sub data refiners according to a preset processing sequence. The plurality of sub-data refiners include a first sub-data refiner that recognizes the type of each variable, a second sub-data refiner that processes missing values, a third sub-data refiner that removes a single variable, and a fourth sub-data refiner that processes outliers. It may include at least one of a data refiner, a fifth sub-data refiner that removes spaces, and a sixth sub-data refiner that removes duplicated rows and duplicated columns.

일 실시예에 있어서, 자동 데이터 정제부(341)는 제1 서브 데이터 정제부, 제2 서브 데이터 정제부, 제3 서브 데이터 정제부, 제4 서브 데이터 정제부, 제5 서브 데이터 정제부 및 제6 서브 데이터 정제부 순서로 실행하여 수집 데이터를 정제할 수 있으나, 반드시 이에 한정되는 것은 아니다. In an embodiment, the automatic data refiner 341 includes a first sub-data refiner, a second sub-data refiner, a third sub-data refiner, a fourth sub-data refiner, a fifth sub-data refiner, and a second sub-data refiner. Collected data may be refined by executing in the order of 6 sub data refiners, but is not necessarily limited thereto.

예를 들어 설명하면, 자동 데이터 정제부(341)는 도 5a에 도시된 바와 같은 수집 데이터를 정제할 수 있다. For example, the automatic data refiner 341 may refine the collected data as shown in FIG. 5A.

먼저, 자동 데이터 정제부(341)는 제1 서브 데이터 정제부를 실행할 수 있다. 제1 서브 데이터 정제부는 각 변수의 타입을 날짜 타입, 문자 타입 및 수치 타입 중 어느 하나로 결정할 수 있다. 제1 서브 데이터 정제부는 각 변수 별로 날짜 형식에 맞춰 수집된 변수를 날짜 타입으로 결정하고, 문자 행의 개수가 전체 행의 50% 이상인 변수를 문자 타입으로 결정하며, 문자 행의 개수가 전체 행의 50% 미만인 변수를 수치 타입으로 결정할 수 있다. First, the automatic data refiner 341 may execute the first sub data refiner. The first sub-data refining unit may determine the type of each variable as one of a date type, a character type, and a numeric type. The first sub-data refining unit determines the variables collected according to the date format for each variable as the date type, and determines the variables in which the number of character lines is 50% or more of the total lines as the character type, and determines the number of character lines as the number of all lines. Variables less than 50% can be determined as numeric types.

제1 서브 데이터 정제부는 결정된 변수의 타입이 수치 타입이면, 수치 타입의 변수에 존재하는 문자 값을 삭제할 수 있다. 이때, 제1 서브 데이터 정제부는 문자 값이 포함된 행을 삭제할 수 있다. 제1 서브 데이터 정제부는 도 5b에 도시된 바와 같이 수치 타입의 변수 'X4'에 존재하는 문자 'Error'가 포함된 행을 삭제할 수 있다. If the type of the determined variable is a numeric type, the first sub-data refining unit may delete a character value existing in the numeric type variable. In this case, the first sub-data refining unit may delete rows including character values. As shown in FIG. 5B , the first sub-data refining unit may delete a row including the character 'Error' existing in the variable 'X4' of numeric type.

이후, 자동 데이터 정제부(341)는 제2 서브 데이터 정제부를 실행할 수 있다. 제2 서브 데이터 정제부는 변수 내에 결측치의 비율을 확인하고, 결측치의 비율이 미리 설정한 값, 예컨대, 60% 보다 크면 해당 변수를 삭제할 수 있다. 제2 서브 데이터 정제부는 도 5c에 도시된 바와 같이 결측치의 비율이 60%이상인 변수 'X3'을 삭제할 수 있다. 더 나아가, 제2 서브 데이터 정제부는 결측치가 존재하는 행도 삭제할 수 있다.After that, the automatic data refiner 341 may execute the second sub data refiner. The second sub-data refining unit checks the ratio of missing values in variables, and deletes the corresponding variable if the ratio of missing values is greater than a preset value, for example, 60%. As shown in FIG. 5C , the second sub-data refiner may delete variable 'X3' having a missing value ratio of 60% or more. Furthermore, the second sub-data refining unit may also delete rows in which missing values exist.

이후, 자동 데이터 정제부(341)는 제3 서브 데이터 정제부를 실행할 수 있다. 제3 서브 데이터 정제부는 하나의 값만 보유한 단일 변수를 삭제할 수 있다. 제3 서브 데이터 정제부는 도 5d에 도시된 바와 같이 하나의 값 '3'만을 보유하는 변수 'X5'을 삭제할 수 있다.After that, the automatic data refiner 341 may execute the third sub data refiner. The third sub-data refiner may delete a single variable having only one value. As shown in FIG. 5D, the third sub-data refiner may delete the variable 'X5' having only one value '3'.

이후, 자동 데이터 정제부(341)는 제4 서브 데이터 정제부를 실행할 수 있다. 제4 서브 데이터 정제부는 이상치를 제거할 수 있다. 제4 서브 데이터 정제부는 특정 방식, 예컨대, Zscore 방식에 의하여 이상치로 판단된 값을 삭제할 수 있다. Thereafter, the automatic data refinement unit 341 may execute the fourth sub data refinement unit. The fourth sub-data refiner may remove outliers. The fourth sub-data refining unit may delete a value determined to be an outlier by a specific method, for example, a Zscore method.

이후, 자동 데이터 정제부(341)는 제5 서브 데이터 정제부를 실행할 수 있다. 제5 서브 데이터 정제부는 문자 타입의 변수에 대해서 공백을 제거할 수 있다. 제5 서브 데이터 정제부는 도 5e에 도시된 바와 같이 문자 타입의 변수 'X6'에 포함된 값들 중 'T wo'에서 공백을 삭제할 수 있다. Thereafter, the automatic data refiner 341 may execute the fifth sub data refiner. A fifth sub-data refining unit may remove spaces from character-type variables. As shown in FIG. 5E , the fifth sub-data refiner may delete blanks from 'T wo' among the values included in the character-type variable 'X6'.

이후, 자동 데이터 정제부(341)는 제6 서브 데이터 정제부를 실행할 수 있다. 제6 서브 데이터 정제부는 중복된 열 및 중복된 행을 삭제할 수 있다. 제6 서브 데이터 정제부는 도 5f에 도시된 바와 같이 중복된 열 'X6' 및 'X7' 중 하나인 'X7'를 삭제할 수 있다.After that, the automatic data refiner 341 may execute the sixth sub data refiner. The sixth sub-data refining unit may delete duplicated columns and duplicated rows. As shown in FIG. 5F , the sixth sub-data refiner may delete 'X7', which is one of duplicated columns 'X6' and 'X7'.

자동 데이터 정제부(341)는 미리 설정된 처리순서에 따라 전처리 대상 수집 데이터를 정제하여 사용자에게 제공할 수 있다. 이에 따라, 본 발명의 일 실시예에 따른 데이터 전처리 시스템(100)은 사용자가 분석 경험이 없더라도 정제 데이터를 획득할 수 있다. 또한, 본 발명의 일 실시예에 따른 데이터 전처리 시스템(100)은 사용자 경험과 상관없이 신뢰도가 높은 정제 데이터를 획득할 수 있다.The automatic data purification unit 341 may refine collected data to be preprocessed according to a preset processing sequence and provide the collected data to the user. Accordingly, the data preprocessing system 100 according to an embodiment of the present invention can acquire refined data even if a user has no analysis experience. In addition, the data pre-processing system 100 according to an embodiment of the present invention can obtain refined data with high reliability regardless of user experience.

한편, 본 발명의 일 실시예에 따른 데이터 전처리 시스템(100)은 자동으로 데이터 정제를 수행할 뿐만 아니라 사용자가 직접 데이터 정제를 수행할 수 있도록 한다. 이를 위하여, 데이터 정제부(214)는 단일 변수 삭제부(342), 결측치 삭제부(343), 결측치 대체부(344), 이상치 삭제부(345), 및 이상치 대체부(346) 중 적어도 하나를 더 포함할 수 있다.Meanwhile, the data preprocessing system 100 according to an embodiment of the present invention not only automatically performs data purification, but also allows a user to directly perform data purification. To this end, the data refiner 214 uses at least one of a single variable deleter 342, a missing value deleter 343, a missing value replacer 344, an outlier deleter 345, and an outlier replacer 346. can include more.

단일 변수 삭제부(342)는 하나의 값만 보유한 단일 변수를 삭제할 수 있다. 단일 변수 삭제부(342)는 사용자 요청이 입력되면 하나의 값만 보유한 단일 변수를 삭제할 수 있다.The single variable deletion unit 342 may delete a single variable having only one value. The single variable deletion unit 342 may delete a single variable having only one value when a user request is input.

결측치 삭제부(343)는 변수 내에 결측치의 비율을 확인하고, 결측치의 비율이 미리 설정한 값, 예컨대, 60% 보다 크면 해당 변수를 삭제할 수 있다. 또한, 결측치 삭제부(343)는 결측치가 존재하는 행을 삭제할 수 있다. 이러한 결측치 삭제부(343)는 사용자 요청에 따라 결측치를 삭제할 수 있다.The missing value deletion unit 343 checks the ratio of missing values in a variable, and deletes the corresponding variable if the ratio of missing values is greater than a preset value, for example, 60%. Also, the missing value deletion unit 343 may delete a row in which a missing value exists. The missing value deletion unit 343 may delete missing values according to a user's request.

결측치 대체부(344)는 사용자 요청에 따라 결측치를 소정의 값으로 대체할 수 있다. 일 실시예에 있어서, 결측치 대체부(344)는 결측치를 사용자가 지정한 값으로 대체할 수 있다. 다른 실시예에 있어서, 결측치 대체부(344)는 결측치를 해당 변수 내에서 빈도가 가장 높은 값으로 대체할 수 있다. 또 다른 실시예에 있어서, 결측치 대체부(344)는 결측치를 해당 변수 내에 포함된 값들의 평균값으로 대체할 수 있다. 또 다른 실시예에 있어서, 결측치 대체부(344)는 결측치를 해당 변수 내에서 직접값으로 대체할 수 있다. The missing value replacing unit 344 may replace the missing value with a predetermined value according to a user's request. In an embodiment, the missing value replacement unit 344 may replace the missing value with a value specified by a user. In another embodiment, the missing value replacing unit 344 may replace the missing value with a value having the highest frequency within a corresponding variable. In another embodiment, the missing value replacement unit 344 may replace the missing value with an average value of values included in the corresponding variable. In another embodiment, the missing value replacing unit 344 may replace the missing value with a direct value within a corresponding variable.

이상치 삭제부(345)는 사용자 요청에 따라 이상치를 제거할 수 있다. 이상치 삭제부(345)는 특정 방식, 예컨대, Zscore 방식에 의하여 이상치로 판단된 값을 삭제할 수 있다. The outlier deletion unit 345 may remove outliers according to a user's request. The outlier deletion unit 345 may delete a value determined to be an outlier by a specific method, for example, a Zscore method.

이상치 대체부(346)는 사용자 요청에 따라 이상치를 소정의 값으로 대체할 수 있다. 일 실시예에 있어서, 이상치 대체부(346)는 이상치를 사용자가 지정한 값으로 대체할 수 있다. 다른 실시예에 있어서, 이상치 대체부(346)는 이상치를 해당 변수 내에서 빈도가 가장 높은 값으로 대체할 수 있다. 또 다른 실시예에 있어서, 이상치 대체부(346)는 이상치를 해당 변수 내에 포함된 값들의 평균값으로 대체할 수 있다. 또 다른 실시예에 있어서, 이상치 대체부(346)는 이상치를 해당 변수 내에서 직접값으로 대체할 수 있다.The outlier replacement unit 346 may replace the outlier with a predetermined value according to a user's request. In one embodiment, the outlier replacement unit 346 may replace the outlier with a value specified by a user. In another embodiment, the outlier replacing unit 346 may replace the outlier with a value having the highest frequency within a corresponding variable. In another embodiment, the outlier replacement unit 346 may replace the outlier with an average value of values included in the corresponding variable. In another embodiment, the outlier replacement unit 346 may replace the outlier with a direct value within the corresponding variable.

다음, 파생변수 생성부(216)는 데이터에 포함된 변수를 기초로 새로운 파생변수를 생성하고, 생성된 파생변수들 중 종속 변수와 강한 상관관계를 가지는 파생변수를 추천한다. 이때, 상기 데이터는 데이터 정제부(214)에 의하여 정제된 데이터일 수 있으나, 반드시 이에 한정되지는 않는다. 상기 데이터는 데이터 로드부(311)에 의하여 로드된 전처리 대상 수집 데이터일 수도 있다. 본 발명의 일 실시예에 따른 데이터 전처리 시스템(1000)은 전처리 과정을 사용자가 요청한 순서로 수행할 수도 있다. 사용자가 파생변수 생성을 먼저 수행하고 데이터 정제를 수행하고자 하는 경우, 파생변수 생성부(216)는 데이터 로드부(311)에 의하여 로드된 수집 데이터를 기초로 파생변수를 생성할 수도 있다. 그러나, 파생변수는 데이터 정제부(214)에 의하여 정제된 데이터를 기초로 생성되어야 신뢰도가 높다. 이에 따라, 파생변수 생성부(216)는 데이터 정제부(214)에 의하여 정제된 데이터(이하, '정제 데이터'라고 함)를 기초로 파생변수를 생성하는 것이 바람직하다. 이하에서는 설명의 편의를 위하여, 파생변수 생성부(216)가 정제 데이터를 기초로 파생변수를 생성하는 것으로 설명한다.Next, the derived variable generation unit 216 creates new derived variables based on the variables included in the data, and recommends a derived variable having a strong correlation with the dependent variable among the generated derived variables. In this case, the data may be data refined by the data refiner 214, but is not necessarily limited thereto. The data may be preprocessing target collection data loaded by the data loading unit 311 . The data pre-processing system 1000 according to an embodiment of the present invention may perform pre-processing in the order requested by the user. If the user wants to perform data refinement after generating the derived variable first, the derived variable generator 216 may generate the derived variable based on the collected data loaded by the data loader 311 . However, the reliability is high when the derived variable is generated based on data refined by the data refiner 214. Accordingly, the derived variable generating unit 216 preferably generates derived variables based on data refined by the data refining unit 214 (hereinafter referred to as 'refined data'). Hereinafter, for convenience of description, it will be described that the derived variable generation unit 216 generates a derived variable based on refined data.

한편, 파생변수 생성부(216)는 복수의 서브 파생변수 생성부들(350) 및 파생변수 추천부(370)를 포함할 수 있다.Meanwhile, the derived variable generator 216 may include a plurality of sub derived variable generators 350 and a derived variable recommendation unit 370 .

복수의 서브 파생변수 생성부들(350) 각각은 수집 데이터들 또는 정제 데이터를 기초로 소정의 방식으로 파생변수를 생성할 수 있다. 복수의 서브 파생변수 생성부들(350)은 정규화 변환부(351), 누적합 변수 생성부(352), 변수 분할부(353), 변수 결합부(354), 변수 범주화부(355), 더미변수 생성부(356), 교호작용 변수 생성부(357), 최대상관시차 변수 생성부(358), 영향인자 탐지부(359), 대량 범주 군집화부(361) 및 층별화 변수 제안부(362) 중 적어도 둘 이상을 포함할 수 있다.Each of the plurality of sub derived variable generators 350 may generate a derived variable in a predetermined manner based on collected data or refined data. The plurality of sub derived variable generators 350 include a normalization transform unit 351, a cumulative sum variable generator 352, a variable divider 353, a variable combiner 354, a variable categorization unit 355, and a dummy variable. Among the generation unit 356, the interaction variable generation unit 357, the maximum correlation lag variable generation unit 358, the influence factor detection unit 359, the mass category clustering unit 361, and the stratification variable proposal unit 362 It may include at least two or more.

정규화 변환부(351)는 정제 데이터를 기초로 정규화 변환을 수행하여 파생변수를 생성한다. 정규화 변환부(351)는 정제 데이터에서 수치 타입의 변수의 분포를 정규분포 형태로 변환할 수 있다. 정규화 변환부(351)는 정규화 변환 전에 복수의 방식들에 따른 정규화 변환 그래프를 화면에 출력할 수 있다. 일 예로, 정규화 변환부(351)는 도 6에 도시된 바와 같이 로그 변환 그래프, 제곱근 변환 그래프, 제곱 변환 그래프, 지수 변환 그래프, 역수 변환 그래프를 변환 전 그래프와 함께 화면에 출력할 수 있다.The normalization conversion unit 351 generates derived variables by performing normalization conversion based on the refined data. The normalization conversion unit 351 may convert the distribution of numeric type variables in the refined data into a normal distribution form. The normalization transformation unit 351 may output a normalization transformation graph according to a plurality of methods to the screen before normalization transformation. For example, as shown in FIG. 6 , the normalization conversion unit 351 may output a log conversion graph, a square root conversion graph, a square conversion graph, an exponential conversion graph, and an inverse conversion graph together with a graph before conversion on the screen.

일 실시예에 있어서, 정규화 변환부(351)는 정제 데이터에 대하여 사용자에 의하여 선택된 정규화 변환을 수행할 수 있다. 사용자는 정규화 변환부(351)에 의하여 출력된 복수의 방식들에 따른 정규화 변환 그래프를 확인한 후, 복수의 정규화 변환들 중 하나를 선택할 수 있다. 정규화 변환부(351)는 정제 데이터에서 수치 타입의 변수의 분포를 사용자에 의하여 선택된 정규화 변환에 따른 정규분포 형태로 변환할 수 있다.In one embodiment, the normalization transformation unit 351 may perform normalization transformation selected by the user on the refined data. The user may select one of the plurality of normalization transformations after checking the normalization transformation graph according to the plurality of methods output by the normalization transformation unit 351 . The normalization conversion unit 351 may convert the distribution of numeric type variables in the refined data into a normal distribution form according to the normalization conversion selected by the user.

다른 일 실시예에 있어서, 정규화 변환부(351)는 정제 데이터에 대하여 복수의 정규화 변환들 중 하나를 추천할 수 있다. 정규화 변환부(351)는 정제 데이터의 변수들에 대해서 정규성 검정을 수행할 수 있다. 일 예로, 정규화 변환부(351)는 정제 데이터의 변수들에 대해서 lilliefors 검정 또는 kolmogorov-smirnov검정을 수행할 수 있다. In another embodiment, the normalization transformation unit 351 may recommend one of a plurality of normalization transformations for refined data. The normalization conversion unit 351 may perform a normality test on the variables of the refined data. For example, the normalization conversion unit 351 may perform a lilliefors test or a kolmogorov-smirnov test on variables of the refined data.

정규화 변환부(351)는 복수의 정규화 변환들 각각에 대하여 조건을 만족하는지 확인할 수 있다. 일 예로, 정규화 변환부(351)는 데이터에 음수가 존재하는 경우 로그 변환, 제곱근 변환을 수행하지 않을 수 있다. 다른 예로, 정규화 변환부(351)는 데이터에 0이 존재하는 경우 역수 변환을 수행하지 않을 수 있다.The normalization transformation unit 351 may check whether a condition is satisfied for each of a plurality of normalization transformations. For example, the normalization conversion unit 351 may not perform log transformation or square root transformation when negative numbers exist in the data. As another example, the normalization conversion unit 351 may not perform inverse conversion when 0 exists in the data.

정규화 변환부(351)는 정제 데이터를 기초로 조건을 만족하는 정규화 변환을 진행하면서, 정규성 검정을 수행할 수 있다. 정규화 변환부(351)는 조건을 만족하는 복수의 정규화 변환들 각각에 대하여 정규성 검정을 수행한 후, 검정의 유의수준이 0.05 이상이면서 통계량이 가장 작은 정규화 변환을 선택할 수 있다. 정규화 변환부(351)는 선택된 정규화 변환을 사용자에게 추천할 수 있다. 또한, 정규화 변환부(351)는 추천한 정규화 변환에 따라 파생변수를 생성할 수 있다.The normalization transformation unit 351 may perform a normality test while performing normalization transformation that satisfies a condition based on the refined data. The normalization transformation unit 351 may perform a normality test on each of the plurality of normalization transformations that satisfy the condition, and then select a normalization transformation having a significance level of 0.05 or more and the smallest statistic. The normalization transformation unit 351 may recommend the selected normalization transformation to the user. Also, the normalization transformation unit 351 may generate a derived variable according to the recommended normalization transformation.

누적합 변수 생성부(352)는 정제 데이터를 기초로 누적된 값을 가지는 파생변수를 생성한다. 분석 목적에 따라 누적된 값이 의미를 가질 수 있다. 예를 들어, 전력량을 분석하는 경우, 전력량 변수가 일정 시점에서의 수집된 전력량을 나타낼 때, 일정기간 동안의 전력량의 합이 의미를 가질 수 있다. The cumulative sum variable generation unit 352 generates a derived variable having an accumulated value based on the refined data. Accumulated values may have meaning depending on the purpose of analysis. For example, when analyzing the amount of power, when the amount of power variable represents the amount of power collected at a certain point in time, the sum of the amount of power for a certain period of time may have meaning.

누적합 변수 생성부(352)는 기존 변수에 대하여 행 별로 누적된 값을 갖는 파생변수를 생성할 수 있다.The cumulative sum variable generation unit 352 may generate a derived variable having a value accumulated for each row with respect to existing variables.

변수 분할부(353)는 정제 데이터에서 기존 변수를 구분자 또는 특정 자릿수 기준으로 분할하여 파생변수를 생성한다. 예를 들어, 변수 분할부(353)는 특정 월에 의미가 있는 경우, '년/월/일 시:분:초' 형태를 갖는 날짜 타입의 변수에 대해서 '월' 형태로 파생변수를 생성할 수 있다.The variable division unit 353 creates derived variables by dividing existing variables in the refined data based on a delimiter or a specific number of digits. For example, the variable divider 353 creates a derived variable in the form of 'month' for a date type variable having the form of 'year/month/day hour:minute:second' when a specific month is meaningful. can

변수 결합부(354)는 정제 데이터에서 두 변수를 결합하여 파생변수를 생성할 수 있다. 변수 결합부(354)는 두 변수 각각에 포함된 값들을 문자 타입으로 변환한 후 붙이는 형태로 결합하여 파생변수를 생성할 수 있다. 일 실시예에 있어서, 변수 결합부(354)는 문자 타입의 변수를 결합하여 파생변수를 생성할 수 있다. 이것은 수치 타입의 변수 보다 문자 타입의 변수를 결합할 때 의미있는 파생변수가 생성될 수 있기 때문이다.The variable combiner 354 may generate a derived variable by combining two variables in the refined data. The variable combiner 354 may generate a derived variable by converting the values included in each of the two variables into character types and then combining them in an appended form. In one embodiment, the variable combiner 354 may generate a derived variable by combining character type variables. This is because meaningful derived variables can be created when character type variables are combined rather than numeric type variables.

변수 범주화부(355)는 정제 데이터를 기초로 범주형 타입으로 파생변수를 생성할 수 있다. 변수 범주화부(355)는 수치 타입의 변수 중 수치 자체의 의미가 없으며 고유값 개수가 적은 변수에 대하여 범주형 타입으로 파생변수를 생성할 수 있다.The variable categorization unit 355 may generate a derived variable in a categorical type based on the refined data. The variable categorization unit 355 may generate a derived variable of a categorical type for a variable having a small number of eigenvalues and having no meaning in itself among numeric type variables.

더미변수 생성부(356)는 정제 데이터를 기초로 0 또는 1의 값을 갖는 변수 형태로 파생변수를 생성한다.The dummy variable generator 356 creates a derived variable in the form of a variable having a value of 0 or 1 based on the refined data.

교호작용 변수 생성부(357)는 정제 데이터에서 둘 이상의 독립변수들을 기초로 연산을 수행하여 파생변수를 생성한다. 교호작용 변수 생성부(357)는 수치 타입의 둘 이상의 독립변수들을 곱하거나 평균을 구하거나 중간값을 구하여 파생변수를 생성할 수 있다.The interaction variable generating unit 357 generates a derived variable by performing an operation based on two or more independent variables in the refined data. The interaction variable generation unit 357 may generate a derived variable by multiplying two or more independent variables of numerical type, obtaining an average, or obtaining a median value.

교호작용 변수 생성부(357)는 둘 이상의 독립변수들을 곱하거나 평균을 구하거나 중간값을 구하여 복수의 독립변수들을 새롭게 생성할 수 있다. 교호작용 변수 생성부(357)는 생성된 복수의 독립변수들을 새로운 데이터프레임에 저장할 수 있다. 교호작용 변수 생성부(357)는 생성된 복수의 독립변수들 각각에 대해서 종속변수와의 상관계수를 산출할 수 있다. 교호작용 변수 생성부(357)는 종속변수와의 상관계수가 미리 설정된 임계값 이상인 독립변수들을 파생변수로 결정할 수 있다.The interaction variable generating unit 357 may newly generate a plurality of independent variables by multiplying two or more independent variables, obtaining an average, or obtaining a median value. The interaction variable generator 357 may store the generated independent variables in a new data frame. The interaction variable generating unit 357 may calculate a correlation coefficient with a dependent variable for each of the plurality of generated independent variables. The interaction variable generating unit 357 may determine independent variables whose correlation coefficients with the dependent variables are greater than or equal to a preset threshold value as derived variables.

예를 들어 설명하면, 도 7을 참조하면, 교호작용 변수 생성부(357)는 기존 데이터가 (a)와 같은 경우 독립변수 'A'를 기준으로 독립변수 'B' 및 독립변수 'C' 각각을 곱하여 (b)와 같이 새로운 독립변수 'A*B' 및 독립변수 'A*C'를 생성할 수 있다. 교호작용 변수 생성부(357)는 독립변수 'A*B' 및 독립변수 'A*C'각각에 대하여 종속변수 'Y'와의 상관계수를 산출할 수 있다. 독립변수 'A*B'와 종속변수 'Y' 간의 상관계수가 0.6이고, 독립변수 'A*C'와 종속변수 'Y' 간의 상관계수가 0.85인 경우, 교호작용 변수 생성부(357)는 독립변수 'A*B' 를 파생변수로 결정할 수 있다. For example, referring to FIG. 7 , the interaction variable generation unit 357 sets the independent variable 'B' and the independent variable 'C' respectively based on the independent variable 'A' when the existing data is the same as (a). By multiplying , new independent variables 'A*B' and independent variables 'A*C' can be created as shown in (b). The interaction variable generating unit 357 may calculate a correlation coefficient with the dependent variable 'Y' for each of the independent variable 'A*B' and the independent variable 'A*C'. When the correlation coefficient between the independent variable 'A*B' and the dependent variable 'Y' is 0.6 and the correlation coefficient between the independent variable 'A*C' and the dependent variable 'Y' is 0.85, the interaction variable generator 357 The independent variable 'A*B' can be determined as a derived variable.

또한, 교호작용 변수 생성부(357)는 독립변수 'B'를 기준으로 독립변수 'C' 각각을 곱하여 (d)와 같이 새로운 독립변수 'B*C'를 생성할 수 있다. 교호작용 변수 생성부(357)는 독립변수 'B*C'각각에 대하여 종속변수 'Y'와의 상관계수를 산출할 수 있다. 독립변수 'B*C'와 종속변수 'Y' 간의 상관계수가 0.9인 경우, 교호작용 변수 생성부(357)는 독립변수 'B*C' 를 파생변수로 결정할 수 있다In addition, the interaction variable generating unit 357 may generate a new independent variable 'B*C' as shown in (d) by multiplying each independent variable 'C' based on the independent variable 'B'. The interaction variable generating unit 357 may calculate a correlation coefficient with the dependent variable 'Y' for each independent variable 'B*C'. When the correlation coefficient between the independent variable 'B*C' and the dependent variable 'Y' is 0.9, the interaction variable generator 357 may determine the independent variable 'B*C' as a derived variable.

최대상관시차 변수 생성부(358)는 정제 데이터를 기초로 시차가 적용된 파생변수를 생성한다. 최대상관시차 변수 생성부(358)는 독립변수 전체를 타임시프트(timeshift)하면서 종속변수와의 상관계수를 산출하고, 상관계수가 가장 높은 시차에서의 독립변수를 파생변수로 생성할 수 있다.The maximum correlated lag variable generating unit 358 generates a derived variable to which the lag is applied based on the refined data. The maximum correlation lag variable generator 358 may calculate a correlation coefficient with the dependent variable while time-shifting all independent variables, and generate an independent variable at a lag with the highest correlation coefficient as a derived variable.

예를 들어 설명하면, 도 8을 참조하면, 최대상관시차 변수 생성부(358)는 기존 데이터가 (a)와 같은 경우 독립변수 'X1'과 종속변수 'Y' 간의 상관계수를 산출할 수 있다. 이때, 독립변수 'X1'과 종속변수 'Y'간의 상관계수가 0.3일 수 있다. 그리고, 최대상관시차 변수 생성부(358)는 (b)와 같이 독립변수 'X1' 전체를 0:01만큼 타임시프트(timeshift)하고, 시차 0:01에서의 독립변수 'X1'과 종속변수 'Y'간의 상관계수를 산출할 수 있다. 또한, 최대상관시차 변수 생성부(358)는 (c)와 같이 독립변수 'X1' 전체를 다시 0:01만큼 타임시프트(timeshift)하고, 시차 0:02에서의 독립변수 'X1'과 종속변수 'Y'간의 상관계수를 산출할 수 있다. 이때, 시차 0:01에서의 독립변수 'X1'과 종속변수 'Y'간의 상관계수가 0.4이고, 시차 0:02에서의 독립변수 'X1'과 종속변수 'Y'간의 상관계수가 0.6인 경우, 최대상관시차 변수 생성부(358)는 시차 0:02에서의 독립변수 'X1'을 파생변수로 생성할 수 있다.For example, referring to FIG. 8 , the maximum correlation lag variable generator 358 may calculate a correlation coefficient between the independent variable 'X1' and the dependent variable 'Y' when the existing data is the same as (a). . At this time, the correlation coefficient between the independent variable 'X1' and the dependent variable 'Y' may be 0.3. Then, the maximum correlation lag variable generator 358 timeshifts the entire independent variable 'X1' by 0:01 as in (b), and the independent variable 'X1' and the dependent variable 'X1' at the lag 0:01 The correlation coefficient between Y' can be calculated. In addition, the maximum correlation lag variable generator 358 timeshifts the entire independent variable 'X1' by 0:01 again, as in (c), and the independent variable 'X1' and the dependent variable at the lag 0:02 The correlation coefficient between 'Y' can be calculated. In this case, if the correlation coefficient between the independent variable 'X1' and the dependent variable 'Y' at the lag 0:01 is 0.4, and the correlation coefficient between the independent variable 'X1' and the dependent variable 'Y' at the lag 0:02 is 0.6 , the maximum correlation lag variable generator 358 may generate the independent variable 'X1' at the lag 0:02 as a derived variable.

영향인자 탐지부(359)는 정제 데이터 내에서 종속 변수에 영향을 주는 독립변수의 집합을 탐지하고 추천한다. 일 실시예에 있어서, 영향인자 탐지부(359)는 Markov Blanket 알고리즘을 활용하여 종속 변수에 영향을 주는 독립변수의 집합을 추출할 수 있다.The influence factor detection unit 359 detects and recommends a set of independent variables that affect dependent variables within the refined data. In one embodiment, the influencing factor detection unit 359 may extract a set of independent variables that affect dependent variables by utilizing the Markov Blanket algorithm.

대량 범주 군집화부(361)는 고유값의 크기가 큰 범주형 변수에 대해서 대형 범주와 소형 범주로 구분하여 재군집화한다. 대량 범주 군집화부(361)는 범주별 개체 수를 고려하여 대형 범주와 소형 범주로 구분할 수 있다. 그런 다음, 대량 범주 군집화부(361)는 각 범주에 속하는 데이터 개체들을 모아 초기 군집으로 형성한 후 중심 좌표를 계산할 수 있다. 그런 다음, 대량 범주 군집화부(361)는 소형 범주의 중심 좌표와 대형 범주의 중심 좌표 간의 거리 행렬을 계산할 수 있다. 그런 다음, 대량 범주 군집화부(361)는 소형 범주들 각각을 가장 가까운 대형 범주에 통합할 수 있다.The large category clustering unit 361 classifies categorical variables having large eigenvalues into large categories and small categories and re-clusters them. The large category clustering unit 361 may classify into large categories and small categories in consideration of the number of objects in each category. Then, the large-scale category clustering unit 361 collects data entities belonging to each category to form an initial cluster, and then calculates center coordinates. Then, the large category clustering unit 361 may calculate a distance matrix between the center coordinates of the small category and the center coordinates of the large category. Then, the large category clustering unit 361 may integrate each of the small categories into the nearest large category.

층별화 변수 제안부(362)는 회귀 또는 분류의 분석 방식에 따라 데이터를 층별화시키고, 층별화 변수 및 최적의 층별화 기준을 제안한다.The stratification variable suggestion unit 362 stratifies the data according to a regression or classification analysis method, and proposes a stratification variable and an optimal stratification criterion.

일 실시예에 있어서, 층별화 변수 제안부(362)는 회귀 분석에 따라 층별화 변수 및 최적의 층별화 기준을 결정할 수 있다. 회귀 분석의 경우, 층별화 변수 제안부(362)는 층별화 변수에 대해 탐색하여 최적의 경계치를 선택할 수 있다. 층별화 변수 제안부(362)는 층별화된 데이터에 대해 다변량 선형 모델을 학습하여 미리 결과값을 도출할 수 있다. 이때, 층별화 변수 제안부(362)는 MSE(Mean Squared Error)와 이질성 비율을 기준으로 최적의 경계치를 정할 수 있다.In an embodiment, the stratification variable proposal unit 362 may determine stratification variables and optimal stratification criteria according to regression analysis. In the case of regression analysis, the stratification variable proposal unit 362 may search for stratification variables and select an optimal boundary value. The stratified variable proposal unit 362 may derive result values in advance by learning a multivariate linear model for the stratified data. At this time, the stratification variable proposal unit 362 may determine an optimal boundary value based on mean squared error (MSE) and heterogeneity ratio.

다른 실시에에 있어서, 층별화 변수 제안부(362)는 분류 분석에 따라 층별화 변수 및 최적의 층별화 기준을 결정할 수 있다. 분류 분석의 경우, 층별화 변수 제안부(362)는 모든 변수에 대해 경계치를 탐색한 후 최적의 층별화 기준을 선택할 수 있다.In another embodiment, the stratification parameter suggestion unit 362 may determine stratification parameters and optimal stratification criteria according to classification analysis. In the case of classification analysis, the stratification variable proposal unit 362 may select an optimal stratification criterion after searching for boundary values for all variables.

파생변수 추천부(370)는 복수의 서브 파생변수 생성부들(350) 중 적어도 하나에 의하여 생성된 복수의 파생변수들 중 적어도 하나를 추천한다. 파생변수 추천부(370)는 적어도 하나의 서브 파생변수 생성부(350)를 기초로 복수의 파생변수들을 생성하고, 종속변수와의 상관관계를 기초로 생성된 복수의 파생변수들 중 적어도 하나를 추천할 수 있다. The derived variable recommender 370 recommends at least one of a plurality of derived variables generated by at least one of the plurality of sub derived variable generators 350 . The derived variable recommendation unit 370 generates a plurality of derived variables based on at least one sub derived variable generator 350, and generates at least one of the plurality of derived variables generated based on the correlation with the dependent variable. can recommend

일 실시예에 있어서, 파생변수 추천부(370)는 적어도 하나의 서브 파생변수 생성부(350)를 기초로 생성된 복수의 파생변수들 중 종속변수와의 상관관계가 미리 설정된 값 이상인 적어도 하나의 파생변수를 추천할 수 있다. 다른 일 실시예에 있어서, 파생변수 추천부(370)는 적어도 하나의 서브 파생변수 생성부(350)를 기초로 생성된 복수의 파생변수들 중 종속변수와의 상관관계가 높은 순서로 적어도 하나의 파생변수를 추천할 수 있다.In one embodiment, the derived variable recommendation unit 370 selects at least one of a plurality of derived variables generated based on at least one sub-derived variable generator 350 whose correlation with the dependent variable is equal to or greater than a preset value. Derived variables can be recommended. In another embodiment, the derived variable recommendation unit 370 selects at least one of the plurality of derived variables generated based on the at least one sub derived variable generator 350 in order of high correlation with the dependent variable. Derived variables can be recommended.

일 실시예에 있어서, 파생변수 추천부(370)는 사용자에 의하여 복수의 서브 파생변수 생성부들(350) 중 적어도 하나가 선택될 수 있다. 파생변수 추천부(370)는 복수의 서브 파생변수 생성부들(350) 각각과 대응되는 복수의 파생변수 변환 선택 항목들을 포함하는 화면을 출력할 수 있다. 일 예로, 정규화 변환부(351), 누적합 변수 생성부(352), 변수 분할부(353), 변수 결합부(354), 변수 범주화부(355), 더미변수 생성부(356), 교호작용 변수 생성부(357), 최대상관시차 변수 생성부(358), 영향인자 탐지부(359), 대량 범주 군집화부(361) 및 층별화 변수 제안부(362) 각각과 대응되는 복수의 파생변수 변환 선택 항목들이 화면에 출력될 수 있다.In one embodiment, the derived variable recommender 370 may select at least one of the plurality of sub derived variable generators 350 by a user. The derived variable recommendation unit 370 may output a screen including a plurality of derived variable conversion selection items corresponding to each of the plurality of sub derived variable generators 350 . For example, a normalization conversion unit 351, cumulative sum variable generation unit 352, variable division unit 353, variable combination unit 354, variable categorization unit 355, dummy variable generation unit 356, interaction Converting a plurality of derived variables corresponding to each of the variable generator 357, maximum correlation lag variable generator 358, influence factor detector 359, large category clustering unit 361, and stratification variable proposal unit 362 Selection items may be output to the screen.

사용자는 화면을 통해 복수의 파생변수 변환 선택 항목들 중 적어도 하나를 선택할 수 있다. 일 예로, 사용자는 교호작용 변수 생성부(357), 영향인자 탐지부(359) 각각과 대응되는 파생변수 변환 선택 항목들을 선택할 수 있다.A user may select at least one of a plurality of derived variable conversion selection items through a screen. For example, the user may select derived variable conversion selection items corresponding to each of the interaction variable generator 357 and the influence factor detector 359 .

다른 일 실시예에 있어서, 파생변수 추천부(370)는 복수의 서브 파생변수 생성부들(350) 중 적어도 하나를 추천할 수 있다. 파생변수 추천부(370)는 데이터 타입을 기초로 복수의 서브 파생변수 생성부들(350) 중 적어도 하나를 추천할 수 있다. In another embodiment, the derived variable recommender 370 may recommend at least one of the plurality of sub derived variable generators 350 . The derived variable recommender 370 may recommend at least one of the plurality of sub derived variable generators 350 based on the data type.

일 예로, 정제 데이터에 포함된 독립변수들의 데이터 타입이 수치 타입을 포함하는 경우, 파생변수 추천부(370)는 복수의 서브 파생변수 생성부들(350) 중 정규화 변환부(351), 누적합 변수 생성부(352), 교호작용 변수 생성부(357)을 추천할 수 있다. 정규화 변환부(351), 누적합 변수 생성부(352), 교호작용 변수 생성부(357) 각각은 수치 타입의 데이터를 기초로 의미있는 파생변수가 생성될 수 있다. For example, when the data types of the independent variables included in the refined data include numeric types, the derived variable recommendation unit 370 includes the normalization conversion unit 351 among the plurality of sub-derived variable generators 350, the cumulative sum variable The generator 352 and the interaction variable generator 357 may be recommended. Each of the normalization conversion unit 351, cumulative sum variable generation unit 352, and interaction variable generation unit 357 may generate a meaningful derived variable based on numerical data.

다른 예로, 정제 데이터에 포함된 독립변수들의 데이터 타입이 날짜 타입을 포함하는 경우, 파생변수 추천부(370)는 변수 분할부(353)를 추천할 수 있다. As another example, when the data types of independent variables included in the refined data include a date type, the derived variable recommendation unit 370 may recommend the variable division unit 353 .

또 다른 예로, 정제 데이터에 포함된 독립변수들이 고유값의 크기가 큰 범주형 변수인 경우, 파생변수 추천부(370)는 대량 범주 군집화부(361)를 추천할 수 있다.As another example, when the independent variables included in the refined data are categorical variables having large eigenvalues, the derived variable recommendation unit 370 may recommend the large category clustering unit 361 .

파생변수 추천부(370)는 추천하는 서브 파생변수 생성부들(350) 중 적어도 하나가 사용자에 의하여 선택될 수 있다. 파생변수 추천부(370)는 추천하는 서브 파생변수 생성부들(350) 각각에 대응되는 파생변수 변환 선택 항목들을 포함하는 화면을 출력할 수 있다. 사용자는 화면을 통해 파생변수 변환 선택 항목들 중 적어도 하나를 선택할 수 있다.At least one of the sub-derived variable generators 350 recommended by the derived variable recommendation unit 370 may be selected by the user. The derived variable recommender 370 may output a screen including derived variable conversion selection items corresponding to each of the recommended sub derived variable generators 350 . A user may select at least one of derived variable conversion selection items through a screen.

한편, 사용자에 의하여 적어도 둘 이상의 서브 파생변수 생성부들(350)이 선택되면, 파생변수 추천부(370)는 적어도 둘 이상의 서브 파생변수 생성부들(350)의 순서를 결정할 수 있다. Meanwhile, when at least two or more sub derived variable generators 350 are selected by a user, the derived variable recommendation unit 370 may determine the order of the at least two or more sub derived variable generators 350 .

일 실시예에 있어서, 선택된 적어도 둘 이상의 서브 파생변수 생성부들(350)의 순서는 사용자에 의하여 결정될 수 있다. 사용자는 복수의 서브 파생변수 생성부들(350) 중 둘 이상을 선택하면서, 선택된 둘 이상의 서브 파생변수 생성부들을 이용한 파생변수 생성 순서도 선택할 수 있다. 예컨대, 사용자는 화면을 통해 교호작용 변수 생성부(357), 최대상관시차 변수 생성부(358), 영향인자 탐지부(359) 각각에 대응되는 파생변수 변환 선택 항목들을 선택할 수 있다. 또한, 사용자는 화면을 통해 교호작용 변수 생성부(357), 최대상관시차 변수 생성부(358), 영향인자 탐지부(359)를 이용한 파생변수 생성 순서를 선택할 수 있다. 사용자는 교호작용 변수 생성부(357), 최대상관시차 변수 생성부(358), 영향인자 탐지부(359) 순으로 파생변수 생성 순서를 선택할 수 있다. 또는, 사용자는 최대상관시차 변수 생성부(358), 교호작용 변수 생성부(357), 영향인자 탐지부(359) 순으로 파생변수 생성 순서를 선택할 수 있다. 파생변수 생성 순서가 상이하면, 이에 따라 생성되는 파생변수도 달라질 수 있다.In one embodiment, the order of the selected at least two or more sub derived variable generators 350 may be determined by a user. While selecting two or more of the plurality of sub-derived variable generators 350, the user may also select a derivative variable generation sequence using the selected two or more sub-derived variable generators. For example, the user may select derived variable conversion selection items corresponding to each of the interaction variable generator 357, maximum correlation lag variable generator 358, and influence factor detector 359 through the screen. In addition, the user may select the order of generating derived variables using the interaction variable generator 357, the maximum correlation lag variable generator 358, and the influencing factor detector 359 through the screen. The user may select the order of generating the derived variables in the order of the interaction variable generator 357, the maximum correlation lag variable generator 358, and the influencing factor detector 359. Alternatively, the user may select the order of generating the derived variables in the order of the maximum correlation lag variable generator 358, the interaction variable generator 357, and the influencing factor detector 359. If the order of generating the derived variable is different, the derived variable generated accordingly may also be different.

다른 일 실시예에 있어서, 파생변수 추천부(370)는 선택된 적어도 둘 이상의 서브 파생변수 생성부들(350)을 이용한 파생변수 생성 순서를 추천할 수 있다. 파생변수 추천부(370)는 상기 둘 이상의 서브 파생변수 생성부들(350)을 포함하는 복수의 파생변수 생성 순서 그룹들을 결정할 수 있다. 여기서, 파생변수 생성 순서 그룹은 둘 이상의 서브 파생변수 생성부들(350)의 실행 순서를 나타낼 수 있다. 예컨대, 복수의 서브 파생변수 생성부들(350) 중 교호작용 변수 생성부(357), 영향인자 탐지부(359)이 선택되면, 파생변수 추천부(370)는 {교호작용 변수 생성부(357), 영향인자 탐지부(359)} 순으로 이루어진 제1 그룹, {영향인자 탐지부(359), 교호작용 변수 생성부(357)} 순으로 이루어진 제2 그룹을 포함할 수 있다. In another embodiment, the derived variable recommendation unit 370 may recommend a generation sequence of derived variables using the selected at least two or more sub-derived variable generators 350 . The derived variable recommendation unit 370 may determine a plurality of derived variable generation order groups including the two or more sub derived variable generators 350 . Here, the derived variable generation order group may indicate an execution order of two or more sub derived variable generators 350 . For example, when the interaction variable generator 357 and the influencing factor detector 359 are selected among the plurality of sub derived variable generators 350, the derived variable recommendation unit 370 {interaction variable generator 357 , the influence factor detector 359}, and a second group formed in the order of {the influence factor detector 359, the interaction variable generator 357}.

파생변수 추천부(370)는 복수의 파생변수 생성 순서 그룹들 중 하나를 추천할 수 있다. 파생변수 추천부(370)는 복수의 파생변수 생성 순서 그룹들 각각에 대하여 파생변수 생성 순서에 따라 파생변수를 생성할 수 있다. 파생변수 추천부(370)는 각 파생변수 생성 순서 그룹에서 생성된 파생변수와 종속변수 간의 상관계수를 산출할 수 있다. 파생변수 추천부(370)는 종속변수와의 상관계수가 가장 큰 파생변수를 생성한 파생변수 생성 순서 그룹을 추천할 수 있다.The derived variable recommendation unit 370 may recommend one of a plurality of derived variable generation order groups. The derived variable recommendation unit 370 may generate a derived variable according to a derived variable generation order for each of a plurality of derived variable generation order groups. The derived variable recommendation unit 370 may calculate a correlation coefficient between the derived variable generated in each derived variable generation order group and the dependent variable. The derived variable recommendation unit 370 may recommend a derived variable generation order group in which a derived variable having the largest correlation coefficient with a dependent variable is generated.

파생변수는 복수의 파생변수 생성 방법들이 조합되어 생성될 수 있으며, 그들의 순서에 따라서 그 값이 달라질 수 있다. 사용자는 복수의 독립변수들에서 기준 변수를 찾고, 기준 변수를 기초로 최적의 파생변수를 도출할 수 있는 복수의 파생변수 생성 방법들의 조합을 탐색하는데 많은 시간이 소요될 수 있다. 더욱이, 복수의 파생변수 생성 방법들의 순서까지 고려하여야 하므로, 사용자가 고려하여야 하는 경우의 수가 더욱 많아질 수 있다. 이러한 이유로, 사용자가 데이터 분석에 대한 지식 및 경험이 없으면, 적절한 파생변수를 도출하지 못할 수 있다. 이로 인하여, 전처리된 데이터에 대한 신뢰성이 떨어지게 된다. A derived variable may be created by combining a plurality of derived variable generation methods, and its value may vary according to their order. A user may spend a lot of time searching for a reference variable from a plurality of independent variables and searching for a combination of a plurality of derived variable generation methods capable of deriving an optimal derived variable based on the reference variable. Moreover, since the order of a plurality of derived variable generation methods must be considered, the number of cases to be considered by the user may further increase. For this reason, if the user does not have knowledge and experience in data analysis, appropriate derived variables may not be derived. As a result, the reliability of the preprocessed data is lowered.

본 발명의 일 실시예에 따른 데이터 전처리 시스템(1000)은 종속변수와의 상관관계가 높은 파생변수들을 추천할 수 있다. 더 나아가, 본 발명의 일 실시예에 따른 데이터 전처리 시스템(1000)은 최적의 파생변수를 도출하기 위한 복수의 파생변수 생성 방법들의 순서까지 추천할 수 있다. 이에 따라, 사용자는 최적의 파생변수를 생성하기 위하여 최적의 기준 변수를 찾고, 최적의 파생변수 생성 방법을 찾는데 소요되던 시간을 줄일 수 있다.The data preprocessing system 1000 according to an embodiment of the present invention may recommend derived variables having a high correlation with the dependent variable. Furthermore, the data preprocessing system 1000 according to an embodiment of the present invention may recommend an order of a plurality of derived variable generation methods for deriving an optimal derived variable. Accordingly, the user can reduce the time required to find an optimal reference variable and find an optimal derived variable generation method in order to generate an optimal derived variable.

결과적으로, 본 발명의 일 실시예에 따른 데이터 전처리 시스템(1000)은 사용자가 분석 경험이 없더라도 신뢰성이 높은 전처리 데이터를 생성할 수 있다.As a result, the data pre-processing system 1000 according to an embodiment of the present invention can generate highly reliable pre-processed data even if the user has no analysis experience.

다음, 전처리 자동화부(216)는 데이터 저장부(110)에 저장된 전처리 대상 수집 데이터가 미리 정해진 순서에 따라 전처리된다. 이때, 전처리 순서는 관리자에 의하여 미리 설정될 수 있다.Next, the pre-processing automation unit 216 pre-processes the collected pre-processing target data stored in the data storage unit 110 according to a predetermined order. At this time, the pre-processing sequence may be set in advance by a manager.

일 실시예에 있어서, 전처리 자동화부(216)는 도 9에 도시된 바와 같이 데이터 준비, 데이터 정제, 정규화 변환 및 파생변수 생성 순서로 진행될 수 있다. In one embodiment, as shown in FIG. 9 , the preprocessing automation unit 216 may proceed in the order of data preparation, data cleaning, normalization transformation, and generation of derived variables.

구체적으로, 전처리 자동화부(216)는 전처리 작업 공간인 프로젝트가 생성되고, 사용자로부터 자동 전처리 요청이 입력되면, 데이터 저장부(110)에 저장된 수집 데이터 중 전처리 대상 수집 데이터를 해당 프로젝트로 읽어올 수 있다(S901). Specifically, the preprocessing automation unit 216 can read preprocessing target collected data from among the collected data stored in the data storage unit 110 into the corresponding project when a project, which is a preprocessing workspace, is created and an automatic preprocessing request is input from the user. Yes (S901).

그런 다음, 전처리 자동화부(216)는 전처리 대상 수집 데이터에서 전처리 대상 독립변수들 및 예측하고자 하는 종속변수를 선택할 수 있다(S902). Then, the preprocessing automation unit 216 may select independent variables to be preprocessed and dependent variables to be predicted from the collected data to be preprocessed (S902).

이때, 전처리 자동화부(216)는 전체 독립변수들 중 일부만 전처리 대상 독립변수로 선택하고자 하는 경우, 전체 독립변수들 중 일부를 전처리 대상 독립변수에서 삭제할 수 있다(S903). In this case, when the preprocessing automation unit 216 intends to select only some of the entire independent variables as the preprocessing target independent variables, it may delete some of the entire independent variables from the preprocessing target independent variables (S903).

그런 다음, 전처리 자동화부(216)는 날짜변수의 타입을 지정할 수 있다(S904). 날짜변수의 타입을 지정하면(S905), 전처리 자동화부(216)는 자동 품질진단부(322)를 통해 전처리 대상 독립변수들 및 종속변수를 포함하는 데이터의 품질을 자동으로 진단하고, 진단결과를 사용자가 확인할 수 있도록 복수의 항목들을 포함하는 데이터 품질진단 리포트 화면을 출력할 수 있다(S906).Then, the preprocessing automation unit 216 may designate the type of the date variable (S904). When the type of the date variable is designated (S905), the preprocessing automation unit 216 automatically diagnoses the quality of the data including the independent variables and dependent variables to be preprocessed through the automatic quality diagnosis unit 322, and returns the diagnosis result. A data quality diagnosis report screen including a plurality of items can be output so that the user can check (S906).

그런 다음, 전처리 자동화부(216)는 자동 데이터 정제부(341)를 통해 미리 설정된 처리순서에 따라 전처리 대상 독립변수들 및 종속변수를 포함하는 데이터를 정제할 수 있다(S907). Then, the preprocessing automation unit 216 may refine the data including the independent variables and the dependent variables to be preprocessed according to a processing sequence set in advance through the automatic data purification unit 341 (S907).

그런 다음, 전처리 자동화부(216)는 정규화 변환부(351)를 통해 정제 데이터에서 수치 타입의 변수의 분포를 정규분포 형태로 변환할 수 있다(S908). 정규화 변환부(351)는 정제 데이터에 대하여 복수의 정규화 변환들 중 하나를 선택할 수 있다. 정규화 변환부(351)는 정제 데이터의 변수들에 대해서 정규성 검정을 수행할 수 있다. 일 예로, 정규화 변환부(351)는 정제 데이터의 변수들에 대해서 lilliefors 검정 또는 kolmogorov-smirnov검정을 수행할 수 있다. Then, the preprocessing automation unit 216 may convert the distribution of numeric type variables in the refined data into a normal distribution form through the normalization conversion unit 351 (S908). The normalization transformation unit 351 may select one of a plurality of normalization transformations for the refined data. The normalization conversion unit 351 may perform a normality test on the variables of the refined data. For example, the normalization conversion unit 351 may perform a lilliefors test or a kolmogorov-smirnov test on variables of the refined data.

정규화 변환부(351)는 복수의 정규화 변환들 각각에 대하여 조건을 만족하는지 확인할 수 있다. 일 예로, 정규화 변환부(351)는 데이터에 음수가 존재하는 경우 로그 변환, 제곱근 변환을 수행하지 않을 수 있다. 다른 예로, 정규화 변환부(351)는 데이터에 0이 존재하는 경우 역수 변환을 수행하지 않을 수 있다.The normalization transformation unit 351 may check whether a condition is satisfied for each of a plurality of normalization transformations. For example, the normalization conversion unit 351 may not perform log transformation or square root transformation when negative numbers exist in the data. As another example, the normalization conversion unit 351 may not perform reciprocal conversion when 0 exists in the data.

그런 다음, 전처리 자동화부(216)는 교호작용 변수 생성부(357)를 통해 정제 데이터에서 둘 이상의 독립변수들을 기초로 연산을 수행하여 파생변수를 생성할 수 있다(S909). 교호작용 변수 생성부(357)는 둘 이상의 독립변수들을 곱하거나 평균을 구하거나 중간값을 구하여 복수의 독립변수들을 새롭게 생성할 수 있다. 교호작용 변수 생성부(357)는 생성된 복수의 독립변수들을 새로운 데이터프레임에 저장할 수 있다. 교호작용 변수 생성부(357)는 생성된 복수의 독립변수들 각각에 대해서 종속변수와의 상관계수를 산출할 수 있다. 교호작용 변수 생성부(357)는 종속변수와의 상관계수가 미리 설정된 임계값 이상인 독립변수들을 파생변수로 결정할 수 있다.Then, the preprocessing automation unit 216 may generate a derived variable by performing an operation based on two or more independent variables in the refined data through the interaction variable generator 357 (S909). The interaction variable generating unit 357 may newly generate a plurality of independent variables by multiplying two or more independent variables, obtaining an average, or obtaining a median value. The interaction variable generator 357 may store the generated independent variables in a new data frame. The interaction variable generating unit 357 may calculate a correlation coefficient with a dependent variable for each of the plurality of generated independent variables. The interaction variable generating unit 357 may determine independent variables whose correlation coefficients with the dependent variables are greater than or equal to a preset threshold value as derived variables.

마지막으로, 전처리 자동화부(216)는 전처리된 데이터의 품질을 자동으로 진단하고, 진단결과를 사용자가 확인할 수 있도록 복수의 항목들을 포함하는 데이터 품질진단 리포트 화면을 출력할 수 있다(S906).Finally, the preprocessing automation unit 216 can automatically diagnose the quality of the preprocessed data and output a data quality diagnosis report screen including a plurality of items so that the user can check the diagnosis result (S906).

본 발명의 일 실시예에 따른 데이터 전처리 시스템(1000)은 자동으로 전처리를 수행하고, 전처리된 데이터의 품질을 진단하여 진단결과를 화면으로 출력할 수 있다. 한편, 본 발명의 일 실시예에 따른 데이터 전처리 시스템(1000)은 수집 데이터에 포함된 복수의 독립변수들 중 일부에 대해서 전처리를 수행하고, 전처리된 데이터의 품질을 진단하여 진단결과를 화면으로 출력할 수 있다. 사용자는 전처리된 데이터의 품질을 쉽게 확인할 수 있으며, 만족할만한 결과가 도출되지 않으면, 독립변수를 변경하면서 전처리를 수행할 수 있다. 이에 따라, 사용자는 기준 변수, 기준 변수로부터 파생된 파생변수, 및 종속변수와의 상관관계를 쉽게 파악할 수 있다. 사용자는 분석 경험이 없이도 신뢰성이 높은 전처리 데이터를 획득할 수 있다.The data pre-processing system 1000 according to an embodiment of the present invention can automatically perform pre-processing, diagnose the quality of pre-processed data, and output the diagnosis result to the screen. Meanwhile, the data pre-processing system 1000 according to an embodiment of the present invention performs pre-processing on some of a plurality of independent variables included in the collected data, diagnoses the quality of the pre-processed data, and outputs the diagnosis result to the screen. can do. The user can easily check the quality of the preprocessed data, and if a satisfactory result is not obtained, the preprocessing can be performed while changing the independent variable. Accordingly, the user can easily grasp the correlation between the reference variable, the derived variable derived from the reference variable, and the dependent variable. Users can acquire highly reliable preprocessing data without analysis experience.

본 발명이 속하는 기술분야의 당업자는 상술한 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다.Those skilled in the art to which the present invention pertains will be able to understand that the above-described present invention may be embodied in other specific forms without changing its technical spirit or essential features.

그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. The scope of the present invention is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. do.

1000: 데이터 전처리 시스템 100: 데이터 관리부
110: 데이터 저장부 120: 데이터 정보 편집부
200: 프로젝트 관리부 211: 데이터 준비부
212: 데이터 진단부 213: 데이터 병합부
214: 데이터 정제부 215: 파생변수 생성부
216: 전처리 자동화부 220: 프로젝트 정보 편지부1000: data pre-processing system 100: data management unit
110: data storage unit 120: data information editing unit
200: project management unit 211: data preparation unit
212: data diagnosis unit 213: data merging unit
214: data refinement unit 215: derived variable generation unit
216: preprocessing automation unit 220: project information letter unit

Claims

a data storage unit for storing collection data collected by the data collection device;
a data preparation unit that loads collected data to be preprocessed among the collected data stored in the data storage unit into a project, which is a preprocessing workspace;
a data refiner configured to refine the loaded preprocessing target collected data; and
It includes a plurality of sub-derived variable generators for generating derived variables from data refined by the data refiner, selects at least one of the plurality of sub-derived variable generators, and generates the selected at least one sub-derived variable. A derived variable generating unit that recommends at least one of a plurality of derived variables generated based on wealth;
The derived variable generation unit,
and a derived variable recommendation unit for recommending a derivative variable generation order corresponding to an execution order of the two or more sub derived variable generators when two or more of the plurality of sub derived variable generators are selected.

According to claim 1,
The derived variable recommendation unit,
A data preprocessing system for recommending at least one of the plurality of derived variables generated based on the selected at least one sub derived variable generator in order of highest correlation coefficient with the dependent variable.

delete

According to claim 1,
The derived variable recommendation unit,
including the two or more sub-derived variable generators, determining a plurality of derived variable generation order groups indicating an execution order of the two or more sub-derived variable generators, and selecting one of the plurality of derivative variable generation order groups; Recommended data preprocessing system.

According to claim 4,
The derived variable recommendation unit,
For each of the plurality of derived variable generation order groups, a derived variable is generated according to the execution order of the sub derived variable generators, and the derived variable generation order group having the highest correlation coefficient with the dependent variable among the generated derived variables Select and recommend a data preprocessing system.

According to claim 1,
The derived variable recommendation unit,
Data for recommending at least two or more of the plurality of sub derived variable generators based on at least one of a data type and data type, and selecting at least one of the recommended at least two or more sub derived variable generators based on a user input pretreatment system.

According to claim 1,
A data preprocessing system in which a plurality of sub-derived variable generators generate derived variables in different ways.

According to claim 1,
The data purification unit,
A data preprocessing system comprising: a plurality of sub data refiners having different data purification methods, and an automatic data refiner configured to refine the collected data to be preprocessed by executing the plurality of sub data refiners according to a preset order.

According to claim 1,
The data pre-processing system further includes a data diagnosis unit configured to diagnose the quality of the pre-processing target collected data loaded by the data preparation unit and output a diagnosis result to a screen.

According to claim 1,
Preprocessing of selecting independent variables and dependent variables to be preprocessed from the collected data to be preprocessed loaded by the data preparation unit, and performing preprocessing on the selected independent variables and dependent variables in the order of data refinement, normalization transformation, and generation of derived variables. Data pre-processing system further comprising an automation unit.

According to claim 10,
The pre-processing automation unit,
A plurality of new independent variables are generated by calculating at least two or more independent variables among the independent variables to be preprocessed, and at least one independent variable selected from among the generated plurality of independent variables in order of highest correlation coefficient with the dependent variable is selected. A data preprocessing system that generates derived variables.