KR20220090358A

KR20220090358A - Device and method for analyzing and visualizing big data

Info

Publication number: KR20220090358A
Application number: KR1020210031377A
Authority: KR
Inventors: 하광림; 강인지; 전혜경; 조용학; 강인호
Original assignee: 주식회사 씨에스리
Priority date: 2020-12-22
Filing date: 2021-03-10
Publication date: 2022-06-29

Abstract

본 발명은 빅데이터 분석 시각화 장치 및 방법으로, 보다 상세하게는 GUI기반 블록 연결 방식으로 빅데이터를 다양한 분석 모델로 분석하고, 결과에 적합한 시각화 정보를 제공하는 빅데이터 분석 시각화 장치 및 방법에 관한 것이다. 본 발명의 일 실시 예에 따르면, 빅데이터 분석 시각화 장치 및 방법은 프로그램 코딩 없이 다양한 분석 모델을 이용하여 빅데이터를 분석할 수 있다.The present invention relates to a big data analysis visualization apparatus and method, and more particularly, to a big data analysis visualization apparatus and method for analyzing big data with various analysis models in a GUI-based block connection method and providing visualization information suitable for the results. . According to an embodiment of the present invention, a big data analysis visualization apparatus and method can analyze big data using various analysis models without program coding.

Description

Big data analysis visualization device and method {DEVICE AND METHOD FOR ANALYZING AND VISUALIZING BIG DATA}

본 발명은 빅데이터 분석 시각화 장치 및 방법으로, 보다 상세하게는 클라우드 기반의 블록 연결 방식으로 코딩없이 빅데이터를 다양한 분석 모델로 분석하고, 결과에 적합한 시각화 정보를 제공하는 빅데이터 분석 시각화 장치 및 방법에 관한 것이다.The present invention is a big data analysis visualization apparatus and method, and more specifically, a big data analysis visualization apparatus and method that analyzes big data into various analysis models without coding in a cloud-based block connection method and provides visualization information suitable for the results. is about

빅데이터는 디지털 환경에서 생성되는 데이터로 그 규모가 방대하고 수치 데이터 같이 정형화된 데이터뿐 아니라 문자와 영상 데이터와 같이 비정형화된 데이터도 포함하는 대규모 데이터를 말한다. 정보 통신 기술의 발전으로 인해 클라우드 서비스가 상용화되면서, 대용량의 빅 데이터를 효율적으로 처리하는 기술에 대한 관심이 증대되고 있다. 특히, 사물인터넷의 부흥에 힘입어 상상할 수 없을 정도로 방대한 양과 다양한 종류의 데이터가 시시각각 생성되고 있다.Big data is data generated in a digital environment. As cloud services are commercialized due to the development of information and communication technologies, interest in technologies for efficiently processing large amounts of big data is increasing. In particular, thanks to the revival of the Internet of Things, an unimaginably large amount and various types of data are being generated every moment.

이러한 빅데이터는 종래의 데이터 처리 방식과는 다른 새로운 알고리즘이나 패러다임을 통해 처리될 필요가 있으며, 수요자의 요구에 맞는 처리 및 분석 과정을 통해서 빅데이터를 통한 다양한 가치 창출이 가능하게 된다.Such big data needs to be processed through a new algorithm or paradigm that is different from the conventional data processing method, and various values can be created through big data through processing and analysis processes that meet the needs of consumers.

최근 PC 이외에 태블릿(tablet), 스마트폰(smart phone)과 같은 고성능의 휴대용 기기들이 등장하면서, 데스크탑 PC를 통한 인터넷 접속뿐만 아니라 모바일 접속을 통해 모바일 쇼핑, 검색, 메일 확인 등을 즐기는 인구가 크게 증가하고 있다. 이러한 휴대용 기기의 보급화 및 모바일 인터넷 기술의 발달로, 인터넷 상에 존재하는 많은 데이터들이 웹로봇, 웹크롤러, 스파이더 등을 통해 수집되고 있으며, 수집된 빅 데이터를 원하는 목적에 따라 분석하여 이용하고 있다.Recently, with the advent of high-performance portable devices such as tablets and smart phones in addition to PCs, the number of people who enjoy mobile shopping, search, and e-mail checking through mobile connections as well as Internet access through desktop PCs has increased significantly. are doing With the spread of these portable devices and the development of mobile Internet technology, a lot of data existing on the Internet is being collected through web robots, web crawlers, spiders, etc., and the collected big data is analyzed and used according to a desired purpose.

기존의 데이터 분석 시스템은 스칼라(scala), 파이썬(python) 등의 프로그래밍 언어를 이용하여 작성된 데이터 분석 코드를 기반으로 빅 데이터를 분석하였다. 다시 말해, 스칼라, 파이썬 등의 프로그래밍 언어를 학습한 사용자는 데이터 분석 코드를 작성할 수 있으나, 해당 프로그래밍 언어를 학습하지 않은 사용자는 데이터 분석 코드를 작성하기 어려워, 다른 사용자에 의해 작성된 데이터 분석 플로우를 직관적으로 이해하는데 어려움이 존재하며, 유지 보수가 어려워진다.Existing data analysis systems analyzed big data based on data analysis codes written using programming languages such as scala and python. In other words, users who have learned a programming language such as Scala or Python can write data analysis codes, but users who have not learned the programming language have difficulty writing data analysis codes, so the data analysis flow written by other users can be intuitive. As a result, there are difficulties in understanding and maintenance becomes difficult.

이에 따라, 파이썬, 스칼라 등의 프로그래밍 언어를 학습하지 못한 사용자라 할지라도, 특정 데이터 파일과 관련하여 자신이 제어 또는 수정하고자 데이터 분석 플로우를 쉽게 작성하도록 도와주는 데이터 분석 기술이 요구된다.Accordingly, even if a user has not learned a programming language such as Python or Scala, data analysis technology that helps them easily write a data analysis flow to control or modify a specific data file is required.

본 발명의 배경기술은 대한민국 공개특허 제10-2013-0155808 호에 게시되어 있다.Background art of the present invention is disclosed in Korean Patent Laid-Open Publication No. 10-2013-0155808.

본 발명은 클라우드 기반의 블록 연결 방식으로 프로그램 코딩없이 빅데이터를 다양한 분석 모델로 분석하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for analyzing big data into various analysis models without program coding in a cloud-based block connection method.

본 발명은 기능 블록들을 적합한 위치와 순서에 맞게 연결하여 워크플로우를 손쉽게 생성할 수 있는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method that can easily create a workflow by connecting functional blocks in an appropriate position and order.

본 발명은 동종 빅데이터 분석에 재사용이 가능한 빅데이터 분석 워크플로우 또는 분석 시나리오를 템플릿으로 제공하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for providing a big data analysis workflow or analysis scenario that can be reused for the same type of big data analysis as a template.

본 발명은 GUI 기반으로 각 과정을 블록 단위로 실행하고 확인할 수 있는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method capable of executing and checking each process in block units based on a GUI.

본 발명은 다양한 분석 결과에 대한 적합한 시각화 방법을 추천하고, 표현하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for recommending and expressing a suitable visualization method for various analysis results.

본 발명은 입력된 데이터의 오류를 확인하고 대체 값을 추천하여 분석 모델을 수행하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for confirming errors in input data and performing an analysis model by recommending an alternative value.

본 발명은 하이퍼파라미터를 추천하여 최적의 성능으로 분석 모델을 수행하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for performing an analysis model with optimal performance by recommending hyperparameters.

본 발명은 빅데이터 분석을 위한 단계를 설정하는 워크플로우 생성 시에 다음 단계의 기능 블록을 추천하여 분석 과정의 효율 및 정확도를 개선하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for improving the efficiency and accuracy of an analysis process by recommending a functional block for the next step when creating a workflow for setting steps for big data analysis.

본 발명은 데이터셋에 적합한 분석 모델, 워크플로우 또는 템플릿을 추천하여 분석 과정의 효율 및 분석의 정확도를 개선하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for improving the efficiency of an analysis process and the accuracy of analysis by recommending an analysis model, workflow, or template suitable for a dataset.

본 발명은 입력 데이터셋 또는 선택된 기능 블록을 분석하여 추천된 블록들을 배치하고, 워크플로우를 완성하는 빅데이터 분석 시각화 장치 및 방법을 제공한다.The present invention provides a big data analysis visualization apparatus and method for arranging recommended blocks by analyzing an input dataset or a selected functional block, and completing a workflow.

본 발명의 일 측면에 따르면, 빅데이터 분석 시각화 장치를 제공한다. According to one aspect of the present invention, there is provided a big data analysis visualization apparatus.

본 발명의 일 실시예에 따른 빅데이터 분석 시각화 장치는 분석하려는 데이터를 수집하는 수집부, 데이터를 분석 모델에 맞도록 전처리하는 전처리부, 데이터를 분석 모델로 분석하는 분석부, 분석 모델을 수행한 결과를 적합한 그래프로 시각화하는 시각화부 및 데이터를 수집, 전처리, 분석 및 시각화 과정을 기능 블록으로 매칭하여 수행하는 수행부를 포함할 수 있다.A big data analysis visualization apparatus according to an embodiment of the present invention includes a collection unit for collecting data to be analyzed, a preprocessing unit for preprocessing data to fit an analysis model, an analysis unit for analyzing data into an analysis model, and performing the analysis model. It may include a visualization unit that visualizes the result as a suitable graph and an execution unit that performs data collection, pre-processing, analysis, and visualization processes by matching them with functional blocks.

본 발명의 다른 일 측면에 따르면, 빅데이터 분석 시각화 방법 및 이를 실행하는 컴퓨터 프로그램이 기록된 컴퓨터가 판독 가능한 기록매체를 제공한다.According to another aspect of the present invention, there is provided a computer-readable recording medium in which a big data analysis visualization method and a computer program executing the same are recorded.

본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 방법은 분석하려는 데이터를 수집하는 단계, 데이터를 분석 모델에 맞게 전처리하는 단계, 데이터를 분석 모델을 이용해 분석하는 단계, 분석한 데이터를 시각화 하는 단계 및 데이터를 수집, 전처리, 분석 및 시각화 단계를 기능 블록으로 매칭하여 수행하는 단계를 포함할 수 있다.The big data analysis visualization method according to an embodiment of the present invention includes the steps of collecting data to be analyzed, preprocessing the data to fit the analysis model, analyzing the data using the analysis model, visualizing the analyzed data, and It may include performing the steps of collecting, pre-processing, analyzing, and visualizing data by matching them with functional blocks.

본 발명의 일 실시 예에 따르면, 클라우드 기반의 블록 연결 방식으로 프로그램 코딩 없이 다양한 분석 모델을 이용하여 빅데이터를 분석할 수 있다.According to an embodiment of the present invention, big data can be analyzed using various analysis models without program coding in a cloud-based block connection method.

본 발명의 일 실시 예에 따르면, 추천 블록들을 적합한 위치와 순서에 맞게 연결하여 분석 워크플로우를 손쉽게 생성할 수 있다.According to an embodiment of the present invention, it is possible to easily create an analysis workflow by connecting the recommended blocks in an appropriate position and order.

본 발명의 일 실시 예에 따르면, 동종 빅데이터 분석에 재사용이 가능한 빅데이터 분석 워크플로우 또는 분석 시나리오를 템플릿으로 제공하여 손쉽게 빅데이터를 분석할 수 있다.According to an embodiment of the present invention, big data can be easily analyzed by providing a big data analysis workflow or analysis scenario that can be reused for analysis of the same type of big data as a template.

본 발명의 일 실시 예에 따르면, GUI 기반으로 각 과정을 블록 단위로 실행하고 확인할 수 있다.According to an embodiment of the present invention, each process can be executed and confirmed in block units based on the GUI.

본 발명의 일 실시 예에 따르면, 다양한 분석 결과에 대한 적합한 시각화 방법을 추천하고, 표현할 수 있다. According to an embodiment of the present invention, a suitable visualization method for various analysis results can be recommended and expressed.

본 발명의 일 실시 예에 따르면, 입력된 데이터의 오류를 확인하고 대체 값을 추천하여 분석 모델을 수행할 수 있다.According to an embodiment of the present invention, an analysis model may be performed by checking an error in input data and recommending an alternative value.

본 발명의 일 실시 예에 따르면, 하이퍼파라미터를 추천하여 최적의 성능으로 분석 모델을 수행할 수 있다.According to an embodiment of the present invention, it is possible to perform an analysis model with optimal performance by recommending hyperparameters.

본 발명의 일 실시 예에 따르면, 데이터셋에 적합한 분석 모델, 워크플로우 또는 템플릿을 추천하여 분석 과정의 효율 및 분석의 정확도를 개선할 수 있다.According to an embodiment of the present invention, it is possible to improve the efficiency of the analysis process and the accuracy of the analysis by recommending an analysis model, workflow, or template suitable for a dataset.

본 발명의 일 실시 예에 따르면, 빅데이터 분석을 위한 단계를 설정하는 워크플로우 생성 시에 다음 단계의 기능 블록을 추천하여 분석 과정의 효율 및 분석의 정확도를 개선할 수 있다.According to an embodiment of the present invention, it is possible to improve the efficiency of the analysis process and the accuracy of the analysis by recommending the functional block of the next step when creating a workflow for setting the steps for big data analysis.

본 발명의 일 실시 예에 따르면, 작업 진행 단계를 분석하여 적절한 다수의 블록 추천 및 배치를 통해 워크플로우를 완성할 수 있다.According to an embodiment of the present invention, it is possible to analyze the work progress stage and complete the workflow through appropriate recommendation and arrangement of a plurality of blocks.

도 1 내지 도 20은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치를 설명하기 위한 도면들.
도21 내지 도 31은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 방법을 설명하기 위한 도면들.
도 32 내지 도 36은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치의 예시 화면들.1 to 20 are diagrams for explaining a big data analysis visualization apparatus according to an embodiment of the present invention.
21 to 31 are diagrams for explaining a big data analysis visualization method according to an embodiment of the present invention.
32 to 36 are exemplary screens of a big data analysis visualization apparatus according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 이를 상세한 설명을 통해 상세히 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 본 명세서 및 청구항에서 사용되는 단수 표현은, 달리 언급하지 않는 한 일반적으로 "하나 이상"을 의미하는 것으로 해석되어야 한다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and will be described in detail through detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing the present invention, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. Also, the expressions "a" and "a", "a" and "a", as used in this specification and claims, should generally be construed to mean "one or more" unless stated otherwise.

이하, 본 발명의 바람직한 실시 예를 첨부도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. do it with

도 1 내지 도 20은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치를 설명하기 위한 도면들이다. 1 to 20 are diagrams for explaining a big data analysis and visualization apparatus according to an embodiment of the present invention.

도 1을 참조하면, 빅데이터 분석 시각화 장치(10)는 다양한 포맷의 데이터를 별다른 변경 과정 없이 마우스로 드래그 앤 드롭하여 손쉽게 이용할 수 있다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 엑셀 파일 형식, CSV 파일 형식, RDS 파일 형식, TXT 파일 형식 및 데이터베이스도 드래그 앤 드롭으로 연결하여 분석하기 원하는 데이터를 수집할 수 있다. 또한 빅데이터 분석 시각화 장치(10)는 제공되는 OpenAPI를 이용해 직접 데이터를 수집할 수 있다.Referring to FIG. 1 , the big data analysis and visualization apparatus 10 can easily use data in various formats by dragging and dropping it with a mouse without a special change process. For example, the big data analysis visualization apparatus 10 may collect data to be analyzed by connecting an Excel file format, a CSV file format, an RDS file format, a TXT file format, and a database by drag and drop. In addition, the big data analysis visualization apparatus 10 may directly collect data using the provided OpenAPI.

빅데이터 분석 시각화 장치(10)는 수집한 다양한 형식의 데이터의 전처리를 수행한다. 빅데이터 분석 시각화 장치(10)는 분석 모델에 따른 필요 데이터 형식으로 다양한 데이터 전처리를 수행한다. 빅데이터 분석 시각화 장치(10)는 컬럼 이름 변경, 타입 설정, 데이터셋 병합, 샘플링, 연산 그룹(Group By 연산) 등 빅데이터 분석 업무의 70%를 차지하는 전처리 과정을 웹페이지 조작 방식으로 쉽게 처리하고 파일로 저장할 수 있다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 CSV 파일 형식, xml 파일 형식, yml 파일 형식, json 파일 형식, txt 파일 형식, log 파일 형식 또는 입력된 데이터 형식 등 다양한 데이터 형식으로 저장할 수 있다.The big data analysis visualization apparatus 10 performs pre-processing of the collected data in various formats. The big data analysis visualization apparatus 10 performs various data preprocessing in a required data format according to the analysis model. The big data analysis visualization device 10 easily processes the pre-processing process that occupies 70% of the big data analysis tasks such as column name change, type setting, dataset merging, sampling, and group by operation (group-by operation) with a web page manipulation method. It can be saved as a file. For example, the big data analysis visualization device 10 may store in various data formats, such as a CSV file format, an xml file format, a yml file format, a json file format, a txt file format, a log file format, or an input data format.

빅데이터 분석 시각화 장치(10)는 분석 데이터를 여러가지 그래프를 통해 시각화할 수 있다. 빅데이터 분석 시각화 장치(10)는 분석 데이터를 시간, 분포, 관계, 비교, 공간 등의 형태로 시각화할 수 있다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 라인, 원, 막대, 히스토그램, 버블 차트, 산점도, 박스플롯, 워드클라우드 등 다양한 그래프를 이용해 분석 데이터를 다양한 형태로 시각화할 수 있다.The big data analysis visualization apparatus 10 may visualize the analysis data through various graphs. The big data analysis visualization apparatus 10 may visualize the analysis data in the form of time, distribution, relationship, comparison, space, or the like. For example, the big data analysis visualization apparatus 10 may visualize the analysis data in various forms using various graphs, such as a line, circle, bar, histogram, bubble chart, scatter plot, box plot, and word cloud.

빅데이터 분석 시각화 장치(10)는 시각화 결과를 파일 형식으로 제공할 수 있다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 분석 결과를 웹 페이지, PDF, MS-Word, CSV 형식 등의 다양한 형식으로 제공할 수 있다.The big data analysis visualization apparatus 10 may provide the visualization result in a file format. For example, the big data analysis visualization apparatus 10 may provide the analysis result in various formats, such as a web page, PDF, MS-Word, and CSV format.

빅데이터 분석 시각화 장치(10)는 전처리한 데이터를 다양한 분석 모델을 이용해 분석할 수 있다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 상관분석, 회귀분석, 의사결정나무, kNN, 랜덤포레스트, MLP 등과 같은 지도학습 또는 연관분석, K-means, PCA, 계층적 군집화 같은 비지도 학습 등의 다양한 분석 모델을 R이나 파이썬(Python)과 같은 프로그램 코드작성없이 데이터를 적용하여 분석할 수 있다.The big data analysis visualization apparatus 10 may analyze the preprocessed data using various analysis models. For example, the big data analysis visualization apparatus 10 is a supervised learning or association analysis such as correlation analysis, regression analysis, decision tree, kNN, random forest, MLP, etc., unsupervised learning such as K-means, PCA, hierarchical clustering Various analysis models such as R or Python can be applied and analyzed without writing program codes such as R or Python.

빅데이터 분석 시각화 장치(10)는 기능을 블록으로 매칭하여 워크플로우를 생성한다. 빅데이터 분석 시각화 장치(10)는 원하는 기능 블록들을 클릭 또는 드래그 앤 드롭을 이용해 워크플로우에 추가하고, 연결하여 분석 및 시각화 워크플로우를 생성한다. 빅데이터 분석 시각화 장치(10)는 GUI 기반으로 데이터 수집, 전처리, 분석 및 시각화 과정을 블록을 드래그 앤 드롭 하고 연결하여 손쉽게 워크플로우를 생성할 수 있다. 빅데이터 분석 시각화 장치(10)는 각 블록 단계별로 워크플로우를 수행할 수 있다. The big data analysis visualization apparatus 10 creates a workflow by matching functions with blocks. The big data analysis visualization apparatus 10 adds desired functional blocks to the workflow by clicking or dragging and dropping, and creates an analysis and visualization workflow by connecting them. The big data analysis visualization device 10 can easily create a workflow by dragging and dropping blocks and connecting the data collection, pre-processing, analysis and visualization processes based on GUI. The big data analysis visualization apparatus 10 may perform a workflow for each block step.

빅데이터 분석 시각화 장치(10)는 생성한 워크플로우를 동종 분석에 재사용할 수 있다.The big data analysis visualization apparatus 10 may reuse the generated workflow for homogeneous analysis.

빅데이터 분석 시각화 장치(10)는 많이 사용하는 분석 시나리오를 템플릿으로 제공하여 프로그램 코드 작성없이 빅데이터 분석을 할 수 있다.The big data analysis visualization apparatus 10 provides a frequently used analysis scenario as a template to perform big data analysis without writing program code.

빅데이터 분석 시각화 장치(10)는 클라우드 플랫폼 환경에서 수행할 수 있다.The big data analysis and visualization apparatus 10 may be performed in a cloud platform environment.

도 2를 참조하면, 빅데이터 분석 시각화 장치(10)는 수집부(100), 전처리부(200), 분석부(300) 및 시각화부(400)를 포함한다.Referring to FIG. 2 , the big data analysis and visualization apparatus 10 includes a collection unit 100 , a preprocessor 200 , an analysis unit 300 , and a visualization unit 400 .

도 3을 참조하면, 수집부(100)는 다양한 형태의 데이터로부터 분석하려는 데이터를 수집할 수 있다. 예를 들면, 수집부(100)는 엑셀, TEXT, CSV 형식의 파일을 업로드하여 수집할 수 있다. 또한 수집부(100)는 관계형 데이터베이스에 직접 연결하여 분석하려는 데이터를 수집할 수 있다. 수집부(100)는 직접 데이터를 직접 입력 또는 붙여넣기 하여 바로 생성할 수도 있다. 수집부(100)는 제공되는 OpenAPI를 이용해 직접 데이터를 수집할 수 있다.Referring to FIG. 3 , the collection unit 100 may collect data to be analyzed from various types of data. For example, the collection unit 100 may upload and collect files in Excel, TEXT, and CSV format. In addition, the collection unit 100 may collect data to be analyzed by directly connecting to the relational database. The collection unit 100 may directly input or paste data to directly generate the data. The collection unit 100 may directly collect data using the provided OpenAPI.

도 4를 참조하면, 전처리부(200)는 수집한 데이터를 분석 모델에 맞도록 다양한 방법으로 전처리한다. 전처리부(200)는 입력된 데이터셋에 적합한 전처리를 수행하여 분석의 정확도를 높일 수 있다. 전처리부(200)는 입력 데이터셋 분석에 따른 전처리 방식을 추천받을 수 있다.Referring to FIG. 4 , the preprocessor 200 preprocesses the collected data in various ways to fit the analysis model. The preprocessor 200 may perform preprocessing suitable for the input data set to increase the accuracy of analysis. The preprocessor 200 may be recommended a preprocessing method according to the analysis of the input data set.

도 5를 참조하면, 빅데이터 분석 시각화 장치(10)는 분석 모델을 이용해 데이터를 분석한다. 분석부(300)는 다양한 분석 모델을 코딩없이 수행한다. 분석부(300)는 입력 데이터셋에 따른 분석 모델을 추천받을 수 있다.Referring to FIG. 5 , the big data analysis visualization apparatus 10 analyzes data using an analysis model. The analysis unit 300 performs various analysis models without coding. The analysis unit 300 may receive a recommendation for an analysis model according to the input data set.

도 6을 참조하면, 시각화부(400)는 다양한 시각화 그래프를 이용하여 분석 데이터를 시각화 한다. 시각화부(400)는 데이터 분석 전 또는 분석 후에 실행할 수 있다.Referring to FIG. 6 , the visualization unit 400 visualizes analysis data using various visualization graphs. The visualization unit 400 may be executed before or after data analysis.

시각화부(400)는 주로 데이터의 형태를 확인할 때 데이터 분석 모델 학습 전에 사용할 수 있다. 예를 들면, 시각화부(400)는 시계열 데이터의 변화도 확인할 수 있는 히스토그램 시각화 기능 블록, 변수 간 상관관계 확인할 수 있는 산점도 또는 히트맵 시각화 기능 블록, 데이터 빈도를 확인할 수 있는 워드 클라우드 시각화 기능 블록, 입력된 데이터 컬럼의 이상치를 확인할 수 있는 박스플롯 시각화 블록 등을 분석 모델 학습 전에 이용할 수 있다.The visualization unit 400 may be mainly used before learning the data analysis model when checking the form of data. For example, the visualization unit 400 is a histogram visualization function block that can check the change in time series data, a scatterplot or heat map visualization function block that can check the correlation between variables, a word cloud visualization function block that can check the data frequency, A boxplot visualization block that can check outliers in the input data column can be used before training the analysis model.

도 7을 참조하면, 시각화부(400)는 분석 모델 학습 후에 분석 결과 보고서를 시각화 기능 블록들을 이용해 표현할 수 있다. 예를 들면, 시각화부(400)는 데이터를 분류할 때 테스트 데이터 적용한 후의 분류 적중 값에 관한 분석 결과 보고서를 파이차트 시각화 블록을 이용해 작성할 수 있다. Referring to FIG. 7 , the visualization unit 400 may express an analysis result report using visualization function blocks after learning the analysis model. For example, the visualization unit 400 may create an analysis result report on a classification hit value after applying the test data when classifying data using a pie chart visualization block.

시각화부(400)는 실제 값과 예측 값이 일치한 경우를 적중 값으로 전처리 하여 파이차트 시각화 기능 블록으로 실행하여 모델이 88.9% 확률로 적중했다는 분석 시각화 보고서를 작성할 수 있다.The visualization unit 400 pre-processes the case where the actual value and the predicted value match as a hit value and executes it as a pie chart visualization function block to write an analysis visualization report that the model hit with a probability of 88.9%.

도 8을 참조하면, 수행부(500)는 수집부(100), 전처리부(200), 분석부(300) 및 시각화부(400)의 각 과정을 블록 기반으로 매칭하여, 각 과정마다 매칭된 기능 블록을 선택할 수 있다. 수행부(500)는 기능 블록들을 선택, 배치 및 연결하여 프로그램 코딩 없이 빅데이터를 분석하고 시각화를 수행한다. 수행부(500)는 드래그 앤 드롭 방식으로 각 기능 블록을 선택하고 연결할 수 있다. 수행부(500)는 기능 블록을 선택, 배치 및 연결하여 워크플로우를 생성할 수 있다. 수행부(500)는 기능 블록들을 선택할 때 중간에 위치하는 블록인지 처음 또는 끝에 위치하는 블록인지 연결점으로 표시하여 워크플로우를 생성할 때 오류를 최소화한다. 예를 들면, 시작하는 기능 블록은 블록의 왼쪽에는 연결점이 없고 오른쪽에만 존재한다. 사전 작업이 하나인 경우 왼쪽 연결점이 하나이고 두개인 경우 연결점을 두개로 표시한다. 후속 작업도 동일한 방식으로 연결점의 수와 작업을 수를 일치하여 표시한다. 마지막 기능 블록의 경우 오른쪽에는 연결점이 없다. 수행부(500)는 각 단계별 기능 블록의 색상을 달리하여 각 단계를 구분한다.Referring to FIG. 8 , the performing unit 500 matches each process of the collecting unit 100 , the preprocessing unit 200 , the analyzing unit 300 , and the visualization unit 400 on a block basis, and matching each process Function blocks can be selected. The performing unit 500 selects, arranges, and connects functional blocks to analyze big data and perform visualization without program coding. The performing unit 500 may select and connect each function block by a drag-and-drop method. The execution unit 500 may generate a workflow by selecting, arranging, and connecting functional blocks. When the function blocks are selected, the execution unit 500 minimizes errors when generating a workflow by displaying whether the block is located in the middle or the block located at the beginning or the end as a connection point. For example, the starting function block has no connection points on the left side of the block and exists only on the right side. If there is one pre-work, the left connection point is one, and if there are two, the connection point is displayed as two. Subsequent tasks are displayed in the same way by matching the number of connection points and the number of tasks. For the last function block, there is no connection point on the right. The performing unit 500 distinguishes each stage by differentiating the color of the functional block for each stage.

수행부(500)는 데이터셋 내의 결측 지와 이상 치 여부 판별하여 정상 값 또는 정상 범주로 보정하도록 전처리 기능 블록을 추천할 수 있고, 결측 치와 이상 치에 대한 정상 값 또는 정상 범주를 예측하고, 제공할 수 있다.The performing unit 500 may recommend a preprocessing function block to determine whether there are missing values and outliers in the dataset and correct them to normal values or normal categories, predict normal values or normal categories for missing values and outliers, can provide

수행부(500)는 데이터셋, 학습한 분석 모델 및 학습 결과를 분석하고 학습하여 데이터셋과 분석 모델에 적합한 하이퍼파라미터를 추천한다. 예를 들면, 수행부(500)는 입력 데이터셋에 따른 분석 모델의 성능을 높일 수 있는 하이퍼파라미터를 추천할 수 있다.The performing unit 500 analyzes and learns the dataset, the learned analysis model, and the learning result, and recommends hyperparameters suitable for the dataset and the analysis model. For example, the execution unit 500 may recommend a hyperparameter capable of improving the performance of the analysis model according to the input data set.

수행부(500)는 수집부(100)에서 수집한 데이터셋을 분석하여 분석 모델 기능 블록 또는 다음 단계의 기능 블록을 추천할 수 있고, 다수의 추천 기능 블록으로 구성된 템플릿을 추천할 수 있다. 예를 들면, 수행부(500)는 독립변수 분석을 통해 단변량 분석을 하는 경우 회귀 분석 기능 블록을 추천할 수 있고, 다변량의 경우 군집 분석 기능 블록을 추천할 수 있다. 또는 수행부(500)는 종속 변수 종류 판별을 통해 범주형은 카이제곱 테스트 및 로지스틱 회귀 분석 기능 블록을 추천하고, 연속형은 피어슨 상관 분석 및 선형 회귀 분석 기능 블록 등을 추천할 수 있다. The performing unit 500 may analyze the dataset collected by the collection unit 100 to recommend an analysis model functional block or a next-stage functional block, and may recommend a template composed of a plurality of recommended functional blocks. For example, the execution unit 500 may recommend a regression analysis function block in case of univariate analysis through independent variable analysis, and may recommend a cluster analysis function block in case of multivariate analysis. Alternatively, the execution unit 500 may recommend a chi-square test and logistic regression analysis function block for the categorical type, and a Pearson correlation analysis and a linear regression analysis function block for the continuous type, by determining the type of the dependent variable.

수행부(500)는 데이터가 날짜(date) 타입의 시계열 자료인 경우 변화량을 확인을 위한 히스토그램 시각화 기능 블록 또는 라인그래프 시각화 기능 블록을 추천할 수 있다. When the data is date-type time series data, the performing unit 500 may recommend a histogram visualization function block or a line graph visualization function block for confirming the amount of change.

수행부(500)는 기능 블록들을 선택하여 원하는 워크플로우를 생성하거나 제공되는 템플릿을 이용할 수 있다. 또는 수행부(500)는 생성한 워크플로우를 저장하여 동종 또는 유사한 데이터 분석에 재사용할 수 있고, 워크플로우 생성이나 기능 블록 선택 시 참고할 수 있다.The execution unit 500 may select functional blocks to create a desired workflow or use a provided template. Alternatively, the execution unit 500 may store the generated workflow and reuse it for analysis of the same or similar data, and may refer to it when creating a workflow or selecting a function block.

도 9를 참조하면, 수행부(500)는 오류 수정부(510), 성능 향상부(520), 블록 추천부(530), 모델 추천부(540) 및 워크플로우 생성부(550)를 포함한다.Referring to FIG. 9 , the execution unit 500 includes an error correction unit 510 , a performance improvement unit 520 , a block recommendation unit 530 , a model recommendation unit 540 , and a workflow generation unit 550 . .

오류 수정부(510)는 전처리된 데이터가 선택된 분석 모델에 적합한지 판단한다. 예를 들면, 전처리된 데이터가 학습 모델에 사용된 데이터와 다른 데이터 타입이거나 학습된 값의 범위를 넘는 이상 치이거나 결측 치가 미처리된 경우 오류가 발생한다. 오류 수정부(510)는 발생된 오류에 대해 매칭되는 정상 값 또는 정상 범주 값을 추천하거나 전처리 기능 블록을 추천할 수 있다. The error correction unit 510 determines whether the preprocessed data is suitable for the selected analysis model. For example, an error occurs if the preprocessed data is of a different data type than the data used in the learning model, outliers that exceed the range of learned values, or missing values are not processed. The error correction unit 510 may recommend a matching normal value or a normal category value or a preprocessing function block for the generated error.

도 10을 참조하면, 오류 수정부(510)는 입력 값의 메타데이터, 현재 수행중인 블록의 메타데이터 그리고 모델 형성에 사용할 입력 값의 메타데이터와 블록 메타데이터를 포함하는 전체 데이터베이스를 이용한다.Referring to FIG. 10 , the error correction unit 510 uses the entire database including metadata of input values, metadata of blocks currently being executed, metadata of input values to be used for model formation, and block metadata.

오류 수정부(510)는 입력 값의 메타데이터 필드의 종류로 입력데이터 크기, 입력 필드 수, 데이터타입, 데이터별로 자체 분석된 주요 특징 등을 포함할 수 있다The error correction unit 510 may include the size of input data, the number of input fields, data type, and main characteristics analyzed by itself as a type of metadata field of the input value.

오류 수정부(510)는 블록 메타데이터 필드의 종류로는 블록 ID, 입력한 파라미터들의 종류와 값, 에러가 발생한 파라미터의 종류와 값 등을 포함할 수 있다.The error correction unit 510 may include a block ID, types and values of input parameters, types and values of parameters in which an error occurs, etc. as types of block metadata fields.

오류 수정부(510)는 기능 블록 실행 중 에러 발생하면 기능 블록 메타데이터를 수집하고 발생 오류를 수집한다.When an error occurs during execution of the function block, the error correction unit 510 collects the function block metadata and collects the generated error.

오류 수정부(510)는 입력된 데이터 또는 전처리된 데이터가 학습 모델에 사용된 데이터와 상이하거나 학습된 값의 범위가 넘어서는 등의 오류가 발생하는 경우 대체 값을 추천하여 오류를 수정한다. 예를 들면, 오류 수정부(510)는 결측 치 처리에 있어서 기준이 불확실하여 사용자의 판단으로 결측 치를 임의로 대체하여 입력하는 경우 사용자가 임의로 입력한 결측 치가 문제가 없는지 판단할 수 있고, 적정 범위 값의 가이드 라인을 제공할 수 있고, 결측 치에 대한 대체 값을 추천할 수 있다.The error correction unit 510 corrects the error by recommending an alternative value when an error occurs, such as that the input data or the preprocessed data is different from the data used in the learning model or the range of the learned value is exceeded. For example, the error correction unit 510 may determine whether there is a problem with the missing value arbitrarily input by the user, and may determine whether there is a problem with the missing value arbitrarily input by the user when the user's judgment arbitrarily substitutes the missing value because the standard is uncertain in processing the missing value. It can provide guidelines for , and can recommend alternative values for missing values.

도 11을 참조하면, 오류 수정부(510)는 판단부(5101), 학습부(5102) 및 추천부(5103)를 포함한다. Referring to FIG. 11 , the error correction unit 510 includes a determination unit 5101 , a learning unit 5102 , and a recommendation unit 5103 .

판단부(5101)는 기능 블록 메타데이터와 발생 오류 및 파라미터의 종류와 값 등의 정보를 종합하여 오류가 발생한 지점을 분석 추적할 수 있다. The determination unit 5101 may analyze and track the point where the error occurred by synthesizing the function block metadata and information such as the error occurrence and the type and value of the parameter.

판단부(5101)는 입력된 데이터가 선택된 분석 모델에 적합한지 판단한다. 또한 판단부(5101)는 전처리된 데이터가 선택된 분석 모델에 적합한지 판단할 수 있다.The determination unit 5101 determines whether the input data is suitable for the selected analysis model. Also, the determination unit 5101 may determine whether the preprocessed data is suitable for the selected analysis model.

판단부(5101)는 전처리된 데이터가 학습 모델에 사용된 데이터와 다른 데이터 타입이거나 학습된 값의 범위를 넘는 이상 치이거나 결측 치가 미처리된 경우 오류로 판단한다. 예를 들면, 판단부(5101)는 SVM, 랜덤 포레스트 등의 분류 알고리즘을 통해 입력된 데이터가 정상인지 비정상인지 판별할 수 있다.The determination unit 5101 determines as an error when the preprocessed data is a different data type from the data used in the learning model, an outlier exceeding the range of the learned value, or a missing value is not processed. For example, the determination unit 5101 may determine whether input data is normal or abnormal through a classification algorithm such as SVM or random forest.

학습부(5102)는 정상 과정과 비정상 과정을 학습한 예측 학습 모델을 통해 정상 과정 매칭을 수행한다. 자세히 설명하면, 학습부 (5102)는 선택한 분석 모델의 정상과정과 오류과정을 학습하여 데이터셋의 정상 값 또는 정상 범주 값을 예측하는 예측 학습 모델을 생성한다. 예를 들면, 예측 학습 모델은 정상 과정과 오류 과정을 학습하여 분류하는 xgboost 학습모델 등 일 수 있다. The learning unit 5102 performs normal process matching through a predictive learning model that has learned the normal process and the abnormal process. More specifically, the learning unit 5102 generates a predictive learning model that predicts a normal value or a normal category value of a dataset by learning the normal process and the error process of the selected analysis model. For example, the predictive learning model may be an xgboost learning model that learns and classifies a normal process and an erroneous process .

학습부(5102)는 정상 범주를 판단하는데 있어 대상 변수의 타입, 데이터셋의 분포, 존재하는 결측 치의 비율, 입력한 결측 지 대체 값, 결측 치 대체 값 처리 시 분석 모델 성능 및 에러 발생 여부 등을 이용할 수 있다.The learning unit 5102 determines the type of target variable, the distribution of the dataset, the ratio of existing missing values, the input missing information replacement value, and the analysis model performance and error occurrence when processing the missing value replacement value in determining the normal category. Available.

학습부(5102)는 사용자의 판단으로 입력된 결측 치의 대체 값이 정상 값 또는 정상 범위 내의 값인지 판별한다. 학습부(5102)는 사용자가 입력한 대체 값이 정상 범위에 속하지 않으면 정상 범위의 가이드라인 또는 정상 범위 내의 값 중 어느 하나를 추천 대체 값으로 정하고 추천한다. 이때 학습부(5102)는 사용자가 추천 대체 값을 선택하는 경우, 예측 학습 모델이 데이터로 재학습시켜 데이터의 정상 범위를 판단하는 정확도를 높이는데 이용한다.The learning unit 5102 determines whether the replacement value of the input missing value is a normal value or a value within a normal range. If the replacement value input by the user does not fall within the normal range, the learning unit 5102 determines and recommends either the guideline within the normal range or the value within the normal range as the recommended replacement value. At this time, when the user selects a recommended replacement value, the learning unit 5102 uses the predictive learning model to re-learn with data to increase the accuracy of determining the normal range of the data.

학습부 (5102)는 오류가 발생한 데이터를 입력 값으로 하고 예측 학습 모델을 수행하여 오류 값의 정상 값 또는 정상 범주 값을 예측한다. The learning unit 5102 uses the data in which an error has occurred as an input value and performs a predictive learning model to predict a normal value or a normal category value of the error value.

추천부(5103)는 정상 수행 과정을 학습한 알고리즘에 따라 오류가 발생한 지점의 파라미터의 대체 값을 추천하거나 정상 수행 과정을 매칭한다. 자세히 설명하면, 추천부(5103)는 예측 학습 모델을 통한 예측된 값을 오류 값의 대체 값으로 추천한다. 추천부(5103)는 학습부(5102)가 생성한 예측 학습 모델을 수행하여 결측 치 또는 이상 치에 대한 정상 값, 정상 범위 또는 정상 범위 내 어느 하나의 값을 대체 값으로 자동 수정하거나 추천할 수 있다.The recommendation unit 5103 recommends a replacement value of a parameter at a point where an error occurs according to an algorithm that has learned the normal execution process or matches the normal execution process. In more detail, the recommender 5103 recommends a value predicted through the predictive learning model as a substitute value for the error value. The recommendation unit 5103 performs the predictive learning model generated by the learning unit 5102 to automatically correct or recommend a normal value for a missing value or an outlier, or any one value within the normal range or normal range as a replacement value. have.

도 12는 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치에서 랜덤 포레스트 분석 모델을 수행할 때 오류 값이 발생한 경우의 예시이다.12 is an example of an error value generated when performing a random forest analysis model in the big data analysis visualization apparatus according to an embodiment of the present invention.

도 12의 예시를 참조하면, 판단부(5101)는 입력된 데이터가 이상 치이거나 결측 치 여부를 판단한다. 입력된 데이터가 비정상 값 또는 비정상 범위에 속하는 경우, 학습부(5102)는 랜덤 포레스트 분석 모델 수행 중 발생한 오류 값을 예측 학습 모델을 통해 정상 값 또는 정상 범주의 값을 예측한다. 추천부(5103)는 예측 학습 모델을 통한 예측된 값을 오류 값에 대한 대체 값으로 추천한다.Referring to the example of FIG. 12 , the determination unit 5101 determines whether input data is an outlier or a missing value. When the input data belongs to an abnormal value or an abnormal range, the learning unit 5102 predicts a normal value or a normal category value through the predictive learning model for an error value generated while performing the random forest analysis model. The recommender 5103 recommends a value predicted through the predictive learning model as a substitute value for the error value.

빅데이터 분석 시각화 장치(10)는 정상 과정과 오류 과정을 학습한 분석 모델을 이용해 오류를 판별하고, 오류 발생 지점 및 오류의 내용을 파악하고, 정상 값 또는 정상 범주를 예측한다. 예를 들면, 판단부(5101)는 오류 정보 파악을 위해 랜덤포레스트, SVM 등과 같은 분류 알고리즘을 이용할 수 있다. 학습부(5102)는 입력 값이 이상 치 또는 결측 치에 해당하면 그 대체 값을 xgboost 알고리즘과 같은 분석 모델을 통해 예측할 수 있다.The big data analysis visualization apparatus 10 uses an analysis model that has learned the normal process and the error process to determine an error, identify an error occurrence point and the content of the error, and predict a normal value or a normal category. For example, the determination unit 5101 may use a classification algorithm such as random forest or SVM to identify error information. When the input value corresponds to an outlier or a missing value, the learner 5102 may predict the replacement value through an analysis model such as the xgboost algorithm.

다시 도 9를 참조하면, 성능 향상부(520)는 분석 모델의 평가 및 성능 향상을 위한 최적의 하이퍼파라미터 값을 추천한다. 빅데이터 분석에 있어서 하이퍼파라미터의 설정 값에 따라 분석 모델의 성능이 크게 달라지기 때문에 하이퍼파라미터 최적화는 매우 중요한 작업 중 하나이다.Referring back to FIG. 9 , the performance improving unit 520 recommends an optimal hyperparameter value for evaluating an analysis model and improving performance. In big data analysis, hyperparameter optimization is one of the most important tasks because the performance of the analysis model varies greatly depending on the hyperparameter setting value.

도 13을 참조하면, 성능 향상부(520)는 데이터셋과 분석 모델에 적합한 하이퍼파라미터를 추천한다. 예를 들면, 성능 향상부(520)는 최적의 성능을 도출할 수 있는 하이퍼파라미터를 추천할 수 있다. 성능 향상부(520)는 하이퍼파라미터 교차 검증 알고리즘을 이용해 하이퍼파라미터를 추전하여 분석 모델의 성능 개선을 제안할 수 있다. 성능 향상부(520)는 추천 하이퍼파라미터 값으로 학습을 하였을 때 예측되는 성능 평가 값을 제공할 수 있다.Referring to FIG. 13 , the performance enhancing unit 520 recommends a hyperparameter suitable for a data set and an analysis model. For example, the performance enhancing unit 520 may recommend a hyperparameter capable of deriving an optimal performance. The performance improvement unit 520 may suggest performance improvement of the analysis model by recommending a hyperparameter using a hyperparameter cross-validation algorithm. The performance enhancer 520 may provide a performance evaluation value predicted when learning is performed using the recommended hyperparameter value.

도 14를 참조하면, 성능 향상부(520)는 입력부(5201), 조정부(5202) 및 성능 비교부(5203)를 포함한다. Referring to FIG. 14 , the performance enhancing unit 520 includes an input unit 5201 , an adjusting unit 5202 , and a performance comparing unit 5203 .

입력부(5201)는 분석 모델 구축을 위해 데이터셋 학습/검증 비율을 포함한 하이퍼파라미터를 사용자에게 입력 받는다. 예를 들면, 데이터셋 학습/검증 비율이 필요한 분석 모델은 랜덤 포레스트, 회귀분석, 의사결정나무 및 로지스틱 회귀 분석 등의 모델이다. 입력 가능한 하이퍼파라미터는 독립 종속 변수 설정, 학습/검증 데이터셋 비율, 트리의 수 등을 포함한다. 랜덤 포레스트 분석 모델의 경우 학습 데이터 비율, 모델 성능 검증을 위한 검증 데이터 비율, 트리의 숫자. 트리의 깊이(depth), 각 리프(leaf)의 최소 데이터 개수, 리프가 아닌 노드의 최소 데이터 개수 등을 하이퍼 파라미터로 사용한다. 조정부(5202)는 선택한 분석 모델의 최적의 성능을 위한 하이퍼파라미터의 조정 값을 추천한다. 조정부(5202)는 데이터셋의 종류와 선택된 분석 모델을 분석하여 최적의 조정 값을 추천한다. 조정부(5202)는 하이퍼파라미터 예측 알고리즘을 이용해 직접적인 하이퍼파라미터 조정을 통한 분석 모델 학습을 수행하지 않고, 데이터셋과 분석모델에 적합한 하이퍼파라미터 값을 예측한다. 예를 들면, 조정부(5202)는 저성능 또는 과적합 여부에 따른 데이터 비율을 조정하여 최적의 성능을 도출할 수 있는 하이퍼파라미터의 조정 값을 추천한다. 하이퍼파라미터 알고리즘은 입력 데이터 셋과 수행된 분석 모델과 수행 결과를 분석하고 학습하여 최적의 성능을 도출할 수 있는 하이퍼파라미터를 예측하는 알고리즘이다.The input unit 5201 receives a hyperparameter including a data set learning/verification ratio from a user to build an analysis model. For example, analysis models that require a dataset training/validation ratio are models such as random forest, regression analysis, decision tree, and logistic regression analysis. Hyperparameters that can be input include setting independent dependent variables, training/validation dataset ratio, and number of trees. For a random forest analysis model, the proportion of training data, the proportion of validation data for model performance validation, and the number of trees. The depth of the tree, the minimum number of data of each leaf, the minimum number of data of non-leaf nodes, etc. are used as hyper parameters. The adjustment unit 5202 recommends an adjustment value of the hyperparameter for optimal performance of the selected analysis model. The adjustment unit 5202 recommends an optimal adjustment value by analyzing the type of data set and the selected analysis model. The adjustment unit 5202 predicts a hyperparameter value suitable for a dataset and an analysis model without performing analysis model learning through direct hyperparameter adjustment using a hyperparameter prediction algorithm. For example, the adjuster 5202 recommends an adjustment value of a hyperparameter capable of deriving an optimal performance by adjusting a data rate according to whether the performance is under-performance or over-fitting. The hyperparameter algorithm is an algorithm that predicts hyperparameters that can derive optimal performance by analyzing and learning the input data set, the performed analysis model, and the performance results.

기존의gridsearch, randomsearch 등과 같은 하이퍼파라미터 추적 알고리즘은 연산의 양이 많기 때문에 많은 시간과 비용이 소모된다. 즉, 사용자가 초기 하이퍼파라미터 입력 시 모델 성능 값 지표 확인을 통해 성능을 도출하게 되고, 성능 향상을 위해 반복적으로 검증 데이터셋과 실제 데이터셋의 비율을 조정하고 학습하여 조정 값을 찾는다. 분석 모델의 성능을 높이기 위해서는 하이퍼파라미터 값을 조정하여 원하는 성능이 도출될 때까지 반복 학습해야 하기 때문에 시간과 비용이 소모된다. 하지만 오류 값을 보정하는 빅데이터 분석 시각화 장치(10)는 기존의 데이터셋의 종류, 선택된 분석 모델 및 성능 결과를 학습하여 성능 향상을 위한 하이퍼파라미터 값을 예측하고 추천하므로 시간과 비용을 줄일 수 있다.Existing hyperparameter tracking algorithms such as gridsearch and randomsearch consume a lot of time and money because of the large amount of computation. That is, when the user inputs the initial hyperparameter, the performance is derived by checking the model performance value index, and to improve the performance, the ratio between the verification dataset and the actual dataset is repeatedly adjusted and learned to find the adjustment value. In order to improve the performance of the analysis model, time and money are consumed because it is necessary to repeatedly learn by adjusting the hyperparameter values until the desired performance is obtained. However, the big data analysis visualization device 10 that corrects the error value predicts and recommends hyperparameter values for performance improvement by learning the type of the existing dataset, the selected analysis model, and the performance result, thereby reducing time and cost. .

조정부(5202)는 이전 모델 대비 성능이 낮을 경우 다른 조정 값 추천을 반복하고, 높을 경우 추천을 중지하고, 미리 설정된 임계 값 이상 도달했을 때에도 추천을 중지한다.When the performance of the previous model is low, the adjustment unit 5202 repeats the recommendation of another adjustment value, when the performance is high, stops the recommendation, and stops the recommendation even when the preset threshold value or more is reached.

성능 비교부(5203)는 추천 하이퍼파라미터 값으로 학습을 하였을 때 예측되는 성능 평가 값을 제공할 수 있다.The performance comparison unit 5203 may provide a performance evaluation value predicted when learning is performed using the recommended hyperparameter value.

성능 비교부(5203)는 사용자가 입력한 하이퍼파라미터를 적용하였을 때 성능과 최적의 성능을 위한 하이퍼파라미터의 조정 값을 적용하였을 때의 성능을 시각화하여 제공한다. 예를 들면, 성능비교부(5203)는 아웃오브백 오류(out of Bag Error) 그래프, 상관행렬 그래프 및 검증지표의 시각자료로 사용자가 입력한 하이퍼파라미터 값을 분석 모델에 적용하였을 때 결과를 제공할 수 있다. 성능 비교부(5203)가 추천 값을 사용자에게 시각자료로 제공하는 방식으로는 사용자 입력 값과 추천된 조정 값을 같이 표시한 후 아웃오브백 오류(out of Bag Error) 그래프 지표의 전후 변화화면을 표시하여 성능 및 결과 지표의 변화를 표시하여 제공한다.The performance comparison unit 5203 visualizes the performance when the hyperparameter input by the user is applied and the performance when the hyperparameter adjustment value for optimal performance is applied. For example, the performance comparison unit 5203 provides a result when the hyperparameter value input by the user as visual data of the out of bag error graph, the correlation matrix graph, and the verification index is applied to the analysis model. can In a way that the performance comparison unit 5203 provides the recommended value to the user as visual data, the user input value and the recommended adjustment value are displayed together, and then the screen of the before and after change of the out of bag error graph indicator is displayed. Displays and provides changes in performance and result indicators.

다시 도9를 참조하면, 블록 추천부(530)는 데이터셋 분석과 현재의 기능 블록 단계 분석을 통해 다음 기능 블록을 추천한다. Referring back to FIG. 9 , the block recommendation unit 530 recommends the next functional block through the data set analysis and the current function block stage analysis.

블록 추천부(530)는 빅데이터 분석 진행 상에서 다음 단계의 기능 블록을 추천하여 분석 과정의 효율 및 분석의 정확도를 개선할 수 있다.The block recommendation unit 530 may improve the efficiency of the analysis process and the accuracy of the analysis by recommending a functional block of the next stage in the big data analysis process.

도 15를 참조하면, 블록 추천부(530)는 입력 데이터셋을 분석하고 지금까지 수행한 기능 블록을 분석하여 유사도가 높은 워크플로우 또는 템플릿을 참고하여 다음 단계의 기능 블록을 추천한다. 예를 들면, 블록 추천부(530)는 독립변수 분석을 통해 단변량 분석을 하는 경우 회귀 분석 기능 블록을 추천할 수 있고, 다변량의 경우 군집 분석 기능 블록을 추천할 수 있다. 또는 블록 추천부(530)는 종속 변수 종류 판별을 통해 범주형은 카이제곱 테스트 및 로지스틱 회귀 분석 기능 블록을 추천하고, 연속형은 피어슨 상관 분석 및 선형 회귀 분석 기능 블록 등을 추천할 수 있다. Referring to FIG. 15 , the block recommendation unit 530 analyzes the input data set and analyzes the functional blocks performed so far, and recommends the next functional block with reference to a workflow or template having a high degree of similarity. For example, the block recommendation unit 530 may recommend a regression analysis function block in case of univariate analysis through independent variable analysis, and may recommend a cluster analysis function block in case of multivariate analysis. Alternatively, the block recommendation unit 530 may recommend a chi-square test and a logistic regression analysis function block for the categorical type, and a Pearson correlation analysis and a linear regression analysis function block for the continuous type, by determining the type of the dependent variable.

블록 추천부(530)는 데이터가 날짜(date) 타입의 시계열 자료인 경우 변화량을 확인을 위한 히스토그램 시각화 기능 블록 또는 라인그래프 시각화 기능 블록을 추천할 수 있다. When the data is date-type time series data, the block recommendation unit 530 may recommend a histogram visualization function block or a line graph visualization function block for confirming a change amount.

블록 추천부(530)는 생성된 워크플로우나 제공된 템플릿의 워크플로우 정보가 저장된 워크플로우 템플릿 빅데이터를 분석에 이용할 수 있다.The block recommendation unit 530 may use the generated workflow or workflow template big data in which workflow information of the provided template is stored for analysis.

도 16을 참조하면, 블록 추천부(530)는 블록 분석부(5301), 데이터셋 분석부(5302). 생성부(5303), 클러스터링부(5304), 유사도 분석부(5305) 및 블록 단계 추천부(5306)를 포함한다.Referring to FIG. 16 , the block recommendation unit 530 includes a block analysis unit 5301 and a data set analysis unit 5302 . It includes a generation unit 5303 , a clustering unit 5304 , a similarity analysis unit 5305 , and a block step recommendation unit 5306 .

블록 추천부(530)는 다음 단계의 기능 블록 추천의 요청을 수신하거나 기능 블록 추천이 필요한 상황을 인식할 수 있다.The block recommendation unit 530 may receive a request for recommending a functional block in the next step or may recognize a situation in which a functional block recommendation is required.

블록 분석부(5301)는 현 기능 블록의 단계를 세부 구분할 수 있는 블록 분석 알고리즘을 수행한다. 예를 들면, 블록 분석부(5301)는 현 진행 단계를 파악하기 위해 현재의 기능 블록의 메타데이터를 추출한다. 기능 블록 메타데이터는 기능 블록 구분 ID, 파라미터 리스트, 파라미터 입력값 등 기능 블록의 기본 정보를 포함하고, 사용한 기능 블록의 중복검사를 위한 기능 블록 사용내역 리스트를 더 포함할 수 있다.The block analysis unit 5301 performs a block analysis algorithm capable of classifying the steps of the current functional block in detail. For example, the block analyzer 5301 extracts metadata of the current functional block in order to identify the current progress stage. The function block metadata includes basic information of the function block, such as a function block identification ID, parameter list, and parameter input value, and may further include a function block usage history list for redundancy check of the used function block.

또한 데이터셋 분석부(5302)는 입력된 데이터셋을 구별하기 위한 데이터셋 분석 알고리즘을 수행한다. 예를 들면, 데이터셋 분석부(5302)는 입력된 데이터셋의 메타데이터를 추출한다. 데이터셋의 메타데이터는 데이터셋의 데이터타입, 데이터 사이즈, 특성의 숫자, 결측 치 여부 및 비율, 이상치 여부 및 비율, 중복데이터 여부 및 비율 등을 포함한다.In addition, the dataset analysis unit 5302 performs a dataset analysis algorithm for discriminating the input dataset. For example, the dataset analysis unit 5302 extracts metadata of the input dataset. The metadata of the dataset includes the data type of the dataset, data size, number of characteristics, whether or not there are missing values, and whether or not there is an outlier, and whether or not there is duplicate data.

생성부(5303)는 블록 분석부(5301)의 결과 값과 데이터셋 분석부(5302)의 결과 값을 취합하여 상세 과정을 구분할 수 있는 상세 단계 분석 메타데이터를 생성한다. 예를 들면, 추출한 현재 기능 블록의 메타데이터와 입력 데이터셋의 메타데이터를 이용해 상세 단계 분석 메타데이터를 생성한다.The generator 5303 collects the result value of the block analyzer 5301 and the result value of the dataset analyzer 5302 to generate detailed step analysis metadata that can classify detailed processes. For example, detailed step analysis metadata is generated using the extracted metadata of the current function block and metadata of the input dataset.

클러스터링부(5304)는 상세 단계 분석 메타데이터의 특징 값을 기반으로, 워크플로우 템플릿 빅데이터와 클러스터링한다. 예를 들면, 클러스터링부(5304)는 K-평균(K-MEANS), K-모드(K-MODE), DBSCAN 등의 클러스터링 기법 등을 이용할 수 있다. 워크플로우 템플릿 빅데이터는 생성된 워크플로우나 분석 시나리오로 제공된 템플릿의 워크플로우를 포함한다.The clustering unit 5304 clusters the workflow template big data based on the feature value of the detailed stage analysis metadata. For example, the clustering unit 5304 may use a clustering technique such as K-means (K-MEANS), K-mode (K-MODE), or DBSCAN. Workflow Template Big data includes a generated workflow or a workflow of a template provided as an analysis scenario.

유사도 분석부(5305)는 상세 단계 분석 메타데이터의 특징 값을 기반으로 클러스터링된 워크플로우와 유사도를 분석하여 유사도 값이 높은 상위랭킹의 워크플로우를 추출한다. 예를 들면, 유사도 분석부(5305)는 유클리디안 거리, 맨하튼 거리, 스피어만 상관점수 등의 유사도 분석 기법을 이용하여 유사도를 분석할 수 있다. 유사도 분석부(5305)는 유사도 상위랭킹의 워크플로우들을 추천할 수 있다.The similarity analysis unit 5305 analyzes the clustered workflow and the similarity based on the feature value of the detailed stage analysis metadata to extract a workflow having a higher ranking with a high similarity value. For example, the similarity analyzer 5305 may analyze the similarity by using a similarity analysis technique such as the Euclidean distance, the Manhattan distance, and the Spearman correlation score. The similarity analyzer 5305 may recommend workflows having a similarity higher ranking.

블록 단계 추천부(5306)는 유사도 상위랭킹의 워크플로우들에서 현재 기능 블록의 과정을 검색하고, 상위랭킹 워크플로우들에서 다음 단계에 해당되는 기능 블록을 추천할 수 있다.The block step recommendation unit 5306 may search for a process of a current functional block in workflows having a higher similarity ranking, and recommend a functional block corresponding to a next step in the workflows ranked higher in similarity.

다시 도9를 참조하면, 모델 추천부(540)는 입력된 데이터셋으로 가장 적합한 분석 모델을 추천하여 분석을 수행한다. 예를 들면, 모델 추천부(540)는 데이터셋과 분석 모델의 적합성을 점수로 산출하여 기준 이상의 점수를 획득한 분석 모델을 추천할 수 있다.Referring back to FIG. 9 , the model recommendation unit 540 recommends an analysis model most suitable for the input data set and performs analysis. For example, the model recommendation unit 540 may recommend an analysis model that has obtained a score greater than or equal to a standard by calculating the suitability of the dataset and the analysis model as a score.

도 17을 참조하면, 모델 추천부(540)는 입력 데이터셋을 분석하여 적합한 분석 모델을 추천하고, 데이터셋과 추천 분석 모델을 분석하여 워크플로우 또는 템플릿을 추천한다. 자세히 설명하면, 모델 추천부(540)는 분석 모델 추천 알고리즘을 통해 입력 데이터셋에 적합한 분석 모델을 추천한다. 모델 추천부(540)는 모의 평가 점수가 높은 분석 모델들을 추천한다. 모델 추천부(540)는 데이터셋 메타데이터와 분석 모델 메타데이터를 이용해 워크플로우 템플릿 빅데이터와의 유사도를 분석한다. 워크플로우 템플릿 빅데이터는 생성된 워크플로우나 분석 시나리오로 제공된 템플릿의 워크플로우를 포함한다. 모델 추천부(540)는 유사도가 높은 워크플로우를 포함한 템플릿을 추천하여 사용자가 분석 모델 선택할 때 가이드를 제공할 수 있다.Referring to FIG. 17 , the model recommendation unit 540 recommends a suitable analysis model by analyzing an input dataset, and recommends a workflow or a template by analyzing the dataset and the recommended analysis model. In more detail, the model recommendation unit 540 recommends an analysis model suitable for the input dataset through the analysis model recommendation algorithm. The model recommendation unit 540 recommends analysis models having a high simulation evaluation score. The model recommendation unit 540 analyzes the similarity with the workflow template big data using the dataset metadata and the analysis model metadata. Workflow Template Big data includes a generated workflow or a workflow of a template provided as an analysis scenario. The model recommendation unit 540 may provide a guide when a user selects an analysis model by recommending a template including a workflow having a high similarity.

도 18을 참조하면, 모델 추천부(540)는 모의 평가부(5401), 분석 모델 추천부(5402), 유사도 분석부(5403) 및 추천부(5404)를 포함한다.Referring to FIG. 18 , the model recommendation unit 540 includes a simulation evaluation unit 5401 , an analysis model recommendation unit 5402 , a similarity analysis unit 5403 , and a recommendation unit 5404 .

모의 평가부(5401)는 적합 모델 평가 알고리즘을 수행하여 입력된 데이터셋에 다종의 분석 모델을 모의 적용하고 데이터셋과 분석 모델의 점수를 산출한다. 예를 들면, 모의 평가부(5401)는 랜덤 포레스트, 상관분석, 다층 퍼셉트론, 나이브베이즈, k means 모델 등을 모의 적용하여 AUC 점수를 산출하여 적합 점수를 추출한다.The simulation evaluation unit 5401 performs a suitable model evaluation algorithm to simulate and apply various types of analysis models to the input dataset, and calculates scores of the dataset and the analysis model. For example, the simulation evaluation unit 5401 calculates an AUC score by simulating random forest, correlation analysis, multilayer perceptron, naive Bayes, k means model, and the like, and extracts a fit score.

분석 모델 추천부(5402)는 산출된 점수를 기반으로 상위 n건의 분석 모델 또는 미리 설정된 임계 값 이상의 분석 모델을 추천한다.The analysis model recommendation unit 5402 recommends the top n analysis models or an analysis model greater than or equal to a preset threshold based on the calculated score.

유사도 분석부(5403)는 선정된 추천 분석 모델 및 데이터셋 메타데이터를 기반으로 워크플로우 템플릿 빅데이터와의 유사도를 분석한다. 이때 유사도 분석부(5403)는 추천 분석 모델 중 최고 적합 점수를 획득한 모델 또는 사용자가 선택한 분석 모델을 선정하여 워크플로우 템플릿 빅데이터와의 유사도를 산출할 수 있다. The similarity analysis unit 5403 analyzes the similarity with the workflow template big data based on the selected recommended analysis model and dataset metadata. In this case, the similarity analysis unit 5403 may calculate the similarity with the workflow template big data by selecting the model that has obtained the highest fit score or the analysis model selected by the user from among the recommended analysis models.

유사도 분석부(5403)는 선정된 추천 분석 모델 및 데이터셋 메타데이터를 기반으로 워크플로 템플릿 빅데이터와의 유사도를 산출할 수 있다.The similarity analysis unit 5403 may calculate a similarity with the workflow template big data based on the selected recommended analysis model and dataset metadata.

추천부(5404)는 유사도 분석부(5403)를 통해 유사도가 높은 상위 n개의 워크플로우를 추천할 수 있다. The recommendation unit 5404 may recommend top n workflows having a high similarity through the similarity analysis unit 5403 .

추천부(5404)는 유사도 분석부(5403)를 통해 유사도가 높은 상위 n개의 워크플로우를 포함하고 있는 템플릿을 추천할 수 있다.The recommendation unit 5404 may recommend a template including the top n workflows having a high similarity through the similarity analysis unit 5403 .

다시 도 9를 참조하면, 워크플로우 생성부(550)는 추천된 블록들로 조합된 워크플로우를 생성한다.Referring back to FIG. 9 , the workflow generating unit 550 generates a workflow combined with the recommended blocks.

도 19를 참조하면, 워크플로우 생성부(550)는 입력된 데이터와 선택된 기능 블록들을 분석하여 추천된 블록들을 워크플로우 내에 적합한 위치와 순서로 배치한다. 워크플로우 생성부(550)는 추천 블록들이 배치된 워크플로우를 생성하고, 저장하여 제공할 수 있다.Referring to FIG. 19 , the workflow generating unit 550 analyzes input data and selected functional blocks and arranges the recommended blocks in an appropriate position and order in the workflow. The workflow generator 550 may generate, store, and provide a workflow in which recommended blocks are arranged.

워크플로우 생성부(550)는 워크플로우 작성 정도에 따라 각각의 기능 블록들을 추천하고 배치하여 미완성의 워크플로우를 완성할 수 있다.The workflow generating unit 550 may recommend and arrange respective functional blocks according to the degree of creation of the workflow to complete the incomplete workflow.

워크플로우 생성부(550)는 입력 데이터 셋과 선택된 블록들 기반으로 유사도가 높은 상위 워크플로우를 추출하여 워크플로우 생성에 이용한다. 워크플로우 생성부(550)는 선택된 블록들을 분석하여 실제 진행된 작업 기반의 워크플로우를 생성할 수 있다. The workflow generating unit 550 extracts an upper workflow having a high similarity based on the input data set and the selected blocks and uses it to generate the workflow. The workflow generator 550 may analyze the selected blocks to generate a workflow based on the actually progressed task.

도 20을 참조하면, 워크플로우 생성부(550)는 데이터 추출부(5501), 전처리 추천부(5502), 분석모델 추천부(5503), 시각화 추천부(5504) 및 완성부(5505)를 포함한다.Referring to FIG. 20 , the workflow generating unit 550 includes a data extraction unit 5501 , a pre-processing recommendation unit 5502 , an analysis model recommendation unit 5503 , a visualization recommendation unit 5504 , and a completion unit 5505 . do.

데이터 추출부(5501)는 데이터셋 분석 알고리즘 및 블록 분석 알고리즘을 이용하는 데이터셋을 분석하고, 블록을 분석한다. 자세히 설명하면, 데이터 추출부(5501)는 데이터셋 분석 알고리즘을 통해 입력된 데이터셋의 데이터셋 메타데이터를 추출하고, 블록 분석 알고리즘을 통해 이미 선택된 블록들의 블록 메타데이터를 추출한다.The data extraction unit 5501 analyzes a dataset using a dataset analysis algorithm and a block analysis algorithm, and analyzes a block. In more detail, the data extraction unit 5501 extracts the dataset metadata of the input dataset through the dataset analysis algorithm, and extracts the block metadata of the blocks already selected through the block analysis algorithm.

전처리 추천부(5502)는 데이터셋 메타데이터를 기반으로 필요한 전처리를 분석한다. 예를 들면, 전처리 추천부(5502)는 이상 치 발견 시 이상 치 처리 블록, 결측 치 발견 시 결측 치 처리 블록, 컬럼수가 불필요하게 많으면 파생변수 블록, PCA 블록들 데이터셋에 필요한 전처리 블록을 추천할 수 있다.The preprocessing recommendation unit 5502 analyzes necessary preprocessing based on the dataset metadata. For example, the preprocessing recommendation unit 5502 may recommend an outlier processing block when an outlier is found, a missing value processing block when a missing value is found, and a derived variable block and a preprocessing block required for the PCA block dataset if the number of columns is unnecessarily large. can

분석모델 추천부(5503)는 블록 메타데이터와 데이터셋 메타데이터를 이용해 분석 모델을 추천한다.The analysis model recommendation unit 5503 recommends an analysis model using block metadata and dataset metadata.

시각화 추천부(5504)는 시각화 추천 알고리즘을 통해 시각화 블록을 추천한다. 예를 들면, 시각화 추천부(5504)는 중복 시각화를 제외하고, 사용 중인 전처리 블록이 이상치 처리 블록이면 박스플롯 또는 산점도 시각화 블록을 추천하고, 데이터셋이 범주형이면 비율 확인을 할 수 있는 파이 차트 시각화 블록을 추천할 수 있다.The visualization recommendation unit 5504 recommends a visualization block through a visualization recommendation algorithm. For example, the visualization recommendation unit 5504 recommends a boxplot or scatterplot visualization block if the preprocessing block being used is an outlier processing block, excluding duplicate visualization, and a pie chart that can check the ratio if the dataset is categorical. A visualization block can be recommended.

완성부(5505)는 추천된 블록들을 배치하고 워크플로우를 완성하여 생성한다. 완성부(5504)는 블록 배치 알고리즘을 통해 유사도가 높은 워크플로우를 추출하여, 각각의 추천 블록들을 배치하고 워크플로우를 생성한다. 블록 배치 알고리즘은 작업 중 워크플로우와 유사도가 높은 상위 n건을 추출하고, 기능 블록의 위치를 파악하여 기능 블록과 그 위치를 추천한다.The completion unit 5505 arranges the recommended blocks and completes the workflow to generate them. The completion unit 5504 extracts a workflow with a high degree of similarity through a block arrangement algorithm, arranges each recommended block, and creates a workflow. The block arrangement algorithm extracts the top n cases with high similarity to the workflow during work, and recommends the functional block and its location by identifying the location of the functional block.

도 21 내지 도 31은 본 발명의 일 실시예에 따른 빅데이터 분석 시각화 방법을 설명하기 위한 도면들이다. 이하 설명하는 각 과정은 빅데이터 분석 시각화 장치를 구성하는 각 기능부가 수행하는 과정이나, 본 발명의 간결하고 명확한 설명을 위해 각 단계의 주체를 빅데이터 분석 시각화 장치로 통칭하도록 한다.21 to 31 are diagrams for explaining a big data analysis visualization method according to an embodiment of the present invention. Each process described below is a process performed by each functional unit constituting the big data analysis and visualization apparatus, but for the sake of concise and clear explanation of the present invention, the subject of each step is collectively referred to as a big data analysis and visualization apparatus.

도 21을 참조하면, 단계 S2101에서 빅데이터 분석 시각화 장치(10)는 분석하고자 하는 데이터를 수집한다. 빅데이터 분석 시각화 장치(10)는 다양한 형태의 데이터로부터 분석하려는 데이터를 수집할 수 있다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 엑셀, TEXT, CSV 형식의 파일을 업로드하여 수집할 수 있다. 또한 빅데이터 분석 시각화 장치(10)는 관계형 데이터베이스에 직접 연결하여 분석하려는 데이터를 수집할 수 있다. 빅데이터 분석 시각화 장치(10)는 직접 데이터를 직접 입력 또는 붙여넣기 하여 바로 생성할 수도 있다. 빅데이터 분석 시각화 장치(10)는 제공되는 OpenAPI를 이용해 직접 데이터를 수집할 수 있다.Referring to FIG. 21 , the big data analysis visualization apparatus 10 collects data to be analyzed in step S2101. The big data analysis visualization apparatus 10 may collect data to be analyzed from various types of data. For example, the big data analysis visualization apparatus 10 may upload and collect files in Excel, TEXT, and CSV format. In addition, the big data analysis visualization apparatus 10 may collect data to be analyzed by directly connecting to the relational database. The big data analysis visualization apparatus 10 may directly input or paste data to directly generate the data. The big data analysis visualization apparatus 10 may directly collect data using the provided OpenAPI.

단계 S2102에서 빅데이터 분석 시각화 장치(10)는 수집한 데이터를 원하는 분석 모델에 맞게 전처리한다.In step S2102, the big data analysis visualization apparatus 10 pre-processes the collected data according to a desired analysis model.

단계 S2103에서 빅데이터 분석 시각화 장치(10)는 분석 모델을 이용해 데이터를 분석한다. 빅데이터 분석 시각화 장치(10)는 데이터셋을 분석하여 분석모델을 추천할 수 있다. 빅데이터 분석 시각화 장치(10)는 데이터셋과 사용하려는 분석 모델을 분석하여 전처리 방식을 추천할 수 있다.In step S2103, the big data analysis visualization apparatus 10 analyzes data using the analysis model. The big data analysis visualization apparatus 10 may analyze a dataset and recommend an analysis model. The big data analysis visualization apparatus 10 may recommend a preprocessing method by analyzing the dataset and the analysis model to be used.

단계 S2104에서 빅데이터 분석 시각화 장치(10)는 분석한 데이터를 시각화하여 표현한다. 빅데이터 분석 시각화 장치(10)는 분석 결과에 적합한 시각화 블록을 추천할 수 있다.In step S2104, the big data analysis visualization apparatus 10 visualizes and expresses the analyzed data. The big data analysis visualization apparatus 10 may recommend a visualization block suitable for the analysis result.

도 22는 본 발명의 일 실시 예에 따른 블록을 추천하는 빅데이터 분석 시각화 장치가 오류 발생 시 오류 지점을 판단하고 정상 값을 추천하는 방법을 설명하기 위한 예시 도면이다.22 is an exemplary diagram for explaining a method for a big data analysis visualization apparatus for recommending a block according to an embodiment of the present invention to determine an error point when an error occurs and to recommend a normal value.

도 22를 참조하면, 빅데이터 분석 시각화 장치(10)는 분석 모델 기능 블록 수행 중 입력된 파라미터로 인해 오류가 발생하는 경우 정확한 오류 지점을 찾아 값을 수정하여 오류없이 분석 과정을 진행한다.Referring to FIG. 22 , when an error occurs due to an input parameter while performing an analysis model function block, the big data analysis visualization apparatus 10 finds an exact error point and corrects the value to proceed with the analysis process without error.

단계 S2201에서 빅데이터 분석 시각화 장치(10)는 기능 블록의 메타데이터 및 발생 오류의 정보를 수집한다.In step S2201, the big data analysis and visualization apparatus 10 collects metadata of functional blocks and information on errors.

단계 S2202에서 빅데이터 분석 시각화 장치(10)는 발생 오류의 정보를 분석하여 오류 발생 지점을 판별한다.In step S2202, the big data analysis visualization apparatus 10 analyzes the information on the error occurrence to determine the error occurrence point.

단계 S2203에서 빅데이터 분석 시각화 장치(10)는 블록의 메타데이터, 발생 오류, 파라미터의 종류 및 값 등을 분석하여 데이터 셋과 분석 모델의 매칭이 적합한지 판단한다. 빅데이터 분석 시각화 장치(10)는 데이터 셋과 분석 모델이 매칭되지 않으면 데이터 셋에 적합한 분석 모델을 추천할 수 있다. 자세히 설명하면, 빅데이터 분석 시각화 장치(10)는 정상 분석 모델 수행 과정의 정상 값을 추출하여 선택한 분석 모델과 매칭되지 않으면 정상 값을 추천한다.In step S2203, the big data analysis and visualization apparatus 10 analyzes the metadata of the block, the error occurrence, the type and value of the parameter, and the like, and determines whether the data set and the analysis model are suitable for matching. The big data analysis visualization apparatus 10 may recommend an analysis model suitable for the data set when the data set and the analysis model do not match. In detail, the big data analysis visualization apparatus 10 extracts a normal value of the normal analysis model performing process and recommends a normal value if it does not match the selected analysis model.

단계 S2204에서 빅데이터 분석 시각화 장치(10)는 데이터에서 이상 치 또는 결측 치 여부를 판단한다. 빅데이터 분석 시각화 장치(10)는 입력된 데이터의 이상 치 또는 결측 치로 인한 오류인 경우 정상 값 또는 정상 범위를 추천하고, 자동 적용할 수 있다.In step S2204, the big data analysis visualization apparatus 10 determines whether an outlier or a missing value in the data. The big data analysis and visualization apparatus 10 may recommend and automatically apply a normal value or a normal range in case of an error due to an outlier or a missing value of the input data.

빅데이터 분석 시각화 장치(10)는 오류가 난 지점을 자가 진단하고, 예측한 정상 값으로 대체하여 오류없이 분석을 계속 수행할 수 있다.The big data analysis visualization apparatus 10 may self-diagnose an error point and replace it with a predicted normal value to continue analysis without error.

도 23은 본 발명의 일시 예에 따른 빅데이터 분석 시각화 장치가 오류를 판별하고 수정하는 예시이다.23 is an example of determining and correcting an error by the big data analysis visualization apparatus according to a temporary example of the present invention.

도 23을 참조하면, 단계 S2301에서 빅데이터 분석 시각화 장치(10)는 사용자에게 분석하고자 하는 데이터를 입력 받는다.Referring to FIG. 23 , in step S2301 , the big data analysis and visualization apparatus 10 receives data to be analyzed from the user.

단계 S2302에서 빅데이터 분석 시각화 장치(10)는 블록을 드래그 앤 드롭하여 새로 작성한 워크플로우, 제공된 워크플로우 템플릿 또는 저장한 워크플로우 중 어느 하나를 이용하여 데이터 분석 워크플로우를 생성한다.In step S2302, the big data analysis visualization apparatus 10 creates a data analysis workflow using any one of a newly created workflow, a provided workflow template, or a stored workflow by dragging and dropping blocks.

단계 S2303에서 빅데이터 분석 시각화 장치(10)는 오류가 발생하면 오류 DB의 정보를 참조하여 오류 타입을 판별하고 오류 지점을 파악한다. 빅데이터 분석 시각화 장치(10)는 정상 과정과 오류 과정을 학습한 분석 모델을 이용하여 오류 발생 시의 오류 발생 지점과 오류의 내용을 파악한다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 오류 정보 파악을 위해 랜덤포레스트, SVM 등과 같은 분류 알고리즘을 이용할 수 있다.In step S2303, when an error occurs, the big data analysis visualization apparatus 10 determines an error type by referring to information in the error DB and identifies an error point. The big data analysis visualization apparatus 10 uses an analysis model that has learned the normal process and the error process to determine the error occurrence point and the content of the error when an error occurs. For example, the big data analysis visualization apparatus 10 may use a classification algorithm such as a random forest or SVM to identify error information.

단계 S2304에서 빅데이터 분석 시각화 장치(10)는 오류 값을 예측 학습 모델을 이용해 정상 값 또는 정상 범위 값을 예측한다. 빅데이터 분석 시각화 장치(10)는 입력 값이 이상 치 또는 결측 치에 해당하면 오류 값의 대체 값을 xgboost 알고리즘과 같은 분석 모델을 통해 추천할 수 있다. 오류 값의 대체 값은 오류가 발생하지 않고 빅데이터 분석이 정상적으로 이루어질 수 있는 정상 값 또는 정상 범주일 수 있다.In step S2304, the big data analysis visualization apparatus 10 predicts a normal value or a normal range value using a predictive learning model for an error value. When the input value corresponds to an outlier or a missing value, the big data analysis visualization apparatus 10 may recommend an alternative value of the error value through an analysis model such as the xgboost algorithm. The replacement value of the error value may be a normal value or a normal category in which no error occurs and big data analysis can be performed normally.

단계 S2305에서 빅데이터 분석 시각화 장치(10)는 예측 값 즉, 예측 학습 모델의 결과값을 사용자에게 추천한다. 또는 빅데이터 분석 시각화 장치(10)는 자동으로 정상 값 또는 정상 범주를 적용하여 분석을 계속 진행할 수 있다.In step S2305, the big data analysis visualization apparatus 10 recommends a predicted value, that is, a result value of the predictive learning model to the user. Alternatively, the big data analysis visualization apparatus 10 may automatically apply a normal value or a normal category to continue the analysis.

단계 S2306에서 빅데이터 분석 시각화 장치(10)는 추천 값을 적용하여 데이터 분석을 수행한다.In step S2306, the big data analysis visualization apparatus 10 performs data analysis by applying the recommended value.

도 24는 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 오류 발생시 대체 값을 추천하는 방법을 설명하기 위한 도면이다.24 is a diagram for explaining a method of recommending an alternative value when an error occurs by the big data analysis and visualization apparatus according to an embodiment of the present invention.

오류 수정부(510)는 입력된 데이터 또는 전처리된 데이터가 학습 모델에 사용된 데이터와 상이하거나 학습된 값의 범위가 넘어서는 등의 오류가 발생하는 경우 대체 값을 추천하여 오류를 수정한다.The error correction unit 510 corrects the error by recommending an alternative value when an error occurs, such as that the input data or the preprocessed data is different from the data used in the learning model or the range of the learned value is exceeded.

도 24를 참조하면, 단계 S2401에서 오류 수정부(510)는 수행한 워크플로우의 정상 값을 기반으로 학습한 학습 모델을 구축한다. 예를 들면, 오류 수정부(510)는 선택한 분석 모델의 정상과정과 오류과정을 학습하여 데이터셋의 정상 값 또는 정상 범주 값을 예측하는 예측 학습 모델을 생성한다.Referring to FIG. 24 , in step S2401 , the error correction unit 510 builds a learning model learned based on the normal value of the performed workflow. For example, the error correction unit 510 generates a predictive learning model for predicting a normal value or a normal category value of a dataset by learning a normal process and an error process of the selected analysis model.

단계 S2402에서 오류 수정부(510)는 입력된 데이터셋에서 오류가 발생하면 오류 값 정보를 수집하고, 오류 지점을 판별한다.In step S2402, when an error occurs in the input data set, the error correction unit 510 collects error value information and determines an error point.

단계 S2403에서 오류 수정부(510)는 단계 S1501에서 구축한 예측 학습 모델을 수행하여 오류 값의 정상 값 또는 정상 범주 값을 예측한다.In step S2403, the error correction unit 510 predicts a normal value or a normal categorical value of the error value by performing the predictive learning model built in step S1501.

단계 S2404에서 오류 수정부(510)는 예측한 값을 정상 값으로 추천하거나 자동 입력 처리한다.In step S2404, the error correction unit 510 recommends the predicted value as a normal value or performs automatic input processing.

단계 S2405에서 오류 수정부(510)는 추천된 값이 적용되면 정상 값 또는 정상 범위 값으로 예측 학습 모델에 추가한다.In step S2405, when the recommended value is applied, the error correction unit 510 adds it to the predictive learning model as a normal value or a normal range value.

단계 S2406에서 오류 수정부(510)는 추천된 정상 값 또는 정상 범위 값을 기준으로 사용자가 오류 값을 수정하여 오류를 수정할 수 있도록 지원한다.In step S2406, the error correction unit 510 supports the user to correct the error by correcting the error value based on the recommended normal value or the normal range value.

도 25는 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 분석 모델의 최적 성능을 도출할 수 있는 하이퍼파라미터를 추천하고 조정하는 방법을 설명하기 위한 도면이다.25 is a diagram for explaining a method of recommending and adjusting a hyperparameter capable of deriving an optimal performance of an analysis model, by the apparatus for visualizing big data analysis according to an embodiment of the present invention.

도 25를 참조하면, 성능 향상부(520)는 데이터셋과 분석 모델에 적합한 하이퍼파라미터를 추천한다. 예를 들면, 성능 향상부(520)는 최적의 성능을 도출할 수 있는 하이퍼파라미터를 추천하고 조정할 수 있다.Referring to FIG. 25 , the performance enhancing unit 520 recommends a hyperparameter suitable for a data set and an analysis model. For example, the performance enhancing unit 520 may recommend and adjust a hyperparameter capable of deriving an optimal performance.

단계 S2501에서 빅데이터 분석 시각화 장치(10)는 분석이 필요한 데이터셋을 입력 받는다.In step S2501, the big data analysis visualization apparatus 10 receives a data set that requires analysis.

단계 S2502에서 빅데이터 분석 시각화 장치(10)는 학습/검증 비율과 같은 하이퍼파라미터가 필요한 분석 모델을 선택한다. 예를 들면, 하이퍼파라미터가 필요한 분석 모델에는 랜덤 포레스트, 회귀분석, 의사결정나무, 로지스틱 회귀 분석 모델 등이고, 입력 가능한 하이퍼파라미터는 독립 종속 변수 설정, 학습/검증 데이터셋 비율, 트리의 수 등을 포함한다.In step S2502, the big data analysis visualization apparatus 10 selects an analysis model that requires hyperparameters such as a learning/verification ratio. For example, analysis models that require hyperparameters include random forest, regression analysis, decision tree, and logistic regression analysis models. do.

단계 S2503에서 빅데이터 분석 시각화 장치(10)는 사용자가 입력한 하이퍼파라미터로 분석모델을 수행한다. In step S2503, the big data analysis visualization apparatus 10 performs an analysis model with the hyperparameter input by the user.

단계 S2504에서 빅데이터 분석 시각화 장치(10)는 수행한 성능 결과 및 검증 지표를 그래프 또는 도표 등의 시각화하여 확인하고 분석 모델의 성능 개선의 필요성을 판단한다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 아웃오브백 오류(out of Bag Error) 그래프, 상관행렬 그래프 및 검증지표의 시각자료로 사용자 입력 하이퍼파라미터 값을 분석 모델에 적용하였을 때 성능 결과를 제공할 수 있다In step S2504, the big data analysis visualization apparatus 10 visualizes the performed performance result and verification index, such as a graph or chart, and determines the necessity of improving the performance of the analysis model. For example, the big data analysis visualization apparatus 10 displays the performance results when the user input hyperparameter values are applied to the analysis model as visual data of the out of bag error graph, the correlation matrix graph, and the verification index. can provide

분석 모델의 성능 개선이 필요한 경우 단계 S2405에서 빅데이터 분석 시각화 장치(10)는 택한 분석 모델의 최적의 성능을 위한 하이퍼파라미터를 추천한다. 빅데이터 분석 시각화 장치(10)는 하이퍼파라미터 예측 알고리즘을 이용해 직접적인 하이퍼파라미터 조정을 통한 분석 모델 학습을 수행하지 않고, 데이터셋과 분석모델에 적합한 하이퍼파라미터 값을 예측한다. 하이퍼파라미터 알고리즘은 입력 데이터 셋과 수행된 분석 모델과 수행 결과를 분석하고 학습하여 최적의 성능을 도출할 수 있는 하이퍼파라미터를 예측하는 알고리즘이다. 빅데이터 분석 시각화 장치(10)는 저성능 또는 과적합 여부에 따른 데이터 비율 조정하여 최적의 성능을 도출할 수 있는 하이퍼파라미터의 조정 값을 추천한다.When it is necessary to improve the performance of the analysis model, the big data analysis visualization apparatus 10 recommends hyperparameters for optimal performance of the selected analysis model in step S2405. The big data analysis visualization apparatus 10 predicts a hyperparameter value suitable for a dataset and an analysis model without performing analysis model learning through direct hyperparameter adjustment using a hyperparameter prediction algorithm. The hyperparameter algorithm is an algorithm that predicts hyperparameters that can derive optimal performance by analyzing and learning the input data set, the performed analysis model, and the performance results. The big data analysis visualization apparatus 10 recommends an adjustment value of a hyperparameter capable of deriving an optimal performance by adjusting the data ratio according to whether the performance is under-performance or over-fitting.

단계 S2506에서 빅데이터 분석 시각화 장치(10)는 추천한 조정 값을 적용한 분석 모델의 성능 결과 및 검증 지표와 사용자 입력 하이퍼파라미터를 적용한 분석 모델의 성능 결과 및 검증 지표를 비교하여 변화를 시각화하여 표시한다.In step S2506, the big data analysis visualization device 10 visualizes and displays the change by comparing the performance result and verification index of the analysis model to which the recommended adjustment value is applied and the performance result and verification index of the analysis model to which the user input hyperparameter is applied. .

단계 S2507에서 빅데이터 분석 시각화 장치(10)는 분석 모델의 성능 값이 최적이라고 판단되거나 성능 값 목표치에 달성되었다면 분석 모델 및 하이퍼파라미터를 확정하고 하이퍼파라미터 조정작업을 중지한다.In step S2507, the big data analysis visualization apparatus 10 determines the analysis model and hyperparameters and stops the hyperparameter adjustment operation if it is determined that the performance value of the analysis model is optimal or the performance value target is achieved.

도 26은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 다음 단계의 기능 블록을 추천하는 방법을 설명하기 위한 도면이다.26 is a diagram for explaining a method of recommending, by a big data analysis visualization apparatus, a function block of a next step according to an embodiment of the present invention.

도 26을 참조하면, 블록 추천부(530)는 데이터셋 분석과 현재의 블록 단계 분석을 통해 다음 기능 블록을 추천한다.Referring to FIG. 26 , the block recommendation unit 530 recommends the next functional block through data set analysis and current block stage analysis.

단계 S2601에서 빅데이터 분석 시각화 장치(10)는 데이터 분석 워크플로우를 수행 중 다음 기능 블록의 추천 요청을 수신한다.In step S2601, the big data analysis visualization apparatus 10 receives a request for recommendation of the next functional block while performing the data analysis workflow.

단계 S2602에서 빅데이터 분석 시각화 장치(10)는 선택한 기능 블록의 진행 단계를 구분하는 블록 분석 알고리즘을 수행하여 현 기능 블록의 진행 단계를 분석한다. In step S2602, the big data analysis visualization apparatus 10 analyzes the progress step of the current functional block by performing a block analysis algorithm for classifying the progress steps of the selected functional block.

단계 S2603에서 빅데이터 분석 시각화 장치(10)는 데이터셋 분석 알고리즘을 이용해 입력된 데이터셋을 분석한다. In step S2603, the big data analysis visualization apparatus 10 analyzes the input dataset using the dataset analysis algorithm.

단계 S2604에서 빅데이터 분석 시각화 장치(10)는 블록 분석 알고리즘 결과값과 및 데이터셋 분석 알고리즘 결과 값을 취합하여 상세 과정을 구분할 수 있는 상세 단계 분석 메타데이터를 생성한다. In step S2604, the big data analysis visualization apparatus 10 collects the block analysis algorithm result value and the dataset analysis algorithm result value to generate detailed stage analysis metadata capable of classifying the detailed process.

단계 S2605에서 빅데이터 분석 시각화 장치(10)는 상세 단계 분석 메타데이터와 기존의 워크플로우 분석 데이터를 기반으로 상세 단계 분석 메타데이터와 워크플로우 분석 데이터의 유사도 분석을 수행한다. In step S2605, the big data analysis visualization apparatus 10 performs similarity analysis between the detailed step analysis metadata and the workflow analysis data based on the detailed step analysis metadata and the existing workflow analysis data.

단계 S2606에서 빅데이터 분석 시각화 장치(10)는 유사도 분석에서 상위 랭크된 워크플로우에서 사용된 기능 블록을 추천한다. 다시 설명하면, 빅데이터 분석 시각화 장치(10)는 상세 단계 분석 메타데이터와 유사도가 높은 워크플로우 분석 데이터를 추출하여 해당 워크플로우 내의 기능 블록을 추천한다. In step S2606, the big data analysis visualization apparatus 10 recommends a functional block used in a workflow ranked higher in the similarity analysis. In other words, the big data analysis visualization apparatus 10 extracts workflow analysis data with high similarity to detailed stage analysis metadata and recommends functional blocks in the corresponding workflow.

도 27은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 다음 단계의 기능 블록을 추천하는 예시 화면이다.27 is an exemplary screen in which the apparatus for visualizing big data analysis according to an embodiment of the present invention recommends a functional block of a next step.

도 27을 참조하면, 단계 S2701에서 빅데이터 분석 시각화 장치(10)는 데이터 분석 워크플로우를 수행 중 다음 기능 블록의 추천 요청을 수신한다.Referring to FIG. 27 , in step S2701 , the big data analysis visualization apparatus 10 receives a request for recommendation of a next function block while performing a data analysis workflow.

단계 S2702에서 빅데이터 분석 시각화 장치(10)는 선택한 기능 블록의 진행 단계를 구분하는 블록 분석 알고리즘을 수행하여 현 기능 블록의 메타데이터를 추출한다. 예를 들면, 블록 메타데이터는 블록 ID, 파라미터 ID, 파라미터 입력값, 파라미터 리스트 등의 블록 기본 정보와, 이미 사용한 블록의 사용내역 리스트 등을 포함한다.In step S2702, the big data analysis visualization apparatus 10 extracts the metadata of the current functional block by performing a block analysis algorithm for classifying the progress steps of the selected functional block. For example, block metadata includes basic block information such as block ID, parameter ID, parameter input value, and parameter list, and a history list of blocks that have already been used.

단계 S2703에서 빅데이터 분석 시각화 장치(10)는 데이터셋 분석 알고리즘을 이용해 입력된 데이터셋의 메타데이터를 추출한다. 예를 들면, 입력된 데이터셋 메타데이터는 데이터 타입, 데이터 사이즈, 특성의 숫자, 결측 지 여부 및 비율, 이상 치 여부 및 비율, 중복 데이터 여부 및 비율 등의 정보를 포함한다.In step S2703, the big data analysis visualization apparatus 10 extracts metadata of the input dataset using a dataset analysis algorithm. For example, the input dataset metadata includes information such as data type, data size, number of characteristics, whether or not there are missing values, and whether or not there is an outlier, and whether or not there is duplicate data.

단계 S2704에서 빅데이터 분석 시각화 장치(10)는 블록 메타데이터 및 데이터셋 메타데이터를 이용하여 상세 단계 분석 메타데이터를 생성한다.In step S2704, the big data analysis visualization apparatus 10 generates detailed step analysis metadata using block metadata and dataset metadata.

단계 S2705에서 빅데이터 분석 시각화 장치(10)는 상세 단계 분석 메타데이터의 특징 값으로, 생성된 워크플로우 또는 제공된 템플릿의 워크플로우와 클러스터링 모델을 수행한다. 이때 클러스터링 모델은 k-means(k-평균), k-mode(k-모드), DBSCAN(밀도 기반) 클러스터링 기법 등이 있다. In step S2705, the big data analysis visualization apparatus 10 performs the generated workflow or the workflow and the clustering model of the provided template as the feature value of the detailed stage analysis metadata. In this case, clustering models include k-means (k-means), k-mode (k-mode), and DBSCAN (density-based) clustering techniques.

단계 S2706에서 빅데이터 분석 시각화 장치(10)는 유사도 분석을 통해 클러스터링된 워크플로우 내에서 상세 단계 분석 메타데이터의 특징 값과 유사한 상위랭크 워크플로우를 추출한다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 데이터셋 패턴, 지정한 종속변수, 사용 기능 블록, 기능 블록들 간의 연결관계 등의 유사도를 분석할 수 있다. 유사도 분석 모델은 유클리디안 거리, 맨하튼 거리, 스피어만 상관점수의 유사도 분석 기법 등을 이용할 수 있다. In step S2706, the big data analysis visualization apparatus 10 extracts an upper rank workflow similar to the feature value of the detailed stage analysis metadata within the clustered workflow through similarity analysis. For example, the big data analysis visualization apparatus 10 may analyze a similarity such as a dataset pattern, a designated dependent variable, a used functional block, and a connection relationship between the functional blocks. The similarity analysis model may use a similarity analysis technique of Euclidean distance, Manhattan distance, Spearman correlation score, and the like.

단계 S2707에서 빅데이터 분석 시각화 장치(10)는 유사도 분석에서 상위 랭크된 워크플로우에서 해당 단계에 사용된 기능 블록을 추천한다.In step S2707, the big data analysis visualization apparatus 10 recommends a functional block used in the corresponding step in the workflow ranked higher in the similarity analysis.

도 28은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 분석 모델을 추천하는 방법을 설명하기 위한 도면이다.28 is a diagram for explaining a method for a big data analysis visualization apparatus to recommend an analysis model according to an embodiment of the present invention.

도 28을 참조하면, 모델 추천부(540)는 입력된 데이터셋으로 가장 적합한 분석 모델을 추천하여 분석을 수행한다.Referring to FIG. 28 , the model recommendation unit 540 recommends an analysis model most suitable for the input data set and performs analysis.

단계 S2801에서 빅데이터 분석 시각화 장치(10)는 분석이 필요한 데이터셋을 입력 받고, 대상 변수 지정 시에 예측 대상값(Y값)을 지정할 수 있다.In step S2801, the big data analysis visualization apparatus 10 may receive a data set that requires analysis, and may designate a predicted target value (Y value) when a target variable is designated.

단계 S2802에서 빅데이터 분석 시각화 장치(10)는 적합 모델 평가 알고리즘을 수행하여 입력된 데이터셋에 다종의 분석 모델을 모의 적용하고 데이터셋과 분석 모델의 적합 점수를 산출한다.In step S2802, the big data analysis visualization apparatus 10 performs a fitting model evaluation algorithm to simulate and apply various types of analysis models to the input dataset, and calculates a fitness score of the dataset and the analysis model.

단계 S2803에서 빅데이터 분석 시각화 장치(10)는 적합 모델 평가 알고리즘을 이용해 산출한 적합 점수가 상위 n개에 해당하는 분석 모델을 추천하거나 최고 적합 점수를 획득한 분석 모델을 워크플로우에 적용한다.In step S2803, the big data analysis visualization apparatus 10 recommends an analysis model having a top n fit score calculated by using the fitting model evaluation algorithm or applies the analysis model that has obtained the highest fitting score to the workflow.

단계 S2804에서 빅데이터 분석 시각화 장치(10)는 데이터셋 메타데이터 및 추천 분석 모델의 블록 메타데이터를 이용해 유사도 분석을 수행한다. 자세히 설명하면, 빅데이터 분석 시각화 장치(10)는 블록 분석 알고리즘으로 추천 분석 모델의 블록 메타데이터를 분석하고 데이터셋 분석 알고리즘으로 입력 데이터셋 메타데이터를 분석하고 그 결과값들을 취합하여 상세 과정을 구분할 수 있는 상세 단계 분석 메타데이터를 생성한다. 빅데이터 분석 시각화 장치(10)는 상세 단계 분석 메타데이터와 기존의 워크플로우 분석 데이터를 기반으로 상세 단계 분석 메타데이터와 워크플로우 분석 데이터의 유사도 분석을 수행할 수 있다.In step S2804, the big data analysis visualization apparatus 10 performs similarity analysis using the dataset metadata and block metadata of the recommendation analysis model. In detail, the big data analysis visualization apparatus 10 analyzes the block metadata of the recommended analysis model with the block analysis algorithm, analyzes the input dataset metadata with the dataset analysis algorithm, and collects the result values to classify the detailed process. Generate detailed step-by-step analysis metadata that can be The big data analysis visualization apparatus 10 may perform similarity analysis between the detailed step analysis metadata and the workflow analysis data based on the detailed step analysis metadata and the existing workflow analysis data.

단계 S2805에서 빅데이터 분석 시각화 장치(10)는 유사도가 높은 워크플로우를 기반으로 템플릿을 추천한다.In step S2805, the big data analysis visualization apparatus 10 recommends a template based on a workflow with a high degree of similarity.

도 29는 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 분석 모델을 추천하는 예시 화면이다.29 is an exemplary screen in which the big data analysis visualization apparatus recommends an analysis model according to an embodiment of the present invention.

도 29를 참조하면, 단계S2901에서 빅데이터 분석 시각화 장치(10)는 데이터셋을 입력 받는다. Referring to FIG. 29 , in step S2901, the big data analysis and visualization apparatus 10 receives a data set.

단계S2902에서 빅데이터 분석 시각화 장치(10)는 입력 데이터셋을 분석 모델에 모의 적용하여 적합 점수를 산출하는 적합 모델 추천 알고리즘을 수행한다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 입력된 데이터셋에 랜덤 포레스트, 상관분석, 다층 퍼셉트론, 나이브베이즈, k-means(k-평균) 모델 등을 모의 적용하여 AUC점수를 산출하고 적합 점수를 산정한다. In step S2902, the big data analysis and visualization apparatus 10 performs a fitting model recommendation algorithm for calculating a fit score by simulating the input dataset to the analysis model. For example, the big data analysis visualization apparatus 10 calculates the AUC score by applying a random forest, correlation analysis, multi-layer perceptron, naive Bayes, k-means (k-means) model, etc. to the input data set by simulating, and Calculate the fit score.

단계S2903에서 빅데이터 분석 시각화 장치(10)는 산출된 적합 점수가 상위 n건에 해당하는 분석 모델을 추천한다.In step S2903, the big data analysis and visualization apparatus 10 recommends an analysis model whose calculated fit score corresponds to the top n cases.

단계S2904에서 빅데이터 분석 시각화 장치(10)는 추천 분석 모델 중 선택된 분석 모델 또는 최상위 AUC 점수를 획득한 분석 모델의 블록을 블록 분석 알고리즘으로 분석하고, 데이터셋 분석 알고리즘을 통해 데이터셋을 분석한다.In step S2904, the big data analysis visualization apparatus 10 analyzes the block of the analysis model selected from among the recommended analysis models or the analysis model that has obtained the highest AUC score with a block analysis algorithm, and analyzes the dataset through the dataset analysis algorithm.

단계 S2905에서 빅데이터 분석 시각화 장치(10)는 블록 분석 알고리즘 및 데이터셋 분석 알고리즘의 결과 값으로 획득한 블록 메타데이터와 데이터셋 메타데이터를 결합하여 제공된 템플릿의 워크플로우와의 유사도 분석을 수행한다.In step S2905, the big data analysis visualization apparatus 10 combines the block metadata obtained as a result value of the block analysis algorithm and the dataset analysis algorithm with the dataset metadata to perform a similarity analysis with the workflow of the provided template.

단계 S2906에서 빅데이터 분석 시각화 장치(10)는 유사도가 높은 상위 n개의 워크플로우를 포함하는 템플릿을 추천한다.In step S2906, the big data analysis visualization apparatus 10 recommends a template including the top n workflows with high similarity.

도 30은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 추천된 블록들로 워크플로우를 생성하는 방법을 설명하기 위한 도면이다.30 is a diagram for explaining a method of generating a workflow using blocks recommended by the apparatus for visualizing big data analysis according to an embodiment of the present invention.

도 30을 참조하면, 워크플로우 생성부(550)는 추천된 블록들을 워크플로우 내에 적합한 위치에 생성하여 새로운 워크플로우를 생성할 수 있다. 워크플로우 생성부(550)는 생성된 워크플로우를 사용자의 워크플로우로 저장하거나 워크플로우 템플릿으로 저장할 수 있다. Referring to FIG. 30 , the workflow generating unit 550 may generate a new workflow by generating recommended blocks at appropriate locations within the workflow. The workflow generator 550 may store the generated workflow as a user's workflow or as a workflow template.

단계 S3001에서 빅데이터 분석 시각화 장치(10)는 워크플로우 작성 중 분석 과정 추천을 요청받는다.In step S3001, the big data analysis visualization apparatus 10 is requested to recommend an analysis process during workflow creation.

단계 S3002에서 빅데이터 분석 시각화 장치(10)는 입력된 데이터셋 분석 알고리즘을 통해 데이터셋을 분석하여 전처리 블록을 추천한다.In step S3002, the big data analysis visualization apparatus 10 analyzes the dataset through the input dataset analysis algorithm and recommends a preprocessing block.

단계 S3003에서 빅데이터 분석 시각화 장치(10)는 작성 중 워크플로우의 블록을 블록 분석 알고리즘을 통해 분석한다. In step S3003, the big data analysis visualization apparatus 10 analyzes the block of the workflow during creation through the block analysis algorithm.

단계 S3004에서 빅데이터 분석 시각화 장치(10)는 블록 분석 알고리즘 및 데이터셋 분석 알고리즘의 결과 값을 이용해 분석 모델의 블록을 추천한다.In step S3004, the big data analysis visualization apparatus 10 recommends a block of the analysis model using the result values of the block analysis algorithm and the dataset analysis algorithm.

단계 S3005에서 빅데이터 분석 시각화 장치(10)는 시각화 추천 알고리즘을 통해 효과적인 시각화에 적합한 시각화 블록을 추천한다.In step S3005, the big data analysis visualization apparatus 10 recommends a visualization block suitable for effective visualization through a visualization recommendation algorithm.

단계 S3006에서 빅데이터 분석 시각화 장치(10)는 추천된 블록들을 블록 배치 알고리즘을 통해 적합한 위치와 순서에 맞게 연결하여 워크플로우를 생성한다.In step S3006, the big data analysis visualization apparatus 10 connects the recommended blocks in a suitable position and order through a block arrangement algorithm to generate a workflow.

도 31은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치가 추천된 블록들로 워크플로우를 생성하는 예시 화면이다.31 is an exemplary screen for generating a workflow using blocks recommended by the big data analysis visualization apparatus according to an embodiment of the present invention.

단계 S3101에서 빅데이터 분석 시각화 장치(10)는 블록과 블록 사이의 분석 과정 추천을 요청받는다. 빅데이터 분석 시각화 장치(10)는 분석 과정 추천을 위해 입력된 데이터셋 및 선택된 기능 블록을 확인한다.In step S3101, the big data analysis visualization apparatus 10 is requested to recommend a block-to-block analysis process. The big data analysis visualization apparatus 10 checks the input data set and the selected function block for the analysis process recommendation.

단계 S3102에서 빅데이터 분석 시각화 장치(10)는 입력된 데이터셋을 데이터셋 분석 알고리즘을 통해 데이터셋 메타데이터를 추출한다.In step S3102, the big data analysis visualization apparatus 10 extracts the dataset metadata from the input dataset through the dataset analysis algorithm.

단계 S3103에서 빅데이터 분석 시각화 장치(10)는 데이터셋 메타데이터를 기반으로 필요한 전처리를 분석하여 전처리 블록을 추천한다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 이상치 발견 시 이상치 처리 블록, 결측 치 발견 시 결측 치 처리 블록, 컬럼수가 불필요하게 많으면 파생변수 블록, PCA 블록들 데이터셋에 필요한 전처리 블록을 추천할 수 있다.In step S3103, the big data analysis visualization apparatus 10 analyzes necessary pre-processing based on the dataset metadata and recommends a pre-processing block. For example, the big data analysis visualization apparatus 10 recommends an outlier processing block when an outlier is found, a missing value processing block when a missing value is found, and a preprocessing block required for a derived variable block and PCA blocks dataset when the number of columns is unnecessarily large. can

단계 S3104에서 빅데이터 분석 시각화 장치(10)는 블록 분석 알고리즘을 통해 선택된 기능 블록의 블록 메타데이터를 추출한다.In step S3104, the big data analysis visualization apparatus 10 extracts block metadata of the selected functional block through a block analysis algorithm.

단계 S3105에서 빅데이터 분석 시각화 장치(10)는 블록 메타데이터와 데이터셋 메타데이터를 이용해 분석 모델의 블록을 추천한다.In step S3105, the big data analysis visualization apparatus 10 recommends a block of the analysis model using block metadata and dataset metadata.

단계 S3106에서 빅데이터 분석 시각화 장치(10)는 시각화 추천 알고리즘을 통해 시각화 블록을 추천한다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 중복 시각화를 제외하고, 사용 중인 전처리 블록이 이상치 처리 블록이면 박스플롯 또는 산점도 시각화 블록을 추천하고, 데이터셋이 범주형이면 비율 확인을 할 수 있는 파이 차트 시각화 블록을 추천할 수 있다.In step S3106, the big data analysis visualization apparatus 10 recommends a visualization block through a visualization recommendation algorithm. For example, the big data analysis visualization device 10 recommends a boxplot or scatterplot visualization block if the preprocessing block being used is an outlier processing block, except for duplicate visualization, and if the dataset is categorical, it is possible to check the ratio. A pie chart visualization block can be recommended.

단계 S3107에서 빅데이터 분석 시각화 장치(10)는 추천된 블록들을 블록 전후 관계 분석 알고리즘을 통해 적합한 위치와 순서로 배치하여 워크플로우를 생성한다. 예를 들면, 빅데이터 분석 시각화 장치(10)는 시각화 블록은 연관 블록의 뒤에 배치하고, 전처리 블록은 데이터 입력과 대상 데이터 분석 블록 사이에 배치한다.In step S3107, the big data analysis visualization apparatus 10 creates a workflow by arranging the recommended blocks in a suitable position and order through a block context analysis algorithm. For example, in the big data analysis visualization apparatus 10, the visualization block is disposed after the associated block, and the preprocessing block is disposed between the data input and the target data analysis block.

도 32 내지 도 36은 본 발명의 일 실시 예에 따른 빅데이터 분석 시각화 장치의 예시 화면들이다.32 to 36 are exemplary screens of a big data analysis visualization apparatus according to an embodiment of the present invention.

도 32는 빅데이터 분석 시각화 장치(10)가 데이터를 수집하는 예시 화면이다.32 is an exemplary screen in which the big data analysis visualization apparatus 10 collects data.

도 32를 참조하면, 빅데이터 분석 시각화 장치(10)는 엑셀 파일 형식, CSV 파일 형식, RDS 파일 형식, TXT 파일 형식 및 데이터베이스도 드래그 앤 드롭으로 연결하여 분석하기 원하는 데이터를 수집할 수 있다. 또한 빅데이터 분석 시각화 장치(10)는 제공되는 OpenAPI를 이용해 직접 데이터를 수집할 수 있다.Referring to FIG. 32 , the big data analysis visualization apparatus 10 may collect data desired to be analyzed by connecting an Excel file format, a CSV file format, an RDS file format, a TXT file format, and a database by drag and drop. In addition, the big data analysis visualization apparatus 10 may directly collect data using the provided OpenAPI.

도 33은 빅데이터 분석 시각화 장치(10)가 데이터 전처리를 수행한 예시 화면이다.33 is an exemplary screen in which the big data analysis visualization apparatus 10 performs data pre-processing.

도 33을 참조하면, 빅데이터 분석 시각화 장치(10는 컬럼 정보를 변경할 수 있고, 전처리를 위한 데이터를 확인할 수 있다.Referring to FIG. 33 , the big data analysis visualization apparatus 10 may change column information and check data for pre-processing.

도 34는 빅데이터 분석 시각화 장치(10)가 분석 모델을 이용해 빅데이터를 분석하는 예시 화면이다.34 is an exemplary screen in which the big data analysis visualization apparatus 10 analyzes big data using an analysis model.

도 34를 참조하면, 빅데이터 분석 시각화 장치(10)는 하이퍼파라미터 및 대상 변수 값을 지정하여 랜덤포레스트 분석 모델을 이용해 분석하고, 분석 모델의 성능 지표도 확인할 수 있다.Referring to FIG. 34 , the big data analysis visualization apparatus 10 designates hyperparameters and target variable values to analyze using the random forest analysis model, and may also check the performance index of the analysis model.

도 35는 빅데이터 분석 시각화 장치(10)가 시각화를 수행한 예시 화면이다.35 is an example screen on which the big data analysis visualization apparatus 10 performs visualization.

도 35를 참조하면, 빅데이터 분석 시각화 장치(10)는 웹페이지 조작방식으로 프로그램 코딩없이 빅데이터를 분석하고 그 결과를 시각화할 수 있다.Referring to FIG. 35 , the big data analysis and visualization apparatus 10 may analyze big data without program coding in a web page manipulation method and visualize the result.

도 36은 빅데이터 분석 시각화 장치(10)가 사용하는 기능 블록의 예시 화면이다.36 is an exemplary screen of a functional block used by the big data analysis and visualization apparatus 10 .

도 36을 참조하면, 빅데이터 분석 시각화 장치(10)는 기능 블록을 드래그앤 드롭 또는 클릭하여 선택하고 이동시킬 수 있다. 빅데이터 분석 시각화 장치(10)는 기능 블록을 각 단계별로 색상이 상이하게 표현하고, 요구에 따라 기능 블록의 내부 포인트 또는 외부 포인트를 포함한다. 또한 기능 블록의 내부 포인트 또는 외부 포인트는 상태에 따라 색상이 상이하여 직관적이다. 빅데이터 분석 시각화 장치(10)는 기능 블록 외부 포인트를 블록 파이프라인을 이용해 기능 블록끼리 연결할 수 있다.Referring to FIG. 36 , the big data analysis visualization apparatus 10 may select and move a function block by dragging and dropping or clicking. The big data analysis visualization apparatus 10 expresses the functional block in different colors at each stage, and includes an internal point or an external point of the functional block according to a request. In addition, the color of the inner point or the outer point of the function block is different depending on the state, so it is intuitive. The big data analysis visualization apparatus 10 may connect functional blocks with external points using a block pipeline.

상술한 빅데이터 분석 시각화 방법은 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 코드로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체는, 예를 들어 이동형 기록 매체(CD, DVD, 블루레이 디스크, USB 저장 장치, 이동식 하드 디스크)이거나, 고정식 기록 매체(ROM, RAM, 컴퓨터 구비형 하드 디스크)일 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체에 기록된 상기 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 상기 다른 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 다른 컴퓨팅 장치에서 사용될 수 있다.The above-described big data analysis visualization method may be implemented as a computer-readable code on a computer-readable medium. The computer-readable recording medium may be, for example, a removable recording medium (CD, DVD, Blu-ray disk, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer-equipped hard disk). can The computer program recorded on the computer-readable recording medium may be transmitted to another computing device through a network such as the Internet and installed in the other computing device, thereby being used in the other computing device.

이상에서, 본 발명의 실시 예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시 예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다.In the above, even though it has been described that all components constituting the embodiment of the present invention are combined or operated as one, the present invention is not necessarily limited to this embodiment. That is, within the scope of the object of the present invention, all the components may operate by selectively combining one or more.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 더욱이, 위에 설명한 실시 예 들에서 다양한 구성들의 분리는 그러한 분리가 반드시 필요한 것으로 이해되어서는 안 되고, 설명된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품으로 함께 통합되거나 다수의 소프트웨어 제품으로 패키지 될 수 있음을 이해하여야 한다.Although acts are shown in a particular order in the drawings, it should not be understood that the acts must be performed in the specific order or sequential order shown, or that all illustrated acts must be performed to obtain a desired result. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of the various components in the embodiments described above should not be construed as necessarily requiring such separation, and the described program components and systems may generally be integrated together into a single software product or packaged into multiple software products. It should be understood that there is

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at focusing on the embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

10: 빅데이터 분석 시각화 장치
100: 수집부
200: 전처리부
300: 분석부
400: 시각화부
500: 수행부
510: 오류 수정부
520: 성능 향상부
530: 블록 추천부
540: 모델 추천부
550: 워크플로우 생성부10: Big data analysis visualization device
100: collection unit
200: preprocessor
300: analysis unit
400: visualization unit
500: attendant
510: error correction unit
520: performance enhancement unit
530: block recommendation
540: model recommendation
550: workflow creation unit

Claims

In the big data analysis visualization device,
a collection unit for collecting data to be analyzed;
a pre-processing unit pre-processing the data to fit the analysis model;
an analysis unit that analyzes the data as an analysis model;
a visualization unit that visualizes the result of performing the analysis model as a suitable graph; and
Big data analysis and visualization apparatus including a performing unit for matching the data collection, pre-processing, analysis and visualization processes with functional blocks.

According to claim 1,
The preprocessor
A big data analysis visualization device that generates and provides the data that has undergone a pre-processing process as a file.

According to claim 1,
the performing unit
A big data analysis visualization device using a workflow for selecting, arranging and linking the functional blocks.

According to claim 1,
A big data analysis visualization device that provides a dataset analysis scenario as a template.

In the method for big data analysis visualization device to analyze big data
collecting data to be analyzed;
preprocessing the data to fit an analysis model;
analyzing the data using an analysis model;
Visualizing the analyzed data; and
Big data analysis visualization method comprising the step of performing the data collection, pre-processing, analysis, and visualization steps by matching them with functional blocks.

6. The method of claim 5,
The step of preprocessing the data to fit the analysis model is
A big data analysis visualization method for generating and providing the data that has undergone a pre-processing process as a file.

6. The method of claim 5,
A big data analysis visualization method using a workflow for selecting, placing and linking the functional blocks.

6. The method of claim 5,
A big data analysis visualization method that provides a dataset analysis scenario as a template.

A computer program recorded in a computer-readable recording medium executing the big data analysis and visualization method of any one of claims 5 to 8.