KR20220069694A

KR20220069694A - Data collecting and processing apparatus and method for data integration analysis

Info

Publication number: KR20220069694A
Application number: KR1020200157088A
Authority: KR
Inventors: 박상우; 전성재; 김영지; 이승재; 김재만
Original assignee: (주)디지탈쉽
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-05-27
Also published as: KR102504531B1

Abstract

According to one embodiment of the present invention, a data collecting and processing apparatus comprises: a data collection unit collecting data every pre-set first period from a plurality of data centers; a pre-processing refining the collected data to be matched with a predetermined data set formation; a de-identification unit detecting data corresponding to sensitive information in the pre-processed data to de-identify the data; and a data transmission unit the de-identified data every predetermined second period.

Description

DATA COLLECTING AND PROCESSING APPARATUS AND METHOD FOR DATA INTEGRATION ANALYSIS

본 발명은 다양한 데이터 센터를 통해 수집되는 데이터를 병합하여 빅데이터 플랫폼에 전달하는 데이터 수집 처리 장치 및 방법에 관한 것으로, 더욱 상세하게는 각 데이터 센터로부터 수신된 데이터를 빅데이터 셋 구축 플랫폼으로 전송하기 전에 전처리를 수행하는 데이터 수집 처리 장치 및 방법에 관한 것이다.The present invention relates to a data collection processing apparatus and method for merging data collected through various data centers and delivering it to a big data platform, and more particularly, to transmitting data received from each data center to a big data set building platform It relates to a data collection processing apparatus and method for performing pre-processing before.

이하에 기술되는 내용은 단순히 본 발명에 따른 일 실시예와 관련되는 배경 정보만을 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described below merely provides background information related to an embodiment according to the present invention and does not constitute the prior art.

빅데이터란, 기존의 기업 환경이나 공공 기관에서 사용되는 정형화된 데이터는 물론, 전자상거래 데이터, 메타 데이터, 웹로그, 무선식별(RFID) 데이터, 센서 네트워크 데이터, 소셜 네트워크 데이터, 소셜 데이터, 인터넷 텍스트와 문서, 인터넷 검색 인덱싱 등 기존에 미처 활용되지 못하던 비정형화 또는 반정형화된 데이터를 모두 포함하는 데이터로서, 이와 같은 데이터는 일반적으로 보통의 소프트웨어 툴 및 컴퓨터 시스템으로는 다루기 어려운 수준의 데이터 양을 갖게 된다는 의미에서 빅데이터(Big Data)라 칭하고 있다.Big data is not only standardized data used in existing corporate environments or public institutions, but also e-commerce data, metadata, web logs, radio identification (RFID) data, sensor network data, social network data, social data, and Internet text. It is data that includes both unstructured or semi-structured data that has not been utilized before, such as documents, Internet search indexing, etc., and such data generally has a level of data that is difficult to handle with ordinary software tools and computer systems. In this sense, it is called Big Data.

IT 융합 기술 발전으로 데이터 이용 수요가 급증함에 따라, 주요 국가들 에서는 데이터 산업 활성화를 위해서 빅데이터에 관련된 다양한 정책들을 추진하고 있다. 최근 들어, 기업 뿐만 아니라 정부에서도, 다양하게 수집된 빅데이터 정보 및 그를 활용한 통계 분석 데이터를 의사 결정과 정책 결정 등에 적극적으로 활용하고자 하는 시도가 이어지고 있으며, 빅데이터를 활용하여 데이터 중심의 컴퓨팅 환경을 구축하기 위한 빅데이터 처리 기술이 활발하게 연구되고 있다. As the demand for data use is rapidly increasing due to the development of IT convergence technology, major countries are promoting various policies related to big data to revitalize the data industry. In recent years, not only companies but also governments have attempted to actively utilize variously collected big data information and statistical analysis data using it for decision-making and policy-making. Big data processing technology is being actively researched to build

우리나라에서도 정부 정책에 따라 다양한 공공 데이터들이 존재한다. 부동산 관련된 데이터들도 있고, 지식재산과 관련된 데이터들도 있으며, 의료와 관련된 데이터 등 다양한 공공 데이터들이 존재한다. 하지만, 다양한 데이터 센터를 통해서 수집된 데이터들은 데이터 분류 형식과 카테고라이징 및 응답 방식의 차이로 인해 일관성이 없는 문제가 있다. 예를 들면, 남녀 성별에 대한 데이터에서도 어떤 데이터는 "남자, 여자"로 구분하는 한편, 다른 데이터에서는 "M, F"로 구분할 수 있다. In Korea, various public data exist according to government policies. There is data related to real estate, data related to intellectual property, and various public data such as data related to medical care. However, data collected through various data centers has a problem of inconsistency due to differences in data classification format, categorization and response methods. For example, even in data on male and female gender, some data may be classified as “male, female” while other data may be classified as “M, F”.

이처럼 동일한 의미의 데이터라도 상이한 표현값으로 수집, 관리되고 있는 데이터들의 특성으로 인해 활용도가 높은 데이터는 양적으로 매우 부족한 실정이다. 이에 일괄적으로 관리가 가능한 양질의 데이터 수집 방안이 필요하다.Even with the same meaning of data, due to the characteristics of the data being collected and managed with different expression values, the amount of highly useful data is insufficient in quantity. Therefore, it is necessary to collect high-quality data that can be managed collectively.

아울러, 종래의 데이터 수집 시스템들의 경우, 원천 데이터만을 수집하다 보니 이상치, 결측치, 무효치 등이 남아있는 데이터들을 함께 수집하게 되어, 과도한 통신 부하가 발생하는 문제가 있었다.In addition, in the case of the conventional data collection systems, since only source data is collected, data remaining such as outliers, missing values, and invalid values are collected together, resulting in excessive communication load.

또한, 빅데이터 산업 시장의 발전은 개인 정보를 수집하여 제3자에게 공유하거나 유통하는 데이터 브로커 산업도 함께 발전시키고 있다. 따라서, 빅데이터 및 데이터 브로커 산업의 발전은 크고 작은 개인정보 유출 사고에 직, 간접적으로 연관이 되고 있다.In addition, the development of the big data industry market is also developing the data broker industry that collects and shares or distributes personal information to third parties. Therefore, the development of big data and data broker industries is directly or indirectly related to large and small personal information leakage accidents.

이에 통계 결과 데이터가 아닌 통계 분석용으로서의 빅데이터 자체를 사업적 목적으로 가공하여 유통하기 위하여, 마스킹, 치환, 반식별화, 유형화 등을 통해 개인 속성을 비식별화하는 방법이 일각에서 적용되고 있다. 하지만, 비식별화를 수행하기 위한 민감 정보에 대한 정의가 명확하지 않고, 개인정보의 범위도 상이하여 비식별화의 기준/절차/방법에 대한 구체적인 설계가 필요하다.Accordingly, in order to process and distribute big data itself for statistical analysis rather than statistical result data for business purposes, methods of de-identifying individual attributes through masking, substitution, anti-identification, and typification are being applied in some areas. . However, the definition of sensitive information for performing de-identification is not clear, and the scope of personal information is also different, so a specific design for the standard/procedure/method of de-identification is required.

이와 같은 점에서 착안된 본 발명이 해결하고자 하는 과제는, 다양한 데이터 센터로부터 수집한 원본 데이터들을 정제하여 양질의 데이터로 관리하는 시스템이 필요하다.The problem to be solved by the present invention conceived in this regard requires a system for purifying original data collected from various data centers and managing them as high-quality data.

또한, 대용량의 정제된 데이터에 대해서 민감 정보에 대한 구체적인 설계를 통해 개인 정보 보호가 강화된 데이터 처리 시스템이 필요하다.In addition, a data processing system with enhanced personal information protection through specific design of sensitive information for large-capacity refined data is required.

아울러 각 데이터 센터로부터 원본 데이터를 수집하여 인공지능에 활용할 수 있는 정제된 양질의 데이터 셋을 확보하기 위해 데이터 제공자가 데이터를 빅데이터 플랫폼으로 전송하기 전에 원하는 방식으로 데이터를 가공할 수 있는 에이전트를 제공하는 시스템이 필요하다.In addition, in order to collect original data from each data center and obtain a refined, high-quality data set that can be used for artificial intelligence, data providers provide an agent that can process data in the desired way before transmitting the data to the big data platform. You need a system that

그리고, 수집되는 데이터의 표준화를 통해 질적인 가치를 높여 활용도가 높은 데이터를 수집하는 시스템이 필요하다.In addition, there is a need for a system that collects highly useful data by increasing the qualitative value through standardization of the collected data.

본 발명의 일실시예에 따른, 데이터 수집 처리 장치는 복수 개의 데이터 센터로부터 미리 설정된 제1 주기마다 데이터를 수집하는 데이터 수집부, 상기 수집된 데이터를 미리 정해진 데이터 셋 형태에 맞도록 정제하는 전처리부, 상기 전처리된 데이터 중 민감정보에 해당하는 데이터를 검출하여 비식별화하는 비식별화부 및 상기 비식별화된 데이터를 미리 정해진 제2 주기마다 빅데이터 플랫폼으로 전송하는 데이터 전송부를 포함할 수 있다.According to an embodiment of the present invention, a data collection processing apparatus includes a data collection unit that collects data from a plurality of data centers at a preset first cycle, and a preprocessor that refines the collected data to fit a predetermined data set format. , a de-identification unit for detecting and de-identifying data corresponding to sensitive information among the pre-processed data, and a data transmission unit for transmitting the de-identified data to a big data platform every second predetermined period.

일실시예에 따르면, 데이터 수집 처리 장치의 데이터 수집부는 데이터 수집 경로 및 데이터 수집 주기를 사용자가 변경할 수 있도록 인터페이스를 제공할 수 있다.According to an embodiment, the data collection unit of the data collection processing apparatus may provide an interface so that a user can change a data collection path and a data collection period.

일실시예에 따르면, 데이터 수집 처리 장치의 데이터 수집부는 상기 미리 설정된 제1 주기에 도달한 경우 데이터 수집 경로 내에 데이터 파일이 존재하는지 체크하여 데이터 파일이 존재하는 경우에 상기 데이터 파일을 수집할 수 있다.According to an embodiment, the data collection unit of the data collection processing apparatus may collect the data file when the data file exists by checking whether a data file exists in the data collection path when the preset first period is reached. .

일실시예에 따르면, 데이터 수집 처리 장치의 데이터 수집부는 상기 복수 개의 데이터 센터마다 식별자를 부여하고, 상기 수집된 데이터의 분류 항목 및 변경 항목 별 구분자를 통해 각 데이터 센터 별 데이터를 관리할 수 있다.According to an embodiment, the data collection unit of the data collection processing apparatus may assign an identifier to each of the plurality of data centers, and manage the data for each data center through a classification item and a separator for each change item of the collected data.

일실시예에 따르면, 데이터 수집 처리 장치의 전처리부는 상기 수집된 데이터를 미리 등록된 표준 코드에 따라 표준화 처리하고, 상기 표준 코드는 표준 코드 변환 기능을 이용하여 해당 칼럼의 값을 표준 코드 값으로 변경하여 표준화를 수행할 수 있다.According to an embodiment, the pre-processing unit of the data collection processing device standardizes the collected data according to a pre-registered standard code, and the standard code uses a standard code conversion function to change the value of a corresponding column to a standard code value standardization can be performed.

일실시예에 따르면, 데이터 수집 처리 장치의 전처리부는 상기 표준화 처리된 데이터를 숫자형 데이터 정규화, 데이터 변환, 데이터 클렌징, 문자열 제거, 열 삭제 및 데이터 난독화 중 적어도 하나를 통해서 정제할 수 있다.According to an embodiment, the preprocessor of the data collection processing apparatus may refine the standardized data through at least one of numeric data normalization, data conversion, data cleansing, string removal, column deletion, and data obfuscation.

일실시예에 따르면, 데이터 수집 처리 장치의 비식별화부는 미리 정해진 정책에 따라 개인 정보로 규정된 컬럼을 추출하여 마스킹, 재배열, 범주화 중 적어도 하나에 따라 비식별화를 수행할 수 있다.According to an embodiment, the de-identification unit of the data collection processing apparatus may extract a column defined as personal information according to a predetermined policy and perform de-identification according to at least one of masking, rearrangement, and categorization.

일실시예에 따르면, 데이터 수집 처리 장치의 데이터 전송부는 데이터 csv 파일, 스키마 메타정보를 갖는 json 파일 및 데이터 전송 완료를 표시하는 end 파일을 SFTP를 이용하여 전송할 수 있다.According to an embodiment, the data transmission unit of the data collection processing apparatus may transmit a data csv file, a json file having schema metadata, and an end file indicating data transmission completion using SFTP.

일실시예에 따르면, 데이터 수집 처리 장치에서 수행되는 데이터 수집 처리 방법은 복수 개의 데이터 센터로부터 미리 설정된 제1 주기마다 데이터를 수집하는 단계, 상기 수집된 데이터를 미리 정해진 데이터 셋 형태에 맞도록 정제하는 단계, 상기 전처리된 데이터 중 민감정보에 해당하는 데이터를 검출하여 비식별화하는 단계 및 상기 비식별화된 데이터를 미리 정해진 제2주기마다 빅데이터 플랫폼으로 전송하는 단계를 포함할 수 있다.According to one embodiment, the data collection processing method performed in the data collection processing apparatus includes collecting data from a plurality of data centers at a preset first cycle, and refining the collected data to fit a predetermined data set type. It may include the steps of detecting and de-identifying data corresponding to sensitive information among the pre-processed data, and transmitting the de-identified data to a big data platform every second predetermined period.

본 발명의 일실시예에 따르면, 데이터 수집 처리 장치를 통해서 다양한 데이터 센터로부터 수집한 원본 데이터들을 빅데이터 셋 구축 플랫폼으로 전송하기 전에 정제하여 양질의 데이터로 관리할 수 있다.According to an embodiment of the present invention, original data collected from various data centers through the data collection processing device can be refined before being transmitted to the big data set building platform and managed as high-quality data.

본 발명의 일실시예에 따르면, 데이터 수집 처리 장치를 통해서 원본 데이터에 대해 전처리, 표준화, 비식별화가 수행되어 다양하게 수집된 데이터들 간에 융합이 용이하도록 할 수 있다.According to an embodiment of the present invention, pre-processing, standardization, and de-identification are performed on the original data through the data collection processing device to facilitate fusion between variously collected data.

아울러, 본 발명의 일실시예에 따르면, 데이터 수집 처리 장치를 통해서 원본 데이터들을 수집하여 인공지능에 활용할 수 있는 양질의 데이터 셋을 자동으로 확보하기 위해 사용자가 원하는 방식으로 데이터를 가공할 수 있는 에이전트가 제공될 수 있다.In addition, according to an embodiment of the present invention, an agent capable of processing data in a user-desired manner in order to automatically secure a high-quality data set that can be utilized for artificial intelligence by collecting original data through a data collection processing device may be provided.

도 1은 본 발명의 실시예에 따른 데이터 수집 처리 장치를 포함하는 빅데이터 관리 시스템을 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 데이터 수집 처리 장치의 구성을 설명하기 위한 도면이다.
도 3 및 도 4는 본 발명의 실시예에 따른 데이터 수집 처리 장치의 데이터 수집 및 전송에 관한 스케줄 관리 기능을 설명하기 위한 도면이다.
도 5는 본 발명의 실시예에 따른 데이터 수집 처리 장치의 데이터를 표준화 처리하는 기능을 설명하기 위한 도면이다.
도 6은 본 발명의 실시예에 따른 데이터 수집 처리 장치의 전처리 기능을 설명하기 위한 도면이다.
도 7은 본 발명의 실시예에 따른 데이터 수집 처리 장치의 비식별화 기능을 설명하기 위한 도면이다.
도 8은 본 발명의 실시예에 따른 데이터 수집 처리 장치를 통한 데이터 수집 처리 방법을 설명하기 위한 흐름도이다.1 is a view for explaining a big data management system including a data collection processing apparatus according to an embodiment of the present invention.
2 is a diagram for explaining the configuration of a data collection processing apparatus according to an embodiment of the present invention.
3 and 4 are diagrams for explaining a schedule management function related to data collection and transmission of the data collection processing apparatus according to an embodiment of the present invention.
5 is a diagram for explaining a function of standardizing data of a data collection processing apparatus according to an embodiment of the present invention.
6 is a view for explaining a pre-processing function of the data collection processing apparatus according to an embodiment of the present invention.
7 is a diagram for explaining a de-identification function of the data collection processing apparatus according to an embodiment of the present invention.
8 is a flowchart illustrating a data collection processing method through the data collection processing apparatus according to an embodiment of the present invention.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시예들은 다양한 형태로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed herein are only exemplified for the purpose of explaining the embodiments according to the concept of the present invention, and the embodiment according to the concept of the present invention These may be embodied in various forms and are not limited to the embodiments described herein.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. In the entire specification, when a part "includes" a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시예들을 특정한 개시형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention may have various changes and may have various forms, the embodiments will be illustrated in the drawings and described in detail herein. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all changes or replacements included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만, 예를 들어 본 발명의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from other components, for example, without departing from the scope of rights according to the concept of the present invention, a first component may be named a second component, Similarly, the second component may also be referred to as the first component.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

이하, 데이터 수집 처리 장치 및 방법의 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of the data collection processing apparatus and method will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 데이터 수집 처리 장치를 포함하는 빅데이터 관리 시스템을 설명하기 위한 도면이다.1 is a view for explaining a big data management system including a data collection processing apparatus according to an embodiment of the present invention.

도 1을 참조하면, 빅데이터 관리 시스템은 복수 개의 데이터 센터(111, 112, 113), 데이터 수집 처리 장치(120) 및 빅데이터 플랫폼(130)을 포함할 수 있다. 여기서 데이터 수집 처리 장치(120)는 데이터 센터들로부터 수집한 데이터를 빅데이터 플랫폼(130)에 데이터를 전달하기 전에 데이터의 전처리, 비식별화를 엣지 컴퓨팅으로 수행하는 에이전트를 포함할 수 있다.Referring to FIG. 1 , the big data management system may include a plurality of data centers 111 , 112 , 113 , a data collection processing device 120 , and a big data platform 130 . Here, the data collection processing device 120 may include an agent that performs pre-processing and de-identification of data by edge computing before transferring the data collected from data centers to the big data platform 130 .

일실시예에 따르면, 복수 개의 데이터 센터(111, 112, 113)는 사람의 이름, 성별, 나이, 생체정보, 주민등록번호, 여권번호, 외국인번호, 자동차운전면허번호, 금융정보(계좌번호, 카드번호), 이메일 및 전화번호 등의 데이터를 수집하는 어떠한 종류의 데이터베이스를 포함할 수 있다. 예를 들면, 동사무소의 동민 데이터나 특정 아파트의 거주민 데이터, 부동산의 부동산 실거래 가격 데이터, 자동차의 수리 이력 데이터, 스마트 시티의 전기 소모량 데이터, 자동차 소유주 정보, 각종 정부 기관에서 수집하는 공공 데이터 등 모든 종류의 데이터가 포함될 수 있다.According to one embodiment, the plurality of data centers (111, 112, 113) is a person's name, gender, age, biometric information, resident registration number, passport number, foreigner number, car driver's license number, financial information (account number, card number) ), e-mail and phone numbers, etc. may include any kind of database that collects data. For example, all kinds of data, such as resident data of a city office, resident data of a specific apartment, real estate transaction price data of real estate, repair history data of cars, electricity consumption data of smart cities, car owner information, public data collected by various government agencies, etc. data may be included.

일실시예에 따르면, 데이터 수집 처리 장치(120)는 복수 개의 데이터 센터(111, 112, 113)로부터 데이터를 수집하고 표준화/비식별화 처리하여 빅데이터 플랫폼(130)에 전달할 수 있다. 여기서, 데이터 수집 처리 장치(120)는 사용자 설정에 맞게 데이터를 가공하고, 데이터의 수집, 처리, 전송에 대한 과정을 모니터링할 수 있도록 사용자 인터페이스 및 디스플레이를 통해서 제공할 수 있다.According to an embodiment, the data collection processing device 120 may collect data from the plurality of data centers 111 , 112 , and 113 , standardize/de-identify it, and transmit it to the big data platform 130 . Here, the data collection processing device 120 may process data according to user settings, and may provide the data through a user interface and a display to monitor the process of data collection, processing, and transmission.

데이터 수집 처리 장치(120)에 대한 보다 구체적인 설명은 도 2를 통해서 하도록 한다.A more detailed description of the data collection processing device 120 is provided with reference to FIG. 2 .

도 2는 본 발명의 실시예에 따른 데이터 수집 처리 장치의 구성을 설명하기 위한 도면이다.2 is a diagram for explaining the configuration of a data collection processing apparatus according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 따른 데이터 수집 처리 장치(120)는 데이터 수집부(210), 전처리부(220), 비식별화부(230) 및 데이터 전송부(240)를 포함할 수 있다.Referring to FIG. 2 , the data collection processing apparatus 120 according to an embodiment of the present invention may include a data collection unit 210 , a preprocessor 220 , a de-identification unit 230 , and a data transmission unit 240 . can

본 발명의 실시예에 따른 데이터 수집부(210)는 복수 개의 데이터 센터로부터 미리 설정된 주기마다 데이터를 수집할 수 있다. 이때, 각각의 데이터 센터에서 수집하는 데이터는 정제되지 않은 원본 데이터들일 수 있다. 일반적으로 원본 데이터들은 칼럼별 데이터들이 각각 다른 표현 방식으로 되어 있을 수 있으며, 결측치나 오류 데이터들이 포함되어 있을 수도 있고, 개인정보들이 식별 가능하도록 저장되어 있어, 개인정보 보호법 상에 위반될 소지가 있다. 따라서, 원본 데이터의 전처리, 비식별화 과정이 필요하다.The data collection unit 210 according to an embodiment of the present invention may collect data from a plurality of data centers at preset intervals. In this case, the data collected by each data center may be unrefined original data. In general, original data may have different expression methods for each column, and may contain missing or erroneous data. . Therefore, pre-processing and de-identification of the original data are required.

일실시예에 따르면, 데이터 수집부(210)는 데이터 수집을 위해서 데이터 수집 경로를 사용자가 변경할 수 있도록 사용자 단말을 통해서 인터페이스를 제공할 수 있다. 예를 들면, 병원의 의료 데이터인 경우, 각각의 병원에 있는 데이터 센터로부터 데이터를 수집하도록 데이터 수집 경로를 설정할 수 있다. 다른 예를 들면, 자동차의 수리 데이터인 경우, 각각의 차량 정비소에 있는 데이터 센터로부터 데이터를 수집하도록 데이터 수집 경로를 설정할 수 있다.According to an embodiment, the data collection unit 210 may provide an interface through the user terminal so that the user can change the data collection path for data collection. For example, in the case of hospital medical data, a data collection path may be set to collect data from data centers in each hospital. As another example, in the case of car repair data, a data collection path may be set to collect data from a data center in each vehicle repair shop.

일실시예에 따르면, 데이터 수집부(210)는 데이터 수집 주기를 사용자가 변경할 수 있도록 사용자 단말을 통해서 인터페이스를 제공할 수 있다. 예를 들면, 각각의 데이터 센터로부터 30분 마다 한번씩 신규 데이터가 있는지 확인하도록 데이터 수집 주기를 30분으로 설정할 수 있다. 이때, 데이터 수집 주기는 일정 주기마다 수집을 수행할 수 있도록 크론식을 정의할 수도 있다.According to an embodiment, the data collection unit 210 may provide an interface through the user terminal so that the user can change the data collection period. For example, the data collection period may be set to 30 minutes to check if there is new data from each data center once every 30 minutes. In this case, the data collection period may be defined as a cron expression so that collection can be performed at regular intervals.

일실시예에 따르면, 데이터 수집부(210)는 미리 설정된 주기에 도달한 경우 데이터 수집 경로 내에 수집하지 않은 데이터 파일이 존재하는지 체크하여 새로운 데이터 파일이 존재하는 경우에만 데이터 파일을 수집할 수 있다. 이때, 이전에 수집한 데이터 이후의 데이터에 대해서만 신규 데이터 파일의 존재 여부를 확인하도록 할 수 있으며, 이를 통해서 데이터 처리의 부하를 최소화할 수 있다.According to an embodiment, when a preset period is reached, the data collection unit 210 checks whether an uncollected data file exists in the data collection path, and collects the data file only when a new data file exists. In this case, it is possible to check the existence of a new data file only for data after previously collected data, and through this, the load of data processing can be minimized.

일실시예에 따르면, 데이터 수집부(210)는 복수 개의 데이터 센터 각각에 식별자(ID)를 부여할 수 있다. 데이터 센터 별 식별자를 부여함으로써 빅데이터 처리를 위한 데이터 셋을 구성할 때 데이터의 관리가 용이해질 수 있다. 또한, 수집된 데이터의 분류 항목 및 변경 항목 별 구분자를 부여함으로써 데이터 센터 별 수집 데이터 셋을 관리할 수 있다.According to an embodiment, the data collection unit 210 may assign an identifier (ID) to each of the plurality of data centers. By assigning an identifier for each data center, data management can be facilitated when configuring a data set for big data processing. In addition, it is possible to manage the collected data set for each data center by assigning a classification item for the collected data and a separator for each change item.

본 발명의 실시예에 따른 전처리부(230)는 수집된 데이터를 미리 정해진 데이터 셋 형태에 맞도록 정제할 수 있다. 예를 들면, 전처리부(220)는 수집된 데이터를 데이터 정규화, 데이터 변환, 데이터 클렌징, 문자열 제거, 열 삭제 및 데이터 난독화 중 적어도 하나의 방식을 통해 정제할 수 있다. 데이터 정규화는 표준 스케일링, min/max 스케일링 방식을 포함할 수 있으며, 데이터 변환은 해시변환, 단방향 암호화, 표준코드 변환 등을 포함할 수 있으며, 데이터 클렌징은 무효값, 빈값, 이상치 행 삭제 등을 포함할 수 있다. 이때, 사용자 단말을 통해 수신된 사용자의 설정에 따라 전처리 방식을 조정할 수 있다.The preprocessor 230 according to an embodiment of the present invention may refine the collected data to fit a predetermined data set type. For example, the preprocessor 220 may refine the collected data through at least one of data normalization, data conversion, data cleansing, string removal, column deletion, and data obfuscation. Data normalization can include standard scaling and min/max scaling methods, data transformation can include hash transformation, one-way encryption, standard code transformation, etc., data cleansing includes invalid value, empty value, and outlier row deletion can do. In this case, the preprocessing method may be adjusted according to the user's settings received through the user terminal.

일실시예에 따르면, 전처리부(220)는 수집된 데이터를 미리 정해진 규정에 맞춰 일관성이 있는 데이터로 표준화 처리할 수 있다. 예를 들면, 전처리부(220)는 수집된 데이터를 미리 등록된 표준 코드에 따라 표준화 처리할 수 있다. 여기서 표준 코드는 표준 코드 변환 기능을 이용하여 해당 칼럼의 값을 표준 코드 값으로 변경하여 표준화를 수행할 수 있다. 종래에는 데이터를 제공하는 기관의 데이터 센터마다 사용하는 용어와 수치의 기준이 다르기 때문에 데이터를 융합하고 비교 분석하는 데에 한계가 있었다. 이때, 데이터 표준화 규정에 맞추어 일관성이 있는 값을 수집하기 위해서 표준코드 기능을 제공할 수 있다.According to an embodiment, the preprocessor 220 may standardize the collected data into consistent data according to a predetermined rule. For example, the preprocessor 220 may standardize the collected data according to a pre-registered standard code. Here, the standard code can be standardized by changing the value of the corresponding column to the standard code value using the standard code conversion function. Conventionally, there is a limit to data fusion and comparative analysis because the terms and numerical standards used for each data center of an institution providing data are different. In this case, a standard code function may be provided to collect consistent values in accordance with data standardization regulations.

본 발명의 실시예에 따른 비식별화부(230)는 전처리된 데이터 중 민감정보에 해당하는 데이터를 비식별화할 수 있다. 여기서 민감정보에 대한 정의와 비식별화 대상의 범위는 개인정보보호위원회에서 운영하는 개인정보보호 포털을 통해서 제공되지만, 사용자의 요구에 따라 민감정보를 업데이트할 수 있다.The de-identification unit 230 according to an embodiment of the present invention may de-identify data corresponding to sensitive information among the pre-processed data. Here, the definition of sensitive information and the scope of de-identification targets are provided through the personal information protection portal operated by the Personal Information Protection Committee, but sensitive information can be updated according to the user's request.

일실시예에 따르면, 비식별화부(230)는 미리 정해진 정책에 따라 개인 정보로 규정된 컬럼을 추출하여 마스킹, 재배열, 범주화 중 적어도 하나에 따라 비식별화를 수행할 수 있다.According to an embodiment, the de-identification unit 230 may extract a column defined as personal information according to a predetermined policy and perform de-identification according to at least one of masking, rearrangement, and categorization.

본 발명의 실시예에 따른 데이터 전송부(240)는 전처리부(220) 및 비식별화부(230)를 통해 전처리되고 비식별화된 데이터를 미리 정해진 제2 주기마다 빅데이터 플랫폼으로 전송할 수 있다. 본 발명의 실시예에 따른 데이터 수집 처리 장치는 엣지 컴퓨팅을 이용한 에이전트 형태로 제공되어 빅데이터 플랫폼에서 원본 데이터를 수집하고 처리하는 것이 아니라, 전처리 및 비식별화 된 데이터를 빅데이터 플랫폼으로 전달하도록 하여 빅데이터 플랫폼 내의 데이터의 처리를 분산시킬 수 있다.The data transmission unit 240 according to an embodiment of the present invention may transmit the pre-processed and de-identified data through the pre-processing unit 220 and the de-identification unit 230 to the big data platform every second predetermined period. The data collection processing device according to the embodiment of the present invention is provided in the form of an agent using edge computing, so that the pre-processed and de-identified data is delivered to the big data platform rather than collecting and processing the original data on the big data platform. Data processing within the big data platform can be distributed.

일실시예에 따르면, 데이터 전송부(250)는 데이터 csv 파일, 스키마 메타정보를 갖는 json 파일 및 데이터 전송 완료를 표시하는 end 파일을 SFTP를 이용하여 전송할 수 있다.According to an embodiment, the data transmission unit 250 may transmit a data csv file, a json file having schema meta information, and an end file indicating data transmission completion using SFTP.

본 발명의 실시예에 따른 데이터 수집 처리 장치(120)는 사용자 단말을 통해서 사용자가 데이터 모니터링을 수행할 수 있도록 모니터링 화면을 제공할 수 있다. 예를 들면, 데이터 수집 처리 장치(120)는 데이터의 전처리 전/후, 비식별화 전/후 처리 결과에 대한 모니터링 화면을 제공할 수 있다. 이 모니터링 화면을 통해서 사용자는 데이터의 정제 방식을 육안으로 확인 가능하며, 데이터 수집 처리 장치(120) 내에 저장되어 있는 샘플 데이터를 삭제함으로써 용량 관리를 수행할 수 있다. 또한, 데이터의 수집, 처리, 저장 로그를 통해 단계별 처리 성공 여부 및 시각을 확인함으로써 저장된 데이터의 처리 일자를 추적하기에 용이하도록 할 수 있다. 그리고, 데이터 센터, 수집일자, 버전 변경, 스키마 단위 등의 관리가 가능하도록 하여 데이터 제공 기관의 데이터 구분 및 관리가 용이하도록 할 수 있으며, 결측 데이터 발생시 재전송하도록 관리할 수 있다.The data collection processing apparatus 120 according to an embodiment of the present invention may provide a monitoring screen so that the user can perform data monitoring through the user terminal. For example, the data collection processing apparatus 120 may provide a monitoring screen for the results of pre-/post-processing of data and pre-/post-processing of de-identification. Through this monitoring screen, the user can visually confirm the data purification method, and can perform capacity management by deleting the sample data stored in the data collection processing device 120 . In addition, it is possible to make it easy to track the processing date of the stored data by checking the success or time of each step processing through the data collection, processing, and storage log. In addition, by enabling management of the data center, collection date, version change, schema unit, etc., it is possible to easily classify and manage data of a data providing institution, and it can manage to retransmit when missing data occurs.

일실시예에 따르면, 데이터 수집 처리 장치(120)는 데이터를 제공하는 데이터 센터 별 식별자와 데이터 셋을 구분하기 위한 카테고리, 데이터 셋, 메타데이터 항목 등을 제공하여 업체별, 분류 항목별, 변경 항목별 구분자를 통해 데이터를 관리할 수 있도록 모니터링 화면을 제공할 수 있다.According to an embodiment, the data collection processing device 120 provides an identifier for each data center that provides data and a category, a data set, and a metadata item for distinguishing a data set for each company, each classification item, and each change item. A monitoring screen can be provided to manage data through a separator.

도 3 및 도 4는 본 발명의 실시예에 따른 데이터 수집 처리 장치의 데이터 수집 및 전송에 관한 스케줄 관리 기능을 설명하기 위한 도면이다.3 and 4 are diagrams for explaining a schedule management function related to data collection and transmission of the data collection processing apparatus according to an embodiment of the present invention.

도 3을 참조하면, 일실시예에 따른 데이터 수집 처리 장치는 사용자 단말의 디스플레이를 통해서 데이터 스케줄 관리 기능을 제공할 수 있다. 사용자는 각각의 데이터 센터로부터 데이터를 수집하는 주기와 정제된 데이터를 사용자 단말에 전송하는 전송 주기를 각각 제어할 수 있다. 도 3에서는 수집 주기는 15분, 전송 주기는 즉시전송으로 되어 있으나, 이는 수정 가능하다.Referring to FIG. 3 , the data collection processing apparatus according to an embodiment may provide a data schedule management function through the display of the user terminal. The user can control the data collection period from each data center and the transmission period for transmitting purified data to the user terminal, respectively. In FIG. 3, the collection cycle is 15 minutes and the transmission cycle is immediate transmission, but this can be modified.

도 4를 참조하면, 일실시예에 따른 데이터 수집 처리 장치는 사용자 단말의 디스플레이를 통해서 보다 업그레이드 된 데이터 스케줄 관리 기능을 제공할 수 있다. 예를 들면, 수집 대상 데이터의 경로를 대상 데이터 위치를 변경함으로써 수정할 수 있으며, 데이터 수집의 기간과 수집 주기를 각각 설정할 수 있고, 이는 크론식으로 직접 입력하도록 할 수도 있다. 여기서, 전처리 여부, 비식별화여부, 전송 주기 설정 등을 할 수 있음은 물론이다.Referring to FIG. 4 , the data collection processing apparatus according to an embodiment may provide a more upgraded data schedule management function through the display of the user terminal. For example, the path of the data to be collected can be modified by changing the location of the data to be collected, and the period and period of data collection can be set respectively, which can be directly input in a cron type. Here, of course, it is possible to set whether to pre-process, whether to de-identify, and whether to set the transmission period.

도 5는 본 발명의 실시예에 따른 데이터 수집 처리 장치의 데이터를 표준화 처리하는 기능을 설명하기 위한 도면이다.5 is a diagram for explaining a function of standardizing data of a data collection processing apparatus according to an embodiment of the present invention.

도 5를 참조하면, 성별의 경우에 수집한 총 데이터는 27개이며, 중복되는 14개의 데이터를 제외하면 총 13개의 종류로 구별될 수 있다. 각각의 데이터를 살펴보면, 남성, 여, M, 여자, W, men, woman, 남, man 등으로 입력되는 것을 볼 수 있다. 이렇게 다양한 형식으로 표현된 성별 데이터는 미리 정해진 표준화 규정에 맞춰 일관성이 있도록 표준화 처리를 수행할 수 있다. 예를 들면, 남성, M, men, man, 남 등의 데이터는 M으로, 여, 여자, W, woman 등의 데이터는 W로 표준화 처리할 수 있다. 사용자는 사용자 단말을 통해서 개인 정보 중 성별에 대한 칼럼을 선택하여 표준화를 수행하도록 할 수 있다.Referring to FIG. 5 , in the case of gender, the total number of data collected is 27, and excluding 14 overlapping data, a total of 13 types can be distinguished. If you look at each data, you can see that it is input as male, female, M, female, W, men, woman, male, man, etc. Gender data expressed in these various formats can be standardized to be consistent in accordance with a predetermined standardization rule. For example, data such as male, M, men, man, and male may be standardized as M, and data such as female, female, W, and woman may be standardized as W. The user may select a column for gender among personal information through the user terminal to perform standardization.

도 6은 본 발명의 실시예에 따른 데이터 수집 처리 장치의 전처리 기능을 설명하기 위한 도면이다.6 is a view for explaining a pre-processing function of the data collection processing apparatus according to an embodiment of the present invention.

도 6을 참조하면, 일실시예에 따른 데이터 수집 처리 장치는 사용자 단말에 제공하는 인터페이스를 통해서 사용자로 하여금 전처리 방식을 선택하도록 할 수 있다. 여기서는 표준코드 변환 방식을 선택하였으며, 각 칼럼별로 최대 최소값에 대한 그래프를 생성하도록 한 것을 확인할 수 있다. Referring to FIG. 6 , the data collection processing apparatus according to an embodiment may allow the user to select a preprocessing method through an interface provided to the user terminal. Here, the standard code conversion method is selected, and it can be seen that the graph of the maximum and minimum values is created for each column.

도 7은 본 발명의 실시예에 따른 데이터 수집 처리 장치의 비식별화 기능을 설명하기 위한 도면이다.7 is a diagram for explaining a de-identification function of the data collection processing apparatus according to an embodiment of the present invention.

도 7을 참조하면, 일실시예에 따른 데이터 수집 처리 장치는 사용자 단말에 제공하는 인터페이스를 통해서 사용자로 하여금 비식별화 방식을 선택하도록 할 수 있다. 각각의 칼럼에 대해서, 비식별화를 수행해야하는 개인정보에 해당하는지 알려줄 수 있으며, 비식별화가 필요한 개인정보인 경우, 마스킹, 재배열 및 범주화 중 적어도 하나의 방법 중 하나로 비식별화를 수행하도록 할 수 있다.Referring to FIG. 7 , the data collection processing apparatus according to an embodiment may allow the user to select a de-identification method through an interface provided to the user terminal. For each column, you can tell whether it corresponds to personal information that needs to be de-identified, and if it is personal information that needs de-identification, de-identification can be performed by at least one of masking, rearrangement, and categorization. can

도 8은 본 발명의 실시예에 따른 데이터 수집 처리 장치를 통한 빅데이터 처리 방법을 설명하기 위한 흐름도이다.8 is a flowchart illustrating a big data processing method through a data collection processing apparatus according to an embodiment of the present invention.

도 8을 참조하면, 단계(S810)에서, 일실시예에 따른 데이터 수집 처리 장치는 복수 개의 데이터 센터로부터 미리 설정된 제1 주기마다 데이터를 수집할 수 있다.Referring to FIG. 8 , in operation S810 , the data collection processing apparatus according to an exemplary embodiment may collect data from a plurality of data centers in a first preset period.

단계(S820)에서, 일실시예에 따른 데이터 수집 처리 장치는 수집된 데이터를 미리 정해진 데이터 셋 형태에 맞도록 정제할 수 있다.In step S820 , the data collection processing apparatus according to an embodiment may refine the collected data to fit a predetermined data set type.

단계(S830)에서, 일실시예에 따른 데이터 수집 처리 장치는 민감정보에 해당하는 데이터를 비식별화할 수 있다.In step S830, the data collection processing apparatus according to an embodiment may de-identify data corresponding to sensitive information.

단계(S840)에서, 일실시예에 따른 데이터 수집 처리 장치는 비식별화된 데이터를 미리 정해진 제2 주기마다 빅데이터 플랫폼으로 전송할 수 있다.In step S840, the data collection processing apparatus according to an embodiment may transmit the de-identified data to the big data platform every second predetermined period.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

120: 데이터 수집 처리 장치
210: 데이터 수집부
220: 전처리부
230: 비식별화부
240: 데이터 전송부120: data collection processing unit
210: data collection unit
220: preprocessor
230: de-identification unit
240: data transmission unit

Claims

a data collection unit that collects data from a plurality of data centers at a preset first period;
a pre-processing unit for refining the collected data to fit a predetermined data set type;
a de-identification unit for detecting and de-identifying data corresponding to sensitive information among the pre-processed data; and
A data transmission unit for transmitting the de-identified data to the big data platform every second predetermined period
A data collection processing device comprising a.

According to claim 1,
The data collection unit provides an interface so that a user can change a data collection path and a data collection period.

According to claim 1,
and the data collection unit checks whether a data file exists in the data collection path when the preset first period is reached, and collects the data file when the data file exists.

According to claim 1,
The data collection unit assigns an identifier to each of the plurality of data centers, and manages the data for each data center through a classification item and a change item identifier of the collected data.

According to claim 1,
The pre-processing unit standardizes the collected data according to a standard code registered in advance, and the standard code uses a standard code conversion function to change the value of the corresponding column to a standard code value to perform standardization. collection processing unit.

According to claim 1,
The pre-processing unit purifies the standardized data through at least one of numeric data normalization, data conversion, data cleansing, string removal, column deletion, and data obfuscation.

According to claim 1,
The de-identification unit extracts a column defined as personal information according to a predetermined policy and performs de-identification according to at least one of masking, rearrangement, and categorization.

According to claim 1,
The data transmission unit transmits a data csv file, a json file having schema meta information, and an end file indicating data transmission completion using SFTP.

In the big data processing method performed in the data server,
collecting data from a plurality of data centers at a first preset period;
refining the collected data to fit a predetermined data set type;
detecting and de-identifying data corresponding to sensitive information among the pre-processed data; and
and transmitting the de-identified data to a big data platform every second predetermined period.