KR20190053616A

KR20190053616A - Data merging device and method for bia datda analysis

Info

Publication number: KR20190053616A
Application number: KR1020170149691A
Authority: KR
Inventors: 최용준; 장명수; 공성원
Original assignee: (주)위세아이텍
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2019-05-20
Also published as: WO2019093675A1; KR102033151B1

Abstract

Provided is a data merging apparatus for big data analysis, which comprises: a database including a plurality of data sets; a data matching unit determining a first data set and a second data set to perform data matching among a plurality of data sets included in the database, selecting a plurality of column items among column items of the first data set and the second data set to calculate a degree of similarity between the selected column items, and matching data rows of the first data set and the second data set based on the degree of similarity; and a data merging unit merging the first data set and the second data set based on the matching result of the data rows.

Description

TECHNICAL FIELD [0001] The present invention relates to a data merging apparatus and method for analyzing large data,

본원은 빅데이터 분석을 위한 데이터 병합 장치 및 방법에 관한 것이다.The present invention relates to a data merging apparatus and method for big data analysis.

빅데이터 분석을 위해 전체 소요 노력의 70% ~80% 를 데이터 전처리에 사용하고 있다. 빅데이터 분석은 폭발적으로 증가하고 있으나 빅데이터 분석 기술 발전만큼 데이터 전처리에 관한 기술의 발전 속도는 느리며 이에 따라 자동화된 데이터 전처리 기술 개발의 필요성이 대두되고 있다. For big data analysis, 70% to 80% of the total effort is used for data preprocessing. Big data analysis is explosively growing, but the development of data preprocessing technology is slow as much as the development of big data analysis technology. Therefore, there is a need to develop automated data preprocessing technology.

관계형 데이터베이스(RDBMS) 기반과 내부 데이터의 분석 환경에서 벗어나 공공정보 개방 환경과 맞물려 외부 데이터와의 매쉬업(Mash-Up)을 통한 정보 가치 재장출의 필요성이 강조되고 있다. It is emphasized that there is a necessity of re-introducing information value through mash-up with external data in conjunction with a public information open environment outside the relational database (RDBMS) based and internal data analysis environment.

그러나, 내·외부 데이터 생성의 관점과 용도 등이 다르거나 데이터 표준의 부재로 데이터가 의미적으로 동일 데이터이나 표현이 달라 기계적으로 매칭하는 것이 어려운 경우 데이터의 병합이 불가능하다. 이러한 경우 과도한 인적자원을 투입하여 데이터를 매칭해야 하며, 더욱이 상시화하는 것은 불가능하다. However, the merge of data is impossible if the viewpoint and purpose of internal / external data generation are different, or the data is semantically different from the same data or expression due to the lack of data standards. In this case, it is necessary to input excessive human resources to match the data, and it is impossible to normalize the data.

따라서, 빅데이터 분석 활성화를 위해 데이터 매칭 알고리즘을 활용한 데이터 병합 기술에 대한 연구가 필요하나, 현재의 데이터 매칭 알고리즘으로는 단일 데이터 항목 간 매칭만 이루어져, 매칭 가능한 타 항목을 사용할 수 없어 정확도를 올리 수 없다. 특히 국문 데이터의 경우 정확도가 떨어지는 문제점이 있다. 또한, 사용자에 의한 보정이 필요한 데이터임에도 불구하고 데이터를 매칭시키는 문제점이 있었다.Therefore, it is necessary to study data merging technique using data matching algorithm to activate big data analysis. However, current data matching algorithm only matches one data item, I can not. Especially, the accuracy of Korean data is low. In addition, there is a problem that the data is matched even though the data is required to be corrected by the user.

또한, 데이터 전처리 중 하나인 데이터 통합 과정에서 가장 많이 사용되는 데이터 매칭 알고리즘은 Fuzzy Data Matching 알고리즘이다. 이 알고리즘은 편집거리(레펜슈타인, Levenshtein Distance)를 기반으로 계산된 결과값을 사용하여 데이터 간 매칭하는 알고리즘이다. 그러나 기존 알고리즘은 단일 Key 간의 비교를 통하여 매칭하기 때문에 다중 Key를 사용하는 테이블의 경우 구조적으로 매칭하기 어려운 단점을 가지고 있다. 또한 영문 기반으로 개발된 알고리즘이기 때문에, 국문 데이터를 대상으로 기존 알고리즘을 사용하는 것 만으로는 정확성의 한계가 있다.In addition, the data matching algorithm, which is one of the data preprocessing processes, is the fuzzy data matching algorithm. This algorithm is an algorithm that matches the data using the calculated values based on the edit distance (Levenshtein Distance). However, since existing algorithms match through comparison between single keys, it is difficult to match structurally in a table using multiple keys. In addition, since it is an algorithm developed based on English, there is a limit to accuracy by using existing algorithms for Korean data.

본원의 배경이 되는 기술은 한국공개특허공보 제2011-0099783(공개일: 2001.11.09)호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Laid-Open Publication No. 2011-0099783 (Publication Date: 2001.11.09).

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 단일 데이터의 유사성 분석으로 매칭이 어려운 경우 복수의 데이터 셋에서 복수의 칼럼 항목을 선택하여 유사도 연산을 수행하여 복수의 데이터를 병합하는 데이터 병합 장치 및 방법을 제공하려는 것을 목적으로 한다. The present invention has been made to solve the above problems occurring in the prior art, and it is an object of the present invention to provide a data merge device for merging a plurality of data by performing a similarity calculation by selecting a plurality of column items from a plurality of data sets, And a method is provided.

또한, 본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 복수의 데이터 셋 매칭 시 매칭이 불가능한 항목에 대해서 사용자에게 매칭 정보를 제공함으로써, 보다 정확하게 이종의 데이터를 병합하는 데이터 병합 장치 및 방법을 제공하려는 것을 목적으로 한다. It is another object of the present invention to provide a data merging apparatus and method for merging different kinds of data more accurately by providing matching information to a user for items that can not be matched when a plurality of data sets are matched The purpose is to provide.

또한, 본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 제 1 매칭 유사도 연산 및 제 2 매칭 유사도 연산을 수행한 후 제 1 매칭 유사도 연산 매칭 결과와 제 2 매칭 유사도 연산을 비교하여 최종 데이터 매칭을 수행함으로써, 보다 정확하게 복수의 데이터를 병합하는 데이터 병합 장치 및 방법을 제공하려는 것을 목적으로 한다. In addition, the present invention has been made to solve the above-mentioned problems of the prior art, and it is an object of the present invention to provide a method and apparatus for performing a first matching similarity calculation and a second matching similarity calculation, And more particularly, to a data merging apparatus and method for merging a plurality of data more accurately.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들도 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.It should be understood, however, that the technical scope of the embodiments of the present invention is not limited to the above-described technical problems, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 빅데이터 분석을 위한 데이터 병합 장치는, 복수의 데이터 셋을 포함하는 데이터 베이스, 상기 데이터 베이스에 포함된 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋 및 제 2 데이터 셋을 결정하고, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 선택된 칼럼 항목 간의 유사도를 연산하고, 유사도에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 데이터 행을 매칭하는 데이터 매칭부 및 상기 데이터 행의 매칭 결과에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋을 병합하는 데이터 병합부를 포함할 수 있다. According to an aspect of the present invention, there is provided a data merge apparatus for analyzing big data, comprising: a data base including a plurality of data sets; a first data set for performing data matching among a plurality of data sets included in the data base; Determining a data set and a second data set, selecting a plurality of column items of the column items of the first data set and the second data set to calculate a degree of similarity between the selected column items, And a data merge unit for merging the first data set and the second data set based on the matching result of the data row.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 제 1 칼럼을 기반으로 제 1 매칭 유사도 연산 수행하여 제 1 유사도 순위 데이터 셋을 도출하고, 상기 제1 유사도 순위 데이터 셋을 기반으로 제 2 매칭 유사도 연산 수행하여 제 2 유사도 순위 데이터 셋을 도출하되, 상기 제 1 유사도 순위 데이터 셋 및 상기 제 2 유사도 순위 데이터 셋을 비교하여 유사도가 가장 높은 데이터를 기반으로 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 데이터 행을 매칭할 수 있다. According to an embodiment of the present invention, the data matching unit derives a first similarity rank data set by performing a first matching similarity calculation based on the first column of the first data set and the second column of the second data set, 1 similarity rank order data set to derive a second similarity rank order data set by performing a second matching similarity degree calculation based on the first similarity degree ranking data set and comparing the first similarity degree ranking data set and the second similarity degree ranking data set, The data row of the first data set and the data row of the second data set may be matched.

본원의 일 실시예에 따르면, 상기 데이터 매칭부를 통해 연산된 상기 제 1 유사도 순위 데이터 셋 및 상기 제 2 유사도 순위 데이터 셋은 상기 제 2 데이터 셋의 각 데이터 행의 순위 정보를 포함할 수 있다. According to an embodiment of the present invention, the first similarity degree ranking data set and the second degree of similarity ranking data set calculated through the data matching unit may include ranking information of each data row of the second data set.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋의 제1행 데이터와 연계하여, 상기 제 1 유사도 순위 데이터 셋의 제 1순위에 위치한 제2데이터 셋의 행 데이터의 제 2 유사도 순위 데이터 셋에서의 순위 및 상기 제 2 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터의 제1유사도 순위 데이터 셋에서의 순위가 모두 1이고, 상기 제 1 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터 및 상기 제 2 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터가 일치하는 경우, 상기 제 1 유사도 순위 데이터 셋의 제 1순위에 위치한 제2데이터 셋의 행 데이터가 상기 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. According to an embodiment of the present invention, the data matching unit is configured to combine the first row data of the first data set and the second row data of the second data set located in the first rank of the first similarity rank order data set, Ranking in the similarity ranking data set and ranking in the first similarity ranking data set of the row data of the second data set located in the first ranking of the second similarity ranking data set are both 1, When the row data of the second data set located at the first rank of the first similarity rank order data set matches the row data of the second data set located at the first rank of the second similarity rank order data set, It is possible to determine that the row data of the second data set located in the first data set matches the first row data.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋의 제1행 데이터와 연계하여, 상기 제 1 유사도 순위 데이터 셋의 제 1순위에 위치한 제2데이터 셋의 행 데이터의 제 2 유사도 순위 데이터 셋에서의 순위 및 상기 제 2 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터의 제1유사도 순위 데이터 셋에서의 순위가 모두 1인 제2데이터 셋의 행 데이터가 복수개인 경우, 상기 제 1 행 데이터에 매칭되는 제2데이터 셋의 행 데이터를 매칭 불가로 판단할 수 있다. According to an embodiment of the present invention, the data matching unit is configured to combine the first row data of the first data set and the second row data of the second data set located in the first rank of the first similarity rank order data set, The ranking in the similarity ranking data set and the row data of the second data set in which the ranking in the first similarity ranking data set of the row data of the second data set located at the first rank of the second similarity ranking data set is all 1 When there is a plurality of rows, it is possible to determine that the row data of the second data set matching the first row data can not be matched.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋의 제1행 데이터와 연계하여, 제2데이터 셋의 제1행 데이터 및 제2행 데이터의 상기 제 1 유사도 순위 데이터 셋의 순위가 모두 1이고, 제2데이터 셋의 제1행 데이터의 상기 제 2 유사도 순위 데이터 셋의 순위가 1이고, 제2데이터 셋의 제2행 데이터의 상기 제 2 유사도 순위 데이터 셋의 순위가 1이 아닌 경우, 상기 제2데이터 셋의 제1행 데이터가 상기 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. According to one embodiment of the present invention, the data matching unit is configured to combine the first row data of the first data set and the first row data of the second data set, Ranking of the second similarity ranking data set of the first row data of the second data set is 1 and the ranking of the second similarity ranking data set of the second row data of the second data set is 1 , It can be determined that the first row data of the second data set is matched with the first row data.

본원의 일 실시예에 따르면, 빅데이터 분석을 위한 데이터 병합 장치는, 데이터 분석을 통해 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 복수의 칼럼 항목의 데이터 타입 및 문자열 패턴을 분석하는 프로파일링부 및 상기 프로파일링부의 분석 결과를 기반으로 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋 간 유사 항목을 분석하는 유사 항목 분석부를 더 포함할 수 있다. According to an embodiment of the present invention, a data merging apparatus for analyzing big data includes a profiling unit for analyzing data types and character string patterns of a plurality of column items of the first data set and the second data set through data analysis, And a similar item analyzer for analyzing similar items between the first data set and the second data set based on the analysis result of the profiling unit.

본원의 일 실시예에 따르면, 상기 데이터 병합부는, 상기 유사 항복 분석부의 분석 결과에 기초하여, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 복수의 칼럼 항목 병합 시 중복되는 칼럼 항목은 제외하고, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 병합을 수행할 수 있다. According to an embodiment of the present invention, the data merge unit may exclude column items that are duplicated in the merging of the plurality of column items of the first data set and the second data set, based on the analysis result of the similar- And perform the merging of the first data set and the second data set.

본원의 일 실시예에 따르면, 빅데이터 분석을 위한 데이터 병합 방법은, 데이터 베이스에 포함된 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋 및 제 2 데이터 셋을 결정하는 단계, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 제 1 칼럼을 기반으로 제 1 매칭 유사도 연산 수행하여 제 1 유사도 순위 데이터 셋을 도출하는 단계, 상기 제1 유사도 순위 데이터 셋을 기반으로 제 2 매칭 유사도 연산 수행하여 제 2 유사도 순위 데이터 셋을 도출하는 단계, 상기 제 1 유사도 순위 데이터 셋 및 상기 제 2 유사도 순위 데이터 셋을 비교하여 제 1 데이터 셋의 각 데이터 행에 매칭되는 제 2 데이터 셋의 데이터 행을 매칭하는 단계 및 매칭 결과에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋을 병합하는 단계를 포함할 수 있다. According to an embodiment of the present invention, there is provided a data merge method for analyzing big data, comprising: determining a first data set and a second data set to perform data matching among a plurality of data sets included in a database; Deriving a first similarity rank order data set by performing a first matching similarity degree computation based on a first column of the first data set and the first column of the second data set, performing a second matching similarity calculation based on the first similarity degree rank data set Deriving a second similarity ranking data set, comparing the first and second similarity ranking data sets, and matching the data rows of a second data set matching each data row of the first data set And merging the first data set and the second data set based on the result of the step and the matching.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described task solution is merely exemplary and should not be construed as limiting the present disclosure. In addition to the exemplary embodiments described above, there may be additional embodiments in the drawings and the detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 복수의 데이터 셋에서 복수의 칼럼 항목을 선택하여 유사도 연산을 수행하여 복수의 데이터를 병합할 수 있다. According to the above-mentioned problem solving means of the present invention, a plurality of data items can be merged by performing a similarity calculation by selecting a plurality of column items from a plurality of data sets.

또한, 전술한 본원의 과제 해결 수단에 의하면, 복수의 데이터 셋 매칭 시 매칭이 불가능한 항목에 대해서 사용자에게 매칭 정보를 제공함으로써, 보다 정확하게 복수의 데이터를 병합할 수 있다.Further, according to the above-mentioned problem solving means of the present invention, it is possible to merge a plurality of data more accurately by providing matching information to a user for items that can not be matched in a plurality of data set matching.

또한, 전술한 본원의 과제 해결 수단에 의하면, 제 1 매칭 유사도 연산 및 제 2 매칭 유사도 연산을 수행한 후 제 1 매칭 유사도 연산 매칭 결과와 제 2 매칭 유사도 연산을 비교하여 최종 데이터 매칭을 수행함으로써, 보다 정확하게 복수의 데이터를 병합할 수 있다.According to the present invention, after the first matching similarity calculation and the second matching similarity calculation are performed, the first matching similarity calculation result is compared with the second matching similarity calculation to perform final data matching, A plurality of data can be merged more accurately.

도 1은 본원의 일 실시예에 따른 데이터 병합 장치의 구성을 개략적으로 나타낸 블록도이다.
도 2a 내지 2d는 본원의 일 실시예에 따른 제 1 매칭 유사도 연산을 설명하기 위한 개략도이다.
도 3은 본원의 일 실시예에 따른 제 2 매칭 유사도 연산을 설명하기 위한 개략도이다.
도 4는 본원의 일 실시예에 따른 제 1 매칭 유사도 연산 및 제 2 유사도 연산 결과를 예시적으로 나타낸 도면이다.
도 5a 내지 5b는 본원의 일 실시예에 따른 제 1 유사도 순위 데이터 셋 및 제 2 유사도 순위 데이터 셋을 비교하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭하는 것을 설명하기 위한 개략도이다.
도 6a 내지 도6b는 본원의 일 실시예에 따른 데이터 분석 및 유사항목 분석의 일 실시예를 설명하기 위한 개략도이다.
도 7은 본원의 일 실시예에 따른 빅데이터 분석을 위한 데이터 병합의 일 실시예를 설명하기 위한 개략도이다.
도8은 본원의 일 실시예에 따른 데이터 병합 방법을 나타낸 흐름도이다. 1 is a block diagram schematically illustrating a configuration of a data merge apparatus according to an embodiment of the present invention.
2A to 2D are schematic diagrams for explaining a first matching similarity calculation according to an embodiment of the present invention.
FIG. 3 is a schematic diagram for explaining a second matching similarity calculation according to an embodiment of the present invention; FIG.
4 is a diagram illustrating exemplary first matching similarity calculation and second similarity calculation according to an embodiment of the present invention.
FIGS. 5A and 5B are schematic diagrams for illustrating a method of comparing a first similarity degree ranking data set and a second similarity ranking ranking data set according to an exemplary embodiment of the present invention to match data rows of a first data set and a second data set.
6A and 6B are schematic diagrams for explaining an embodiment of data analysis and similar item analysis according to an embodiment of the present invention.
FIG. 7 is a schematic diagram for explaining an embodiment of data merging for big data analysis according to an embodiment of the present invention; FIG.
8 is a flowchart illustrating a data merging method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. It should be understood, however, that the present invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In the drawings, the same reference numbers are used throughout the specification to refer to the same or like parts.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is referred to as being "connected" to another part, it is not limited to a case where it is "directly connected" but also includes the case where it is "electrically connected" do.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.It will be appreciated that throughout the specification it will be understood that when a member is located on another member "top", "top", "under", "bottom" But also the case where there is another member between the two members as well as the case where they are in contact with each other.

본원 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함" 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout this specification, when an element is referred to as " including " an element, it is understood that the element may include other elements as well, without departing from the other elements unless specifically stated otherwise.

도 1은 본원의 일 실시예에 따른 데이터 병합 장치의 구성을 개략적으로 나타낸 블록도이고, 도 2a 내지 2d는 본원의 일 실시예에 따른 제 1 매칭 유사도 연산을 설명하기 위한 개략도이고, 도 3은 본원의 일 실시예에 따른 제 2 매칭 유사도 연산을 설명하기 위한 개략도이고, 도 4는 본원의 일 실시예에 따른, 제 1 매칭 유사도 연산 결과 및 제 2 매칭 유사도 연산 결과를 예시적으로 나타낸 도면이고, 도 5a 내지 5b는 본원의 일 실시예에 따른 제 1 유사도 순위 데이터 셋 및 제 2 유사도 순위 데이터 셋을 비교하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭하는 것을 설명하기 위한 개략도이다.FIG. 1 is a block diagram schematically illustrating the configuration of a data merge apparatus according to an embodiment of the present invention. FIGS. 2A to 2D are schematic diagrams for explaining a first matching similarity calculation according to an embodiment of the present invention, FIG. 4 is a diagram illustrating a first matching similarity calculation result and a second matching similarity calculation result according to an embodiment of the present invention, and FIG. 4 illustrates a result of a second matching similarity calculation according to an embodiment of the present invention FIGS. 5A and 5B are schematic diagrams for explaining matching of data rows of a first data set and a second data set by comparing a first similarity degree ranking data set and a second similarity degree ranking data set according to an embodiment of the present invention .

도 1을 참조하면, 데이터 병합 장치(100)는, 데이터 베이스(110), 데이터 매칭부(120), 프로파일링부(130), 유사 항목 분석부(140) 및 데이터 병합부(150)를 포함할 수 있다. 다른 일예로, 데이터 병합 장치(100)는 외부 서버로부터 데이터 병합에 필요한 데이터를 전송받을 수 있으나, 이에 한정 되는 것은 아니다. 1, the data merge apparatus 100 includes a data base 110, a data matching unit 120, a profiling unit 130, a similar item analysis unit 140, and a data merge unit 150 . As another example, the data merge apparatus 100 may receive data necessary for data merge from an external server, but the present invention is not limited thereto.

데이터 병합 장치(100)는 복수의 데이터 셋 중 제 1 데이터 셋 및 제 2 데이터 셋을 결정하고, 제 1 데이터 셋 및 제 2 데이터 셋의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 선택된 항목 간의 유사도를 연산할 수 있다. 데이터 병합 장치(100)는 유사도에 기초하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭할 수 있다. 또한, 데이터 병합 장치(100)는 복수의 데이터 병합 시 매칭 판단 여부에 기초하여 불가 판단 시 사용자에게 매칭 판단 여부를 제공하고, 사용자의 매칭 판단 여부 정보에 기초하여 복수의 데이터를 병합할 수 있다. 데이터 병합 장치(100)는 데이터 행의 매칭 결과에 기초하여 제 1 데이터 셋 및 제 2 데이터 셋을 병합할 수 있다. The data merge apparatus 100 determines a first data set and a second data set among a plurality of data sets, selects a plurality of column items from the column items of the first data set and the second data set, . The data merge apparatus 100 may match the data rows of the first data set and the second data set based on the similarity. In addition, the data merge device 100 may provide the user with a matching determination whether or not the user is unacceptable based on whether or not to match the plurality of data, and may merge the plurality of data based on the matching determination information. The data merge apparatus 100 may merge the first data set and the second data set based on the matching result of the data row.

본원의 일 실시예에 따르면, 데이터 병합 장치(100)는 복수의 데이터 셋에서 인공지능 기반의 데이터 매칭 알고리즘을 이용하여 제 1 데이터 셋 및 제 2 데이터 셋을 매칭하여 데이터를 병합할 수 있다. According to an embodiment of the present invention, the data merge apparatus 100 may merge data by matching a first data set and a second data set using a data matching algorithm based on artificial intelligence in a plurality of data sets.

또한, 데이터 병합 장치(100)는 기존의 단일 칼럼 항목간의 유사도 연산을 통하여 수행된 데이터 매칭 결과 시 발생하는 문제점인 제 1 데이터에 제 2 데이터 셋의 제1 데이터 및 제 2 데이터가 중복되어 매칭되는 것을, 복수의 칼럼 항목간의 유사도 연산을 통하여 데이터 매칭을 실시하여, 제 1 데이터 셋의 제 1 행 데이터와 제 2 데이터 셋의 제1 행 데이터가 매칭되도록 할 수 있다.또한, 데이터 병합 장치(100)는 복수의 데이터 셋의 데이터 유사도를 분석하여 매칭하므로, 데이터 엔지니어 전문가와 업무 담당자의 개입 없이 데이터 병합이 가능할 수 있다. 데이터 병합 장치(100)는 빅데이터 환경하에서 늘어나는 데이터 양과 특정 업부 도메인의 데이터 특성을 학습 분석함으로써 데이터 매칭 정확율을 향상시킬 수 있다. In addition, the data merge apparatus 100 may be configured such that the first data and the second data of the second data set are overlapped and matched to the first data, which is a problem that arises in the result of data matching performed through the calculation of similarity between existing single column items The first row data of the first data set and the first row data of the second data set may be matched by performing data matching through a similarity calculation between a plurality of column items. ) Analyzes and matches the data similarity of a plurality of data sets, so data merging can be possible without intervention of a data engineer specialist and a business person. The data merge apparatus 100 can improve the accuracy of data matching by learning and analyzing the amount of data that grows under the big data environment and the data characteristics of the specific business domain.

또한, 데이터 병합 장치(100)는 매칭이 불가능하다고 판단되는 데이터 항목을 사용자에게 직접 매칭 후보들을 선택하게 하여 결과를 보정함으로써, 보다 정확한 데이터 매칭 및 병합을 수행할 수 있다. In addition, the data merge apparatus 100 can perform more accurate data matching and merging by selecting matching candidates directly to the user of a data item that is determined to be impossible to match, and correcting the result.

데이터 베이스(110)는 빅데이터 병합에 사용되는 복수의 데이터 셋을 포함할 수 있다. 데이터 베이스(110)는 비정형데이터를 포함할 수 있다. The database 110 may include a plurality of data sets used for merging big data. The database 110 may include unstructured data.

데이터 매칭부(120)는 데이터 베이스(110)에 포함된 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 결정할 수 있다. 본원의 일 실시예에 따르면, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)은 유사 항목 분석부(140)의 유사도 연산 결과에 기초하여 미리 설정된 유사도 정보의 컬럼 항목을 복수개 포함하는 데이터 셋일 수 있다. 다른 일 예에 따르면, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)은 사용자의 선택에 따라 결정된 데이터 셋일 수 있다. 데이터 매칭부(120)는 복수의 데이터 셋 중 병합이 필요한 항목을 포함하고 있는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 결정하여 유사도 연산을 수행할 수 있다. The data matching unit 120 may determine a first data set 10 and a second data set 20 to perform data matching among a plurality of data sets included in the data base 110. According to one embodiment of the present invention, the first data set 10 and the second data set 20 are data that includes a plurality of column items of similarity degree information set in advance based on the similarity calculation result of the similarity analysis unit 140 Can be set. According to another example, the first data set 10 and the second data set 20 may be a data set determined by the user's selection. The data matching unit 120 may determine the first data set 10 and the second data set 20 that include items required to be merged among a plurality of data sets to perform the similarity calculation.

데이터 매칭부(120)는 제 1 데이터 셋 및 제 2 데이터 셋의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 선택된 칼럼 항목 간의 유사도를 연산할 수 있다. 데이터 매칭부(120)는 유사도 연산 수행 시 인공지능 기반의 알고리즘을 사용하여 연산을 수행할 수 있다. 본원의 일 실시예에 따른, 데이터 매칭부(120)의 데이터 매칭 알고리즘은 Fuzzy Data Matching 알고리즘을 사용하여 유사도 연산을 수행할 수있으나, 이에 한정되는 것은 아니다. Fuzzy Data Matching 알고리즘은 편집거리(레펜슈타인, Levenshtein Distance)를 기반으로 계산된 결과값을 사용하여 데이터 간에 매칭을 수행하는 알고리즘이다. The data matching unit 120 may select a plurality of column items among the column items of the first data set and the second data set to calculate the similarity between the selected column items. The data matching unit 120 may perform an operation using an artificial intelligence based algorithm when performing a similarity operation. The data matching algorithm of the data matching unit 120 according to an embodiment of the present invention may perform similarity calculation using a fuzzy data matching algorithm, but the present invention is not limited thereto. The Fuzzy Data Matching algorithm is an algorithm that performs matching between data using the calculated values based on the edit distance (Levenshtein Distance).

예시적으로, 도 2a내지 도 2b를 참고하면, 도 2a내지 도 2b의 도면 부호(a)는 제 1 데이터 셋(10)이고, 도 2a내지 도 2b의 도면 부호(b)는 제 2 데이터 셋(10)일 수 있다. 제 1 데이터 셋은 4개의 칼럼 항목(11 내지 14)을 포함할 수 있다. 또한, 제 2 데이터 셋은 4개의 칼럼 항목(21 내지 24)를 포함할 수 있다. 제 1 데이터 셋 및 제 2 데이터 셋에 포함된 칼럼 항목은 대표 키로 구분될 수 있다. 예를 들어, 제 1 데이터 셋의 제 1 칼럼 항목(11)의 대표 키는 '주차장명'일 수 있고, 제 1칼럼 항목(11)은 대표 키 '주차장명'에 포함된 열의 집합일 수 있다. 즉, 제 1 데이터 셋(10)에 포함된 제 1 칼럼 항목(11)의 대표 키는 '주차장명'이고, 제 1 데이터셋(10)의 제 2 칼럼 항목(12)의 대표 키는 '주차장 운영 시작 시간'이고, 제 1 데이터 셋(10)의 제 3 칼럼 항목(13)의 대표 키는 '주차장 운영 종료 시간'이고, 제 1 데이터 셋(10)의 제 4 칼럼 항목(14)의 대표 키는 '주차요금(10분당, 원단위)' 일 수 있다. 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)의 대표 키는 '주차장'이고, 제 2 데이터 셋(20)의 제 2 칼럼 항목(22)의 대표 키는 '주차장 운영 시작 시간'이고, 제 2 데이터 셋(20)의 제 3 칼럼 항목(23)의 대표 키는 '주차장 운영 종료 시간'이고, 제 2 데이터 셋(20)의 제 4 칼럼 항목(24)의 대표 키는 '주소'일 수 있다.2A and 2B, reference numeral (a) in FIGS. 2A and 2B is a first data set 10 and reference numeral (b) in FIGS. 2A and 2B is a second data set (10). The first data set may include four column entries 11-14. Also, the second data set may include four column entries 21 to 24. The column items included in the first data set and the second data set can be divided into representative keys. For example, the representative key of the first column item 11 of the first data set may be a 'parking lot name', and the first column item 11 may be a set of columns included in the representative key 'parking lot name' . That is, the representative key of the first column item 11 included in the first data set 10 is the 'parking lot name' and the representative key of the second column item 12 of the first data set 10 is' Operation start time ', the representative key of the third column item 13 of the first data set 10 is the' parking lot operation end time ', and the representative of the fourth column item 14 of the first data set 10 The key can be 'parking fee (10 minutes, unit level)'. The representative key of the first column item 21 of the second data set 20 is the parking lot and the representative key of the second column item 22 of the second data set 20 is the parking lot operation start time , The representative key of the third column item 23 of the second data set 20 is the parking lot operation end time and the representative key of the fourth column item 24 of the second data set 20 is the address of the ' Lt; / RTI >

본원의 일 실시예에 따르면, 데이터 매칭부(120)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항 목 중 제 1 데이터 셋(10)의 제 1 칼럼 항목 내지 제 3 칼럼 항목(11 내지 13)을 선택하고, 제 2 데이터 셋(20)의 제 1 칼럼 항목 내지 제 3 칼럼 항목(21 내지 23)을 선택하여 선택된 칼럼 항목 간의 유사도를 연산할 수 있다. 이때, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항목의 개수는 일치해야한다. 즉, 데이터 매칭부(120)는 데이터 매칭을 진행할 칼럼 항목의 개수를 일치하여 결정할 수 있다. 다른 일예로, 사용자가 데이터 매칭을 진행할 칼럼 항목을 선택할 수 있다. According to one embodiment of the present invention, the data matching unit 120 may include a first column of the first data set 10 of the plurality of column items of the first data set 10 and the second data set 20, The degree of similarity between the selected column items can be calculated by selecting the third column items 11 to 13 and selecting the first to third column items 21 to 23 of the second data set 20. At this time, the number of the plurality of column items of the first data set 10 and the second data set 20 must match. That is, the data matching unit 120 can determine the number of column items to be matched with the data matching. As another example, the user may select a column item for which to perform data matching.

데이터 매칭부(120)는 선택된 복수의 칼럼 항목 중 대표 칼럼 항목을 선택할 수 있다. 예시적으로, 도 2c를 참조하면, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)를 대표 칼럼 항목으로 선택하고, 제 2 데이터 셋(20)의 제 2 칼럼 항목(21)를 대표 칼럼 항목으로 선택하여, 유사도 연산을 수행할 수 있다. 다른 일 예로, 대표 칼럼 항목은 사용자에 의해 선택될 수 있다. 데이터 매칭부(120)는 선택된 복수의 칼럼 항목 중 대표 칼럼 항목을 선택하여 유사도 연산 수행 시 Fuzzy 알고리즘을 이용하여 데이터 매칭을 수행할 수 있으나, 이에 한정 되는 것은 아니다.The data matching unit 120 may select a representative column item among a plurality of selected column items. 2C, the data matching unit 120 selects the first column item 11 of the first data set 10 as a representative column item and the second column item 11 of the second data set 20 as a representative column item. The column item 21 can be selected as the representative column item to perform the similarity calculation. As another example, the representative column entry may be selected by the user. The data matching unit 120 may perform data matching using a fuzzy algorithm when performing a similarity operation by selecting a representative column item among a plurality of selected column items, but the present invention is not limited thereto.

데이터 매칭부(120)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼(11 및 21)을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출할 수 있다. 예를 들어, 제 1 데이터 셋 및 제 2 데이터 셋의 제 1 칼럼은 주차장명(11) 및 주차장(21)일 수 있다. 본원의 일 실시예일뿐, 제 1 칼럼이 주차장(11) 및 주차장명(21)으로 한정되는 것은 아니고, 데이터 매칭부(120) 또는 사용자의 선택에 의해 제 1 칼럼(11 및 21)을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출할 수 있다. The data matching unit 120 performs a first matching similarity calculation based on the first columns 11 and 21 of the first data set 10 and the second data set 20 to generate a first similarity rank order data set 30 Can be derived. For example, the first column of the first data set and the second data set may be the parking lot name 11 and the parking lot 21. The first column is not limited to the parking lot 11 and the parking lot name 21 but may be provided in the data matching unit 120 or at the user's choice based on the first column 11 and 21 The first degree of similarity degree calculation may be performed to derive the first degree of similarity ranking data set 30.

데이터 매칭부(120)는 제 1 데이터 셋(10)의 각 행 데이터와 제2데이터 셋(20)의 대응되는 칼럼 항목의 모든 행 데이터 간의 유사도 연산을 수행할 수 있다. 제 1 매칭 유사도 연산은 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1 행 데이터(예를 들어, 훈련원)와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)과의 유사도를 비교하는 연산일 수 있다. 예시적으로, 도 2c를 참조하면, 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)은 대표 키 주자차명으로, 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)에 포함된, 제 1 데이터 '훈련원'과 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)은 대표 키 주차장으로, 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 ‘한외빌딩(타워)’, '서울역 2주차장’,’서울역 3주차장’,’청계천 두산 위브 더 제니스’, ‘제일은행(본점)’, ‘훈련원공원’,’ 신방화역(9호선 연계)’, ‘제일주차장, ‘골든타워’와 유사도 연산을 수행하는 것일 수 있다. 즉, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)과의 유사도를 비교할 수 있다. The data matching unit 120 may perform a similarity calculation between each row data of the first data set 10 and all row data of the corresponding column items of the second data set 20. [ The first matching similarity computation is performed on the first column data 11 of the first data set 10 and the first column data 21 of the second data set 20 May be an operation for comparing the degree of similarity. Illustratively, referring to FIG. 2C, the first column item 11 of the first data set 10 is a representative key column name, and the first column item 11 of the first data set 10 , The first column item 21 of the first data item 'trainee' and the second item data set 20 is a representative key parking item, and the first column item 21 of the second data set 20 is' , 'Seoul Station 2 Parking Lot', 'Seoul Station 3 Parking Lot', 'Cheonggyecheon Doosan We've the Zenith', 'First Bank', 'Training Center Park', 'New Shinhae Station (Link to Line 9)', Car park, and 'Golden Tower'. That is, the data matching unit 120 compares the respective row data of the first column item 11 of the first data set 10 with the first column item 21 of the second data set 20 .

제 1 유사도 순위 데이터 셋(30)은 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)에 포함된 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 각각의 행 데이터와 비교하여 유사도가 높은 순위로 나열된 데이터 셋일 수 있다. 본원의 일 실시예에 따른 제 1 유사도 순위 데이터 셋(30)은 도2d를 참조하여, 설명할 수 있다. 도 2d의 Column A는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)에 포함된 각각의 행 데이터이다. A의 1차 유사도 Rank1내지 A의 1차 유사도 Rank 3은 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)간의 유사도 연산 결과의 값이 높은 순으로 나열된 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 데이터이다. 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)간의 유사도 연산 결과의 값이 중복되는 경우, 예를 들어, 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1데이터(예를 들어, 서울역 주차장)의 데이터에 매칭된 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 데이터가 중복(예를 들어, 서울역 2주차장, 서울역 3주차장)되는 경우, 데이터 매칭부(120)는 제 2 데이터 셋(20)에 배치된 제2 데이터 셋(20)의 제 1 칼럼 항목(21)의 칼럼 인덱스의 순서대로 제 1 유사도 순위 데이터 셋을 도출할 수 있다. The first order of likeness ranking data set 30 is included in each row data included in the first column item 11 of the first data set 10 and in the first column item 21 of the second data set 20 The data set may be sorted in the order of higher degree of similarity compared with the respective row data. The first degree of similarity ranking data set 30 according to one embodiment of the present application can be described with reference to FIG. 2D. Column A of FIG. 2D is each row data included in the first column item 11 of the first data set 10. The first degree of similarity Rank 1 of the first order similarities Rank 1 to A of the first data set 10 and the first order similarity Rank 2 of the first column items 11 of the second data set 20 ) Is the data included in the first column item 21 of the second data set 20 arranged in descending order of the value of the result of the similarity calculation between the first column 21 and the second column 21. When the values of the similarity calculation results between the respective row data of the first column items 11 of the first data set 10 and the first column items 21 of the second data set 20 are overlapped, Is included in the first column item 21 of the second data set 20 matched to the data of the first data item 11 of the first data set 10 The data matching unit 120 sets the first column item of the second data set 20 in the second data set 20 to the first data item 20 in the second data set 20, 21) in the order of the column indexes of the first similarity rank order data set.

예를 들어, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 '훈련원'의 데이터 유사도 연산 결과를 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 행 데이터 중 ('훈련원공원', 90, 5), ('한외빌딩(타워)', 0 ,0) ('서울역 2주차장', 0, 1)의 순으로 결정하여 제 1 유사도 순위 데이터 셋을 도출 할 수 있다. 데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋에 유사도 연산 결과 값 및 제 2 데이터 셋(20)의 칼럼 인덱스를 포함하여 제 1 매칭 유사도 순위 데이터 셋(30)의 데이터를 도출할 수 있다. 즉, 제 1 칼럼 항목(11)에 포함된 '훈련원'과 '훈련원공원'과의 유사도 결과 값은 90이고, '훈련원공원' 데이터의 제 2 데이터 셋(20)에서의 칼럼 인덱스는 5일 수 있다. 또한, 제 1 칼럼 항목(11)에 포함된 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)의 유사도 결과 값이 일치하는 경우, 제 2 데이터 셋(20)에서의 칼럼 인덱스의 순대로 나열될 수 있다. 예를 들어, '서울역 주차창' 과 '서울역 2 주차장', '서울역 3주차장'의 연산 결과 값은 93으로 동일하고, '서울역 2 주차장'의 칼럼 인덱스는 1이고, '서울역 3주차장'은 2인경우, 데이터 매칭부(120)는 제 2 데이터 셋(20)에 배치된 칼럼 인덱스의 순서대로 연산 결과를 나열할 수 있다. For example, the data matching unit 120 may store the data similarity calculation result of the 'training source' of the first column item 11 of the first data set 10 into the first column item 21 of the second data set 20 ), (0, 0) (0, 1) of the row data included in the training data ('Training Park', 90 and 5) A ranking data set can be derived. The data matching unit 120 may derive the data of the first matching degree ranking data set 30 including the similarity calculation result value and the column index of the second data set 20 to the first similarity ranking data set. That is, the result of similarity between the 'training center' and the 'training center park' included in the first column item 11 is 90, and the column index in the second data set 20 of 'training center park' have. When the row data included in the first column item 11 and the result value of the similarity of the first column item 21 of the second data set 20 coincide with each other, Can be listed in the order of. For example, the calculation result of 'Seoul Station Parking Window', 'Seoul Station 2 Parking Lot' and 'Seoul Station 3 Parking Lot' is 93, 'Seoul Station 2 Parking Lot' has 1 column index and 'Seoul Station 3 parking lot' The data matching unit 120 may list the operation results in the order of the column indexes arranged in the second data set 20. [

데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30)을 기반으로 제 2 매칭 유사도 연산을 수행하여 제 2 유사도 순위 데이터 셋(40)을 도출 할 수 있다. 이때, 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)은 제 2 데이터 셋(20)의 각 데이터 행의 순위 정보를 포함할 수 있다. 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)은 제 2 데이터 셋(20)의 각 데이터 행를 유사도 값의 순위에 따라 나열하여 포함할 수 있다.The data matching unit 120 may derive the second similarity rank order data set 40 by performing a second matching similarity calculation based on the first similarity rank order data set 30. In this case, the first similarity ranking data set 30 and the second similarity ranking data set 40 may include ranking information of each data row of the second data set 20. The first similarity ranking data set 30 and the second similarity ranking data set 40 may include respective data rows of the second data set 20 in the order of similarity value.

데이터 매칭부(120)는 제 1 매칭 유사도 순위 데이터 셋(30)을 기반으로 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항목을 비교하여 제 2 유사도 순위 데이터 셋을 도출 할 수 있다. 데이터 매칭부(120)는 제1 칼럼(11 및 21)을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출하고, 제 1 유사도 순위 데이터 셋(30)의 유사도 순위에 따라 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 칼럼 항목 중 선택한 복수의 나머지 칼럼 항목에 포함된 각 행 데이터에 대하여 제 2 매칭 유사도 연산을 수행할 수 있다. 데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30)의 유사도 순위 대로 순차적으로 제 1 유사도 순위 데이터 셋(30)에 포함된 제 2 데이터 셋(20)의 각 행 데이터의 나머지 칼럼 데이터(각 행 데이터)와 제1데이터 셋(10)의 나머지 칼럼 데이터(각 행 데이터)의 유사도를 연산하여 제 2 매칭 유사도 연산을 수행할 수 있다. The data matching unit 120 compares a plurality of column items of the first data set 10 and the second data set 20 based on the first matching degree of similarity ranking data set 30 to obtain a second similarity ranking data set . The data matching unit 120 derives the first similarity rank order data set 30 by performing a first matching similarity degree calculation based on the first columns 11 and 21, The second matching similarity calculation may be performed for each row data included in the plurality of remaining column items selected from among the column items of the first data set 10 and the second data set 20 according to the rank. The data matching unit 120 sequentially arranges the remaining column data of each row data of the second data set 20 included in the first similarity rank order data set 30 in order of similarity of the first similarity rank order data set 30 (Row data) of the first data set 10 and the remaining column data (each row data) of the first data set 10 to calculate a second matching similarity degree.

예시적으로 도 3을 참조하면, 도 3의 (a)는 제 2 유사도 순위 데이터 셋(40)을 도출하는 과정일 수 있다. 데이터 매칭부(120)는 2차 매칭 유사도 연산에서 도 3(a)의 ①과 같이 제 1 칼럼 (11 및 21)을 기반으로 칼럼 항목 간의 유사도 연산을 수행하고, 데이터 매칭부 (120)는 2차 매칭 유사도 연산에서 도 3의 (a)의 ②와 같이 제 2 칼럼 (12 및 22)을 기반으로 칼럼 항목 간의 유사도 연산을 수행하고, 도 3(a)의 ③과 같이 제 3 칼럼(13 및 23)을 기반으로 칼럼 항목 간의 유사도 연산을 수행할 수 있다. 다시 말해, 데이터 매칭부(120)는 제 1 매칭 유사도 연산 시 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1 행 데이터 '훈련원'에 기반하여, 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)간의 유사도를 연산하고, 상기 유사도 연산 결과가 도 3의 (a)의 ①과 같은 제 1 유사도 순위 데이터 셋(30)으로 도출 될 수 있다. 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1 행 데이터 '훈련원'과 동일한 행에 위치한 제 2 칼럼 항목(12) 및 제 3 칼럼 항목(13) 및 제 2 데이터 셋(20)의 제 1 칼럼 항목(21) 의 제 1 데이터 내지 제 3 데이터 '훈련원 공원' 내지 '서울역 2주차장'과 동일한 행에 위치한 제 2 칼럼 항목(12) 및 제 3 칼럼 항목(13) 간의 제 2 매칭 유사도 연산을 수행할 수 있다. 제2매칭 유사도 연산은 제1데이터 셋(10)의 각 행 데이터에 대하여 제 2 데이터 셋(20)의 모든 행 데이터와의 유사도 연산을 포함한다.Illustratively, referring to FIG. 3, FIG. 3 (a) may be a process of deriving a second similarity ranking data set 40. The data matching unit 120 performs a similarity calculation between the column items based on the first columns 11 and 21 as shown in (1) of FIG. 3 (a) in the second matching similarity degree calculation, The similarity degree calculation between the column items is performed on the basis of the second columns 12 and 22 as shown in (2) of FIG. 3 (a) 23), the similarity calculation between the column items can be performed. In other words, the data matching unit 120 calculates the first matching degree of the second data set 20 based on the first row data 'training source' of the first column item 11 of the first data set 10, And the result of the similarity calculation may be derived as a first similarity ranking data set 30 as shown by (1) in FIG. 3 (a). The data matching unit 120 includes a second column item 12 and a third column item 13 located in the same row as the first row data 'training source' of the first column item 11 of the first data set 10, And a second column item (12) located on the same row as the first data to third data 'training center park' to 'Seoul station 2 parking lot' of the first column item (21) of the second data set (20) Item 13 can be performed. The second matching similarity computation includes a computation of similarity with all the row data of the second data set 20 for each row data of the first data set 10.

데이터 매칭부(120)는 제 2 매칭 유사도 연산을 수행 시 제 1 데이터 셋(10)의 '주차장명' 의 칼럼 항목의 행 데이터인 '훈련원'과 제 2 데이터 셋(20)의 '주차장'의 칼럼 항목의 행 데이터인 '훈련원공원'과의 유사도를 연산(①)하고, 제 1 데이터 셋(10)의 '훈련원'의 행 데이터의 '주차장 운영 시작 시간'의 칼럼 항목에 포함된 데이터 및 제 2 데이터 셋(20)의 '훈련원공원'의 행 데이터인 '주차장 운영 시작 시간'에 포함된 데이터 간의 유사도를 연산(②)하고, 제 1 데이터 셋(10)의 '훈련원'의 행 데이터의'주차장 운영 종료 시간'의 칼럼 항목에 포함된 데이터 및 제 2 데이터 셋(20)의 '훈련원공원'의 행 데이터인 '주차장 운영 종료 시간'에 포함된 데이터 간의 유사도를 연산(③)한 결과의 평균으로 제2유사도 순위를 결정하고 제 2 유사도 순위 데이터 셋(40)을 도출할 수 있다. The data matching unit 120 performs a second matching similarity calculation to determine whether or not the 'training person' of the column item of the 'parking lot name' of the first data set 10 and the 'parking lot' of the second data set 20 (1), which is the row data of the column item, and the data included in the column item of the 'parking lot operation start time' of the row data of the 'training source' of the first data set 10 (②) of the data included in the 'parking lot operation start time' which is the row data of the 'training center park' of the first data set 10 and the data of the 'training center' of the first data set 10, (3) calculating the similarity between the data included in the column item of the 'parking lot operation end time' and the data included in the 'parking lot operation end time' which is the row data of the 'training center park' of the second data set 20 To determine a second order of similarity ranking and derive a second order of similarity ranking data set 40 .

데이터 매칭부(120)는 제 2 유사도 순위 데이터 셋(40) 도출 시 제 1 칼럼 항목(11)의 각 데이터와 제 1 유사도 순위 데이터 셋(30)의 각 행 데이터의 연산 결과를 포함하여 도출할 수 있다. 즉, 도 3의 (b)를 참조하면, 데이터 매칭부(120)는 훈련원공원'의 제 2 데이터 셋(20)에서의 칼럼 인덱스는 5이고, '훈련원'과 '훈련원공원'과의 제 1 매칭 유사도 연산 결과 값은 90이고, 제 2 매칭 유사도 결과 값(①, 및 유사도 결과의 평균값)은 96.67로 분석할 수 있다. 본원의 일 실시예에 따르면, 제1컬럼 항목만을 고려하여 연산된 제1매칭 유사도 값에 비하여 제1내지 제3컬럼 항목을 고려하여 연산된 제2매칭 유사도 값이 증가한 것을 알 수 있으며, 복수개의 칼럼 항목간의 비교를 통하여 보다 정확한 데이터 매칭이 수행될 수 있다. 데이터 매칭부(120)는 유사도에 기초하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭할 수 있다. The data matching unit 120 may derive the data of the first column item 11 and the calculation results of the respective row data of the first similarity rank order data set 30 at the time of deriving the second similarity rank order data set 40 . 3 (b), the data matching unit 120 determines that the column index in the second data set 20 of the training center park is 5, and the column index in the first data set 20 of the training center The matching similarity degree calculation result value is 90, and the second matching similarity degree result value (?, And the average value of the similarity degree result) can be analyzed as 96.67. According to an embodiment of the present invention, it can be seen that the second matching similarity value calculated by considering the first to third column items is increased compared to the first matching similarity value calculated by considering only the first column item, More accurate data matching can be performed through comparison between column items. The data matching unit 120 may match the data rows of the first data set and the second data set based on the similarity.

도 4를 참조하면, 도 4 의 (a)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼(11)을 기반으로 제 1 매칭 유사도 연산을 수행한 수행 결과값 중 유사도가 가장 높은 데이터의 결과를 나타낸 도면일 수 있다. 예를 들어, 제 1 데이터 셋(10)에 포함된 '훈련원'과 제 2 데이터 셋(20)에 포함된 '훈련원공원'간의 1차 유사도 연산 결과 유사도는 90%로 연산될 수 있다. 제 1 데이터 셋(10)의 제 1 칼럼(11)에 중복되는 데이터의 명칭이 존재하는 경우, 예를들어, '서울역 주차장'은 제 2 데이터 셋(20)의 제 2 칼럼(21)의 데이터 매칭 결과가 서로 상이하게 존재하고, 유사도 값 역시 93%로 일치 할 수 있다. 즉, 제 1 데이터 셋(10)의 제 1 칼럼(11)의 '서울역 주차장'은 매칭되는 데이터가 '서울역 2주차장' 및 '서울역 3주차장'으로 복수의 데이터가 매칭될 수 있다. Referring to FIG. 4, (a) of FIG. 4 illustrates a result of performing a first matching similarity calculation based on the first column 11 of the first data set 10 and the second column of the second data set 20 And may be a diagram showing the result of the data with the highest degree of similarity. For example, the similarity degree of the first degree of similarity calculation result between the 'training center' included in the first data set 10 and the 'training center park' included in the second data set 20 can be calculated to be 90%. For example, if the name of the data overlapping the first column 11 of the first data set 10 exists, the 'Seoul station parking lot' stores the data of the second column 21 of the second data set 20 The matching results are different from each other, and the similarity value can also be equal to 93%. That is, a plurality of data may be matched to the 'Seoul Station 2 parking lot' and the 'Seoul Station 3 parking lot' in the first column 11 of the first data set 10.

도 4의(b)는 제 2 칼럼 항목을 추가하여, 제 2 매칭 유사도 연산을 수행한 결과를 나타낸 것일 수 있다. 제 1 매칭 유사도에서 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼(11 및 21)을 기반으로 수행한 결과에 기반하여, 제 2 칼럼(12 및22)를 포함하여, 유사도 연산을 수행할 수 있다. 데이터 매칭부(120)는 복수의 칼럼에 기반하여 유사도 연산을 수행함으로써, 단일 칼럼 시 수행한 연산의 결과보다 정확한 유사도 연산 결과를 도출할 수 있다. 4B may show the result of performing the second matching similarity calculation by adding the second column item. Based on the results of performing based on the first column 11 and the second column 21 of the first data set 10 and the second data set 20 at the first matching degree of similarity, , The similarity calculation can be performed. The data matching unit 120 may perform a similarity calculation based on a plurality of columns to derive a result of a similarity calculation more accurately than a result of a single column.

도 4에는 제 1 칼럼 및 제 2 칼럼에 기반하여 제 1 매칭 유사도 연산 및 제 2 매칭 유사도 연산을 수행하였지만, 제 2 매칭 유사도 연산 수행 시 복수개의 (예를 들어 3개의) 칼럼을 선택하여 유사도 결과 값을 연산할 수 있다. 상기 설명된 칼럼은 데이터 매칭부(120)에 의해 선택될 수 있으나, 이에 한정되는 것은 아니고, 사용자의 결정에 의해 제 1 데이터 셋(10)의 칼럼 항목 및 제 2 데이터 셋(20)의 칼럼 항목의 유사도 연산이 수행될 수 있다. 4, the first matching similarity calculation and the second matching similarity calculation are performed based on the first column and the second column. However, when performing the second matching similarity calculation, a plurality of (for example, three) columns are selected, Value can be calculated. The above-described columns may be selected by the data matching unit 120, but the present invention is not limited thereto. It is possible to determine the column items of the first data set 10 and the column items of the second data set 20, Can be performed.

본원의 일 실시예에서는. 제 1 매칭 유사도 연산 수행 후 발생하는 중복 데이터를 보다 정확하게 매칭하기 위해, 제 1 매칭 유사도 연산 수행 후, 제 1 유사도 연산 결과에 기초하여, 제 2 매칭 유사도 연산을 수행하고, 제 1 매칭 유사도 연산 결과와 제 2 매칭 유사도 연산을 비교하여 최종 데이터 매칭을 수행할 수 있다. 데이터 매칭부(120)는 제 1 매칭 유사도 연산 결과 및 제 2 매칭 유사도 연산 결과를 비교하여 최종 데이터 매칭을 수행함으로써 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)간의 행 데이터 매칭시 데이터 매칭 정확도를 높이는 효과가 있다.In one embodiment herein. In order to more accurately match the duplicated data generated after the first matching similarity calculation, a second matching similarity calculation is performed based on the first similarity calculation result after the first matching similarity calculation is performed, and the first matching similarity calculation result And the second matching similarity calculation to perform final data matching. The data matching unit 120 compares the first matching similarity degree calculation result and the second matching similarity degree calculation result to perform final data matching so that data matching is performed between the first data set 10 and the second data set 20 The matching accuracy is improved.

데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)을 비교하여 유사도가 가장 높은 데이터를 기반으로 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 데이터 행을 매칭할 수 있다. The data matching unit 120 compares the first similarity rank order data set 30 and the second similarity order ranking data set 40 and outputs the first data set 10 and the second data set 20 can be matched.

이하 도 5a 설명 시 사용되는 용어에서 a 는 제 1 유사도 순위 데이터 셋(30) 에서 1순위(Rank 1)인 데이터 이고, a'은 제 2 유사도 순위 데이터 셋(40)에서 1순위 (Rank 1)인 데이터 일 수 있다. N은 제 1 유사도 순위 데이터 셋(30) 에서 1순위(Rank 1)인 데이터의 제 2 유사도 순위 데이터 셋(40)에서의 순위이고, N'은 제 2 유사도 순위 데이터 셋(40)에서 1순위 (Rank 1)인 데이터의 제 1 유사도 순위 데이터 셋(30) 에서의 순위일 수 있다. Hereinafter, a is a first rank (Rank 1) in the first similarity ranking data set 30, and a 'is a ranking (Rank 1) in the second similarity ranking ranking data set 40, Lt; / RTI > N is the rank in the second similarity rank order data set 40 of the first rank rank order data set 30 in the first similarity rank order data set 30 and N ' (Rank 1) in the first similarity rank order data set 30.

도 5a 를 참조하면, 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 단계 S510에서 N과 N'을 비교할 수 있다. 데이터 매칭부(120)는 N과 N'의 비교결과가 일치하는 경우, 데이터 매칭부(120)는 단계 S520에서 a와 a'을 비교할 수 있다. 데이터 매칭부(120)는 N과 N'이 일치하고, a와 a'이 일치하는 경우, 제 1 유사도 순위 데이터 셋(30)의 제 1 순위에 위치한 제 2 데이터 셋(20)의 행 데이터가 제 1 데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제1행 데이터(예를 들어, A1)와 연계하여, 제 1 유사도 순위 데이터 셋(30)의 제 1순위에 위치한 제2데이터 셋(20)의 행 데이터의 제 2 유사도 순위 데이터 셋(40)에서의 순위(N) 및 제 2 유사도 순위 데이터 셋(40)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터의 제1유사도 순위 데이터 셋(30)에서의 순위(N')가 모두 1이고, 제 1 유사도 순위 데이터 셋(30)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터 (a) 및 제 2 유사도 순위 데이터 셋(40)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터(a')가 일치하는 경우, 제 1 유사도 순위 데이터 셋(30)의 제 1순위에 위치한 제2데이터 셋(30)의 행 데이터가 제 1 데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. Referring to FIG. 5A, according to an embodiment of the present invention, the data matching unit 120 may compare N and N 'in step S510. If the comparison result of N and N 'coincides with each other, the data matching unit 120 may compare a and a' in step S520. The data matching unit 120 determines that the row data of the second data set 20 located at the first rank of the first similarity rank order data set 30 is equal to the row data of the second data set 20 when N and N ' It can be determined that the first row data of the first data set 10 is matched with the first row data of the first data set 10. According to one embodiment of the present invention, the data matching unit 120 may associate with the first row data (e.g., A1) of the first data set 10, (N) in the second similarity ranking data set 40 of the row data of the second data set 20 located in the ranking and the second data set (N) located in the first ranking of the second similarity ranking data set 40 20 are all 1 and the rank N 'of the second data set 20 located in the first rank of the first similarity rank order data set 30 is 1 When the row data a and the row data a 'of the second data set 20 located at the first rank of the second similarity rank order data set 40 coincide with each other, It can be determined that the row data of the second data set 30 located in the first rank matches with the first row data of the first data set 10. [

예를 들어, 도 5b의(a)를 참고하면, 데이터 매칭부(120)는 N과 N'이 일치하고, a와 a'이 일치하는 경우 '훈련원' 과 '훈련원공원'이 매칭되는 것으로 판단할 수 있다. 즉, '훈련원'과 연계하여 제 1 유사도 순위 데이터 셋(30)에서 순위가 1인 '훈련원공원'의 제 2 유사도 순위 데이터 셋(40)에서의 순위는 1이고, 제 2 유사도 순위 데이터 셋(40)에서 순위가 1인 '훈련원공원'의 제 1 유사도 순위 데이터 셋(30)에서의 순위가 1이기 때문에 N과 N'이 일치하는 것을 판단할 수 있다. 또한, '훈련원'과 연계하여, 제 1 유사도 순위 데이터 셋(30)의 유사도 값의 순위가 1인 '훈련원공원'과 제 2 유사도 순위 데이터 셋(40)의 유사도 값의 순위가 1인 '훈련원공원'이 일치하므로, 데이터 매칭부(120)는 '훈련원' ('훈련원'이 포함된 제1데이터 셋(10)의 행 데이터)과 '훈련원공원'('훈련원공원'이 포함된 제2데이터 셋(20)의 행 데이터)이 매칭되는 것으로 판단할 수 있다.For example, referring to (a) of FIG. 5B, the data matching unit 120 determines that 'N' and 'N' match, and 'a' and 'a' can do. That is, in the second similarity rank order data set 40 of the 'Trainee Park' having the rank 1 in the first similarity rank order data set 30 in association with the 'trainee', the ranking is 1 and the second similarity rank order data set 40, the ranking in the first similarity rank order data set 30 of the 'Trainee Park' having rank 1 is 1, so it can be determined that N and N 'match. In addition, in association with the 'training center', the 'training center' having the rank of the degree of similarity value of the first similarity degree ranking data set 30 and the ' The data matching unit 120 stores the data of the first data set 10 including the training data and the data of the first data set 10 including the training data and the second data including the training data The row data of the set 20) is matched.

또한, 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 단계 S510에서 N과 N'이 일치하는 경우, 데이터 매칭부(120)는 단계 S520에서 a와 a'을 비교할 수 있다. 단계 S520에서 a와 a'비교 결과 a와 a'일치하지 않는 경우, 제 1 행 데이터에 매칭되는 제 2 데이터 셋(20)의 행 데이터를 매칭 불가로 판단할 수 있다. 다시 말해, 제 1 데이터 셋(10)의 제1행 데이터와 연계하여, 제 1 유사도 순위 데이터 셋(30)의 제 1순위에 위치한 제2데이터 셋(20)의 행 데이터의 제 2 유사도 순위 데이터 셋(40)에서의 순위 및 제 2 유사도 순위 데이터 셋(40)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터의 제1유사도 순위 데이터 셋(30)에서의 순위가 모두 1인 제2데이터 셋(20)의 행 데이터가 복수개인 경우, 제 1 행 데이터에 매칭되는 제2데이터 셋(20)의 행 데이터를 매칭 불가로 판단할 수 있다. In addition, according to one embodiment of the present invention, when N and N 'match in step S510, the data matching unit 120 may compare a and a' in step S520. If it is determined in step S520 that a and a 'are not coincident with the comparison results a and a, the row data of the second data set 20 matched with the first row data can be determined as unmatchable. In other words, in conjunction with the first row data of the first data set 10, the second similarity rank data of the row data of the second data set 20 located in the first rank of the first similarity rank order data set 30 The ranking in the set 40 and the ranking in the first similarity ranking data set 30 of the row data of the second data set 20 located in the first ranking of the second similarity ranking data set 40 are all 1 When there are a plurality of row data of the second data set 20, row data of the second data set 20 matched with the first row data can be determined as unmatchable.

예를 들어, 도 5b의(b)를 참고하면, 데이터 매칭부(120)는 N과 N'이 1로 일치하되, N과 N'이 1인 a와 a'의 데이터가 복수개 존재하는 경우 매칭 불가로 판단할 수 있다. 데이터 매칭부(120)는 '오피시아'와 연계하여 제 2 데이터 셋(20)에 포함된 데이터 항목을 비교하여, 제 1 유사도 순위 데이터 셋(30)은 ('오피시아1', 89,13) 및 ('오피시아2', 89, 14) 등으로 나열된 셋이고, 제 2 유사도 순위 데이터 셋(40)은 ('오피시아1', 13, 89, 94.5) 및 ('오피시아2', 14,89,94.5)등으로 나열된 셋일 수 있다. 데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30)에서의 유사도 연산 결과가 1인 데이터를 '오피시아 1'과 '오피시아2'로 판단하고(즉, 제1유사도 순위가 1위인 행 데이터가 '오피시아 1'과 '오피시아2'로서 복수개), 제 2 유사도 순위 데이터 셋(40)에서의 유사도 연산 결과가 1위인 데이터를 '오피시아 1'과 '오피시아2'로 판단할 수 있다(즉, 제2유사도 순위가 1위인 행 데이터가 '오피시아 1'과 '오피시아2'로서 복수개). 도 5b의 (b)에서 RANK 1, RANK 2는 유사도 순위가 아니라 데이터의 수일 수 있다. 데이터 매칭부(120)는 '오피시아'와 매칭할 수 있는 데이터가 복수개 존재하므로, 보다 정확한 매칭 결과를 얻기 위해 사용자에게 판단결과를 제공하고, 제1데이터 셋(10)의 제 1 행 데이터 매칭되는 제 2 데이터 셋의 행 데이터를 매칭할 수 있다. 즉, 데이터 매칭부(120)는 제1데이터 셋(10)의 특정 행 데이터에 대하여 제1유사도 순위와 제2유사도 순위가 모두 1위인 제2데이터 셋(20)의 행 데이터가 복수인 경우, 상기 제1데이터 셋(10)의 특정 행 데이터에 매칭되는 제2데이터 셋(20)의 행 데이터를 결정할 수 없다. 본원의 일 실시예에 따르면, 제1데이터 셋(10)의 제 1 행 데이터 매칭되는 제 2 데이터 셋의 행 데이터를 매칭할 수 없는 경우, 사용자의 보정 또는 선택을 통해 제1데이터 셋(10)의 제 1 행 데이터 매칭되는 제 2 데이터 셋의 행 데이터를 결정할 수 있다.For example, referring to (b) of FIG. 5B, the data matching unit 120 performs matching when N and N 'are equal to 1, and a and a' having N and N ' It can be judged to be impossible. The data matching unit 120 compares the data items included in the second data set 20 in association with the'Opicia 'so that the first similarity degree rank data set 30 ('Optia 1', 89, 13) and (Opicia 1 ', 13, 89, 94.5) and (' Opicia 2 ', 14, 89, and 94), and the second similarity ranking data set 40 ), And the like. The data matching unit 120 determines that the data having the similarity calculation result of 1 in the first similarity rank order data set 30 is'Opithia 1 'and'Opecia 2' (that is, The data having the first degree of similarity calculation result in the second degree of similarity ranking data set 40 can be judged to be 'Opisia 1' and 'Opisia 2' (that is, And the row data having the second highest order of similarity rankings are a plurality of 'Occia 1' and 'Occia 2'). In (b) of FIG. 5B, RANK 1 and RANK 2 may be the number of data, not the order of similarity. Since the data matching unit 120 has a plurality of data that can be matched with 'Occia', the data matching unit 120 provides the determination result to the user to obtain a more accurate matching result, and the first row data of the first data set 10 The row data of the second data set can be matched. That is, when there are a plurality of row data of the second data set 20 in which the first similarity rank order and the second similarity rank order are both first in relation to the specific row data of the first data set 10, The row data of the second data set 20 matching the specific row data of the first data set 10 can not be determined. According to one embodiment of the present invention, when the first row data of the first data set 10 can not be matched with the row data of the second data set to be matched, the first data set 10 is corrected or selected by the user, The row data of the second data set to be matched can be determined.

또한, 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 단계 S510에서 N과 N'이 일치하지 않을 경우, 단계 S530에서 N과 N'을 비교할 수 있다. 데이터 매칭부(120)는 N과 N' 비교 결과 N이 N'보다 이상인 경우 제 2 유사도 순위 데이터 셋(40)의 순위가 1인 데이터를 제 1 데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. 다시 말해, 제 1 데이터 셋(10)의 제1행 데이터(예를 들어, A1)와 연계하여, 제2데이터 셋(20)의 제1행 데이터 및 제2행 데이터의 제 1 유사도 순위 데이터 셋(30)의 순위가 모두 1이고, 제2데이터 셋(20)의 제1행 데이터의 제 2 유사도 순위 데이터 셋(40)의 순위가 1이고, 제2데이터 셋(20)의 제2행 데이터의 제 2 유사도 순위 데이터 셋(40)의 순위가 1이 아닌 경우, 제2데이터 셋(20)의 제1행 데이터가 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. 데이터 매칭부(120)는 각 행 데이터(a, a')의 순위 정보(N. N')를 산출하여 비교할 수 있다.Also, according to one embodiment of the present invention, if N and N 'do not match in step S510, the data matching unit 120 may compare N and N' in step S530. The data matching unit 120 matches the data having the rank 1 of the second similarity rank order data set 40 with the first row data of the first data set 10 when the N and N 'comparison result N is greater than or equal to N' . In other words, in conjunction with the first row data (e.g., A1) of the first data set 10, the first row data of the second data set 20 and the first similarity ranking data set of the second row data (30) is all 1, the second similarity ranking data set (40) of the first row data of the second data set (20) is 1, and the second row data of the second data set It can be determined that the first row data of the second data set 20 is matched with the first row data if the ranking of the second similarity ranking data set 40 of the second data set 20 is not 1. [ The data matching unit 120 may calculate and compare the rank information N. N 'of each row data a, a'.

예를 들어, 도 5b의 (c)를 참조하면, 데이터 매칭부(120)는 N과 N'이 불일치 하는 경우, N과 N'의 순위를 비교하여, 데이터가 매칭되는 것으로 판단할 수 있다. 제 1 데이터 셋의 제 1 행 데이터를 '서울역 주차장'으로 결정할 수 있다. 데이터 매칭부(120)는 '서울역 주차장'의 제 1 유사도 순위 데이터 셋(30)에서의 연산 결과를 ('서울역 2주차장', 93,1) 및 ('서울역 3주차장', 93,2)등으로 나열하여 데이터 셋을 도출할 수 있고, 제 2 유사도 순위 데이터 셋(40)에서의 연산 결과를 ('서울역 3주차장', 2, 93, 97.67) 및 ('서울역 2주차장', 1, 93,84.33)등으로 나열하여 제 2 유사도 순위 데이터 셋(40)을 도출할 수 있다. 데이터 매칭부(120)는 a를 서울역 2주차장으로 판단하고, a'를 서울역 3주차장으로 판단하고, N을 2, N'는 1로 판단할 수 있다. 즉, 데이터 매칭부(120)는 N과 N'이 불일치 하고, N'보다 N이 더 큰 순위에 위치한 것을 판단하고, 제 2 유사도 순위 데이터 셋(40)의 1에 위치하는 데이터(서울역 3주차장)를 제 1 데이터 셋의 제 1 행 데이터(서울역 주차장)과 매칭되는 것으로 판단할 수 있다. 데이터 매칭부(120)는 각 행 데이터(a, a')의 순위 정보(N. N')를 산출하여 비교할 수 있다.For example, referring to (c) of FIG. 5B, the data matching unit 120 may compare N and N 'rankings to determine that data is matched when N and N' do not match. The first row data of the first data set can be determined as " Seoul station parking lot ". The data matching unit 120 compares the calculation result of the first similarity rank order data set 30 of the 'Seoul Station Parking Lot' ('Seoul Station 2 parking lot', 93,1) and ('Seoul Station 3 parking lot', 93,2) (3 parking lots in Seoul Station, 2, 93, and 97.67) and (2 parking lots in Seoul Station 2, 1, 93, and 97.67 in the second similarity degree ranking data set 40) 84.33) or the like to derive the second similarity ranking data set 40. The data matching unit 120 may determine a as a parking lot at the Seoul Station 2, determine a 'as a parking lot at the Seoul Station 3, determine N as 2, and N' as 1. That is, the data matching unit 120 determines that N and N 'are in disagreement and N is positioned higher than N', and data located at 1 in the second similarity rank order data set 40 ) Matches the first row data of the first data set (Seoul station parking lot). The data matching unit 120 may calculate and compare the rank information N. N 'of each row data a, a'.

데이터 매칭부(120)는 제 1 데이터 셋(10)의 제1행 데이터(예를 들어, A1)와 연계하여, 제2데이터 셋(20)의 제1행 데이터 및 제2행 데이터의 제 1 유사도 순위 데이터 셋(30)의 순위가 모두 1이었으나, 제2유사도 순위 데이터에서 제1행 데이터 및 제2행 데이터의 순위가 달라지는 경우, 더 높은 유사도 순위를 가지는 제2데이터 셋(20)의 행 데이터를 제1데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다.The data matching unit 120 associates the first row data of the second data set 20 and the first row data of the second data set 20 in association with the first row data of the first data set 10 When the order of the similarity ranking data set 30 is all 1, but the order of the first row data and the second row data in the second similarity order data is different, the order of the rows of the second data set 20 having a higher similarity order It can be determined that the data is matched with the first row data of the first data set 10.

도 6a 내지 도6b는 본원의 일 실시예에 따른 데이터 분석 및 유사항목 분석의 일 실시예를 설명하기 위한 개략도이다. 6A and 6B are schematic diagrams for explaining an embodiment of data analysis and similar item analysis according to an embodiment of the present invention.

도 6a를 참조하면, 프로파일링부(130)는 데이터 분석을 통해 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항목의 데이터 타입 및 문자열 패턴을 분석할 수 있다. 프로파일링부(130)는 데이터 베이스(110)에 포함된 복수의 데이터 셋의 데이터를 칼럼 항목 별로 데이터 타입 및 문자열 패턴으로 분석할 수 있다. 프로파일링부(130)는 데이터 베이스(110)에 포함된 비정형 데이터를 정형화하는 분석을 수행할 수 있다. Referring to FIG. 6A, the profiling unit 130 may analyze data types and character string patterns of a plurality of column items of the first data set 10 and the second data set 20 through data analysis. The profiling unit 130 may analyze the data of a plurality of data sets included in the database 110 in a data type and a character string pattern for each column item. The profiling unit 130 may perform an analysis for shaping the unstructured data included in the database 110.

도6b를 참조하면, 유사 항목 분석부(140)는 프로파일링부(130)의 분석 결과를 기반으로 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)간 유사 항목을 분석할 수 있다. 예를 들어, 유사 항목 분석부(140)는 제 1 데이터 셋(10)에 포함된 복수의 칼럼 항목(11 내지 14) 의 대표 키 및 제 2 데이터 셋(20)에 포함된 복수의 칼럼 항목(21 내지 24) 의 대표 키의 문자열 패턴을 분석하여 유사도 분석을 수행할 수 있다. 제 1 데이터 셋(10)에 포함된 '주차장 운영 시작 시간'의 대표키 및 제 2 데이터 셋(20)에 포함된 '주차장 운영 시작 시간'의 대표키의 유사도 분석 결과는 유사도 99%로 연산될 수 있다. 즉, '주차장 운영 시작 시간'에 해당하는 칼럼은 제 1 데이터 셋(10)에도 포함되어 있고, 제 2 데이터 셋(20)에도 포함되어 있다는 것을 의미한다. 유사 항목 분석부(140)는 제 1 데이터 셋(10)의 '주차장명'과 제 2 데이터 셋(20)의 '주차장'의 유사도를 80%로 연산할 수 있다. 유사 항목 분석부(140)는 '주차장명' 과 '주차장'의 문자열 패턴 분석 결과 유사한 문자열이 사용되었으나, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)에 동일한 칼럼 항목이 포함되는 것은 아니라고 판단할 수 있다. Referring to FIG. 6B, the similar item analysis unit 140 may analyze similar items between the first data set 10 and the second data set 20 based on the analysis result of the profiling unit 130. For example, the similar item analysis unit 140 may include a representative key of a plurality of column items 11 to 14 included in the first data set 10 and a plurality of column items included in the second data set 20 21 to 24) can be analyzed to perform similarity analysis. The similarity analysis result of the representative key of the 'parking lot operation start time' included in the first data set 10 and the representative key of the 'parking lot operation start time' included in the second data set 20 is calculated to be 99% . That is, the column corresponding to the 'parking lot operation start time' is included in the first data set 10 and is also included in the second data set 20. The similar item analysis unit 140 may calculate 80% of the similarity between the 'parking lot name' of the first data set 10 and the 'parking lot' of the second data set 20. The similar item analyzing unit 140 uses a similar character string as a result of the character pattern analysis of the 'parking lot name' and the 'parking lot', but the same column item is included in the first data set 10 and the second data set 20 It can be judged that it is not.

데이터 매칭부(120) 및 유사 항목 분석부(140)는 유사도 분석은 레펜슈타인 거리 Levenshtein Distance)를 기반으로 유사도 분석을 수행할 수 있다.The data matching unit 120 and the similar item analysis unit 140 may perform similarity analysis based on the Levenshtein distance).

도 7은 본원의 일 실시예에 따른 빅데이터 분석을 위한 데이터 병합의 일 실시예를 설명하기 위한 개략도이다. FIG. 7 is a schematic diagram for explaining an embodiment of data merging for big data analysis according to an embodiment of the present invention; FIG.

데이터 병합부(150)는 데이터 행의 매칭 결과에 기초하여 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 병합할 수 있다. 또한, 데이터 병합부(150)는 유사 항목 분석부(140)의 분석 결과에 기초하여, 제 1 데이터 셋 및 제 2 데이터 셋의 복수의 칼럼 항목 병합 시 중복되는 칼럼 항목은 제외하고, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 병합을 수행할 수 있다. 데이터 병합부(150)는 데이터 행의 매칭 결과에따라 제 1 데이터 셋(10) 과 제 2 데이터 셋(20)의 매칭되는 행 데이터를 이용하여 각 칼럼 항목을 병합할 수 있다.The data merge unit 150 may merge the first data set 10 and the second data set 20 based on the matching result of the data row. The data merging unit 150 may extract the first data set and the second data set from the first data set and the second data set based on the analysis result of the similar item analyzing unit 140, The combination of the set 10 and the second set of data 20 can be performed. The data merge unit 150 may merge the column items using the matching row data of the first data set 10 and the second data set 20 according to the matching result of the data row.

예시적으로, 도 7을 참조하면, 데이터 병합부(150)는 빅데이터 분석 시 필요한 칼럼 항목에 기반하여, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 병합을 수행할 수 있다. 데이터 병합부(150)는 제 1 데이터 셋(10)은 주차장명(11), 주차장 운영 시작 시간(12), 주차장 운영 종료 시간(13), 주차 요금((14) 10분당, 원단위)의 칼럼 항목을 포함하고, 제 2 데이터 셋(20)는 주차장(21), 주차장 운영 시작 시간(22), 주차장 운영 종료 시간(23), 주소(24)의 칼럼 항목을 선택하여 병합을 수행할 수 있다. 예시적으로, 데이터 병합부(150)는 칼럼 항목을 사용자의 결정에 의하여 선택된 칼럼 항목을 기반으로 병합을 수행할 수 있으나, 이에 한정되는 것은 아니다. 7, the data merge unit 150 may perform the merging of the first data set 10 and the second data set 20 based on the column items required for the big data analysis . The data merge unit 150 receives the first data set 10 from the column of the parking lot name 11, the parking lot operation start time 12, the parking lot operation end time 13, and the parking fee 14 And the second data set 20 may perform a merge by selecting a column item of the parking lot 21, the parking lot operation start time 22, the parking lot operation end time 23, and the address 24 . For example, the data merge unit 150 may merge the column items based on the column items selected by the user's decision, but the present invention is not limited thereto.

데이터 병합부(150)는 유사 항목 분석부(140)의 분석 결과에 기초하여, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)에 중복된 주차장 운영 시작 시간(12 및 22), 주차장 운영 종료 시간(13 및 23)의 두가지 칼럼 항목 중 하나의 칼럼 항목을 선택하여 데이터 병합을 수행할 수 있다. The data merge unit 150 may calculate the parking lot operation start times 12 and 22 overlapping the first data set 10 and the second data set 20 based on the analysis result of the similarity analysis unit 140, It is possible to perform the data merge by selecting one of the two column items of the operation end time 13 and 23.

본원의 일 실시예에서는, 데이터 병합부(150)는 주차장(11), 주차장 운영 시작 시간(12), 주차장 운영 종료 시간(13), 주차 요금((14) 10분당, 원단위), 주소(24)의 칼럼 항목을 선택하여 병합하였지만, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)에 포함된 복수의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 데이터 병합을 수행할 수 있다. In one embodiment of the present invention, the data merge unit 150 receives data from the parking lot 11, the parking lot operation start time 12, the parking lot operation end time 13, the parking fee (14) However, it is possible to perform a data merge by selecting a plurality of column items among a plurality of column items included in the first data set 10 and the second data set 20.

도8은 본원의 일 실시예에 따른 데이터 병합 방법을 나타낸 흐름도이다. 도 8에 도시된 데이터 병합 방법은 도 1 내지 도 7을 통해 설명된 데이터 병합 장치(100)에 의하여 수행된다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 7을 통해 데이터 병합 장치(100)에 대하여 설명된 내용은 도 8에도 적용될 수 있다. 8 is a flowchart illustrating a data merging method according to an embodiment of the present invention. The data merge method shown in FIG. 8 is performed by the data merge apparatus 100 described with reference to FIG. 1 through FIG. Therefore, even if omitted below, the description of the data merge apparatus 100 through FIGS. 1 through 7 can also be applied to FIG.

단계, S801에서 데이터 병합 장치(100)는 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 결정할 수 있다. In step S801, the data merge apparatus 100 may determine a first data set 10 and a second data set 20 to perform data matching among a plurality of data sets.

단계, S802에서 데이터 병합 장치(100)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출할 수 있다. In step S802, the data merge apparatus 100 performs a first matching similarity calculation based on the first column of the first data set 10 and the second data set 20 to generate a first similarity ranking data set 30, Can be derived.

단계, S803에서 데이터 병합 장치(100)는 제 1 유사도 순위 데이터 셋(30)을 기반으로 제 2 매칭 유사도 연산을 수행하여 제 2 유사도 순위 데이터 셋(40)을 도출할 수 있다. In step S803, the data merge apparatus 100 may derive a second similarity ranking data set 40 by performing a second matching similarity calculation based on the first similarity ranking data set 30. [

단계, S804에서 데이터 병합 장치(100)는 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)을 비교하여 제 1 데이터 셋(10)의 각 데이터 행에 매칭되는 제 2 데이터 셋(20)의 데이터 행을 매칭할 수 있다. In step S804, the data merge apparatus 100 compares the first similarity rank order data set 30 and the second similarity rank order data set 40, and compares the second data that is matched with each data row of the first data set 10 The data row of the set 20 can be matched.

단계, S805에서 데이터 병합 장치(100)는 매칭 결과에 기초하여 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 병합할 수 있다. In step S805, the data merge apparatus 100 may merge the first data set 10 and the second data set 20 based on the matching result.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.It will be understood by those of ordinary skill in the art that the foregoing description of the embodiments is for illustrative purposes and that those skilled in the art can easily modify the invention without departing from the spirit or essential characteristics thereof. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is defined by the appended claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included within the scope of the present invention.

10: 제 1 데이터 셋 20: 제 2 데이터 셋
30: 제 1 유사도 순위 데이터 셋 40: 제 2 유사도 순위 데이터 셋
100: 데이터 병합 장치
110: 데이터 베이스
120: 데이터 매칭부
130: 프로파일링부
140: 유사 항목 분석부
150: 데이터 병합부10: first data set 20: second data set
30: first degree of similarity ranking data set 40: second degree of similarity ranking data set
100: Data merge device
110: Database
120: Data matching unit
130: Profiling portion
140:
150: Data merging unit

Claims

A data merge device for big data analysis,
A database comprising a plurality of data sets;
Determining a first data set and a second data set to perform data matching among a plurality of data sets included in the database and selecting a plurality of column items among the column items of the first data set and the second data set A data matching unit operable to calculate similarities between selected column items and to match the data rows of the first data set and the second data set based on the similarity; And
A data merging unit for merging the first data set and the second data set based on the matching result of the data row,
And a data merge device.

The method according to claim 1,
Wherein the data-
Based on the first similarity rank order data set, deriving a first similarity rank order data set by performing a first matching similarity degree calculation based on the first column of the first data set and the second column of the second data set, To derive a second degree of similarity ranking data set,
And comparing the first similarity degree ranking data set and the second similarity degree ranking data set to match the data rows of the first data set and the second data set based on data having the highest degree of similarity.

3. The method of claim 2,
Wherein the first similarity degree ranking data set and the second degree of similarity ranking data set include ranking information of each data row of the second data set.

The method of claim 3,
Wherein the data-
In association with the first row data of the first data set, a ranking in a second similarity rank order data set of row data of a second data set located in a first rank of the first similarity rank order data set, Wherein the ranking of the row data of the second data set located at the first rank of the data set is all 1 and the rank of the row data of the second data set located at the first rank of the first rank of similarity ranking data set is & And if the row data of the second data set located in the first order of the second order of similarity ranking data set coincides with the row data of the second data set located in the first order of the first similarity degree ordering data set, And determines that the data is matched with the row data.

The method of claim 3,
Wherein the data-
In association with the first row data of the first data set, a ranking in a second similarity rank order data set of row data of a second data set located in a first rank of the first similarity rank order data set, If there are a plurality of row data of the second data set whose rank is all 1 in the first similarity rank order data set of the row data of the second data set located in the first rank of the data set, 2 < / RTI > dataset to be non-matching.

The method of claim 3,
Wherein the data-
Wherein the ranking of the first similarity degree rank data set of the first row data and the second row data of the second data set is all 1 in association with the first row data of the first data set, If the ranking of the second similarity ranking data set of the row data is 1 and the ranking of the second similarity ranking data set of the second row data of the second data set is not 1, And determines that the data matches the first row data of the first data set.

The method according to claim 1,
A profiling unit for analyzing data types and character string patterns of a plurality of column items of the first data set and the second data set through data analysis; And
A similar item analyzer for analyzing similar items between the first data set and the second data set based on the analysis result of the profiling unit,
Further comprising:

The method according to claim 6,
Wherein the data merge unit comprises:
Wherein the first data set and the second data set are combined based on the analysis result of the similarity breakdown analysis unit, except for a column item which is duplicated when a plurality of column items of the first data set and the second data set are merged, To the data merge device.

A method for merging data for big data analysis,
Determining a first data set and a second data set to perform data matching among a plurality of data sets included in the database;
Deriving a first similarity rank order data set by performing a first matching similarity degree calculation based on the first data set and the first column of the second data set;
Deriving a second similarity ranking data set by performing a second matching similarity calculation based on the first similarity ranking data set;
Comparing the first similarity degree ranking data set and the second degree of similarity ranking data set to match data rows of a second data set matching each data row of the first data set; And
Merging the first data set and the second data set based on a matching result,
/ RTI >