KR102033151B1

KR102033151B1 - Data merging device and method for bia datda analysis

Info

Publication number: KR102033151B1
Application number: KR1020170149691A
Authority: KR
Inventors: 최용준; 장명수; 공성원
Original assignee: (주)위세아이텍
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2019-10-16
Also published as: KR20190053616A; WO2019093675A1

Abstract

빅데이터 분석을 위한 데이터 병합 장치는, 복수의 데이터 셋을 포함하는 데이터 베이스, 상기 데이터 베이스에 포함된 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋 및 제 2 데이터 셋을 결정하고, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 선택된 칼럼 항목 간의 유사도를 연산하고, 유사도에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 데이터 행을 매칭하는 데이터 매칭부 및 상기 데이터 행의 매칭 결과에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋을 병합하는 데이터 병합부를 포함할 수 있다. The data merging apparatus for big data analysis may determine a database including a plurality of data sets, a first data set and a second data set to perform data matching among the plurality of data sets included in the database, Selecting a plurality of column items among the column items of the first data set and the second data set to calculate similarity between the selected column items and matching data rows of the first data set and the second data set based on the similarity. And a data merger configured to merge the first data set and the second data set based on a matching result of the data matcher and the data row.

Description

Data merging device and method for big data analysis {DATA MERGING DEVICE AND METHOD FOR BIA DATDA ANALYSIS}

본원은 빅데이터 분석을 위한 데이터 병합 장치 및 방법에 관한 것이다.The present invention relates to a data merging apparatus and method for big data analysis.

빅데이터 분석을 위해 전체 소요 노력의 70% ~80% 를 데이터 전처리에 사용하고 있다. 빅데이터 분석은 폭발적으로 증가하고 있으나 빅데이터 분석 기술 발전만큼 데이터 전처리에 관한 기술의 발전 속도는 느리며 이에 따라 자동화된 데이터 전처리 기술 개발의 필요성이 대두되고 있다. For big data analysis, 70% to 80% of the total effort is used for data preprocessing. Big data analysis is exploding, but the development of technology related to data preprocessing is as slow as the development of big data analysis technology. Therefore, the development of automated data preprocessing technology is emerging.

관계형 데이터베이스(RDBMS) 기반과 내부 데이터의 분석 환경에서 벗어나 공공정보 개방 환경과 맞물려 외부 데이터와의 매쉬업(Mash-Up)을 통한 정보 가치 재장출의 필요성이 강조되고 있다. In addition to the relational database (RDBMS) -based and internal data analysis environment, the necessity of reloading information value through mash-up with external data is emphasized.

그러나, 내·외부 데이터 생성의 관점과 용도 등이 다르거나 데이터 표준의 부재로 데이터가 의미적으로 동일 데이터이나 표현이 달라 기계적으로 매칭하는 것이 어려운 경우 데이터의 병합이 불가능하다. 이러한 경우 과도한 인적자원을 투입하여 데이터를 매칭해야 하며, 더욱이 상시화하는 것은 불가능하다. However, merging of data is impossible when it is difficult to match mechanically because the data are different from each other in terms of the purpose and use of internal and external data generation, or because of the absence of a data standard. In this case, excessive human resources must be input to match the data, and it is not always possible to make it constant.

따라서, 빅데이터 분석 활성화를 위해 데이터 매칭 알고리즘을 활용한 데이터 병합 기술에 대한 연구가 필요하나, 현재의 데이터 매칭 알고리즘으로는 단일 데이터 항목 간 매칭만 이루어져, 매칭 가능한 타 항목을 사용할 수 없어 정확도를 올리 수 없다. 특히 국문 데이터의 경우 정확도가 떨어지는 문제점이 있다. 또한, 사용자에 의한 보정이 필요한 데이터임에도 불구하고 데이터를 매칭시키는 문제점이 있었다.Therefore, research on data merging technology using data matching algorithm is needed to activate big data analysis.However, the current data matching algorithm only matches between single data items. Can't. In particular, in the case of Korean data, there is a problem that the accuracy is poor. In addition, despite the data that needs to be corrected by the user, there was a problem of matching the data.

또한, 데이터 전처리 중 하나인 데이터 통합 과정에서 가장 많이 사용되는 데이터 매칭 알고리즘은 Fuzzy Data Matching 알고리즘이다. 이 알고리즘은 편집거리(레펜슈타인, Levenshtein Distance)를 기반으로 계산된 결과값을 사용하여 데이터 간 매칭하는 알고리즘이다. 그러나 기존 알고리즘은 단일 Key 간의 비교를 통하여 매칭하기 때문에 다중 Key를 사용하는 테이블의 경우 구조적으로 매칭하기 어려운 단점을 가지고 있다. 또한 영문 기반으로 개발된 알고리즘이기 때문에, 국문 데이터를 대상으로 기존 알고리즘을 사용하는 것 만으로는 정확성의 한계가 있다.Also, Fuzzy Data Matching algorithm is the most commonly used data matching algorithm in the data integration process. This algorithm is an algorithm for matching data using calculated values based on the edit distance (Levenshtein Distance). However, existing algorithms have a disadvantage in that structural matching is difficult in the case of tables using multiple keys because they are matched through comparison between single keys. In addition, since the algorithm was developed based on the English language, there is a limit of accuracy only by using an existing algorithm for Korean data.

본원의 배경이 되는 기술은 한국공개특허공보 제2011-0099783(공개일: 2001.11.09)호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Laid-Open Publication No. 2011-0099783 (published date: 2001.11.09).

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 단일 데이터의 유사성 분석으로 매칭이 어려운 경우 복수의 데이터 셋에서 복수의 칼럼 항목을 선택하여 유사도 연산을 수행하여 복수의 데이터를 병합하는 데이터 병합 장치 및 방법을 제공하려는 것을 목적으로 한다. The present invention is to solve the above-described problems of the prior art, a data merging device for merging a plurality of data by performing a similarity operation by selecting a plurality of column items from a plurality of data sets when matching is difficult due to similarity analysis of a single data And to provide a method.

또한, 본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 복수의 데이터 셋 매칭 시 매칭이 불가능한 항목에 대해서 사용자에게 매칭 정보를 제공함으로써, 보다 정확하게 이종의 데이터를 병합하는 데이터 병합 장치 및 방법을 제공하려는 것을 목적으로 한다. In addition, the present application is to solve the above-mentioned problems of the prior art, and provides a data merging apparatus and method for merging heterogeneous data more accurately by providing matching information to the user for the items that cannot be matched when matching a plurality of data sets It is intended to provide.

또한, 본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 제 1 매칭 유사도 연산 및 제 2 매칭 유사도 연산을 수행한 후 제 1 매칭 유사도 연산 매칭 결과와 제 2 매칭 유사도 연산을 비교하여 최종 데이터 매칭을 수행함으로써, 보다 정확하게 복수의 데이터를 병합하는 데이터 병합 장치 및 방법을 제공하려는 것을 목적으로 한다. In addition, the present application is to solve the above-described problems of the prior art, the first matching similarity operation and the second matching similarity operation after performing the first matching similarity operation matching result and the second matching similarity operation to compare the final data An object of the present invention is to provide a data merging apparatus and method for merging a plurality of data more accurately.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들도 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the embodiments of the present application is not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 빅데이터 분석을 위한 데이터 병합 장치는, 복수의 데이터 셋을 포함하는 데이터 베이스, 상기 데이터 베이스에 포함된 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋 및 제 2 데이터 셋을 결정하고, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 선택된 칼럼 항목 간의 유사도를 연산하고, 유사도에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 데이터 행을 매칭하는 데이터 매칭부 및 상기 데이터 행의 매칭 결과에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋을 병합하는 데이터 병합부를 포함할 수 있다. As a technical means for achieving the above technical problem, a data merging apparatus for big data analysis, the database including a plurality of data sets, the first to perform data matching among a plurality of data sets included in the database Determine a data set and a second data set, select a plurality of column items among the column items of the first data set and the second data set, calculate a similarity between the selected column items, and based on the similarity, The data matching unit may match a data row of a set and the second data set, and a data merger may merge the first data set and the second data set based on a matching result of the data rows.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 제 1 칼럼을 기반으로 제 1 매칭 유사도 연산 수행하여 제 1 유사도 순위 데이터 셋을 도출하고, 상기 제1 유사도 순위 데이터 셋을 기반으로 제 2 매칭 유사도 연산 수행하여 제 2 유사도 순위 데이터 셋을 도출하되, 상기 제 1 유사도 순위 데이터 셋 및 상기 제 2 유사도 순위 데이터 셋을 비교하여 유사도가 가장 높은 데이터를 기반으로 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 데이터 행을 매칭할 수 있다. According to an embodiment of the present disclosure, the data matching unit may derive a first similarity ranking data set by performing a first matching similarity operation based on first columns of the first data set and the second data set, A second similarity ranking data set is derived by performing a second matching similarity operation based on a first similarity ranking data set, and is compared with the first similarity ranking data set and the second similarity ranking data set based on the highest similarity data. The data rows of the first data set and the second data set may be matched.

본원의 일 실시예에 따르면, 상기 데이터 매칭부를 통해 연산된 상기 제 1 유사도 순위 데이터 셋 및 상기 제 2 유사도 순위 데이터 셋은 상기 제 2 데이터 셋의 각 데이터 행의 순위 정보를 포함할 수 있다. According to one embodiment of the present application, the first similarity ranking data set and the second similarity ranking data set calculated through the data matching unit may include rank information of each data row of the second data set.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋의 제1행 데이터와 연계하여, 상기 제 1 유사도 순위 데이터 셋의 제 1순위에 위치한 제2데이터 셋의 행 데이터의 제 2 유사도 순위 데이터 셋에서의 순위 및 상기 제 2 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터의 제1유사도 순위 데이터 셋에서의 순위가 모두 1이고, 상기 제 1 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터 및 상기 제 2 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터가 일치하는 경우, 상기 제 1 유사도 순위 데이터 셋의 제 1순위에 위치한 제2데이터 셋의 행 데이터가 상기 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. According to an embodiment of the present application, the data matching unit, in association with the first row data of the first data set, the second of the row data of the second data set located in the first rank of the first similarity ranking data set The ranking in the first similarity ranking data set of the ranking in the similarity ranking data set and the row data of the second data set located in the first ranking of the second similarity ranking data set is all 1, and the first similarity ranking data set. A first rank of the first similarity rank data set when the row data of the second data set located in the first rank of and the row data of the second data set located in the first rank of the second similarity rank data set match. The row data of the second data set located at may be determined to match the first row data.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋의 제1행 데이터와 연계하여, 상기 제 1 유사도 순위 데이터 셋의 제 1순위에 위치한 제2데이터 셋의 행 데이터의 제 2 유사도 순위 데이터 셋에서의 순위 및 상기 제 2 유사도 순위 데이터 셋의 제 1 순위에 위치한 제2데이터 셋의 행 데이터의 제1유사도 순위 데이터 셋에서의 순위가 모두 1인 제2데이터 셋의 행 데이터가 복수개인 경우, 상기 제 1 행 데이터에 매칭되는 제2데이터 셋의 행 데이터를 매칭 불가로 판단할 수 있다. According to an embodiment of the present application, the data matching unit, in association with the first row data of the first data set, the second of the row data of the second data set located in the first rank of the first similarity ranking data set The row data of the second data set having a rank of the first similarity rank data set of both the rank in the similarity rank data set and the row data of the second data set located in the first rank of the second similarity rank data set is 1; When there are a plurality of row data, row data of a second data set matching the first row data may be determined as non-matchable.

본원의 일 실시예에 따르면, 상기 데이터 매칭부는, 상기 제 1 데이터 셋의 제1행 데이터와 연계하여, 제2데이터 셋의 제1행 데이터 및 제2행 데이터의 상기 제 1 유사도 순위 데이터 셋의 순위가 모두 1이고, 제2데이터 셋의 제1행 데이터의 상기 제 2 유사도 순위 데이터 셋의 순위가 1이고, 제2데이터 셋의 제2행 데이터의 상기 제 2 유사도 순위 데이터 셋의 순위가 1이 아닌 경우, 상기 제2데이터 셋의 제1행 데이터가 상기 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. According to one embodiment of the present application, the data matching unit, in association with the first row data of the first data set, of the first similarity ranking data set of the first row data and the second row data of the second data set The ranking is all 1, the ranking of the second similarity ranking data set of the first row of data of the second data set is one, and the ranking of the second similarity ranking data set of the second row of data of the second data set is one. If not, it may be determined that the first row data of the second data set matches the first row data.

본원의 일 실시예에 따르면, 빅데이터 분석을 위한 데이터 병합 장치는, 데이터 분석을 통해 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 복수의 칼럼 항목의 데이터 타입 및 문자열 패턴을 분석하는 프로파일링부 및 상기 프로파일링부의 분석 결과를 기반으로 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋 간 유사 항목을 분석하는 유사 항목 분석부를 더 포함할 수 있다. According to an embodiment of the present disclosure, a data merging device for big data analysis may include: a profiling unit configured to analyze data types and string patterns of a plurality of column items of the first data set and the second data set through data analysis; The apparatus may further include a similar item analyzer configured to analyze similar items between the first data set and the second data set based on analysis results of the profiling unit.

본원의 일 실시예에 따르면, 상기 데이터 병합부는, 상기 유사 항복 분석부의 분석 결과에 기초하여, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 복수의 칼럼 항목 병합 시 중복되는 칼럼 항목은 제외하고, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 병합을 수행할 수 있다. According to an embodiment of the present disclosure, the data merger is based on an analysis result of the quasi-yield analysis unit, except for overlapping column items when merging a plurality of column items of the first data set and the second data set. Merging of the first data set and the second data set may be performed.

본원의 일 실시예에 따르면, 빅데이터 분석을 위한 데이터 병합 방법은, 데이터 베이스에 포함된 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋 및 제 2 데이터 셋을 결정하는 단계, 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋의 제 1 칼럼을 기반으로 제 1 매칭 유사도 연산 수행하여 제 1 유사도 순위 데이터 셋을 도출하는 단계, 상기 제1 유사도 순위 데이터 셋을 기반으로 제 2 매칭 유사도 연산 수행하여 제 2 유사도 순위 데이터 셋을 도출하는 단계, 상기 제 1 유사도 순위 데이터 셋 및 상기 제 2 유사도 순위 데이터 셋을 비교하여 제 1 데이터 셋의 각 데이터 행에 매칭되는 제 2 데이터 셋의 데이터 행을 매칭하는 단계 및 매칭 결과에 기초하여 상기 제 1 데이터 셋 및 상기 제 2 데이터 셋을 병합하는 단계를 포함할 수 있다. According to an embodiment of the present disclosure, a data merging method for big data analysis may include: determining a first data set and a second data set to perform data matching among a plurality of data sets included in a database, the first data set; Deriving a first similarity ranking data set by performing a first matching similarity operation based on a first column of a data set and the second data set, and performing a second matching similarity operation based on the first similarity ranking data set Deriving a second similarity ranking data set, comparing the first similarity ranking data set and the second similarity ranking data set, and matching data rows of a second data set matching each data row of the first data set; Merging the first data set and the second data set based on the matching result.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-mentioned means for solving the problems are merely exemplary and should not be construed as limiting the present application. In addition to the above-described exemplary embodiments, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 복수의 데이터 셋에서 복수의 칼럼 항목을 선택하여 유사도 연산을 수행하여 복수의 데이터를 병합할 수 있다. According to the aforementioned problem solving means of the present application, a plurality of data can be merged by selecting a plurality of column items from a plurality of data sets and performing a similarity calculation.

또한, 전술한 본원의 과제 해결 수단에 의하면, 복수의 데이터 셋 매칭 시 매칭이 불가능한 항목에 대해서 사용자에게 매칭 정보를 제공함으로써, 보다 정확하게 복수의 데이터를 병합할 수 있다.In addition, according to the above-described problem solving means of the present application, by providing matching information to the user for the item that can not be matched when matching a plurality of data, it is possible to merge a plurality of data more accurately.

또한, 전술한 본원의 과제 해결 수단에 의하면, 제 1 매칭 유사도 연산 및 제 2 매칭 유사도 연산을 수행한 후 제 1 매칭 유사도 연산 매칭 결과와 제 2 매칭 유사도 연산을 비교하여 최종 데이터 매칭을 수행함으로써, 보다 정확하게 복수의 데이터를 병합할 수 있다.In addition, according to the above-described problem solving means of the present invention, after performing the first matching similarity operation and the second matching similarity operation, the first matching similarity operation by comparing the matching result and the second matching similarity operation to perform the final data matching, It is possible to merge a plurality of data more accurately.

도 1은 본원의 일 실시예에 따른 데이터 병합 장치의 구성을 개략적으로 나타낸 블록도이다.
도 2a 내지 2d는 본원의 일 실시예에 따른 제 1 매칭 유사도 연산을 설명하기 위한 개략도이다.
도 3은 본원의 일 실시예에 따른 제 2 매칭 유사도 연산을 설명하기 위한 개략도이다.
도 4는 본원의 일 실시예에 따른 제 1 매칭 유사도 연산 및 제 2 유사도 연산 결과를 예시적으로 나타낸 도면이다.
도 5a 내지 5b는 본원의 일 실시예에 따른 제 1 유사도 순위 데이터 셋 및 제 2 유사도 순위 데이터 셋을 비교하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭하는 것을 설명하기 위한 개략도이다.
도 6a 내지 도6b는 본원의 일 실시예에 따른 데이터 분석 및 유사항목 분석의 일 실시예를 설명하기 위한 개략도이다.
도 7은 본원의 일 실시예에 따른 빅데이터 분석을 위한 데이터 병합의 일 실시예를 설명하기 위한 개략도이다.
도8은 본원의 일 실시예에 따른 데이터 병합 방법을 나타낸 흐름도이다. 1 is a block diagram schematically illustrating a configuration of a data merging device according to an embodiment of the present disclosure.
2A to 2D are schematic diagrams for describing a first matching similarity calculation according to an embodiment of the present application.
3 is a schematic diagram illustrating a second matching similarity calculation according to an embodiment of the present application.
4 is a diagram exemplarily illustrating a result of a first matching similarity calculation and a second similarity calculation according to an exemplary embodiment of the present application.
5A through 5B are schematic diagrams for comparing data rows of a first data set and a second data set by comparing a first similarity rank data set and a second similarity rank data set according to an exemplary embodiment of the present disclosure.
6A to 6B are schematic diagrams for describing an embodiment of data analysis and similarity analysis according to an embodiment of the present application.
7 is a schematic diagram illustrating an embodiment of data merging for big data analysis according to an embodiment of the present application.
8 is a flowchart illustrating a data merging method according to an embodiment of the present application.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present disclosure. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted for simplicity of explanation, and like reference numerals designate like parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element in between. do.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when a member is said to be located on another member "on", "upper", "top", "bottom", "bottom", "bottom", this means that any member This includes not only the contact but also the presence of another member between the two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함" 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout this specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding the other components unless specifically stated otherwise.

도 1은 본원의 일 실시예에 따른 데이터 병합 장치의 구성을 개략적으로 나타낸 블록도이고, 도 2a 내지 2d는 본원의 일 실시예에 따른 제 1 매칭 유사도 연산을 설명하기 위한 개략도이고, 도 3은 본원의 일 실시예에 따른 제 2 매칭 유사도 연산을 설명하기 위한 개략도이고, 도 4는 본원의 일 실시예에 따른, 제 1 매칭 유사도 연산 결과 및 제 2 매칭 유사도 연산 결과를 예시적으로 나타낸 도면이고, 도 5a 내지 5b는 본원의 일 실시예에 따른 제 1 유사도 순위 데이터 셋 및 제 2 유사도 순위 데이터 셋을 비교하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭하는 것을 설명하기 위한 개략도이다.1 is a block diagram schematically showing the configuration of a data merging apparatus according to an embodiment of the present application, Figures 2a to 2d is a schematic diagram for explaining a first matching similarity calculation according to an embodiment of the present application, Figure 3 is 4 is a schematic diagram illustrating a second matching similarity calculation according to an embodiment of the present disclosure, and FIG. 4 is a diagram illustrating a first matching similarity calculation result and a second matching similarity calculation result according to an embodiment of the present disclosure. 5A through 5B are schematic diagrams for comparing data rows of a first data set and a second data set by comparing a first similarity rank data set and a second similarity rank data set according to an exemplary embodiment of the present application. .

도 1을 참조하면, 데이터 병합 장치(100)는, 데이터 베이스(110), 데이터 매칭부(120), 프로파일링부(130), 유사 항목 분석부(140) 및 데이터 병합부(150)를 포함할 수 있다. 다른 일예로, 데이터 병합 장치(100)는 외부 서버로부터 데이터 병합에 필요한 데이터를 전송받을 수 있으나, 이에 한정 되는 것은 아니다. Referring to FIG. 1, the data merging device 100 may include a database 110, a data matching unit 120, a profiling unit 130, a similar item analysis unit 140, and a data merging unit 150. Can be. As another example, the data merging device 100 may receive data required for data merging from an external server, but is not limited thereto.

데이터 병합 장치(100)는 복수의 데이터 셋 중 제 1 데이터 셋 및 제 2 데이터 셋을 결정하고, 제 1 데이터 셋 및 제 2 데이터 셋의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 선택된 항목 간의 유사도를 연산할 수 있다. 데이터 병합 장치(100)는 유사도에 기초하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭할 수 있다. 또한, 데이터 병합 장치(100)는 복수의 데이터 병합 시 매칭 판단 여부에 기초하여 불가 판단 시 사용자에게 매칭 판단 여부를 제공하고, 사용자의 매칭 판단 여부 정보에 기초하여 복수의 데이터를 병합할 수 있다. 데이터 병합 장치(100)는 데이터 행의 매칭 결과에 기초하여 제 1 데이터 셋 및 제 2 데이터 셋을 병합할 수 있다. The data merging apparatus 100 determines the first data set and the second data set among the plurality of data sets, selects a plurality of column items among the column items of the first data set and the second data set, and determines the similarity between the selected items. Can be calculated. The data merging apparatus 100 may match data rows of the first data set and the second data set based on the similarity. In addition, the data merging apparatus 100 may provide a user with a determination of a match at the time of a non-determination based on whether a match is determined when merging a plurality of data, and merge the plurality of data based on information on whether the user has determined the match. The data merging apparatus 100 may merge the first data set and the second data set based on the matching result of the data rows.

본원의 일 실시예에 따르면, 데이터 병합 장치(100)는 복수의 데이터 셋에서 인공지능 기반의 데이터 매칭 알고리즘을 이용하여 제 1 데이터 셋 및 제 2 데이터 셋을 매칭하여 데이터를 병합할 수 있다. According to an embodiment of the present disclosure, the data merging device 100 may merge data by matching the first data set and the second data set by using an artificial intelligence-based data matching algorithm in a plurality of data sets.

또한, 데이터 병합 장치(100)는 기존의 단일 칼럼 항목간의 유사도 연산을 통하여 수행된 데이터 매칭 결과 시 발생하는 문제점인 제 1 데이터에 제 2 데이터 셋의 제1 데이터 및 제 2 데이터가 중복되어 매칭되는 것을, 복수의 칼럼 항목간의 유사도 연산을 통하여 데이터 매칭을 실시하여, 제 1 데이터 셋의 제 1 행 데이터와 제 2 데이터 셋의 제1 행 데이터가 매칭되도록 할 수 있다.또한, 데이터 병합 장치(100)는 복수의 데이터 셋의 데이터 유사도를 분석하여 매칭하므로, 데이터 엔지니어 전문가와 업무 담당자의 개입 없이 데이터 병합이 가능할 수 있다. 데이터 병합 장치(100)는 빅데이터 환경하에서 늘어나는 데이터 양과 특정 업부 도메인의 데이터 특성을 학습 분석함으로써 데이터 매칭 정확율을 향상시킬 수 있다. In addition, the data merging apparatus 100 may overlap the first data and the second data of the second data set with the first data, which is a problem that occurs when a data matching result is performed through a similarity operation between existing single column items. The data matching may be performed by performing a similarity operation between a plurality of column items to match the first row data of the first data set with the first row data of the second data set. ) Analyzes and matches the data similarity of multiple data sets, so that data can be merged without the involvement of data engineer experts and staff. The data merging apparatus 100 may improve data matching accuracy by learning and analyzing an increasing amount of data and data characteristics of a specific business domain in a big data environment.

또한, 데이터 병합 장치(100)는 매칭이 불가능하다고 판단되는 데이터 항목을 사용자에게 직접 매칭 후보들을 선택하게 하여 결과를 보정함으로써, 보다 정확한 데이터 매칭 및 병합을 수행할 수 있다. In addition, the data merging apparatus 100 may perform more accurate data matching and merging by correcting the result by allowing a user to directly select matching candidates for a data item determined to be impossible to match.

데이터 베이스(110)는 빅데이터 병합에 사용되는 복수의 데이터 셋을 포함할 수 있다. 데이터 베이스(110)는 비정형데이터를 포함할 수 있다. The database 110 may include a plurality of data sets used for big data merging. The database 110 may include unstructured data.

데이터 매칭부(120)는 데이터 베이스(110)에 포함된 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 결정할 수 있다. 본원의 일 실시예에 따르면, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)은 유사 항목 분석부(140)의 유사도 연산 결과에 기초하여 미리 설정된 유사도 정보의 컬럼 항목을 복수개 포함하는 데이터 셋일 수 있다. 다른 일 예에 따르면, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)은 사용자의 선택에 따라 결정된 데이터 셋일 수 있다. 데이터 매칭부(120)는 복수의 데이터 셋 중 병합이 필요한 항목을 포함하고 있는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 결정하여 유사도 연산을 수행할 수 있다. The data matching unit 120 may determine the first data set 10 and the second data set 20 to perform data matching among the plurality of data sets included in the database 110. According to the exemplary embodiment of the present application, the first data set 10 and the second data set 20 may include data including a plurality of column items of similarity information preset based on the similarity calculation result of the similarity analysis unit 140. It can be three. According to another example, the first data set 10 and the second data set 20 may be data sets determined according to a user's selection. The data matching unit 120 may determine a first data set 10 and a second data set 20 including items that need to be merged among a plurality of data sets and perform a similarity operation.

데이터 매칭부(120)는 제 1 데이터 셋 및 제 2 데이터 셋의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 선택된 칼럼 항목 간의 유사도를 연산할 수 있다. 데이터 매칭부(120)는 유사도 연산 수행 시 인공지능 기반의 알고리즘을 사용하여 연산을 수행할 수 있다. 본원의 일 실시예에 따른, 데이터 매칭부(120)의 데이터 매칭 알고리즘은 Fuzzy Data Matching 알고리즘을 사용하여 유사도 연산을 수행할 수있으나, 이에 한정되는 것은 아니다. Fuzzy Data Matching 알고리즘은 편집거리(레펜슈타인, Levenshtein Distance)를 기반으로 계산된 결과값을 사용하여 데이터 간에 매칭을 수행하는 알고리즘이다. The data matching unit 120 may select a plurality of column items among column items of the first data set and the second data set to calculate the similarity between the selected column items. The data matching unit 120 may perform an operation using an artificial intelligence based algorithm when performing the similarity calculation. According to an embodiment of the present disclosure, the data matching algorithm of the data matching unit 120 may perform a similarity operation using a Fuzzy Data Matching algorithm, but is not limited thereto. Fuzzy Data Matching Algorithm is an algorithm that performs matching between data using the result value calculated based on the editing distance (Levenshtein Distance).

예시적으로, 도 2a내지 도 2b를 참고하면, 도 2a내지 도 2b의 도면 부호(a)는 제 1 데이터 셋(10)이고, 도 2a내지 도 2b의 도면 부호(b)는 제 2 데이터 셋(10)일 수 있다. 제 1 데이터 셋은 4개의 칼럼 항목(11 내지 14)을 포함할 수 있다. 또한, 제 2 데이터 셋은 4개의 칼럼 항목(21 내지 24)를 포함할 수 있다. 제 1 데이터 셋 및 제 2 데이터 셋에 포함된 칼럼 항목은 대표 키로 구분될 수 있다. 예를 들어, 제 1 데이터 셋의 제 1 칼럼 항목(11)의 대표 키는 '주차장명'일 수 있고, 제 1칼럼 항목(11)은 대표 키 '주차장명'에 포함된 열의 집합일 수 있다. 즉, 제 1 데이터 셋(10)에 포함된 제 1 칼럼 항목(11)의 대표 키는 '주차장명'이고, 제 1 데이터셋(10)의 제 2 칼럼 항목(12)의 대표 키는 '주차장 운영 시작 시간'이고, 제 1 데이터 셋(10)의 제 3 칼럼 항목(13)의 대표 키는 '주차장 운영 종료 시간'이고, 제 1 데이터 셋(10)의 제 4 칼럼 항목(14)의 대표 키는 '주차요금(10분당, 원단위)' 일 수 있다. 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)의 대표 키는 '주차장'이고, 제 2 데이터 셋(20)의 제 2 칼럼 항목(22)의 대표 키는 '주차장 운영 시작 시간'이고, 제 2 데이터 셋(20)의 제 3 칼럼 항목(23)의 대표 키는 '주차장 운영 종료 시간'이고, 제 2 데이터 셋(20)의 제 4 칼럼 항목(24)의 대표 키는 '주소'일 수 있다.2A to 2B, reference numeral a of FIGS. 2A to 2B denotes a first data set 10, and a reference numeral b of FIGS. 2A to 2B denotes a second data set. (10). The first data set may include four column items 11 to 14. In addition, the second data set may include four column items 21 to 24. Column items included in the first data set and the second data set may be separated by the representative key. For example, the representative key of the first column item 11 of the first data set may be 'parking name', and the first column item 11 may be a set of columns included in the representative key 'parking name'. . That is, the representative key of the first column item 11 included in the first data set 10 is 'parking name', and the representative key of the second column item 12 of the first data set 10 is 'parking lot'. Operation start time ', the representative key of the third column item 13 of the first data set 10 is' parking lot operation end time', and the representative of the fourth column item 14 of the first data set 10. The key may be a 'parking fee (per 10 minutes)'. The representative key of the first column item 21 of the second data set 20 is 'parking lot', and the representative key of the second column item 22 of the second data set 20 is 'parking operation start time' The representative key of the third column item 23 of the second data set 20 is 'parking operation end time', and the representative key of the fourth column item 24 of the second data set 20 is 'address'. Can be.

본원의 일 실시예에 따르면, 데이터 매칭부(120)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항 목 중 제 1 데이터 셋(10)의 제 1 칼럼 항목 내지 제 3 칼럼 항목(11 내지 13)을 선택하고, 제 2 데이터 셋(20)의 제 1 칼럼 항목 내지 제 3 칼럼 항목(21 내지 23)을 선택하여 선택된 칼럼 항목 간의 유사도를 연산할 수 있다. 이때, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항목의 개수는 일치해야한다. 즉, 데이터 매칭부(120)는 데이터 매칭을 진행할 칼럼 항목의 개수를 일치하여 결정할 수 있다. 다른 일예로, 사용자가 데이터 매칭을 진행할 칼럼 항목을 선택할 수 있다. According to an exemplary embodiment of the present disclosure, the data matching unit 120 may include the first column items of the first data set 10 among the plurality of column items of the first data set 10 and the second data set 20. The similarity between the selected column items may be calculated by selecting the third column items 11 to 13 and selecting the first to third column items 21 to 23 of the second data set 20. At this time, the number of column items of the first data set 10 and the second data set 20 must match. That is, the data matching unit 120 may determine the number of column items to be matched with data. As another example, the user may select a column item for data matching.

데이터 매칭부(120)는 선택된 복수의 칼럼 항목 중 대표 칼럼 항목을 선택할 수 있다. 예시적으로, 도 2c를 참조하면, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)를 대표 칼럼 항목으로 선택하고, 제 2 데이터 셋(20)의 제 2 칼럼 항목(21)를 대표 칼럼 항목으로 선택하여, 유사도 연산을 수행할 수 있다. 다른 일 예로, 대표 칼럼 항목은 사용자에 의해 선택될 수 있다. 데이터 매칭부(120)는 선택된 복수의 칼럼 항목 중 대표 칼럼 항목을 선택하여 유사도 연산 수행 시 Fuzzy 알고리즘을 이용하여 데이터 매칭을 수행할 수 있으나, 이에 한정 되는 것은 아니다.The data matching unit 120 may select a representative column item among the plurality of selected column items. For example, referring to FIG. 2C, the data matching unit 120 selects the first column item 11 of the first data set 10 as a representative column item, and selects a second column of the second data set 20. Similarity calculation may be performed by selecting the column item 21 as a representative column item. As another example, the representative column item may be selected by the user. The data matching unit 120 may perform data matching using a Fuzzy algorithm when performing a similarity calculation by selecting a representative column item among a plurality of selected column items, but is not limited thereto.

데이터 매칭부(120)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼(11 및 21)을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출할 수 있다. 예를 들어, 제 1 데이터 셋 및 제 2 데이터 셋의 제 1 칼럼은 주차장명(11) 및 주차장(21)일 수 있다. 본원의 일 실시예일뿐, 제 1 칼럼이 주차장(11) 및 주차장명(21)으로 한정되는 것은 아니고, 데이터 매칭부(120) 또는 사용자의 선택에 의해 제 1 칼럼(11 및 21)을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출할 수 있다. The data matching unit 120 performs a first matching similarity operation based on the first columns 11 and 21 of the first data set 10 and the second data set 20, and thus, the first similarity ranking data set 30. ) Can be derived. For example, the first column of the first data set and the second data set may be the parking lot name 11 and the parking lot 21. Only one embodiment of the present application, the first column is not limited to the parking lot 11 and the parking lot name 21, but based on the first column (11 and 21) by the data matching unit 120 or the user's selection A first similarity ranking data set 30 may be derived by performing a first matching similarity operation.

데이터 매칭부(120)는 제 1 데이터 셋(10)의 각 행 데이터와 제2데이터 셋(20)의 대응되는 칼럼 항목의 모든 행 데이터 간의 유사도 연산을 수행할 수 있다. 제 1 매칭 유사도 연산은 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1 행 데이터(예를 들어, 훈련원)와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)과의 유사도를 비교하는 연산일 수 있다. 예시적으로, 도 2c를 참조하면, 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)은 대표 키 주자차명으로, 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)에 포함된, 제 1 데이터 '훈련원'과 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)은 대표 키 주차장으로, 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 ‘한외빌딩(타워)’, '서울역 2주차장’,’서울역 3주차장’,’청계천 두산 위브 더 제니스’, ‘제일은행(본점)’, ‘훈련원공원’,’ 신방화역(9호선 연계)’, ‘제일주차장, ‘골든타워’와 유사도 연산을 수행하는 것일 수 있다. 즉, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)과의 유사도를 비교할 수 있다. The data matching unit 120 may perform a similarity operation between each row data of the first data set 10 and all row data of a corresponding column item of the second data set 20. The first matching similarity calculation is performed with the first row data of the first column item 11 of the first data set 10 (eg, the training staff) and the first column item 21 of the second data set 20. It may be an operation for comparing the similarity of. For example, referring to FIG. 2C, the first column item 11 of the first data set 10 is a representative key runner name and is included in the first column item 11 of the first data set 10. The first column item 21 of the first data 'trainer' and the second data set 20 are representative key parking lots, and the 'ultra-building' included in the first column item 21 of the second data set 20. (Tower), 'Seoul Station 2nd Parking Lot', 'Seoul Station 3rd Parking Lot', 'Cheonggyecheon Doosan We've the Zenith', 'Cheil Bank (Main Branch)', 'Training Park', 'Shinbanghwa Station (Line 9)', 'Cheil It may be to perform the similarity calculation with the parking lot, 'Golden Tower'. That is, the data matching unit 120 may compare the similarity between each row data of the first column item 11 of the first data set 10 and the first column item 21 of the second data set 20. Can be.

제 1 유사도 순위 데이터 셋(30)은 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)에 포함된 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 각각의 행 데이터와 비교하여 유사도가 높은 순위로 나열된 데이터 셋일 수 있다. 본원의 일 실시예에 따른 제 1 유사도 순위 데이터 셋(30)은 도2d를 참조하여, 설명할 수 있다. 도 2d의 Column A는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)에 포함된 각각의 행 데이터이다. A의 1차 유사도 Rank1내지 A의 1차 유사도 Rank 3은 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)간의 유사도 연산 결과의 값이 높은 순으로 나열된 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 데이터이다. 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 각각의 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)간의 유사도 연산 결과의 값이 중복되는 경우, 예를 들어, 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1데이터(예를 들어, 서울역 주차장)의 데이터에 매칭된 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 데이터가 중복(예를 들어, 서울역 2주차장, 서울역 3주차장)되는 경우, 데이터 매칭부(120)는 제 2 데이터 셋(20)에 배치된 제2 데이터 셋(20)의 제 1 칼럼 항목(21)의 칼럼 인덱스의 순서대로 제 1 유사도 순위 데이터 셋을 도출할 수 있다. The first similarity ranking data set 30 is included in each row data included in the first column item 11 of the first data set 10 and in the first column item 21 of the second data set 20. It may be a data set that is ranked in a high degree of similarity compared with each row data. The first similarity ranking data set 30 according to an embodiment of the present disclosure may be described with reference to FIG. 2D. Column A of FIG. 2D is each row data included in the first column item 11 of the first data set 10. The first similarity Rank A of A to the first similarity rank 3 of A are each row data of the first column item 11 of the first data set 10 and the first column item 21 of the second data set 20. ) Is the data included in the first column item 21 of the second data set 20 in which the values of the similarity arithmetic results are arranged in ascending order. When the value of the similarity calculation result between each row data of the first column item 11 of the first data set 10 and the first column item 21 of the second data set 20 overlaps, for example Included in the first column item 21 of the second data set 20 matched with the data of the first data (eg, Seoul station parking lot) of the first column item 11 of the first data set 10. If the data is duplicated (for example, Seoul Station 2 Parking Lot, Seoul Station 3 Parking Lot), the data matching unit 120 may include the first column item (eg, the second data set 20 disposed in the second data set 20). The first similarity ranking data set may be derived in the order of the column index of 21).

예를 들어, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 '훈련원'의 데이터 유사도 연산 결과를 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)에 포함된 행 데이터 중 ('훈련원공원', 90, 5), ('한외빌딩(타워)', 0 ,0) ('서울역 2주차장', 0, 1)의 순으로 결정하여 제 1 유사도 순위 데이터 셋을 도출 할 수 있다. 데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋에 유사도 연산 결과 값 및 제 2 데이터 셋(20)의 칼럼 인덱스를 포함하여 제 1 매칭 유사도 순위 데이터 셋(30)의 데이터를 도출할 수 있다. 즉, 제 1 칼럼 항목(11)에 포함된 '훈련원'과 '훈련원공원'과의 유사도 결과 값은 90이고, '훈련원공원' 데이터의 제 2 데이터 셋(20)에서의 칼럼 인덱스는 5일 수 있다. 또한, 제 1 칼럼 항목(11)에 포함된 행 데이터와 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)의 유사도 결과 값이 일치하는 경우, 제 2 데이터 셋(20)에서의 칼럼 인덱스의 순대로 나열될 수 있다. 예를 들어, '서울역 주차창' 과 '서울역 2 주차장', '서울역 3주차장'의 연산 결과 값은 93으로 동일하고, '서울역 2 주차장'의 칼럼 인덱스는 1이고, '서울역 3주차장'은 2인경우, 데이터 매칭부(120)는 제 2 데이터 셋(20)에 배치된 칼럼 인덱스의 순서대로 연산 결과를 나열할 수 있다. For example, the data matching unit 120 may calculate a data similarity calculation result of the 'trainer' of the first column item 11 of the first data set 10 and then use the first column item 21 of the second data set 20. 1) Similarity is determined in order of ('Training Park', 90, 5), ('Ultra Building (Tower)', 0,0) ('Seoul Station 2 Parking Lot', 0, 1) A ranking data set can be derived. The data matching unit 120 may derive data of the first matching similarity ranking data set 30 by including a similarity calculation result value and a column index of the second data set 20 in the first similarity ranking data set. That is, the similarity result between 'training center' and 'training park' included in the first column item 11 is 90, and the column index in the second data set 20 of the 'training park' data is 5 days. have. In addition, when the similarity result values of the row data included in the first column item 11 and the first column item 21 of the second data set 20 coincide, the column index in the second data set 20 is matched. Can be listed in the order of. For example, the calculation result of 'Seoul Station Parking Lot', 'Seoul Station Parking Lot 2' and 'Seoul Station Parking Lot 3' is the same as 93.The column index of 'Seoul Station Parking Lot 2' is 1 and 'Seoul Station Parking Lot 3' is 2 In this case, the data matching unit 120 may list the operation results in the order of the column indexes arranged in the second data set 20.

데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30)을 기반으로 제 2 매칭 유사도 연산을 수행하여 제 2 유사도 순위 데이터 셋(40)을 도출 할 수 있다. 이때, 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)은 제 2 데이터 셋(20)의 각 데이터 행의 순위 정보를 포함할 수 있다. 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)은 제 2 데이터 셋(20)의 각 데이터 행를 유사도 값의 순위에 따라 나열하여 포함할 수 있다.The data matching unit 120 may derive the second similarity ranking data set 40 by performing a second matching similarity operation based on the first similarity ranking data set 30. In this case, the first similarity ranking data set 30 and the second similarity ranking data set 40 may include ranking information of each data row of the second data set 20. The first similarity ranking data set 30 and the second similarity ranking data set 40 may include each data row of the second data set 20 in a list according to the ranking of similarity values.

데이터 매칭부(120)는 제 1 매칭 유사도 순위 데이터 셋(30)을 기반으로 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항목을 비교하여 제 2 유사도 순위 데이터 셋을 도출 할 수 있다. 데이터 매칭부(120)는 제1 칼럼(11 및 21)을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출하고, 제 1 유사도 순위 데이터 셋(30)의 유사도 순위에 따라 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 칼럼 항목 중 선택한 복수의 나머지 칼럼 항목에 포함된 각 행 데이터에 대하여 제 2 매칭 유사도 연산을 수행할 수 있다. 데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30)의 유사도 순위 대로 순차적으로 제 1 유사도 순위 데이터 셋(30)에 포함된 제 2 데이터 셋(20)의 각 행 데이터의 나머지 칼럼 데이터(각 행 데이터)와 제1데이터 셋(10)의 나머지 칼럼 데이터(각 행 데이터)의 유사도를 연산하여 제 2 매칭 유사도 연산을 수행할 수 있다. The data matching unit 120 compares a plurality of column items of the first data set 10 and the second data set 20 based on the first matching similarity ranking data set 30 to compare the second similarity ranking data set. Can be derived. The data matching unit 120 derives the first similarity ranking data set 30 by performing a first matching similarity operation based on the first columns 11 and 21, and the similarity degree of the first similarity ranking data set 30. According to the ranking, a second matching similarity calculation may be performed on each row data included in the plurality of remaining column items selected from the column items of the first data set 10 and the second data set 20. The data matching unit 120 sequentially stores the remaining column data of each row data of the second data set 20 included in the first similarity ranking data set 30 in the order of similarity ranking of the first similarity ranking data set 30. The second matching similarity calculation may be performed by calculating the similarity between each row data) and the remaining column data (each row data) of the first data set 10.

예시적으로 도 3을 참조하면, 도 3의 (a)는 제 2 유사도 순위 데이터 셋(40)을 도출하는 과정일 수 있다. 데이터 매칭부(120)는 2차 매칭 유사도 연산에서 도 3(a)의 ①과 같이 제 1 칼럼 (11 및 21)을 기반으로 칼럼 항목 간의 유사도 연산을 수행하고, 데이터 매칭부 (120)는 2차 매칭 유사도 연산에서 도 3의 (a)의 ②와 같이 제 2 칼럼 (12 및 22)을 기반으로 칼럼 항목 간의 유사도 연산을 수행하고, 도 3(a)의 ③과 같이 제 3 칼럼(13 및 23)을 기반으로 칼럼 항목 간의 유사도 연산을 수행할 수 있다. 다시 말해, 데이터 매칭부(120)는 제 1 매칭 유사도 연산 시 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1 행 데이터 '훈련원'에 기반하여, 제 2 데이터 셋(20)의 제 1 칼럼 항목(21)간의 유사도를 연산하고, 상기 유사도 연산 결과가 도 3의 (a)의 ①과 같은 제 1 유사도 순위 데이터 셋(30)으로 도출 될 수 있다. 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제 1 칼럼 항목(11)의 제 1 행 데이터 '훈련원'과 동일한 행에 위치한 제 2 칼럼 항목(12) 및 제 3 칼럼 항목(13) 및 제 2 데이터 셋(20)의 제 1 칼럼 항목(21) 의 제 1 데이터 내지 제 3 데이터 '훈련원 공원' 내지 '서울역 2주차장'과 동일한 행에 위치한 제 2 칼럼 항목(12) 및 제 3 칼럼 항목(13) 간의 제 2 매칭 유사도 연산을 수행할 수 있다. 제2매칭 유사도 연산은 제1데이터 셋(10)의 각 행 데이터에 대하여 제 2 데이터 셋(20)의 모든 행 데이터와의 유사도 연산을 포함한다.For example, referring to FIG. 3, FIG. 3A may refer to a process of deriving a second similarity ranking data set 40. In the second matching similarity calculation, the data matching unit 120 performs a similarity operation between column items based on the first columns 11 and 21 as shown in ① of FIG. 3 (a). In the difference matching similarity calculation, a similarity operation is performed between column items based on the second columns 12 and 22 as shown in (a) of FIG. 3, and the third column 13 and as shown in (3) of FIG. Based on 23), the similarity operation between column items can be performed. In other words, the data matching unit 120 based on the first row data 'trainer' of the first column item 11 of the first data set 10 when calculating the first matching similarity, the second data set 20. The similarity between the first column items 21 may be calculated, and the similarity calculation result may be derived as the first similarity ranking data set 30 as shown in ① of FIG. The data matching unit 120 includes the second column item 12 and the third column item 13 located in the same row as the first row data 'trainer' of the first column item 11 of the first data set 10. And the second column item 12 and the third column located in the same row as the first to third data 'training park' and 'Seoul station 2 parking lot' of the first column item 21 of the second data set 20. A second matching similarity operation between items 13 may be performed. The second matching similarity operation includes a similarity operation with all the row data of the second data set 20 for each row data of the first data set 10.

데이터 매칭부(120)는 제 2 매칭 유사도 연산을 수행 시 제 1 데이터 셋(10)의 '주차장명' 의 칼럼 항목의 행 데이터인 '훈련원'과 제 2 데이터 셋(20)의 '주차장'의 칼럼 항목의 행 데이터인 '훈련원공원'과의 유사도를 연산(①)하고, 제 1 데이터 셋(10)의 '훈련원'의 행 데이터의 '주차장 운영 시작 시간'의 칼럼 항목에 포함된 데이터 및 제 2 데이터 셋(20)의 '훈련원공원'의 행 데이터인 '주차장 운영 시작 시간'에 포함된 데이터 간의 유사도를 연산(②)하고, 제 1 데이터 셋(10)의 '훈련원'의 행 데이터의'주차장 운영 종료 시간'의 칼럼 항목에 포함된 데이터 및 제 2 데이터 셋(20)의 '훈련원공원'의 행 데이터인 '주차장 운영 종료 시간'에 포함된 데이터 간의 유사도를 연산(③)한 결과의 평균으로 제2유사도 순위를 결정하고 제 2 유사도 순위 데이터 셋(40)을 도출할 수 있다. When performing the second matching similarity calculation, the data matching unit 120 performs the 'training center', which is the row data of the column item of the 'parking name' of the first data set 10, and the 'parking lot' of the second data set 20. Similarity with 'training park', which is the row data of the column item, is calculated (1), and data included in the column item of 'parking operation start time' of the row data of 'training station' of the first data set 10 and 2 Calculate the similarity between the data included in the 'Parking lot start time', which is the row data of the 'Training Park' of the data set 20, and calculate the similarity between the row data of the 'Training Center' of the first data set 10. Average of the result of calculating (3) the similarity between the data included in the column item of the parking lot operation end time and the data included in the parking lot operation end time, which is the row data of the training park in the second data set 20 To determine the second similarity ranking and derive the second similarity ranking data set 40. Can be.

데이터 매칭부(120)는 제 2 유사도 순위 데이터 셋(40) 도출 시 제 1 칼럼 항목(11)의 각 데이터와 제 1 유사도 순위 데이터 셋(30)의 각 행 데이터의 연산 결과를 포함하여 도출할 수 있다. 즉, 도 3의 (b)를 참조하면, 데이터 매칭부(120)는 훈련원공원'의 제 2 데이터 셋(20)에서의 칼럼 인덱스는 5이고, '훈련원'과 '훈련원공원'과의 제 1 매칭 유사도 연산 결과 값은 90이고, 제 2 매칭 유사도 결과 값(①, 및 유사도 결과의 평균값)은 96.67로 분석할 수 있다. 본원의 일 실시예에 따르면, 제1컬럼 항목만을 고려하여 연산된 제1매칭 유사도 값에 비하여 제1내지 제3컬럼 항목을 고려하여 연산된 제2매칭 유사도 값이 증가한 것을 알 수 있으며, 복수개의 칼럼 항목간의 비교를 통하여 보다 정확한 데이터 매칭이 수행될 수 있다. 데이터 매칭부(120)는 유사도에 기초하여 제 1 데이터 셋 및 제 2 데이터 셋의 데이터 행을 매칭할 수 있다. When deriving the second similarity ranking data set 40, the data matching unit 120 includes a result of calculation of each data of the first column item 11 and each row data of the first similarity ranking data set 30. Can be. That is, referring to FIG. 3B, the data matching unit 120 has a column index of 5 in the second data set 20 of the training park, and is the first of the training center and the training park. The matching similarity calculation result value is 90, and the second matching similarity result value (①, and an average value of the similarity result) may be analyzed as 96.67. According to one embodiment of the present application, it can be seen that the second matching similarity value calculated in consideration of the first to third column items is increased compared to the first matching similarity value calculated in consideration of only the first column item. More accurate data matching may be performed by comparing column items. The data matching unit 120 may match data rows of the first data set and the second data set based on the similarity.

도 4를 참조하면, 도 4 의 (a)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼(11)을 기반으로 제 1 매칭 유사도 연산을 수행한 수행 결과값 중 유사도가 가장 높은 데이터의 결과를 나타낸 도면일 수 있다. 예를 들어, 제 1 데이터 셋(10)에 포함된 '훈련원'과 제 2 데이터 셋(20)에 포함된 '훈련원공원'간의 1차 유사도 연산 결과 유사도는 90%로 연산될 수 있다. 제 1 데이터 셋(10)의 제 1 칼럼(11)에 중복되는 데이터의 명칭이 존재하는 경우, 예를들어, '서울역 주차장'은 제 2 데이터 셋(20)의 제 2 칼럼(21)의 데이터 매칭 결과가 서로 상이하게 존재하고, 유사도 값 역시 93%로 일치 할 수 있다. 즉, 제 1 데이터 셋(10)의 제 1 칼럼(11)의 '서울역 주차장'은 매칭되는 데이터가 '서울역 2주차장' 및 '서울역 3주차장'으로 복수의 데이터가 매칭될 수 있다. Referring to FIG. 4, (a) of FIG. 4 is a result of performing a first matching similarity calculation based on the first column 11 of the first data set 10 and the second data set 20. It may be a diagram showing a result of data having the highest similarity. For example, as a result of the first similarity calculation between the 'training center' included in the first data set 10 and the 'training park' included in the second data set 20, the similarity may be calculated as 90%. When there is a name of overlapping data in the first column 11 of the first data set 10, for example, 'Seoul station parking lot' is the data of the second column 21 of the second data set 20. The matching results are different from each other, and the similarity value may also be 93%. That is, a plurality of data may be matched to 'Seoul Station Parking Lot' and 'Seoul Station Parking Lot 3' in the 'Seoul Station Parking Lot' of the first column 11 of the first data set 10.

도 4의(b)는 제 2 칼럼 항목을 추가하여, 제 2 매칭 유사도 연산을 수행한 결과를 나타낸 것일 수 있다. 제 1 매칭 유사도에서 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼(11 및 21)을 기반으로 수행한 결과에 기반하여, 제 2 칼럼(12 및22)를 포함하여, 유사도 연산을 수행할 수 있다. 데이터 매칭부(120)는 복수의 칼럼에 기반하여 유사도 연산을 수행함으로써, 단일 칼럼 시 수행한 연산의 결과보다 정확한 유사도 연산 결과를 도출할 수 있다. FIG. 4B illustrates a result of performing a second matching similarity calculation by adding a second column item. Based on the results of performing the first columns 11 and 21 of the first data set 10 and the second data set 20 in the first matching similarity, including the second columns 12 and 22. , Similarity operation can be performed. The data matching unit 120 may derive a similarity calculation result more accurate than a result of the operation performed in a single column by performing a similarity operation based on a plurality of columns.

도 4에는 제 1 칼럼 및 제 2 칼럼에 기반하여 제 1 매칭 유사도 연산 및 제 2 매칭 유사도 연산을 수행하였지만, 제 2 매칭 유사도 연산 수행 시 복수개의 (예를 들어 3개의) 칼럼을 선택하여 유사도 결과 값을 연산할 수 있다. 상기 설명된 칼럼은 데이터 매칭부(120)에 의해 선택될 수 있으나, 이에 한정되는 것은 아니고, 사용자의 결정에 의해 제 1 데이터 셋(10)의 칼럼 항목 및 제 2 데이터 셋(20)의 칼럼 항목의 유사도 연산이 수행될 수 있다. In FIG. 4, although the first matching similarity operation and the second matching similarity operation are performed based on the first column and the second column, similarity results are obtained by selecting a plurality of (eg, three) columns when performing the second matching similarity operation. Can compute a value. The column described above may be selected by the data matching unit 120, but is not limited thereto. The column item of the first data set 10 and the column item of the second data set 20 may be determined by a user. The similarity operation of may be performed.

본원의 일 실시예에서는. 제 1 매칭 유사도 연산 수행 후 발생하는 중복 데이터를 보다 정확하게 매칭하기 위해, 제 1 매칭 유사도 연산 수행 후, 제 1 유사도 연산 결과에 기초하여, 제 2 매칭 유사도 연산을 수행하고, 제 1 매칭 유사도 연산 결과와 제 2 매칭 유사도 연산을 비교하여 최종 데이터 매칭을 수행할 수 있다. 데이터 매칭부(120)는 제 1 매칭 유사도 연산 결과 및 제 2 매칭 유사도 연산 결과를 비교하여 최종 데이터 매칭을 수행함으로써 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)간의 행 데이터 매칭시 데이터 매칭 정확도를 높이는 효과가 있다.In one embodiment of the present application. To more accurately match duplicate data generated after performing the first matching similarity operation, after performing the first matching similarity operation, the second matching similarity operation is performed based on the first similarity operation result, and the first matching similarity operation result. And the second matching similarity operation may be compared to perform final data matching. The data matching unit 120 compares the first matching similarity calculation result and the second matching similarity calculation result and performs final data matching to perform data matching between row data between the first data set 10 and the second data set 20. This increases the matching accuracy.

데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)을 비교하여 유사도가 가장 높은 데이터를 기반으로 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 데이터 행을 매칭할 수 있다. The data matching unit 120 compares the first similarity ranking data set 30 and the second similarity ranking data set 40 to the first data set 10 and the second data set based on the highest similarity data. 20 rows of data can be matched.

이하 도 5a 설명 시 사용되는 용어에서 a 는 제 1 유사도 순위 데이터 셋(30) 에서 1순위(Rank 1)인 데이터 이고, a'은 제 2 유사도 순위 데이터 셋(40)에서 1순위 (Rank 1)인 데이터 일 수 있다. N은 제 1 유사도 순위 데이터 셋(30) 에서 1순위(Rank 1)인 데이터의 제 2 유사도 순위 데이터 셋(40)에서의 순위이고, N'은 제 2 유사도 순위 데이터 셋(40)에서 1순위 (Rank 1)인 데이터의 제 1 유사도 순위 데이터 셋(30) 에서의 순위일 수 있다. Hereinafter, in terms used in the description of FIG. 5A, a is data of rank 1 in the first similarity ranking data set 30, and a ′ is rank 1 in the second similarity ranking data set 40. Can be data. N is the rank in the second similarity rank data set 40 of the data having the first rank (Rank 1) in the first similarity rank data set 30, and N 'is the first rank in the second similarity rank data set 40. (Rank 1) may be a rank in the first similarity ranking data set 30.

도 5a 를 참조하면, 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 단계 S510에서 N과 N'을 비교할 수 있다. 데이터 매칭부(120)는 N과 N'의 비교결과가 일치하는 경우, 데이터 매칭부(120)는 단계 S520에서 a와 a'을 비교할 수 있다. 데이터 매칭부(120)는 N과 N'이 일치하고, a와 a'이 일치하는 경우, 제 1 유사도 순위 데이터 셋(30)의 제 1 순위에 위치한 제 2 데이터 셋(20)의 행 데이터가 제 1 데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 제 1 데이터 셋(10)의 제1행 데이터(예를 들어, A1)와 연계하여, 제 1 유사도 순위 데이터 셋(30)의 제 1순위에 위치한 제2데이터 셋(20)의 행 데이터의 제 2 유사도 순위 데이터 셋(40)에서의 순위(N) 및 제 2 유사도 순위 데이터 셋(40)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터의 제1유사도 순위 데이터 셋(30)에서의 순위(N')가 모두 1이고, 제 1 유사도 순위 데이터 셋(30)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터 (a) 및 제 2 유사도 순위 데이터 셋(40)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터(a')가 일치하는 경우, 제 1 유사도 순위 데이터 셋(30)의 제 1순위에 위치한 제2데이터 셋(30)의 행 데이터가 제 1 데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. Referring to FIG. 5A, according to an embodiment of the present disclosure, the data matching unit 120 may compare N with N ′ in step S510. When the comparison result of N and N 'is matched, the data matching unit 120 may compare a with a' in step S520. When N and N 'coincide, and a and a' coincide, the data matching unit 120 matches the row data of the second data set 20 located in the first rank of the first similarity rank data set 30. It may be determined that the first row data of the first data set 10 matches. According to an exemplary embodiment of the present application, the data matching unit 120 is connected to the first row data (eg, A1) of the first data set 10, and thus, the first data of the first similarity ranking data set 30 may be used. A ranking N in the second similarity ranking data set 40 of the row data of the second data set 20 located in the ranking and a second data set located in the first ranking of the second similarity ranking data set 40 ( In the first similarity ranking data set 30 of the row data of 20), the ranking N 'is all 1, and the second data set 20 in the first ranking of the first similarity ranking data set 30 is located. If the row data (a ') of the second data set 20 located in the first rank of the row data (a) and the second similarity rank data set 40 matches, the first similarity rank data set 30 It may be determined that the row data of the second data set 30 positioned in the first rank matches the first row data of the first data set 10.

예를 들어, 도 5b의(a)를 참고하면, 데이터 매칭부(120)는 N과 N'이 일치하고, a와 a'이 일치하는 경우 '훈련원' 과 '훈련원공원'이 매칭되는 것으로 판단할 수 있다. 즉, '훈련원'과 연계하여 제 1 유사도 순위 데이터 셋(30)에서 순위가 1인 '훈련원공원'의 제 2 유사도 순위 데이터 셋(40)에서의 순위는 1이고, 제 2 유사도 순위 데이터 셋(40)에서 순위가 1인 '훈련원공원'의 제 1 유사도 순위 데이터 셋(30)에서의 순위가 1이기 때문에 N과 N'이 일치하는 것을 판단할 수 있다. 또한, '훈련원'과 연계하여, 제 1 유사도 순위 데이터 셋(30)의 유사도 값의 순위가 1인 '훈련원공원'과 제 2 유사도 순위 데이터 셋(40)의 유사도 값의 순위가 1인 '훈련원공원'이 일치하므로, 데이터 매칭부(120)는 '훈련원' ('훈련원'이 포함된 제1데이터 셋(10)의 행 데이터)과 '훈련원공원'('훈련원공원'이 포함된 제2데이터 셋(20)의 행 데이터)이 매칭되는 것으로 판단할 수 있다.For example, referring to FIG. 5B (a), when N and N 'match and a and a' match, the data matching unit 120 determines that 'training center' and 'training park' match. can do. That is, the ranking in the second similarity ranking data set 40 of the 'training park' which is ranked 1 in the first similarity ranking data set 30 in association with the 'training center' is 1, and the second similarity ranking data set ( In (40), since the ranking in the first similarity ranking data set 30 of the 'training park' with the ranking is 1, it may be determined that N and N 'match. In addition, the training center having a ranking of the similarity values of the training park and the second similarity ranking data set 40 in which the ranking of the similarity value of the first similarity ranking data set 30 is 1 is associated with the training center. Park 'matches, the data matching unit 120 is a training center (row data of the first data set 10 including' training center ') and' training park '(second training data including' training park ') It may be determined that the row data of the set 20 is matched.

또한, 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 단계 S510에서 N과 N'이 일치하는 경우, 데이터 매칭부(120)는 단계 S520에서 a와 a'을 비교할 수 있다. 단계 S520에서 a와 a'비교 결과 a와 a'일치하지 않는 경우, 제 1 행 데이터에 매칭되는 제 2 데이터 셋(20)의 행 데이터를 매칭 불가로 판단할 수 있다. 다시 말해, 제 1 데이터 셋(10)의 제1행 데이터와 연계하여, 제 1 유사도 순위 데이터 셋(30)의 제 1순위에 위치한 제2데이터 셋(20)의 행 데이터의 제 2 유사도 순위 데이터 셋(40)에서의 순위 및 제 2 유사도 순위 데이터 셋(40)의 제 1 순위에 위치한 제2데이터 셋(20)의 행 데이터의 제1유사도 순위 데이터 셋(30)에서의 순위가 모두 1인 제2데이터 셋(20)의 행 데이터가 복수개인 경우, 제 1 행 데이터에 매칭되는 제2데이터 셋(20)의 행 데이터를 매칭 불가로 판단할 수 있다. In addition, according to an exemplary embodiment of the present disclosure, when N and N 'match in step S510, the data matching unit 120 may compare a and a' in step S520. If the result of the comparison a and a 'does not match in step S520, the row data of the second data set 20 matching the first row data may be determined as not matchable. In other words, the second similarity ranking data of the row data of the second data set 20 positioned in the first rank of the first similarity ranking data set 30 in association with the first row data of the first data set 10. The rank in the first similarity ranking data set 30 of the rank data in the set 40 and the row data of the second data set 20 located in the first rank of the second similarity rank data set 40 are all one. When there is a plurality of row data of the second data set 20, the row data of the second data set 20 matching the first row data may be determined as not matchable.

예를 들어, 도 5b의(b)를 참고하면, 데이터 매칭부(120)는 N과 N'이 1로 일치하되, N과 N'이 1인 a와 a'의 데이터가 복수개 존재하는 경우 매칭 불가로 판단할 수 있다. 데이터 매칭부(120)는 '오피시아'와 연계하여 제 2 데이터 셋(20)에 포함된 데이터 항목을 비교하여, 제 1 유사도 순위 데이터 셋(30)은 ('오피시아1', 89,13) 및 ('오피시아2', 89, 14) 등으로 나열된 셋이고, 제 2 유사도 순위 데이터 셋(40)은 ('오피시아1', 13, 89, 94.5) 및 ('오피시아2', 14,89,94.5)등으로 나열된 셋일 수 있다. 데이터 매칭부(120)는 제 1 유사도 순위 데이터 셋(30)에서의 유사도 연산 결과가 1인 데이터를 '오피시아 1'과 '오피시아2'로 판단하고(즉, 제1유사도 순위가 1위인 행 데이터가 '오피시아 1'과 '오피시아2'로서 복수개), 제 2 유사도 순위 데이터 셋(40)에서의 유사도 연산 결과가 1위인 데이터를 '오피시아 1'과 '오피시아2'로 판단할 수 있다(즉, 제2유사도 순위가 1위인 행 데이터가 '오피시아 1'과 '오피시아2'로서 복수개). 도 5b의 (b)에서 RANK 1, RANK 2는 유사도 순위가 아니라 데이터의 수일 수 있다. 데이터 매칭부(120)는 '오피시아'와 매칭할 수 있는 데이터가 복수개 존재하므로, 보다 정확한 매칭 결과를 얻기 위해 사용자에게 판단결과를 제공하고, 제1데이터 셋(10)의 제 1 행 데이터 매칭되는 제 2 데이터 셋의 행 데이터를 매칭할 수 있다. 즉, 데이터 매칭부(120)는 제1데이터 셋(10)의 특정 행 데이터에 대하여 제1유사도 순위와 제2유사도 순위가 모두 1위인 제2데이터 셋(20)의 행 데이터가 복수인 경우, 상기 제1데이터 셋(10)의 특정 행 데이터에 매칭되는 제2데이터 셋(20)의 행 데이터를 결정할 수 없다. 본원의 일 실시예에 따르면, 제1데이터 셋(10)의 제 1 행 데이터 매칭되는 제 2 데이터 셋의 행 데이터를 매칭할 수 없는 경우, 사용자의 보정 또는 선택을 통해 제1데이터 셋(10)의 제 1 행 데이터 매칭되는 제 2 데이터 셋의 행 데이터를 결정할 수 있다.For example, referring to FIG. 5B (b), the data matching unit 120 matches when N and N 'are 1, but a plurality of data of a and a' having N and N '1 are matched. It can be judged impossible. The data matching unit 120 compares the data items included in the second data set 20 in association with 'opia', so that the first similarity ranking data set 30 is ('opia1', 89, 13) and ('Ophisia 2', 89, 14) and the like, and the second similarity ranking data set 40 includes ('Ophisia 1', 13, 89, 94.5) and ('Opiasia 2', 14,89,94.5). May be three, listed). The data matching unit 120 determines the data having a similarity calculation result of the first similarity ranking data set 30 as 'Ophicia 1' and 'Ophisia 2' (that is, the row data having the first similarity ranking first). Is a plurality of 'Opcia 1' and 'Opcia 2') and the data having the first similarity calculation result in the second similarity ranking data set 40 may be determined as 'Opcia 1' and 'Opcia 2' (ie, The second similarity ranking ranks first with plural number of 'opia1' and 'opia2'. In FIG. 5B, RANK 1 and RANK 2 may be the number of data rather than the similarity rank. Since the data matching unit 120 has a plurality of data that can be matched with 'Opcia', the data matching unit 120 provides the determination result to the user to obtain a more accurate matching result, and the data of the first row of the first data set 10 is matched. The row data of the second data set may be matched. That is, when the data matching unit 120 has a plurality of row data of the second data set 20 having the first similarity rank and the second similarity rank with respect to the specific row data of the first data set 10, It is not possible to determine the row data of the second data set 20 matching the specific row data of the first data set 10. According to an exemplary embodiment of the present disclosure, when the row data of the second data set that matches the first row data of the first data set 10 cannot be matched, the first data set 10 may be corrected or selected by a user. The row data of the second data set that matches the first row data of may be determined.

또한, 본원의 일 실시예에 따르면, 데이터 매칭부(120)는 단계 S510에서 N과 N'이 일치하지 않을 경우, 단계 S530에서 N과 N'을 비교할 수 있다. 데이터 매칭부(120)는 N과 N' 비교 결과 N이 N'보다 이상인 경우 제 2 유사도 순위 데이터 셋(40)의 순위가 1인 데이터를 제 1 데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. 다시 말해, 제 1 데이터 셋(10)의 제1행 데이터(예를 들어, A1)와 연계하여, 제2데이터 셋(20)의 제1행 데이터 및 제2행 데이터의 제 1 유사도 순위 데이터 셋(30)의 순위가 모두 1이고, 제2데이터 셋(20)의 제1행 데이터의 제 2 유사도 순위 데이터 셋(40)의 순위가 1이고, 제2데이터 셋(20)의 제2행 데이터의 제 2 유사도 순위 데이터 셋(40)의 순위가 1이 아닌 경우, 제2데이터 셋(20)의 제1행 데이터가 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다. 데이터 매칭부(120)는 각 행 데이터(a, a')의 순위 정보(N. N')를 산출하여 비교할 수 있다.In addition, according to an exemplary embodiment of the present application, when N and N 'do not coincide in step S510, the data matching unit 120 may compare N and N ′ in step S530. The data matching unit 120 matches the first row data of the first data set 10 with the first rank data of the second similarity ranking data set 40 when N is greater than or equal to N 'as a result of comparing N and N'. It can be judged. In other words, the first similarity ranking data set of the first row data and the second row data of the second data set 20 in association with the first row data (eg, A1) of the first data set 10. All of the ranks of 30 are 1, the second similarity ranking data set 40 of the first row of data of the second data set 20 has a rank of 1, and the second row of data of the second data set 20. When the rank of the second similarity ranking data set 40 of the other than 1 is determined, it may be determined that the first row data of the second data set 20 matches the first row data. The data matching unit 120 may calculate and compare the ranking information N. N 'of each row data a and a'.

예를 들어, 도 5b의 (c)를 참조하면, 데이터 매칭부(120)는 N과 N'이 불일치 하는 경우, N과 N'의 순위를 비교하여, 데이터가 매칭되는 것으로 판단할 수 있다. 제 1 데이터 셋의 제 1 행 데이터를 '서울역 주차장'으로 결정할 수 있다. 데이터 매칭부(120)는 '서울역 주차장'의 제 1 유사도 순위 데이터 셋(30)에서의 연산 결과를 ('서울역 2주차장', 93,1) 및 ('서울역 3주차장', 93,2)등으로 나열하여 데이터 셋을 도출할 수 있고, 제 2 유사도 순위 데이터 셋(40)에서의 연산 결과를 ('서울역 3주차장', 2, 93, 97.67) 및 ('서울역 2주차장', 1, 93,84.33)등으로 나열하여 제 2 유사도 순위 데이터 셋(40)을 도출할 수 있다. 데이터 매칭부(120)는 a를 서울역 2주차장으로 판단하고, a'를 서울역 3주차장으로 판단하고, N을 2, N'는 1로 판단할 수 있다. 즉, 데이터 매칭부(120)는 N과 N'이 불일치 하고, N'보다 N이 더 큰 순위에 위치한 것을 판단하고, 제 2 유사도 순위 데이터 셋(40)의 1에 위치하는 데이터(서울역 3주차장)를 제 1 데이터 셋의 제 1 행 데이터(서울역 주차장)과 매칭되는 것으로 판단할 수 있다. 데이터 매칭부(120)는 각 행 데이터(a, a')의 순위 정보(N. N')를 산출하여 비교할 수 있다.For example, referring to (c) of FIG. 5B, when N and N 'do not match, the data matching unit 120 may compare the ranks of N and N' and determine that the data is matched. The first row data of the first data set may be determined as 'Seoul station parking lot'. The data matching unit 120 calculates the calculation result of the first similarity ranking data set 30 of the 'Seoul station parking lot' ('Seoul Station 2 parking lot', 93,1) and ('Seoul Station 3 parking lot', 93,2), etc. The data set can be derived, and the result of the calculation in the second similarity ranking data set 40 can be obtained from ('Seoul Station 3 Parking Lot', 2, 93, 97.67) and ('Seoul Station 2 Parking Lot', 1, 93, 84.33) may be used to derive the second similarity ranking data set 40. The data matching unit 120 may determine a as the second parking lot in Seoul Station, a 'as the third parking lot in Seoul Station, and determine N as 2 and N as 1. That is, the data matching unit 120 determines that N is inconsistent with N 'and that N is located in a higher rank than N', and data located in 1 of the second similarity ranking data set 40 (Seoul Station 3 Parking Lot). ) May be determined to match the first row data of the first data set (Seoul station parking lot). The data matching unit 120 may calculate and compare the ranking information N. N 'of each row data a and a'.

데이터 매칭부(120)는 제 1 데이터 셋(10)의 제1행 데이터(예를 들어, A1)와 연계하여, 제2데이터 셋(20)의 제1행 데이터 및 제2행 데이터의 제 1 유사도 순위 데이터 셋(30)의 순위가 모두 1이었으나, 제2유사도 순위 데이터에서 제1행 데이터 및 제2행 데이터의 순위가 달라지는 경우, 더 높은 유사도 순위를 가지는 제2데이터 셋(20)의 행 데이터를 제1데이터 셋(10)의 제 1 행 데이터와 매칭되는 것으로 판단할 수 있다.The data matching unit 120 is connected to the first row data (eg, A1) of the first data set 10, and thus, the first row data and the first row data of the second row data of the second data set 20. If the ranking of the similarity ranking data set 30 is all 1, but the ranking of the first row data and the second row data in the second similarity ranking data are different, the rows of the second data set 20 having the higher similarity ranking are higher. The data may be determined to match the first row data of the first data set 10.

도 6a 내지 도6b는 본원의 일 실시예에 따른 데이터 분석 및 유사항목 분석의 일 실시예를 설명하기 위한 개략도이다. 6A to 6B are schematic diagrams for describing an embodiment of data analysis and similarity analysis according to an embodiment of the present application.

도 6a를 참조하면, 프로파일링부(130)는 데이터 분석을 통해 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 복수의 칼럼 항목의 데이터 타입 및 문자열 패턴을 분석할 수 있다. 프로파일링부(130)는 데이터 베이스(110)에 포함된 복수의 데이터 셋의 데이터를 칼럼 항목 별로 데이터 타입 및 문자열 패턴으로 분석할 수 있다. 프로파일링부(130)는 데이터 베이스(110)에 포함된 비정형 데이터를 정형화하는 분석을 수행할 수 있다. Referring to FIG. 6A, the profiling unit 130 may analyze data types and string patterns of a plurality of column items of the first data set 10 and the second data set 20 through data analysis. The profiling unit 130 may analyze data of a plurality of data sets included in the database 110 by data type and string pattern for each column item. The profiling unit 130 may perform an analysis for shaping unstructured data included in the database 110.

도6b를 참조하면, 유사 항목 분석부(140)는 프로파일링부(130)의 분석 결과를 기반으로 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)간 유사 항목을 분석할 수 있다. 예를 들어, 유사 항목 분석부(140)는 제 1 데이터 셋(10)에 포함된 복수의 칼럼 항목(11 내지 14) 의 대표 키 및 제 2 데이터 셋(20)에 포함된 복수의 칼럼 항목(21 내지 24) 의 대표 키의 문자열 패턴을 분석하여 유사도 분석을 수행할 수 있다. 제 1 데이터 셋(10)에 포함된 '주차장 운영 시작 시간'의 대표키 및 제 2 데이터 셋(20)에 포함된 '주차장 운영 시작 시간'의 대표키의 유사도 분석 결과는 유사도 99%로 연산될 수 있다. 즉, '주차장 운영 시작 시간'에 해당하는 칼럼은 제 1 데이터 셋(10)에도 포함되어 있고, 제 2 데이터 셋(20)에도 포함되어 있다는 것을 의미한다. 유사 항목 분석부(140)는 제 1 데이터 셋(10)의 '주차장명'과 제 2 데이터 셋(20)의 '주차장'의 유사도를 80%로 연산할 수 있다. 유사 항목 분석부(140)는 '주차장명' 과 '주차장'의 문자열 패턴 분석 결과 유사한 문자열이 사용되었으나, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)에 동일한 칼럼 항목이 포함되는 것은 아니라고 판단할 수 있다. Referring to FIG. 6B, the similar item analyzer 140 may analyze similar items between the first data set 10 and the second data set 20 based on the analysis result of the profiling unit 130. For example, the similar item analyzer 140 may represent the representative keys of the plurality of column items 11 to 14 included in the first data set 10 and the plurality of column items included in the second data set 20 ( Similarity analysis can be performed by analyzing the string pattern of the representative key of 21 to 24). The similarity analysis result of the representative key of the 'parking operation start time' included in the first data set 10 and the representative key of the 'parking operation start time' included in the second data set 20 may be calculated with a similarity 99%. Can be. That is, it means that the column corresponding to 'parking start time' is included in the first data set 10 and the second data set 20. The similar item analyzer 140 may calculate the similarity between the 'parking name' of the first data set 10 and the 'parking lot' of the second data set 20 at 80%. Similar item analysis unit 140, but the similar character string is used as a result of the string pattern analysis of the 'parking name' and 'parking lot', the same column item is included in the first data set 10 and the second data set 20 You can judge that.

데이터 매칭부(120) 및 유사 항목 분석부(140)는 유사도 분석은 레펜슈타인 거리 Levenshtein Distance)를 기반으로 유사도 분석을 수행할 수 있다.The data matching unit 120 and the similar item analyzer 140 may perform the similarity analysis based on the similarity analysis based on the Levenshtein distance.

도 7은 본원의 일 실시예에 따른 빅데이터 분석을 위한 데이터 병합의 일 실시예를 설명하기 위한 개략도이다. 7 is a schematic diagram illustrating an embodiment of data merging for big data analysis according to an embodiment of the present application.

데이터 병합부(150)는 데이터 행의 매칭 결과에 기초하여 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 병합할 수 있다. 또한, 데이터 병합부(150)는 유사 항목 분석부(140)의 분석 결과에 기초하여, 제 1 데이터 셋 및 제 2 데이터 셋의 복수의 칼럼 항목 병합 시 중복되는 칼럼 항목은 제외하고, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 병합을 수행할 수 있다. 데이터 병합부(150)는 데이터 행의 매칭 결과에따라 제 1 데이터 셋(10) 과 제 2 데이터 셋(20)의 매칭되는 행 데이터를 이용하여 각 칼럼 항목을 병합할 수 있다.The data merger 150 may merge the first data set 10 and the second data set 20 based on the matching result of the data rows. In addition, the data merging unit 150 may exclude the column items that are overlapped when merging a plurality of column items of the first data set and the second data set based on the analysis result of the similar item analysis unit 140, and the first data. Merging of the set 10 and the second data set 20 may be performed. The data merger 150 may merge each column item by using matching row data of the first data set 10 and the second data set 20 according to the matching result of the data rows.

예시적으로, 도 7을 참조하면, 데이터 병합부(150)는 빅데이터 분석 시 필요한 칼럼 항목에 기반하여, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 병합을 수행할 수 있다. 데이터 병합부(150)는 제 1 데이터 셋(10)은 주차장명(11), 주차장 운영 시작 시간(12), 주차장 운영 종료 시간(13), 주차 요금((14) 10분당, 원단위)의 칼럼 항목을 포함하고, 제 2 데이터 셋(20)는 주차장(21), 주차장 운영 시작 시간(22), 주차장 운영 종료 시간(23), 주소(24)의 칼럼 항목을 선택하여 병합을 수행할 수 있다. 예시적으로, 데이터 병합부(150)는 칼럼 항목을 사용자의 결정에 의하여 선택된 칼럼 항목을 기반으로 병합을 수행할 수 있으나, 이에 한정되는 것은 아니다. For example, referring to FIG. 7, the data merger 150 may merge the first data set 10 and the second data set 20 based on column items necessary for big data analysis. . The data merger 150 may include a column of the first data set 10 including the parking lot name 11, the parking lot operation start time 12, the parking lot operation end time 13, and the parking fee (10 minutes per unit). The second data set 20 may include columnar items of the parking lot 21, the parking lot operation start time 22, the parking lot operation end time 23, and the address 24 to perform merging. . For example, the data merging unit 150 may merge the column items based on the column item selected by the user's decision, but is not limited thereto.

데이터 병합부(150)는 유사 항목 분석부(140)의 분석 결과에 기초하여, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)에 중복된 주차장 운영 시작 시간(12 및 22), 주차장 운영 종료 시간(13 및 23)의 두가지 칼럼 항목 중 하나의 칼럼 항목을 선택하여 데이터 병합을 수행할 수 있다. The data merging unit 150 based on the analysis result of the similar item analysis unit 140, the parking lot operating start times 12 and 22 overlapped with the first data set 10 and the second data set 20, and the parking lot. Data merging may be performed by selecting one column item among two column items of the operation end times 13 and 23.

본원의 일 실시예에서는, 데이터 병합부(150)는 주차장(11), 주차장 운영 시작 시간(12), 주차장 운영 종료 시간(13), 주차 요금((14) 10분당, 원단위), 주소(24)의 칼럼 항목을 선택하여 병합하였지만, 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)에 포함된 복수의 칼럼 항목 중 복수의 칼럼 항목을 선택하여 데이터 병합을 수행할 수 있다. In one embodiment of the present application, the data merging unit 150 is a parking lot 11, a parking lot operation start time 12, a parking lot operation end time 13, a parking fee (per 10 minutes, per unit), address (24) Although the column items of) are selected and merged, data merging may be performed by selecting a plurality of column items among a plurality of column items included in the first data set 10 and the second data set 20.

도8은 본원의 일 실시예에 따른 데이터 병합 방법을 나타낸 흐름도이다. 도 8에 도시된 데이터 병합 방법은 도 1 내지 도 7을 통해 설명된 데이터 병합 장치(100)에 의하여 수행된다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 7을 통해 데이터 병합 장치(100)에 대하여 설명된 내용은 도 8에도 적용될 수 있다. 8 is a flowchart illustrating a data merging method according to an embodiment of the present application. The data merging method illustrated in FIG. 8 is performed by the data merging apparatus 100 described with reference to FIGS. 1 to 7. Therefore, although omitted below, the contents described with respect to the data merging device 100 through FIGS. 1 to 7 may also be applied to FIG. 8.

단계, S801에서 데이터 병합 장치(100)는 복수의 데이터 셋 중 데이터 매칭을 수행할 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 결정할 수 있다. In operation S801, the data merging apparatus 100 may determine the first data set 10 and the second data set 20 to perform data matching among the plurality of data sets.

단계, S802에서 데이터 병합 장치(100)는 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)의 제 1 칼럼을 기반으로 제 1 매칭 유사도 연산을 수행하여 제 1 유사도 순위 데이터 셋(30)을 도출할 수 있다. In operation S802, the data merging apparatus 100 performs a first matching similarity operation based on the first columns of the first data set 10 and the second data set 20 to perform a first similarity ranking data set 30. Can be derived.

단계, S803에서 데이터 병합 장치(100)는 제 1 유사도 순위 데이터 셋(30)을 기반으로 제 2 매칭 유사도 연산을 수행하여 제 2 유사도 순위 데이터 셋(40)을 도출할 수 있다. In operation S803, the data merging apparatus 100 may derive the second similarity ranking data set 40 by performing a second matching similarity operation based on the first similarity ranking data set 30.

단계, S804에서 데이터 병합 장치(100)는 제 1 유사도 순위 데이터 셋(30) 및 제 2 유사도 순위 데이터 셋(40)을 비교하여 제 1 데이터 셋(10)의 각 데이터 행에 매칭되는 제 2 데이터 셋(20)의 데이터 행을 매칭할 수 있다. In operation S804, the data merging apparatus 100 compares the first similarity ranking data set 30 and the second similarity ranking data set 40 to match the second data to each data row of the first data set 10. Data rows of the set 20 may be matched.

단계, S805에서 데이터 병합 장치(100)는 매칭 결과에 기초하여 제 1 데이터 셋(10) 및 제 2 데이터 셋(20)을 병합할 수 있다. In operation S805, the data merging apparatus 100 may merge the first data set 10 and the second data set 20 based on the matching result.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the above description, and it should be construed that all changes or modifications derived from the meaning and scope of the claims and their equivalents are included in the scope of the present application.

10: 제 1 데이터 셋 20: 제 2 데이터 셋
30: 제 1 유사도 순위 데이터 셋 40: 제 2 유사도 순위 데이터 셋
100: 데이터 병합 장치
110: 데이터 베이스
120: 데이터 매칭부
130: 프로파일링부
140: 유사 항목 분석부
150: 데이터 병합부10: first data set 20: second data set
30: first similarity ranking data set 40: second similarity ranking data set
100: data merge device
110: database
120: data matching unit
130: profiling unit
140: analog analysis unit
150: data merge unit

Claims

In the data merging device for big data analysis,
A database comprising a plurality of data sets;
Determine a first data set and a second data set to perform data matching among a plurality of data sets included in the database, and represent a representative column among a plurality of column items included in the first data set and the second data set. Selecting an item, performing a similarity operation for analyzing the string pattern of the row data of each of the selected representative column item of the first data set and the data of the entire representative column item of the second data set, and based on the similarity, A data matcher for matching data rows of the first data set and the second data set; And
Based on a similarity analysis result of analyzing a string pattern of a representative key of a plurality of column items included in the first data set and a representative key of a plurality of column items included in the second data set, the first data set and The first data set and the second data set according to a matching result including a similarity rank of data rows of the first data set and the second data set, except for overlapping column items when merging a plurality of column items of the second data set; A data merging unit for merging column items of the first data set and the second data set using matching row data of the second data set;
Including, the data merging device.

The method of claim 1,
The data matching unit,
A first similarity ranking data set is derived by performing a first matching similarity operation based on the first column of the first data set and the second data set, and a second matching similarity based on the first similarity ranking data set. Perform a calculation to derive a second similarity rank data set,
And comparing the first similarity ranking data set and the second similarity ranking data set to match data rows of the first data set and the second data set based on data having the highest similarity.

The method of claim 2,
And the first similarity ranking data set and the second similarity ranking data set include ranking information of each data row of the second data set.

The method of claim 3, wherein
The data matching unit,
A second similarity rank and a second similarity rank of a second similarity rank data set of row data of a second data set positioned in a first rank of the first similarity rank data set in association with first row data of the first data set; The first similarity rank data sets of the row data of the second data set located at the first rank of the data set are all one, and the row data of the second data set located at the first rank of the first similarity rank data set. And when the row data of the second data set located at the first rank of the second similarity rank data set is matched, the row data of the second data set located at the first rank of the first similarity rank data set is matched to the first data. And determine to match the row data.

The method of claim 3, wherein
The data matching unit,
A second similarity rank and a second similarity rank of a second similarity rank data set of row data of a second data set positioned in a first rank of the first similarity rank data set in association with first row data of the first data set; A first similarity degree of the row data of the second data set positioned in the first rank of the data set, when there is a plurality of row data of the second data set having a rank of all 1 in the rank data set, the first matched to the first row data; And determining row data of the two data sets as non-matchable.

The method of claim 3, wherein
The data matching unit,
In association with the first row data of the first data set, the first similarity ranking data set of the first row data and the second row data of the second data set are both one, and the first of the second data set is first. The first row of the second data set when the rank of the second similarity rank data set of row data is 1 and the rank of the second similarity rank data set of second row data of the second data set is not one. And determine that data matches the first row data of the first data set.

The method of claim 1,
A profiling unit which analyzes data types and string patterns of a plurality of column items of the first data set and the second data set through data analysis; And
A similar item analysis unit analyzing a similar item between the first data set and the second data set based on an analysis result of the profiling unit;
Further comprising, the data merging device.

delete

In the data merging device, in the data merging method for big data analysis,
Determining a first data set and a second data set to perform data matching among a plurality of data sets included in the database;
A representative column item is selected from a plurality of column items included in the first data set and the second data set, and the row data of each of the representative column items of the selected first data set and the representative column item of the second data set Deriving a first similarity ranking data set by performing a first matching similarity operation analyzing the string pattern of the entire data;
A second similarity similarity operation by performing a second matching similarity operation that compares a string pattern of data of all the representative column items of the first data set selected based on the first similarity ranking data set and the representative column items of the second data set; Deriving a rank data set;
Comparing the first similarity ranking data set and the second similarity ranking data set to match data rows of a second data set matching each data row of the first data set; And
Based on a similarity analysis result of analyzing a string pattern of a representative key of a plurality of column items included in the first data set and a representative key of a plurality of column items included in the second data set, the first data set and The first data set and the second data set according to a matching result including a similarity rank of data rows of the first data set and the second data set, except for overlapping column items when merging a plurality of column items of the second data set; Merging column items of the first data set and the second data set using matching row data of the second data set;
Data merging method comprising a.