KR102110350B1

KR102110350B1 - Domain classifying device and method for non-standardized databases

Info

Publication number: KR102110350B1
Application number: KR1020190087748A
Authority: KR
Inventors: 황덕열; 공성원; 김세경
Original assignee: (주)위세아이텍
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2020-06-08

Abstract

The present invention relates to a domain discrimination device for a non-standardized database. The domain discrimination device for non-standardized database comprises: a database containing a plurality of non-standardized data; a derived variable generation unit for generating a derived variable using a plurality of non-standardized data included in the database; a model learning unit that generates a domain discrimination model using a derived variable data set; a domain discrimination unit for determining a domain for a first data set using the domain discrimination model; and a data result unit that stores data in the database by linking the data included in the first data set with the domain discrimination result determined by the domain discrimination unit.

Description

Domain determination device and method for non-standardized database {DOMAIN CLASSIFYING DEVICE AND METHOD FOR NON-STANDARDIZED DATABASES}

본원은 비표준화 데이터베이스를 위한 도메인 판별 장치 및 방법에 관한 것이다.The present invention relates to a domain discrimination apparatus and method for a non-standardized database.

현대 사회에서 빅데이터의 가치는 높아져서 IT 뿐 아니라, 공공기관, 법률, 의료, 금융 등 다양한 분야에서 빅데이터를 활용하여 새로운 가치를 창출하고 있다. 정부에서도 이러한 움직임에 발맞춰, 빅데이터 관련 산업에 많은 투자를 시작하였다. 특히 2017년 이후, 4차 산업혁명의 활성화를 위하여 공공데이터를 개방하였다.In modern society, the value of big data is increasing, and not only IT, but also public institutions, legal, medical, and financial fields are using big data to create new values. In response to this move, the government has also started to invest heavily in big data-related industries. In particular, after 2017, public data was opened to promote the 4th industrial revolution.

빅데이터를 이용하여 새로운 가치를 창출하기 위해서는 신뢰도가 높은 데이터가 전제되어야 한다. 낮은 신뢰도 데이터 기반의 분석은 분석 과정에서부터 문제가 생길 뿐만 아니라 분석 결과에 따른 판단에도 오류를 범할 수 있다. 때문에 전 세계적으로 민간부분의 데이터 신뢰성과 품질확보를 위해 년간 6000억 달러 이상의 비용을 소비하고 있으며, 품질관리 수준을 평가하기 위한 지표 등에 관한 연구들도 진행 중이다. 우리나라의 경우 공공기관을 중심으로 공공데이터의 품질을 높이기 위하여 투자를 시행하고 있다. To create new value using big data, highly reliable data must be premised. Analysis based on low-reliability data not only causes problems from the analysis process, but can also make errors in judgment based on the analysis results. For this reason, the private sector worldwide spends more than $600 billion a year to secure data reliability and quality, and studies on indicators for evaluating the quality control level are also under way. In Korea, public institutions are making investments to improve the quality of public data.

데이터 품질에 관한 이슈는 데이터 마이닝으로 인한 가치 창출과 인공지능 산업 전반에 걸친 문제가 될 수 있다. 데이터 품질이란 <데이터의 최신성, 정확성, 상호연계성 등을 확보하고, 이를 이용하여 사용자에게 유용한 가치를 줄 수 있는 수준>으로 정의하고 있다. 데이터 품질 진단을 실시하는 목적은 데이터 품질을 체계적, 지속적으로 유지하고 향상시키기 위함이다.Data quality issues can be a problem across the AI industry and the creation of value from data mining. Data quality is defined as <a level that secures the latest, accurate, and interrelated data, and uses it to give useful value to users>. The purpose of conducting data quality diagnosis is to maintain and improve data quality systematically and continuously.

도메인 분류 작업의 경우, 사용자가 데이터를 일일이 확인하고 수작업으로 진행하였기 때문에, 휴먼 에러가 자주 발생하였고, 작업시간 역시 많은 시간이 소요되는 문제점을 가지고 있었다. 이러한 문제를 해결하기 위해서 도메인 판별 장치(제10-2018-0097892)를 제안했으나, 해당 장치의 경우 표준화되어 있는 데이터베이스에만 적용이 가능하다는 단점이 존재하였고, 과적합 문제를 가지고 있는 알고리즘을 사용하고 있었다.In the case of the domain classification work, since the user checks the data one by one and proceeds by hand, human errors frequently occur, and the work time also takes a long time. In order to solve this problem, a domain discrimination device (No. 10-2018-0097892) was proposed, but there was a disadvantage that the device can be applied only to a standardized database, and an algorithm with an overfitting problem was used. .

본원의 배경이 되는 기술은 한국등록특허공보 제10-1930034호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Registration No. 10-1930034.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 판별 대상 데이터베이스의 테이블로부터 컬럼 별 파생변수를 생성하고, 머신러닝 알고리즘을 사용하여 데이터의 도메인을 자동으로 판별하는 도메인 판별 장치 및 방법을 제공하려는 것을 목적으로 한다.The present invention is to solve the problems of the prior art described above, and to generate a derived variable for each column from a table of the database to be determined, and to provide a domain determination apparatus and method for automatically determining the domain of data using a machine learning algorithm It is aimed at.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 과적합 문제를 안고 있는 의사 결정 트리 알고리즘 대신, 랜덤 포레스트 알고리즘을 사용하여 안정적인 도메인 판별 장치 및 방법을 제공하려는 것을 목적으로 한다.An object of the present invention is to solve the above-described problems of the prior art, and to provide a stable domain discrimination apparatus and method using a random forest algorithm instead of a decision tree algorithm having an overfitting problem.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the embodiments of the present application are not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치는, 복수의 비표준화 데이터를 포함하는 데이터베이스, 상기 데이터베이스에 포함된 복수의 비표준화 데이터를 이용하여 파생변수를 생성하는 파생변수 생성부, 파생변수 데이터 셋을 이용하여 도메인 판별 모델을 생성하는 모델 학습부, 상기 도메인 판별 모델을 이용하여 제1데이터 셋에 대한 도메인을 판별하는 도메인 판별부 및 상기 제1데이터 셋에 포함된 데이터와 상기 도메인 판별부에서 판별된 도메인 판별 결과를 연계하여 상기 데이터베이스에 저장하는 데이터 결과부를 포함할 수 있다. As a technical means for achieving the above technical problem, the domain discrimination apparatus for a non-standardized database according to an embodiment of the present application includes a database including a plurality of non-standardized data and a plurality of non-standardized data included in the database. Derived variable generation unit for generating a derived variable using, a model learning unit for generating a domain discrimination model using a derived variable data set, a domain discrimination unit for determining a domain for the first data set using the domain discrimination model, and And a data result unit that stores data in the database by linking the data included in the first data set with the domain discrimination result determined by the domain discrimination unit.

본원의 일 실시예에 따르면, 상기 파생변수 생성부는, 상기 데이터베이스에 포함된 복수의 비표준화 데이터의 특성값 또는 대표값을 추출하여 파생변수를 생성할 수 있다.According to an embodiment of the present application, the derived variable generator may generate a derived variable by extracting a characteristic value or a representative value of a plurality of non-standardized data included in the database.

본원의 일 실시예에 따르면, 상기 파생변수 생성부는, 상기 데이터베이스에 포함된 상기 비표준화 데이터를 데이터 타입, 데이터 최대길이, 데이터 최소길이, 데이터 길이 변화, 소수점 아래 길이, 날짜 형식 여부, 연락처 형식 여부, 공백 비율, 엔터 포함 여부, 영어 작성 여부, 숫자 작성 여부, 백단위 이하 비율, 그룹화 비율, PK 여부 중 적어도 어느 하나에 대응하는 파생변수를 생성할 수 있다.According to one embodiment of the present application, the derived variable generator, whether the non-standardized data included in the database is a data type, a data maximum length, a data minimum length, a data length change, a length under the decimal point, a date format, or a contact format , Derivative variables corresponding to at least one of a blank ratio, whether to include enter, whether to write in English, whether to write a number, a ratio in hundreds or less, a grouping ratio, or PK can be generated.

본원의 일 실시예에 따르면, 상기 모델 학습부는, 상기 파생변수 생성부에서 생성된 파생변수에 도메인이 라벨링 되어 있는 파생변수 데이터 셋을 인공지능 알고리즘에 적용하여 상기 도메인 판별 모델을 생성할 수 있다.According to one embodiment of the present application, the model learning unit may generate the domain discrimination model by applying a set of derived variable data whose domain is labeled to the derived variable generated by the derived variable generation unit to an artificial intelligence algorithm.

본원의 일 실시예에 따르면, 상기 도메인 판별부는, 상기 도메인 판별 모델을 이용하여, 상기 제1데이터 셋에 대한 도메인을 번호, 금액, 코드, 수, 날짜, 내용, 율, 명칭, 플래그, 연락처 중 적어도 어느 하나로 판별할 수 있다.According to one embodiment of the present application, the domain discrimination unit uses the domain discrimination model to number a domain for the first data set among numbers, amounts, codes, numbers, dates, contents, rates, names, flags, and contacts It can be determined by at least one.

본원의 일 실시예에 따르면, 상기 데이터 결과부는, 생성된 상기 파생변수가 상기 도메인 판별 결과에 작용한 영향도 분석을 수행할 수 있다.According to one embodiment of the present application, the data result unit may perform an analysis of the influence of the generated derivative variable on the domain discrimination result.

본원의 일 실시예에 따르면, 비표준화 데이터베이스를 위한 도메인 판별 장치는, 사용자 단말로 도메인 판별 입력 정보와 관련된 선택 항목을 제공하는 데이터 제공부 및 상기 사용자 단말로부터 도메인 판별 입력 정보를 수신하는 사용자 입력 수신부를 더 포함할 수 있다.According to an embodiment of the present application, the domain discrimination apparatus for a non-standardized database includes a data providing unit providing a selection item related to domain discrimination input information to a user terminal and a user input receiving unit receiving domain discrimination input information from the user terminal. It may further include.

본원의 일 실시예에 따르면, 비표준화 데이터베이스를 위한 도메인 판별 방법은, 복수의 비표준화 데이터를 포함하는 데이터베이스에 포함된 복수의 비표준화 데이터를 이용하여 파생변수를 생성하는 단계, 파생변수 데이터 셋을 이용하여 도메인 판별 모델을 생성하는 단계, 상기 도메인 판별 모델을 이용하여 제1데이터 셋에 대한 도메인을 판별하는 단계 및 상기 제1데이터 셋에 포함된 데이터와 판별된 도메인 판별 결과를 연계하여 상기 데이터베이스에 저장하는 단계를 포함할 수 있다. According to an embodiment of the present application, a domain discrimination method for a non-standardized database includes generating a derived variable using a plurality of non-standardized data included in a database including a plurality of non-standardized data, and a set of derived variable data. Generating a domain discrimination model using the domain discrimination model, determining a domain for the first data set using the domain discrimination model, and linking the data included in the first data set with the discriminated domain discrimination results to the database. And storing.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present application. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 비표준화 데이터베이스를 위한 도메인 판별 장치 및 방법은 기존의 데이터 아키텍쳐 엔지니어가 수작업으로 수행하던 업무를 도구를 활용하여 자동화함으로써 투입 인력을 최소화할 수 있으며, 데이터 엔지니어 전문가의 개입 없이 지속적인 데이터 품질 관리가 용이할 수 있다.According to the above-described problem solving means of the present application, the domain discrimination device and method for the non-standardized database can minimize the input manpower by automating the work that the existing data architecture engineer has manually performed using tools, and the data engineer expert Continuous data quality management can be facilitated without human intervention.

전술한 본원의 과제 해결 수단에 의하면, 빅데이터 환경 내에서 데이터의 도메인을 자동으로 판별함으로써, 엔지니어의 실수 등으로 인해 잘못될 수 있는 데이터 품질 관리를 방지할 수 있다.According to the above-described problem solving means of the present application, by automatically determining the domain of the data in the big data environment, it is possible to prevent data quality management that may be wrong due to an engineer's mistake.

전술한 본원의 과제 해결 수단에 의하면, 데이터베이스의 표준화 여부와 상관없이, 데이터베이스 내의 데이터들을 이용하여 파생변수를 생성하고 자동화 모델을 학습시킴으로써, 장치 사용에 대한 범용성을 넓힐 수 있다.According to the above-described problem solving means of the present application, regardless of whether or not the database is standardized, it is possible to broaden the versatility of using the device by generating a derived variable using data in the database and learning an automation model.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the above-described effects, and other effects may exist.

도 1은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 개략적인 구성도이다.
도 2는 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 파생변수로 추출된 데이터 셋에 대한 예시이다.
도 3은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 파생변수 추출을 위한 쿼리문 중의 일부를 나타낸 표이다.
도 4는 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 지도학습 알고리즘 중 랜덤포레스트 모델의 원리를 나타낸 도면이다.
도 5는 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 데이터 결과부에서 확인 가능한 컨퓨전 매트릭스의 결과를 예시적으로 나타낸 도면이다.
도 6은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 도메인 판별 결과를 개략적으로 나타낸 도면이다.
도 7은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 방법에 대한 동작 흐름도이다.1 is a schematic configuration diagram of a domain discrimination apparatus for a non-standardized database according to an embodiment of the present application.
2 is an example of a data set extracted as a derived variable of a domain discrimination device for a non-standardized database according to an embodiment of the present application.
3 is a table showing a part of query statements for extracting derived variables of a domain discrimination device for a non-standardized database according to an embodiment of the present application.
FIG. 4 is a diagram illustrating the principle of a random forest model among supervised learning algorithms of a domain discrimination apparatus for a non-standardized database according to an embodiment of the present application.
FIG. 5 is a diagram exemplarily showing a result of a confusion matrix that can be confirmed in a data result unit of a domain discrimination device for a non-standardized database according to an embodiment of the present application.
FIG. 6 is a diagram schematically showing a domain discrimination result of a domain discrimination apparatus for a non-standardized database according to an embodiment of the present application.
7 is an operation flowchart for a domain discrimination method for a non-standardized database according to an embodiment of the present application.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present application pertains may easily practice. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" to another part, it is not only "directly connected", but also "electrically connected" or "indirectly connected" with another element in between. "It also includes the case where it is.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is positioned on another member “on”, “on the top”, “top”, “bottom”, “bottom”, “bottom”, this means that one member is attached to another member. This includes cases where there is another member between the two members as well as when in contact.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout this specification, when a part “includes” a certain component, it means that the component may further include other components, not to exclude other components, unless otherwise stated.

이하에서는 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치를 설명의 편의상 도메인 판별 장치(100)라 하기로 한다. 또한, 이하에서 혼용하여 사용되는 칼럼, 컬럼은 column의 동일한 명칭이며, 테이블을 구성하는 각각의 열에 위치한 정보를 의미한다. Hereinafter, a domain discrimination apparatus for a non-standardized database according to an embodiment of the present application will be referred to as a domain discrimination apparatus 100 for convenience of description. In addition, columns and columns used interchangeably hereinafter are the same name of the column, and mean information located in each column constituting the table.

또한 이하에서 설명되는 도메인은 일반적으로 데이터베이스 설계 시 부여되는 데이터의 가장 작은 단위인 컬럼의 특성을 의미하며, 도메인에 부합하는 데이터 업무 규칙을 적용하면, 데이터의 무결성을 유지할 수 있는 효과를 가질 수 있다. 도메인은 데이터베이스에서 테이블의 설계 단계에서 각 컬럼들에 적용되는 고유한 성격으로 데이터베이스에서 관리하는 데이터의 가장 작은 단위인 컬럼의 특성에 대한 정의라고 할 수 있다.In addition, the domain described below generally refers to a characteristic of a column, which is the smallest unit of data given when designing a database, and if data business rules that conform to the domain are applied, the integrity of data can be maintained. . A domain is a unique property that is applied to each column in the design stage of a table in the database. It can be said to be a definition of the characteristics of a column, the smallest unit of data managed by the database.

본원의 일 실시예에 따르면, 도메인 판별 장치(100) 표준화가 진행되어 있지 않은 데이터 베이스에 포함된 컬럼들의 도메인 분류 작업을 위하여, 각 컬럼들의 데이터들을 이용하여 파생변수들을 생성하고, 도메인 판별 모델과 파생변수로 생성된 데이터 셋을 이용하여, 도메인을 판별하는 장치일 수 있다.According to an embodiment of the present application, in order to perform domain classification of columns included in a database in which the domain discrimination apparatus 100 is not standardized, derived variables are generated using data of each column, and a domain discrimination model and It may be a device for determining a domain using a data set generated as a derived variable.

또한, 도메인 판별 장치(100)는 이미 도메인이 판별된 데이터베이스에 포함된 칼럼의 파생변수와 도메인을 기반으로 인공지능을 통한 선행학습을 수행하고, 학습 결과에 기초하여 판별 대상 데이터베이스의 파생변수 데이터 셋에 대한 도메인 판별을 수행할 수 있다. 또한, 도메인 판별 장치(100)는 학습 결과에 기초하여 판별 대상 데이터베이스에 포함된 컬럼 별 도메인을 판별할 수 있다. In addition, the domain discrimination apparatus 100 performs pre-learning through artificial intelligence based on the domain and the derived variable of the column already included in the database in which the domain has been determined, and sets the derived variable data of the database to be determined based on the learning result. It is possible to perform domain discrimination. Also, the domain determination device 100 may determine a domain for each column included in the determination target database based on the learning result.

또한, 도메인 판별 장치(100)는 데이터베이스에 포함된 컬럼에 대하여 컬럼의 도메인을 번호, 금액, 코드, 수, 날짜, 내용, 율, 명칭, 플래그, 연락처 등 10개의 도메인으로 판별할 수 있으며, 판별된 도메인의 유형을 저장 및 학습함으로써, 도메인이 판별되지 않은 데이터 셋의 도메인을 판별하는데 이용할 수 있다. 표1은 도메인 판별 장치(100)에서 판별되는 도메인 분류, 도메인 예시 및 점검내용을 개략적으로 나타낸 것이다.In addition, the domain discrimination apparatus 100 may discriminate the domain of the column into 10 domains such as number, amount, code, number, date, content, rate, name, flag, and contact information for the column included in the database. By storing and learning the type of the domain, it can be used to determine the domain of the data set in which the domain is not discriminated. Table 1 schematically shows domain classification, domain examples, and check contents determined by the domain discrimination apparatus 100.

도메인 분류Domain classification 도메인 예시Domain example 점검내용Inspection contents 번호number 주민등록번호, 사업자등록번호, 우편번호, 고객번호, 계좌번호Resident registration number, business registration number, postal code, customer number, account number 번호 관련 데이터의 패턴 및 체크비트 진단Diagnosis of pattern and check bit of number related data 금액Price 금액, 세금, 가격, 단가, 비용, 요금, 잔액, 총액Amount, tax, price, unit price, cost, charge, balance, total amount 금액 관련 데이터의 허용범위 진단Diagnosis of allowable range of amount-related data 명칭designation 명, 주소, ID, 장소, 고객명, 영문 고객명, URL, 이메일, IPName, address, ID, place, customer name, English customer name, URL, email, IP 명칭 관련 데이터의 패턴 및 길이 진단Diagnosis of pattern and length of name-related data 수량Quantity 건수, 매수, 회차, 개수, 거리, 규모, 길이, 무게, 속도, 횟수, 평형, 면적, 온Number, number, round, count, distance, scale, length, weight, speed, count, balance, area, temperature 수량 관련 데이터의 허용범위 진단Diagnosis of allowable range of data related to quantity 플래그flag 여부, 유무, 구분, 상태Presence, presence, classification, status 분류 관련 데이터의 표준정의 값 진단Diagnosis of standard definition values of classification-related data 날짜date 년월, 년, 년월일, 시, 분, 초, 일, 반기, 분기Year, Year, Year Date, Hour, Minute, Second, Day, Half Year, Quarter 날짜 관련 데이터의 허용범위 및 유효값 진단Diagnosis of allowable range and effective value of date related data 율rate 금리, 이율, 비율, 환율, 백분율Interest rate, interest rate, rate, exchange rate, percentage 비율(%) 관련 데이터의 허용범위 진단Diagnosis of allowable range of data related to percentage (%) 내용Contents 내용, 비고, 설명, 정보, 요약Content, remarks, explanation, information, summary 내용 관련 데이터의 적용언어 패턴 진단Diagnosis of applied language patterns of content-related data 연락처Contact 주소, 전화번호, 이메일Address, phone number, email 연락처 데이터의 패턴 진단Diagnose patterns in contact data 코드code 분류, 유형Classification, type 코드 관련 데이터의 중복 및 반복 여부 진단Diagnose whether code-related data is duplicated or repeated

본원의 일 실시예에 따르면, 도메인 판별 장치(100)에 사용한 도메인의 종류는 표 1에 명시한 10개이나, 도메인의 종류는 명세서에 명시한 도메인 이외에 사용자의 요구나 사업의 특성에 따라 종류는 증가, 감소 또는 수정이 가능하다. 단, 이를 위해서는 선행학습 시에 사용된 파생변수 데이터 셋의 라벨링이 수정되어야 한다. 또한, 도메인 판별 장치(100)는 빅데이터 환경에서 신규 유형의 도메인을 인공지능 기반으로 자동 판별 또는 학습할 수 있으며, 이를 통해 도메인에 대한 품질관리가 가능해진다. 좀 더 상세히 말하면, 이미 도메인이 판별되어 있는 데이터 베이스의 컬럼의 파생변수와 도메인을 기반으로 머신러닝 알고리즘 모델의 선행학습을 수행하고, 선행학습 결과에 기초하여 판별 대상 데이터베이스 내의 데이터 도메인에 대한 판별을 수행할 수 있다. According to one embodiment of the present application, the domains used in the domain discrimination device 100 are 10 types specified in Table 1, but the types of domains are increased in accordance with the needs of the user or the characteristics of the business in addition to the domains specified in the specification. It can be reduced or modified. However, for this, the labeling of the derived variable data set used in prior learning must be modified. In addition, the domain discrimination apparatus 100 may automatically determine or learn a new type of domain based on artificial intelligence in a big data environment, thereby enabling quality control for the domain. More specifically, prior learning of the machine learning algorithm model is performed based on the domain and the derived variable of the column of the database where the domain has already been determined, and discrimination of the data domain in the database to be determined is performed based on the results of the prior learning. It can be done.

일반적으로 데이터 품질 관리는 현재 운영 또는 관리되고 있는 정보 시스템 내에 수록된 데이터의 품질을 측정하여 현재의 수준을 평가하고, 품질 저하의 요인을 분석하는 절차를 의미할 수 있다. 데이터 품질 관리는 운영 데이터 베이스의 테이블, 컬럼, 코드, 관계, 업무 규칙 등을 기준으로 데이터의 값을 분석하여 데이터의 품질을 진단하는 것을 의미할 수 있으며, 데이터 값과 관련된 품질 기준을 적용하여 오류 내역을 산출하고 오류 원인을 분석하는 절차를 의미할 수 있다.In general, data quality management may refer to a procedure for evaluating the current level by analyzing the quality of data contained in an information system currently being operated or managed, and analyzing factors of quality degradation. Data quality management can mean diagnosing the quality of data by analyzing the values of data based on tables, columns, codes, relationships, business rules, etc. of the operational database, and applying quality standards related to data values to error It can mean the process of calculating the details and analyzing the cause of the error.

도 1은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 개략적인 구성도이다.1 is a schematic configuration diagram of a domain discrimination apparatus for a non-standardized database according to an embodiment of the present application.

도 1을 참조하면, 도메인 판별 장치(100)는 데이터베이스(110), 파생변수 생성부(120), 모델 학습부(130), 도메인 판별부(140), 데이터 결과부(150), 데이터 제공부(160) 및 사용자 입력 수신부(170)를 포함할 수 있다. Referring to FIG. 1, the domain discrimination apparatus 100 includes a database 110, a derivative variable generation unit 120, a model learning unit 130, a domain discrimination unit 140, a data result unit 150, and a data providing unit It may include a 160 and the user input receiving unit 170.

데이터베이스(110)는 복수의 비표준화 데이터를 포함할 수 있다. 달리 말해, 데이터베이스(110)는 도메인 판별의 대상이 되는 데이터를 포함하는 데이터베이스(판별 대상 데이터베이스)를 포함할 수 있다. 여기서, 비표준화 데이터는 표준화가 진행되어 있지 않은 데이터이다. The database 110 may include a plurality of non-standardized data. In other words, the database 110 may include a database (determination target database) including data that is a target of domain determination. Here, the non-standardized data is data that has not been standardized.

또한, 데이터베이스(110)는 파생변수 생성부(120)에서 생성된 파생변수와 도메인 판별부(130)에서 판별된 도메인 결과를 저장하는 데이터베이스를 포함할 수 있다. 또한, 데이터베이스(110)는 도메인 판별을 위한 데이터, 파생변수 및 도메인 판별 결과를 연계하여 하나의 데이터베이스로 저장할 수 있다. 또한 데이터베이스(110)는 판별 대상 데이터베이스의 컬럼 별로 생성된 파생변수와 그 컬럼의 도메인과, 대상 데이터 베이스(예를 들어, 제1데이터 셋)의 도메인 판별의 결과를 저장할 수 있다. In addition, the database 110 may include a database storing the derived variable generated by the derived variable generator 120 and the domain result determined by the domain determining unit 130. In addition, the database 110 may store data as a database by linking data for domain determination, derived variables, and domain determination results. Also, the database 110 may store the result of discrimination of the domain of the target database (eg, the first data set) and the domain of the derived variable and the column generated for each column of the target database.

예시적으로, 데이터베이스(110)는 수치형 데이터 및/또는 문자형 데이터를 포함하는 복수의 데이터 셋을 포함할 수 있다. 데이터 베이스는 비정형 데이터를 포함할 수 있다. 비정형 데이터, 비구화 데이터, 비 구조적 데이터는 미리 정의된 데이터 모델이 없거나 미리 정의된 방식으로 정리되지 않은 정보를 의미할 수 있다. 비정형 데이터(Unstructured Data)란 일정한 규격이나 형태를 지닌 숫자 데이터(Numeric data)와 달리 그림이나 영상, 문서처럼 형태와 구조가 다른 구조화 되지 않은 데이터를 의미할 수 있다. 예를 들어, 데이터 셋은, 의료 분야, 금융 분야 등에서 도출되는 데이터들의 집합을 포함할 수 있다. 데이터 셋은 로우(Row: 행, 줄)와 칼럼(Column: 열, 칸)이라는 일종의 표 형태로 데이터를 저장할 수 있다. 데이터 셋에 포함된 칼럼 항목은 대표키와 일반 칼럼(데이터)으로 구분될 수 있다. 대표키는 칼럼 항목을 대표하는 항목으로서, 예시적으로 도 6을 참조하면, 컬럼명에 포함된 POLY_NO, CUST_ID, CUST_ROLE 등과 같은 항목을 대표키라 할 수 있으나, 이에 한정되는 것은 아니다. 또한, 데이터 셋은 복수의 칼럼 항목을 포함할 수 있다. 복수의 칼럼 항목은 복수의 파생변수, 도메인 등으로 구분될 수 있다. 각각의 칼럼 항목은 첫번째 행 데이터에 저장되고, 칼럼 항목 각각의 행에는 칼럼 항목에 포함되는 데이터가 저장될 수 있다. For example, the database 110 may include a plurality of data sets including numeric data and/or character data. The database may contain unstructured data. Unstructured data, unstructured data, and unstructured data may mean information that does not have a predefined data model or is not organized in a predefined manner. Unstructured data may mean unstructured data having different shapes and structures, such as pictures, images, and documents, unlike numeric data having a certain standard or shape. For example, the data set may include a set of data derived from the medical field, financial field, and the like. A data set can store data in a form of a table, called rows (rows, rows) and columns (columns: columns, columns). Column items included in the data set can be divided into a representative key and a general column (data). The representative key is a representative item of the column item. For example, referring to FIG. 6, items such as POLY_NO, CUST_ID, and CUST_ROLE included in the column name may be referred to as a representative key, but are not limited thereto. In addition, the data set may include a plurality of column items. A plurality of column items may be divided into a plurality of derived variables, domains, and the like. Each column item may be stored in the first row data, and data included in the column item may be stored in each row of the column items.

본원의 일 실시예에 따르면, 파생변수 생성부(120)는 데이터베이스(110)에 포함된 복수의 비표준화 데이터를 이용하여 파생변수를 생성할 수 있다. 달리 말해, 파생변수 생성부(120)는 판별 대상 데이터베이스로부터 파생변수를 생성할 수 있다. 즉, 파생변수 생성부(120)는 판별하고자 하는 데이터베이스의 데이터를 이용하여 파생변수를 생성할 수 있다. 파생변수 생성부(120)는 데이터베이스(110)에 포함된 복수의 컬럼을 기반으로, 도메인 판별에 필요한 파생변수를 생성하여 저장할 수 있다.According to an embodiment of the present application, the derived variable generator 120 may generate a derived variable using a plurality of non-standardized data included in the database 110. In other words, the derived variable generator 120 may generate a derived variable from the database to be determined. That is, the derived variable generator 120 may generate a derived variable using data of the database to be determined. The derived variable generator 120 may generate and store a derived variable required for domain determination based on a plurality of columns included in the database 110.

또한, 파생변수 생성부(120)는 데이터베이스(110)에 포함된 복수의 비표준화 데이터의 특성값 또는 대표값을 추출하여 파생변수를 생성할 수 있다. 예시적으로, 파생변수 생성부(120)는 판별 대상 데이터베이스에 쿼리문으로 컬럼에 포함된 데이터들의 특성값이나 대표 값 등을 추출할 수 있다. 파생변수 생성부(120)는 추출된 결과들을 데이터베이스(110) 내의 파생변수 테이블에 해당하는 값들을 저장할 수 있다. 이때, 파생변수 생성부(120)는 이미 도메인이 판별이 되어있는 컬럼의 경우 도메인도 함께 저장할 수 있다. In addition, the derived variable generator 120 may generate a derived variable by extracting a characteristic value or a representative value of a plurality of non-standardized data included in the database 110. For example, the derived variable generator 120 may extract characteristic values or representative values of data included in the column as a query statement in the database to be determined. The derived variable generator 120 may store the extracted results corresponding to the derived variable table in the database 110. At this time, in the case of a column in which the domain is already determined, the derived variable generator 120 may also store the domain.

도 2는 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 파생변수로 추출된 데이터 셋에 대한 예시이다.2 is an example of a data set extracted as a derived variable of a domain discrimination device for a non-standardized database according to an embodiment of the present application.

도 2를 참조하면, 파생변수 데이터 셋(1)은 복수의 파생변수 칼럼 항목(12) 및 도메인 칼럼 항목(13)을 포함할 수 있다. 파생변수 생성부(120)는 각각의 파생변수 칼럼 항목(12)에 해당하는 파생변수들을 저장할 수 있다. 또한, 파생변수 생성부(120)는 도메인 데이터가 존재할 경우, 도메인 칼럼 항목(13)에 해당 데이터를 저장할 수 있다. 2, the derived variable data set 1 may include a plurality of derived variable column items 12 and a domain column item 13. The derived variable generator 120 may store derived variables corresponding to each derived variable column item 12. In addition, when the domain data exists, the derived variable generator 120 may store the data in the domain column item 13.

도 3은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 파생변수 추출을 위한 쿼리문 중의 일부를 나타낸 표이다.3 is a table showing a part of query statements for extracting derived variables of a domain discrimination device for a non-standardized database according to an embodiment of the present application.

본원의 일 실시예에 따르면, 파생변수 생성부(120)는 데이터베이스(110)에 포함된 비표준화 데이터를 데이터 타입, 데이터 최대길이, 데이터 최소길이, 데이터 길이 변화, 소수점 아래 길이, 날짜 형식 여부, 연락처 형식 여부, 공백 비율, 엔터 포함 여부, 영어 작성 여부, 숫자 작성 여부, 백단위 이하 비율, 그룹화 비율, PK 여부 중 적어도 어느 하나에 대응하는 파생변수를 생성할 수 있다. 앞서 설명된 복수의 파생변수들은 파생변수들에 기초하여 사용자가 데이터 컬럼의 도메인을 결정할 수 있는 지를 기준으로 결정된 파생변수이다. According to one embodiment of the present application, the derived variable generation unit 120 may change the non-standardized data included in the database 110 into a data type, a data maximum length, a data minimum length, a data length change, a length under the decimal point, or a date format, Derived variables corresponding to at least one of a contact type, a blank ratio, whether to include an enter, whether to write in English, whether to write a number, a ratio in hundreds or less, a grouping ratio, or a PK can be generated. The plurality of derived variables described above are derived variables determined based on whether the user can determine the domain of the data column based on the derived variables.

일예로, 파생변수 생성부(120)는 판별 대상 데이터 베이스에서 SQL쿼리문을 이용하여 데이터 타입, 데이터 최대길이, 데이터 최소길이, 데이터 길이 변화, 소수점 아래 길이, 날짜 형식 여부, 연락처 형식 여부, 공백 비율, 엔터 포함 여부, 영어 작성 여부, 숫자 작성 여부, 백단위 이하 비율, 그룹화 비율, PK 여부 중 적어도 어느 하나의 파생변수를 생성하고 데이터 베이스(110)에 저장할 수 있다.As an example, the derived variable generator 120 uses a SQL query statement in the database to be determined, data type, data maximum length, data minimum length, data length change, length under decimal point, date format, contact format, space Ratios, whether to include enter, whether to write in English, whether to write numbers, ratios in hundreds or less, grouping ratios, PK or not can generate at least one derived variable and store it in the database 110.

예시적으로 도 3을 참조하면, 파생변수 생성부(120)는 파생 변수를 생성하기 위해 SQL 문을 작성할 수 있다. 도 3에는 최대 길이, 데이터 길이 가변여부, 연락처 형식 여부의 파생 변수를 생성하기 위한 SQL 문의 예시만을 도시하였으나, 이에 한정되는 것은 아니다. 또한, 파생변수를 생성하기 위한 SQL 문중 3가지인 최대 길이, 데이터 길이 가변여부, 연락처 형식 여부에 관한 파생변수에 관한 SQL 쿼리문으로 같은 결과를 생성한다면 쿼리문을 다르게 작성하여도 무방하다.Referring to FIG. 3 as an example, the derived variable generator 120 may write an SQL statement to generate the derived variable. In FIG. 3, only an example of an SQL statement for generating a derived variable of whether the maximum length, the data length is variable, or the contact type is shown, but is not limited thereto. In addition, if the same result is generated with SQL query statements for derived variables regarding maximum length, variable data length, and contact type, which are three of the SQL statements for generating derived variables, you can write different query statements.

표2는 파생변수 생성부(120)에서 생성되는 파생변수의 종류와 그 설명을 나타낸다. Table 2 shows the types and descriptions of the derived variables generated in the derived variable generator 120.

변수variable 설명Explanation 데이터 타입Data type INT, CHAR, VARCHAR 등 같은 데이터 값을 구분할 수 있는 변수Variable that can distinguish data values such as INT, CHAR, VARCHAR, etc. 데이터 최대길이Maximum data length 칼럼 내의 데이터 중 최대 길이를 가지고 있는 데이터의 길이The length of the data in the column that has the maximum length 데이터 최소길이Minimum data length 칼럼 내의 데이터 중 최소 길이를 가지고 있는 데이터의 길이The length of the data in the column that has the minimum length 데이터 길이 변화Data length change 칼럼 내의 데이터 길이의 가변 여부Whether the data length in the column is variable 소수점 아래 길이Length below decimal point 칼럼 내의 데이터들의 소수점 아래의 길이Length of data in column below decimal point 날짜 형식 여부Date format 데이터 타입이 아닌 날짜 포맷 데이터 여부Date format data, not data type 연락처 형식 여부Whether the contact is formatted @, - 등 연락처 및 주소에서 사용하는 패턴을 이용한 데이터
존재여부Data using patterns used in contacts and addresses such as @,-
Presence 공백 비율Blank ratio 전체 데이터에서 공백이 차지하는 여부Whether the space is occupied by the entire data 엔터 포함 여부Whether to include enter 칼럼 내의 데이터에서 줄 바꿈이 일어났는지 여부Whether a line break occurs in the data in the column 영어 작성 여부English writing 데이터들이 영어로만 작성되었는지 여부Whether the data was written in English only 숫자 작성 여부Whether to write numbers 데이터들이 숫자로만 작성되었는지 여부Whether the data is written only as numbers 백단위 이하 비율Percentage of hundreds or less 칼럼 내의 데이터 중 100단위 이하는 000으로 표기 된 비율Percentage of less than 100 units of data in the column as 000 그룹화 비율Grouping rate 칼럼 내의 데이터 중 그룹화가 가능한 비율The ratio that can be grouped among the data in the column PK 여부PK or not 칼럼이 Primary Key로 설정되었는지 여부Whether the column is set as the primary key

다만, 앞서 표2에 설명된 파생변수는 일 실시예일뿐이며, 파생변수가 이에 한정되는 것인 아니다. 즉, 파생변수는 본원에서 명시하지 않은 컬럼의 데이터로부터 추출할 수 있는 대표값이나 특성값을 포함할 수 있다. 본원의 일 실시예에 따르면, 파생변수 생성부(120)에서 결정되는 파생 변수 중 데이터 타입은 판별 대상 데이터베이스에서 컬럼에 저장되는 데이터의 유형을 의미할 수 있다. 예를 들면, 숫자, 문자 또는 문자열 등의 유형으로 분류될 수 있으며, 'NUMBER', 'CHAR' 또는 'VARCHAR'등으로 구분되어 저장될 수 있다.However, the derived variable described in Table 2 above is only an example, and the derived variable is not limited thereto. That is, the derived variable may include representative values or characteristic values that can be extracted from data of columns not specified herein. According to an embodiment of the present application, the data type of the derived variable determined by the derived variable generator 120 may mean the type of data stored in the column in the database to be determined. For example, it may be classified as a number, character, or string type, and may be classified and stored as'NUMBER','CHAR', or'VARCHAR'.

또한, 본원의 일 실시예에 따르면, 파생변수 생성부(120)에서 결정되는 파생 변수 중 데이터 최소길이와 최대길이 그리고 데이터 길이 변화 여부는 판별 대상 데이터베이스에서 컬럼에 저장되는 데이터의 실제 길이를 확인할 수 있다. 이는 스키마에 저장된 데이터 타입의 길이가 아닌 실제 데이터에서 사용하고 있는 데이터의 각각의 길이와, 데이터 내에서 길이가 변화하고 있는지를 파악하여 저장할 수 있다.In addition, according to an embodiment of the present application, among the derived variable determined by the derived variable generator 120, the minimum and maximum data length and whether the data length is changed can determine the actual length of data stored in a column in the database to be determined. have. This is not the length of the data type stored in the schema, but the length of each data used in the actual data and whether the length is changing in the data can be identified and stored.

본원의 일 실시예에 따르면, 모델 학습부(130)는 파생변수 데이터 셋(1)을 이용하여 도메인 판별 모델을 생성할 수 있다. 달리 말해, 모델 학습부(130)는 파생변수 생성부(120)에서 생성된 파생변수에 도메인이 라벨링 되어 있는 파생변수 데이터 셋(1)을 인공지능 알고리즘에 적용하여 도메인 판별 모델을 생성할 수 있다. 여기서, 인공지능 알고리즘은 랜덤 포레스트 알고리즘일 수 있으나 이에 한정되는 것은 아니며, 인공지능 알고리즘은 분류 알고리즘 중 적어도 어느 하나로 대체할 수 있다. 일예로, 모델 학습부(130)는 데이터베이스(110)에 저장된 도메인이 라벨링 되어 있는 파생 변수를 이용하여 랜덤 포레스트 알고리즘 모델을 학습시켜 도메인 판별 모델을 생성할 수 있다. According to one embodiment of the present application, the model learning unit 130 may generate a domain discrimination model using the derived variable data set 1. In other words, the model learning unit 130 may generate a domain discrimination model by applying the derived variable data set 1 having the domain label to the derived variable generated by the derived variable generation unit 120 to the artificial intelligence algorithm. . Here, the AI algorithm may be a random forest algorithm, but is not limited thereto, and the AI algorithm may be replaced by at least one of the classification algorithms. As an example, the model learning unit 130 may generate a domain discrimination model by training a random forest algorithm model using a derived variable labeled with a domain stored in the database 110.

도 4는 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 지도학습 알고리즘 중 랜덤포레스트 모델의 원리를 나타낸 도면이다.FIG. 4 is a diagram illustrating the principle of a random forest model among supervised learning algorithms of a domain discrimination apparatus for a non-standardized database according to an embodiment of the present application.

기계 학습에서의 랜덤 포레스트는 분류, 회귀 분석 등에 사용되는 앙상블 학습 방법의 일종으로, 훈련 과정에서 구성한 다수의 결정 트리로부터 부류 또는 평균 예측치를 출력함으로써 동작한다.The random forest in machine learning is a kind of ensemble learning method used for classification, regression analysis, etc., and operates by outputting class or average prediction values from a plurality of decision trees constructed in a training process.

예시적으로 도 4를 참조하면, 랜덤 포레스트는 사용자가 정의한 임의의 n 개의 의사결정나무를 만들어 그 안에서 정해진 값들을 알고리즘 내부에서 투표를 통하여 분류한다. 때문에 모델 학습부(130)는 각각의 분류 대상에 대한(예들 들어, 도메인 분류) 정답 확률을 사용자에게 제공해 줄 수 있다.For example, referring to FIG. 4, the random forest creates random n decision trees defined by the user and classifies the values determined therein through voting in the algorithm. Therefore, the model learning unit 130 may provide a user with a correct answer probability for each classification target (eg, domain classification).

본원의 일 실시예에 따르면, 모델 학습부(130)는 도메인으로 라벨이 있는 데이터 셋만을 사용하여 인공지능 알고리즘을 학습시킬 수 있다. 또한, 모델 학습부(130)는 라벨이 되어 있지 않은 데이터 셋의 경우 도메인 판별부(140)에서 자동으로 도메인을 판별해 줄 수 있다. 또한, 모델 학습부(130)는 라벨이 되어 있지 않은 데이터 셋의 경우, 사용자 입력 수신부(170)로부터 제공받은 사용자 입력 정보에 기반하여 도메인을 판별해 줄 수 있다. According to one embodiment of the present application, the model learning unit 130 may train the artificial intelligence algorithm using only a data set labeled as a domain. In addition, the model learning unit 130 may automatically determine the domain in the case of an unlabeled data set by the domain determination unit 140. In addition, in the case of an unlabeled data set, the model learning unit 130 may determine the domain based on the user input information provided from the user input receiving unit 170.

또한, 모델 학습부(130)는 신규 데이터 셋(예를 들어, 제1 데이터 셋)을 판별한 결과를 이용하여 모델을 재 학습시켜 도메인 판별 모델을 발전시킬 수 있다. In addition, the model learning unit 130 may develop the domain discrimination model by retraining the model using the result of determining the new data set (eg, the first data set).

예시적으로, 모델 학습부(130)와 도메인 판별부(140)에서 수행하는 학습 및 예측하는 도메인의 종류로는 번호, 금액, 코드, 수, 날짜, 내용, 율, 명칭, 플래그, 연락처 등을 포함할 수 있으나 이에 한정되는 것은 아니며, 사용자의 편의와 사업의 종류에 따라 도메인의 종류를 줄이거나 늘일 수 있다.For example, the types of domains to be trained and predicted by the model learning unit 130 and the domain discrimination unit 140 include numbers, amounts, codes, numbers, dates, contents, rates, names, flags, and contacts. It may include, but is not limited to, the type of domain may be reduced or increased according to the user's convenience and the type of business.

본원의 일 실시예에 따르면, 도메인 판별부(140)는 도메인 판별 모델을 이용하여 제1데이터 셋에 대한 도메인을 판별할 수 있다. 여기서, 제1데이터 셋은 도메인 판별 모델 생성시 사용하지 않은 데이터 셋일 수 있다. 또한, 제1데이터 셋은 도메인 라벨링이 되지 않은 파생변수 데이터 셋일 수 있다. 예시적으로, 도메인 판별부(140)는 모델 학습부(130)에서 생성된 도메인 판별 모델을 이용하여 라벨링이 되지 않은 파생변수 데이터 셋에 대한 도메인을 판별할 수 있다. According to an embodiment of the present application, the domain discrimination unit 140 may determine a domain for the first data set using a domain discrimination model. Here, the first data set may be an unused data set when generating a domain discrimination model. Also, the first data set may be a derived variable data set that is not domain labeled. For example, the domain discrimination unit 140 may use the domain discrimination model generated by the model learning unit 130 to determine the domain for the unlabeled derived variable data set.

또한, 본원의 일 실시예에 따르면, 도메인 판별부(140)는 도메인 판별 모델을 이용하여, 제1데이터 셋에 대한 도메인을 번호, 금액, 코드, 수, 날짜, 내용, 율, 명칭, 플래그, 연락처 중 적어도 어느 하나로 판별할 수 있다. 다만, 앞서 분류된 도메인이 이에 한정되는 것은 아니며 다양한 도메인이 더 존재할 수 있다. In addition, according to an embodiment of the present application, the domain discrimination unit 140 uses the domain discrimination model to number the domain for the first data set by number, amount, code, number, date, content, rate, name, flag, It can be determined by at least one of the contacts. However, the domains classified above are not limited thereto, and various domains may exist.

또한, 번호 도메인은 고객번호, 상품번호 등 순차적으로 증가하는 의미가 없는 숫자 값을 의미할 수 있다. 좀 더 상세히 말하면, 번호 도메인은 문자 또는 숫자들의 조합에 의해 구성되며 대부분 내외부적인 번호 체계에 따라 관리될 수 있다. 번호 도메인은 주민등록번호, 사업자등록번호, 우편번호, 법인번호, IP Address, 국제표준자료번호(ISBN/ISSN) 등과 같이 국내·외 표준 번호 체계를 공통적으로 준수하는 데이터 또는, 사용자번호, 계좌번호, 허가번호, 승인번호, 등록번호, 상품번호 등과 같이 기업 내부적인 표준 번호 관리 체계에 따라 관리되는 데이터가 존재할 수 있다. ID 도메인을 따로 구분하기도 하지만, 번호 도메인에 통합하여 사용할 수 있다.Also, the number domain may mean a numerical value that does not have a meaning that increases sequentially, such as a customer number or a product number. More specifically, the number domain is composed of a letter or a combination of numbers and can be managed mostly according to internal and external numbering systems. The number domain is data that commonly complies with domestic and foreign standard numbering systems such as social security number, business registration number, postal code, corporate number, IP address, and international standard data number (ISBN/ISSN), or user number, account number, and permission. There may be data managed according to the standard number management system internal to the company, such as number, approval number, registration number, and product number. Although ID domains are distinguished, they can be integrated into number domains.

또한, 금액 도메인은 데이터의 유형이 숫자 유형인 데이터 중 매출액, 판매액, 원가 등 돈과 관련된 숫자를 의미할 수 있다. 좀 더 상세히 말하면, 금액 도메인은 돈의 액수를 표현하는 값으로써 국가별 화폐단위에 맞는 숫자 타입의 값으로 구성될 수 있다. 금액 도메인을 지속적으로 관리하면 금액 도메인에 저장되는 데이터를 항상 유효한 형태로 관리할 수 있으며 데이터 자체의 범위 유효성을 유지할 수 있다. 금액 도메인의 예시로는 금액, 세금, 가격, 단가, 비용, 요금, 잔액 또는 총액 등이 있을 수 있다.In addition, the money domain may mean numbers related to money, such as sales, sales, and cost, among data in which the data type is a numeric type. In more detail, the money domain is a value representing the amount of money, and may be configured as a number type value suitable for each country's currency unit. By continuously managing the money domain, the data stored in the money domain can always be managed in a valid form and the validity of the scope of the data itself can be maintained. Examples of the money domain may include money, tax, price, unit price, cost, fee, balance or total amount.

또한, 코드 도메인은 데이터의 유형이 숫자 또는 문자 유형인 데이터 중 사전에 정의된 항목으로 코드와 값을 포함할 수 있다. 예를 들면, 여자는 'F' 남자는 'M'으로 표현하는 경우, 'F', 'M'이 코드가 될 수 있고 여자와 남자가 값이 될 수 있다. 좀 더 상세히 말하면, 코드 도메인은 사용할 수 있는 데이터를 제한하거나 동일한 의미의 데이터를 동일 표현으로 관리하기 위해 간략한 코드값으로 대체된 데이터를 의미할 수 있고, 일반적으로 코드 도메인은 코드와 코드값으로 관리될 수 있다. 일 예로, 코드는 성별구분코드, 고객등급코드, 부서코드, 상품코드, 지역코드 등을 의미할 수 있고, 코드값은 성별구분코드의 'M', 'F' 등과 같이 해당 정보 항목에 데이터를 대표하거나 제한하기 위한 값을 의미할 수 있다. 코드 도메인은 표준화된 코드가 미리 정의되어 관리될 수 있다. In addition, the code domain may include codes and values as predefined items among data in which the data type is numeric or character type. For example, if a woman expresses'F' as a man and'M','F' and'M' can be codes, and a woman and a man can be values. More specifically, code domains can mean data that is replaced by simple code values to limit the data that can be used or to manage data of the same meaning in the same expression, and code domains are generally managed by code and code values. Can be. As an example, the code may mean a gender classification code, a customer grade code, a department code, a product code, a region code, etc., and the code value may include data in the corresponding information item such as'M','F' of the gender classification code. It can mean a value to represent or limit. In the code domain, standardized codes may be predefined and managed.

또한, 수 도메인은 데이터의 유형이 숫자 유형인 데이터 중 고객 수, 상품 수, 관객 수 등 금액이 아닌 숫자를 의미할 수 있다. 좀 더 상세히 말하면, 수 도메인은 건수, 규모, 횟수 등과 같이 숫자로 관리되는 항목을 의미할 수 있다. 수량 도메인을 지속적으로 관리하면 수 도메인에 저장되는 데이터에 대한 최대값과 최소값의 유효 범위를 유지할 수 있다. 수 도메인의 예시로는 건수, 매수, 회차, 개수, 거리, 규모, 길이, 무게, 속도, 횟수, 평형, 면적 또는 온도 등이 있을 수 있다.In addition, the number domain may mean a number that is not an amount of money such as the number of customers, the number of products, or the number of audiences among data in which the data type is a number type. In more detail, the number domain may mean items managed by numbers, such as the number, size, and number of times. By continuously managing the quantity domain, it is possible to maintain the effective range of the maximum and minimum values for data stored in the quantity domain. Examples of the number domain may include the number of cases, number of copies, number of rounds, number, distance, scale, length, weight, speed, number, balance, area, or temperature.

또한, 날짜 도메인은 연도, 연월, 연월일, 일자 등의 날짜를 의미할 수 있다. 좀 더 상세히 말하면, 날짜 도메인은 날짜로 관리되는 항목을 의미할 수 있으며, 접수일자, 등록일시, 결산년월, 전송시간 등 날짜 및 시간을 의미하는 데이터를 포함할 수 있다. 날짜 도메인의 데이터 타입은 DBMS에서 제공하는 날짜 데이터 타입을 사용하는 방법과 문자 타입을 사용할 수 있다. DBMS에서 제공하는 날짜 데이터 타입을 사용하는 경우에는 DBMS 자체에서 유효하지 않은 날짜 값을 검사하기 때문에 날짜 값의 오류가 거의 없으나, 문자 타입으로 정의하여 사용하는 경우에는 잘못된 날짜값이 입력될 수 있다.Also, the date domain may mean dates such as year, year, month, day, and date. In more detail, the date domain may mean an item managed as a date, and may include data indicating a date and time, such as a reception date, a registration date, a settlement month, and a transmission time. The data type of the date domain can use the method and character type using the date data type provided by the DBMS. When using the date data type provided by DBMS, there is almost no error in the date value because the DBMS itself checks for an invalid date value, but when it is defined and used as a character type, an incorrect date value may be entered.

또한, 내용 도메인은 소정 길이 이상의 문자열을 포함하며, 게시물 내용 또는 자기소개서 내 등을 의미할 수 있다. 좀 더 상세히 말하면, 내용 도메인의 값은 사물 또는 행위에 대한 설명이나 참고가 될만한 내용을 기술한 데이터를 의미할 수 있으며, 정의, 설명, 비고, 내용, 요약 등 예가 존재할 수 있다. 내용 도메인은 비정형 문자로 구성되는 특성을 가질 수 있다. 날짜 도메인은 연도, 연월, 연월일, 일자 등의 날짜를 의미할 수 있다. 좀 더 상세히 말하면, 날짜 도메인은 날짜로 관리되는 항목을 의미할 수 있으며, 접수일자, 등록일시, 결산년월, 전송시간 등 날짜 및 시간을 의미하는 데이터를 포함할 수 있다. 날짜 도메인의 데이터 타입은 DBMS에서 제공하는 날짜 데이터 타입을 사용하는 방법과 문자 타입을 사용할 수 있다. DBMS에서 제공하는 날짜 데이터 타입을 사용하는 경우에는 DBMS 자체에서 유효하지 않은 날짜 값을 검사하기 때문에 날짜 값의 오류가 거의 없으나, 문자 타입으로 정의하여 사용하는 경우에는 잘못된 날짜값이 입력될 수 있다.In addition, the content domain includes a character string of a predetermined length or more, and may mean content in a post or self-introduction. In more detail, the value of the content domain may mean data describing an object or an action or content that can be referred to, and examples such as definition, description, remarks, content, and summary may exist. The content domain may have characteristics composed of atypical characters. The date domain may refer to dates such as year, year, month, day, and date. In more detail, the date domain may mean an item managed as a date, and may include data indicating a date and time, such as a reception date, a registration date, a settlement month, and a transmission time. The data type of the date domain can use the method and character type using the date data type provided by the DBMS. When using the date data type provided by the DBMS, there is almost no error in the date value because the DBMS itself checks for an invalid date value, but when defined and used as a character type, an incorrect date value may be entered.

또한, 율 도메인은 데이터의 유형이 숫자 유형인 데이터 중 달성률, 정확도, 원가율 등 비율을 포함하는 숫자를 의미할 수 있다. 좀 더 상세히 말하면, 율 도메인은 진척율, 증가율, 수익율, 변동율, 이자율, 가산율, 요율 등 매우 다양하게 정의되어 활용될 수 있으며, 이때, 증가율, 수익율, 변동율 등은 계산식에 의해 산출될 수 있고, 이자율, 가산율, 요율 등은 다른 수치데이터의 산출에 적용되는 기준정보로 관리될 수 있다. 율 도메인의 예시로는 금리, 이율, 비율, 환율 또는 백분율 등이 있을 수 있다.In addition, the rate domain may mean a number including a ratio such as achievement rate, accuracy, and cost rate among data in which the data type is a numeric type. In more detail, the rate domain can be defined and utilized in various ways such as progress rate, growth rate, rate of return, rate of change, interest rate, addition rate, rate, etc. At this time, the rate of increase, rate of return, rate of change, etc. can be calculated by a calculation formula, The interest rate, addition rate, and rate can be managed as reference information applied to the calculation of other numerical data. Examples of rate domains include interest rates, interest rates, rates, exchange rates, or percentages.

또한, 명칭 도메인은 소정 길이 이하의 문자열을 포함하며, 고객명, 상품명 등을 의미할 수 있다. 좀 더 상세히 말하면, 명칭 도메인은 다른 것과 식별하기 위하여 사물이나 인물, 단체 등에 붙이는 이름으로, 이름, 장소, 고객명, 영문고객명, URL, IP 등을 포함할 수 있다.Further, the name domain includes a character string of a predetermined length or less, and may mean a customer name, a product name, and the like. In more detail, the name domain is a name attached to an object, person, organization, etc. in order to distinguish it from others, and may include a name, a place, a customer name, an English customer name, a URL, an IP, and the like.

또한, 플래그 도메인은 여부를 뜻하는 것으로, 0, 1, 'Y', 'N', '참', 거짓', 'True', False' 등 2개의 대향되는 값으로 구성될 수 있다. 좀 더 상세히 말하면, 플래그 도메인은 데이터의 표준화와 관리가 평이한 도메인 중의 하나로, 여부, 유무, 'Y', 'N' 또는 1, 0 등과 같이 2 내지 3개의 단순한 분류 값으로 구성될 수 있으며, 코드가 아닌 별개의 값으로 정의될 수 있다. 플래그 도메인에 저장되는 데이터는 항상 동일한 형태로 관리되어 정보 시스템 간의 정합성을 유지할 수 있다.In addition, the flag domain indicates whether or not, and may be composed of two opposite values: 0, 1,'Y','N','true', false','True', False'. In more detail, the flag domain is one of the domains in which standardization and management of data is simple, and may be composed of 2 to 3 simple classification values, such as whether or not,'Y','N', or 1, 0, and the like. Can be defined as a separate value. Data stored in the flag domain is always managed in the same form to maintain consistency between information systems.

또한, 연락처 도메인은 주소, 이메일, 연락처와 같은 정보를 포함하는 데이터를 의미할 수 있다.Further, the contact domain may refer to data including information such as an address, email, and contact information.

본원의 일 실시예에 따르면, 데이터 결과부(150)는 제1데이터 셋에 포함된 데이터와 상기 도메인 판별부에서 판별된 도메인 판별 결과를 연계하여 상기 데이터베이스에 저장할 수 있다. 또한, 데이터 결과부(150)는 모델 학습부(130)와 도메인 판별부(140)의 결과를 확인하고 수정 및 저장할 수 있다. 또한, 데이터 결과부(150)는 파생변수 데이터 셋(1)에 대한 도메인을 입력할 수 있다. 입력한 도메인은 모델 학습부(130)의 학습용 데이터의 라벨 값으로 사용될 수 있다. According to an embodiment of the present application, the data result unit 150 may store the data included in the first data set and the domain discrimination result determined by the domain discrimination unit in the database. In addition, the data result unit 150 may check, modify, and store the results of the model learning unit 130 and the domain discrimination unit 140. Also, the data result unit 150 may input a domain for the derived variable data set 1. The entered domain may be used as a label value of training data of the model learning unit 130.

또한, 데이터 결과부(150)는 모델 학습부(130)가 도메인 판별 모델 생성시에 학습용 데이터와 테스트용 데이터로 나눠서 학습시켜서 학습 경우, 테스트용 데이터를 이용하여 도메인 판별 모델을 테스트할 수 있다. 데이터 결과부(150)는 테스트용 데이터를 이용하여 도메인 판별 모델을 테스트함으로써, 정확도, 정밀도, 재현율, F1 점수 등 모델의 성능을 확인할 수 있다. 데이터 결과부(150)는 도메인 판별 모델의 성능에 관한 지표와 각 도메인별 정답과 오답의 개수에 대한 컨퓨전 매트릭스로 확인할 수 있다. In addition, the data result unit 150 may train the model discrimination model 130 by dividing it into training data and test data when the domain discrimination model is generated, and in case of training, test the domain discrimination model using the test data. The data result unit 150 may check the performance of the model such as accuracy, precision, reproducibility, and F1 score by testing the domain discrimination model using test data. The data result unit 150 may be identified by an indicator of performance of the domain discrimination model and a confusion matrix of the number of correct and incorrect answers for each domain.

도 5는 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 데이터 결과부에서 확인 가능한 컨퓨전 매트릭스의 결과를 예시적으로 나타낸 도면이다.FIG. 5 is a diagram exemplarily showing a result of a confusion matrix that can be confirmed in a data result unit of a domain discrimination device for a non-standardized database according to an embodiment of the present application.

예시적으로 도 5를 참조하면, 가능하듯 Y축은 도면의 실제 도메인들에 대한 분류고 X축은 도메인 판별 모델을 통하여 예측한 도메인들의 분류이다. X축과 Y축의 조합을 통하여, 각각의 도메인들의 정답의 비율과, 오답의 패턴 등을 확인할 수 있다.Referring to FIG. 5 as an example, as possible, the Y-axis is a classification of real domains in the drawing, and the X-axis is a classification of domains predicted through a domain discrimination model. Through the combination of the X-axis and the Y-axis, it is possible to check the ratio of the correct answer of each domain and the pattern of the incorrect answer.

데이터 결과부(150)는 생성된 파생변수가 도메인 판별 결과에 작용한 영향도 분석을 수행할 수 있다. 달리 말해, 데이터 결과부(150)는 모델 학습부(130)의 결과로서 변수의 영향도를 확인할 수 있다. 변수의 영향도는 각 파생 변수가 도메인을 분류 시에 영향을 얼마나 미치는지에 대해서 퍼센티지로 나타낼 수 있다. The data result unit 150 may perform an analysis of the influence of the generated derived variable on the domain determination result. In other words, the data result unit 150 may check the influence of the variable as a result of the model learning unit 130. The degree of influence of variables can be expressed as a percentage of how much each derived variable affects domain classification.

표3은 파생변수의 영향도를 퍼센티지로 나타낸 것이다. Table 3 shows the degree of influence of the derived variables in percentage.

변수variable 영향도Impact 데이터 길이 변화Data length change 19.60%19.60% 날짜 형식 여부Date format 15.40%15.40% 데이터 최대길이Maximum data length 15.30%15.30% 그룹화 비율Grouping rate 12.70%12.70% 데이터 최소길이Minimum data length 8.90%8.90% 데이터 타입Data type 7.70%7.70% 공백 비율Blank ratio 5.60%5.60% 연락처 형식 여부Whether the contact is formatted 4.40%4.40% 숫자 작성 여부Whether to write numbers 4.30%4.30% 백단위 이하 비율Percentage of hundreds or less 3.20%3.20% PK 여부PK or not 2.10%2.10% 엔터 포함 여부Whether to include enter 0.40%0.40% 소수점 아래 길이Length below decimal point 0.20%0.20% 영어 작성 여부English writing 0.20%0.20%

또한, 데이터 결과부(150)는 도메인 판별부(140)의 판별 결과를 확인하고 수정할 수 있다. 데이터 결과부(150)는 도메인 판별 결과가 틀렸을 경우 해당 결과를 수정할 수 있다. 또한, 데이터 결과부(150)는 도메인 판별부(140)의 판별 결과를 데이터 제공부(160)를 통해 사용자 단말(미도시)로 제공할 수 있다. 사용자 입력 수신부(170)는 판결 결과를 수정하고자 하는 사용자 입력을 수신하고, 데이터 결과부(150)로 사용자 입력을 제공할 수 있다. 데이터 결과부(150)는 사용자 입력 정보에 기반하여 도메인 판별부(140)의 판별 결과를 수정하고, 데이터베이스(110)에 저장할 수 있다. 또한, 데이터 결과부(150)는 도메인 판별 결과를 데이터베이스(110)에 저장하되, 도메인 판별 결과 저장 시 파생변수 데이터와 연계하여 데이터 판별 결과를 저장할 수 있다. 또한, 데이터 결과부(150)는 데이터 베이스(110) 에 저장된 도메인 판별 결과를 불러와 수정할 수 있다. In addition, the data result unit 150 may check and correct the discrimination result of the domain discrimination unit 140. The data result unit 150 may correct the result when the domain determination result is wrong. In addition, the data result unit 150 may provide the determination result of the domain determination unit 140 to the user terminal (not shown) through the data providing unit 160. The user input receiving unit 170 may receive a user input for modifying the judgment result and provide a user input to the data result unit 150. The data result unit 150 may correct the discrimination result of the domain discrimination unit 140 based on the user input information and store it in the database 110. In addition, the data result unit 150 stores the domain discrimination result in the database 110, but when the domain discrimination result is stored, the data discrimination result may be stored in connection with the derived variable data. In addition, the data result unit 150 may retrieve and modify the domain determination result stored in the database 110.

본원의 일 실시예에 따르면, 데이터 제공부(160)는 사용자 단말(미도시)로 도메인 판별 입력 정보와 관련된 선택 항목을 제공할 수 있다. 사용자는 사용자 단말(미도시)에 표시된 도메인 판별과 관련된 항목을 확인하고, 수정 사항을 선택할 수 있다. 일예로, 도메인 판별과 관련된 선택 항목은 데이터 결과부(150)의 요청에 따른, 도메인 판별 결과의 확인 및 수정과 관련된 선택 항목을 포함할 수 있다. According to one embodiment of the present application, the data providing unit 160 may provide a selection item related to domain discrimination input information to a user terminal (not shown). The user may check the items related to the domain determination displayed on the user terminal (not shown), and select a modification. As an example, a selection item related to domain determination may include a selection item related to confirmation and modification of a domain determination result according to a request of the data result unit 150.

본원의 일 실시예에 따르면, 사용자 입력 수신부(170)는 사용자 단말(미도시)로부터 도메인 판별 입력 정보를 수신할 수 있다. 달리 말해, 사용자 입력 수신부(170)는 사용자 단말(미도시)로부터 사용자가 선택한 도메인 판별과 관련된 사용자 입력 정보를 수신할 수 있다. 사용자 입력 수신부(미도시)는 데이터 예측 입력 정보와 관련된 선택 항목을 요청한 각 유닛(부)에 해당 정보를 제공할 수 있다. 일예로, 도메인 판별 입력 정보는, 도메인 판별 결과의 확인 및 수정 메뉴 등에서 제공한 항목에 대한 사용자 입력 정보일 수 있다. According to an embodiment of the present application, the user input receiver 170 may receive domain discrimination input information from a user terminal (not shown). In other words, the user input receiver 170 may receive user input information related to domain selection selected by the user from a user terminal (not shown). The user input receiving unit (not shown) may provide corresponding information to each unit (unit) that has requested a selection item related to data prediction input information. As an example, the domain discrimination input information may be user input information for an item provided in a confirmation and modification menu of a domain discrimination result.

본원의 일 실시예에 따르면, 도메인 판별 장치(100)는 사용자 단말 (미도시)로 도메인 판별에 필요한 선택 메뉴를 제공할 수 있다. 예를 들어, 도메인 판별 장치(100)가 제공하는 어플리케이션 프로그램을 사용자 단말(미도시)이 다운로드하여 설치하고, 설치된 어플리케이션을 통해 도메인 판별 메뉴가 제공될 수 있다. 도메인 판별에 필요한 선택 메뉴는 최종 도메인 선택 메뉴 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.According to one embodiment of the present application, the domain discrimination apparatus 100 may provide a selection menu required for domain discrimination to a user terminal (not shown). For example, a user terminal (not shown) downloads and installs an application program provided by the domain discrimination device 100, and a domain discrimination menu may be provided through the installed application. The selection menu required for domain identification may include a final domain selection menu, but is not limited thereto.

도메인 판별 장치(100)는 사용자 단말 (미도시)과 데이터, 콘텐츠, 각종 통신 신호를 네트워크를 통해 송수신하고, 데이터 저장 및 처리의 기능을 가지는 모든 종류의 서버, 단말, 또는 디바이스를 포함할 수 있다.The domain discrimination apparatus 100 may include a user terminal (not shown) and transmits/receives data, contents, and various communication signals through a network, and includes all types of servers, terminals, or devices having functions of data storage and processing. .

사용자 단말 (미도시)은 네트워크를 통해 도메인 판별 장치(100)와 연동되는 디바이스로서, 예를 들면, 스마트폰(Smartphone), 스마트패드(Smart Pad), 태블릿 PC, 웨어러블 디바이스 등과 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말기 같은 모든 종류의 무선 통신 장치 및 데스크탑 컴퓨터, 스마트 TV와 같은 고정용 단말기일 수도 있다. The user terminal (not shown) is a device interworking with the domain discrimination device 100 through a network, for example, a smart phone, a smart pad, a tablet PC, a wearable device, and a PCS (Personal Communication System). ), Global System for Mobile Communication (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA)-2000 , W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet), and all kinds of wireless communication devices and desktop computers, fixed terminals such as smart TVs.

도메인 판별 장치(100) 및 사용자 단말 (미도시)간의 정보 공유를 위한 네트워크의 일 예로는 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, 5G 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 유무선 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, Wifi 네트워크, NFC(Near Field Communication) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함될 수 있으며, 이에 한정된 것은 아니다.Examples of networks for sharing information between the domain discrimination device 100 and a user terminal (not shown) include a 3rd Generation Partnership Project (3GPP) network, a Long Term Evolution (LTE) network, a 5G network, and World Interoperability for Microwave Access (WIMAX). ) Network, wired and wireless Internet (LAN), Local Area Network (LAN), Wireless Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), Bluetooth network, Wifi network, NFC ( Near Field Communication (Near Field Communication) network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, and the like may be included, but are not limited thereto.

도 6은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 장치의 도메인 판별 결과를 개략적으로 나타낸 도면이다.FIG. 6 is a diagram schematically showing a domain discrimination result of a domain discrimination apparatus for a non-standardized database according to an embodiment of the present application.

예시적으로, 도 6을 참조하면, 도메인 판별 장치(100)는 도메인 판별을 수행할 데이터 셋을 결정할 수 있다. 예시적으로, 도메인 판별 장치(100)는 DBMS의 METADB의 스키마명은 FINTECH이고, 테이블명은 CNTT_DATA에 포함된 데이터의 도메인 판별을 수행할 수 있다. For example, referring to FIG. 6, the domain discrimination apparatus 100 may determine a data set to perform domain discrimination. For example, the domain discrimination apparatus 100 may perform a domain discrimination of data included in CNTT_DATA, and the schema name of METADB of the DBMS is FINTECH.

도메인 판별 장치(100)는 각각의 컬럼명에 해당하는 도메인 판별 결과를 추천도메인(14), 추천도메인확률(15) 및 최종도메인(16)으로 분류하여 제공할 수 있다. 추천도메인(14)은 도메인 판별 장치(100)에서 인공지능 알고리즘을 사용하여 데이터의 도메인을 판별한 결과일 수 있다. 또한, 추천도메인확률(15)은 도메인 판별 장치(100)에서 판별한 도메인 판별에 대한 정확도일 수 있다. 또한, 최종도메인(16)은 도메인 판별 장치(100)에서 판별한 도메인이 아닌 사용자가 수정한 도메인 판별 결과일 수 있다. 달리 말해, 최종 도메인(16)은 사용자 입력 정보에 기반하여 변경이 가능한 정보일 수 있다. The domain discrimination apparatus 100 may classify and provide a domain discrimination result corresponding to each column name into a recommended domain 14, a recommended domain probability 15, and a final domain 16. The recommended domain 14 may be a result of determining the domain of the data using the artificial intelligence algorithm in the domain determination device 100. In addition, the recommended domain probability 15 may be accuracy for domain discrimination determined by the domain discrimination apparatus 100. Also, the final domain 16 may be a domain determination result modified by a user, not a domain determined by the domain determination apparatus 100. In other words, the final domain 16 may be information that can be changed based on user input information.

이와 같이, 도메인 판별 장치(100)는 탐지 대상 데이터베이스(110)에 대하여, 데이터베이스 내의 데이터들을 이용하여 생성한 파생변수를 이용하여 학습한 지도학습 알고리즘을 통해 도메인 판별을 수행할 수 있으며, 도메인 판별의 횟수가 많아질수록 학습을 통해 도메인 판별에 대한 정확도가 증가하여, 데이터 품질관리의 신뢰도가 증가할 수 있다.As described above, the domain discrimination apparatus 100 may perform domain discrimination through a supervised learning algorithm learned by using a derived variable generated by using data in the database with respect to the database 110 to be detected. As the number of times increases, the accuracy of domain discrimination increases through learning, thereby increasing the reliability of data quality management.

또한, 도메인 판별 장치(100)는 빅데이터를 적용한 시스템 내에서 인공지능 기반의 도메인 자동 판별을 수행함으로써, 빅 데이터에 대한 품질의 신뢰성을 향상시키기 위한 장치 및 방법을 제공할 수 있다. 이를 위해 도메인 판별 장치(100)는 데이터 품질 가이드라인에서 정의하고 있는 도메인에 대하여 자동으로 판별하기 위한 다양한 분류 기준을 통해 도메인을 판별할 수 있다.In addition, the domain discrimination apparatus 100 may provide an apparatus and method for improving reliability of quality for big data by performing automatic domain identification based on artificial intelligence in a system to which big data is applied. To this end, the domain discrimination apparatus 100 may discriminate domains through various classification criteria for automatically discriminating domains defined in the data quality guidelines.

또한, 도메인 판별 장치(100)는 표준 데이터 사전을 기반으로 인공지능을 통한 선행학습을 수행하고, 선행학습 결과를 기반으로 담지 대상 데이터베이스에 대한 도메인 판별을 수행할 수 있다. 도메인 판별 장치(100)는 도메인 판별 결과를 저장하여 학습함으로써, 도메인 판별의 정확도를 증가시킬 수 있으며, 이러한 도메인 판별 장치(100)는 데이터 품질 관리를 위해 엔지니어가 수작업으로 진행하던 업무를 자동화할 수 있으며, 자동화함으로써, 투입인력을 최소화하고, 인력의 개입 없이 지속적인 데이터 품질 관리를 수행할 수 있다.In addition, the domain discrimination apparatus 100 may perform pre-learning through artificial intelligence based on a standard data dictionary, and perform domain discrimination on a supported database based on the pre-learning results. The domain discrimination apparatus 100 can increase the accuracy of domain discrimination by storing and learning the domain discrimination results, and such a domain discrimination apparatus 100 can automate the work manually performed by engineers for data quality management. In addition, by automating, it is possible to minimize input manpower and perform continuous data quality management without human intervention.

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, based on the details described above, the operation flow of the present application will be briefly described.

도 7은 본원의 일 실시예에 따른 비표준화 데이터베이스를 위한 도메인 판별 방법에 대한 동작 흐름도이다.7 is an operation flowchart for a domain discrimination method for a non-standardized database according to an embodiment of the present application.

도 7에 도시된 비표준화 데이터베이스를 위한 도메인 판별 방법은 앞서 설명된 비표준화 데이터베이스를 위한 도메인 판별 장치(100)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 비표준화 데이터베이스를 위한 도메인 판별 장치(100)에 대하여 설명된 내용은 비표준화 데이터베이스를 위한 도메인 판별 방법에 대한 설명에도 동일하게 적용될 수 있다.The domain determination method for the non-standardized database illustrated in FIG. 7 may be performed by the domain determination apparatus 100 for the non-standardized database described above. Therefore, even if omitted, the description of the domain discrimination apparatus 100 for a non-standardized database can be applied to the description of the domain discrimination method for a non-standardized database.

단계 S701에서 도메인 판별 장치(100)는, 복수의 비표준화 데이터를 포함하는 데이터베이스(110)에 포함된 복수의 비표준화 데이터를 이용하여 파생변수를 생성할 수 있다. In step S701, the domain determination apparatus 100 may generate a derived variable using a plurality of non-standardized data included in the database 110 including a plurality of non-standardized data.

단계 S702에서 도메인 판별 장치(100)는, 파생변수 데이터 셋을 이용하여 도메인 판별 모델을 생성할 수 있다.In step S702, the domain discrimination apparatus 100 may generate a domain discrimination model using the derived variable data set.

단계 S703에서 도메인 판별 장치(100)는, 도메인 판별 모델을 이용하여 제1데이터 셋에 대한 도메인을 판별할 수 있다. In step S703, the domain discrimination apparatus 100 may discriminate the domain for the first data set using the domain discrimination model.

단계 S704에서 도메인 판별 장치(100)는, 제1데이터 셋에 포함된 데이터와 판별된 도메인 판별 결과를 연계하여 데이터베이스(110)에 저장할 수 있다. In step S704, the domain determination device 100 may store the data included in the first data set and the determined domain determination result in association with the database 110.

상술한 설명에서, 단계 S701 내지 S704는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S701 to S704 may be further divided into additional steps, or combined into fewer steps, according to an embodiment of the present application. Also, some steps may be omitted if necessary, and the order between the steps may be changed.

본원의 일 실시 예에 따른 비표준화 데이터베이스를 위한 도메인 판별 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The domain discrimination method for a non-standardized database according to an embodiment of the present application may be implemented in a form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

또한, 전술한 비표준화 데이터베이스를 위한 도메인 판별 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the above-described domain discrimination method for a non-standardized database may also be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present application is for illustration, and a person having ordinary knowledge in the technical field to which the present application belongs will understand that it is possible to easily change to other specific forms without changing the technical spirit or essential characteristics of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the claims below, rather than the detailed description, and it should be interpreted that all modifications or variations derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present application.

100: 도메인 판별 장치
110: 데이터베이스
120: 파생변수 생성부
130: 모델 학습부
140: 도메인 판별부
150: 데이터 결과부
160: 데이터 제공부
170: 사용자 입력 수신부100: domain determination device
110: database
120: derived variable generator
130: model learning department
140: domain discrimination unit
150: data result unit
160: data providing unit
170: user input receiver

Claims

In the domain discrimination device for a non-standardized database,
A database including a plurality of non-standardized data;
A derived variable generator for generating derived variables using a plurality of non-standardized data included in the database;
A model learning unit generating a domain discrimination model using a derived variable data set;
A domain discrimination unit for determining a domain for a first data set using the domain discrimination model; And
A data result unit that stores data in the database by linking data included in the first data set and domain determination results determined by the domain determination unit;
Including,
The model learning unit,
Re-learning the domain discrimination model using the new data set stored by linking the data included in the first data set with the domain discrimination result determined by the domain discrimination unit,
The data result section,
A domain discrimination apparatus that checks and corrects the discrimination results of the model learning unit and the domain discrimination unit, and stores them in the database.

According to claim 1,
The derived variable generator,
The domain discrimination apparatus is to generate a derived variable by extracting a characteristic value or a representative value of a plurality of non-standardized data included in the database.

According to claim 2,
The derived variable generator,
The non-standardized data included in the database is the data type, the maximum data length, the minimum data length, the data length change, the length under the decimal point, the date format, the contact format, the blank ratio, whether to include the enter, whether to write in English, or to write numbers A domain discrimination device that generates a derivative variable corresponding to at least one of whether or not, a percentage of a hundred or less, a grouping rate, or a PK.

According to claim 3,
The model learning unit,
The domain discrimination device is to generate the domain discrimination model by applying a set of derived variable data whose domain is labeled to the derivation variable generated by the derivation variable generation unit to an artificial intelligence algorithm.

According to claim 1,
The domain discrimination unit,
The domain discrimination apparatus is to determine a domain for the first data set using at least one of a number, an amount, a code, a number, a date, a content, a rate, a name, a flag, and a contact information, using the domain discrimination model.

According to claim 1,
The data result section,
A domain discrimination apparatus that performs an analysis of the influence of the generated derived variable on the domain discrimination result.

According to claim 1,
A data providing unit that provides a selection item related to domain discrimination input information to a user terminal; And
And a user input receiving unit receiving domain determination input information from the user terminal.

In the domain discrimination method for a non-standardized database,
Generating a derived variable using a plurality of non-standardized data included in a database including a plurality of non-standardized data;
Generating a domain discrimination model using the derived variable data set;
Determining a domain for a first data set using the domain discrimination model; And
Storing the data included in the first data set and the determined domain discrimination result in the database,
Including,
Generating the domain discrimination model,
Re-learning the domain discrimination model using a new data set stored in association with the data included in the first data set and the discriminated domain discrimination result,
The step of storing in the database,
The method for determining a domain, wherein the generated domain discrimination model and the discriminated domain discrimination result are checked and corrected and stored in the database.