KR102623561B1

KR102623561B1 - Method for generating metadata for automatically determining type of data and apparatus for determining type of data using a machine learning/deep learning model for the same

Info

Publication number: KR102623561B1
Application number: KR1020230148252A
Authority: KR
Inventors: 마보현
Original assignee: 주식회사 스타캣
Priority date: 2021-11-12
Filing date: 2023-10-31
Publication date: 2024-01-09
Also published as: KR102622433B1; KR20230069428A; KR20230154157A; KR20230155390A; KR102622434B1

Abstract

본 발명의 일 실시 예에 따른 머신러닝(Machine Learning)/딥러닝(Deep Learning) 모델을 이용한 메타데이터(Metadata) 생성 장치가 데이터의 타입 - 상기 데이터의 타입은 숫자형, 문자형, 범주형 및 날짜형을 포함함 - 을 자동으로 판별하여 메타데이터를 생성하는 방법은 (a) 수신한 데이터의 필드값이 날짜형 타입인지 1차적으로 판단하는 단계, (b) 상기 (a) 단계의 판단 결과, 날짜형 타입이 아니라고 1차적으로 판단되었다면, 상기 수신한 데이터의 필드값에 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건을 적용하여 범주형 및 날짜형 중 어느 하나의 타입인지 2차적으로 판단하는 단계, (c) 상기 (b) 단계의 판단 결과, 범주형 및 날짜형 중 어느 하나의 타입이 아니라고 2차적으로 판단되었다면, 상기 수신한 데이터의 필드명을 상기 메타데이터 생성 규칙이 포함하는 데이터 타입의 판별에 관한 필드 매핑 테이블(Field Mapping Table)에 적용하여 숫자형, 문자형, 범주형 및 날짜형 중 어느 하나의 타입인지 최종적으로 판단하고 메타데이터를 생성하는 단계 및 (d) 상기 생성한 메타데이터를 머신러닝/딥러닝 모델로 학습하여 상기 메타데이터 생성 규칙을 업데이트하는 단계를 포함한다. A metadata generating device using a machine learning/deep learning model according to an embodiment of the present invention is a data type - the data type is numeric, character, categorical, and date. The method of automatically determining and generating metadata includes the following steps: (a) primarily determining whether the field value of the received data is a date type; (b) the determination result of step (a) above, If it is primarily determined that it is not a date type, applying the data type determination conditions included in the metadata creation rule to the field value of the received data to secondarily determine whether it is a categorical type or a date type. , (c) If, as a result of the determination in step (b), it is secondarily determined that it is not one of the categorical and date types, the field name of the received data is changed to one of the data types included in the metadata creation rule. (d) applying the field mapping table for determination to determine whether it is a numeric type, character type, categorical type, or date type and generating metadata; and (d) the generated metadata. It includes the step of updating the metadata creation rules by learning with a machine learning/deep learning model.

Description

Method for automatically determining the type of data and generating metadata and data type determination device using machine learning/deep learning model for this {METHOD FOR GENERATING METADATA FOR AUTOMATICALLY DETERMINING TYPE OF DATA AND APPARATUS FOR DETERMINING TYPE OF DATA USING A MACHINE LEARNING/DEEP LEARNING MODEL FOR THE SAME}

본 발명은 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법 및 이를 위한 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치에 관한 것이다. 보다 자세하게는 인공지능 모델인 머신러닝/딥러닝 모델에 적용하기 위한 메타데이터의 생성에 이용되는 데이터 타입의 판별을 손쉽고 간편하게 수행할 수 있는 방법 및 장치에 관한 것이다. The present invention relates to a method for automatically determining the type of data and generating metadata, and a data type determination device using a machine learning/deep learning model for this purpose. More specifically, it relates to a method and device that can easily and conveniently determine the data type used to generate metadata for application to machine learning/deep learning models, which are artificial intelligence models.

머신러닝(Machine Learning)/딥러닝(Deep Learning) 모델을 포함하는 광범위한 인공지능(A.I, Artificial Intelligence) 모델에 데이터를 적용하기 위해서는 로우한 상태의 정리되지 않은 데이터를 그대로 적용하는 것보다는 해당 데이터를 설명해줄 수 있는 구조화된 데이터인 메타데이터를 우선적으로 파악하고 식별하는 것이 요구된다. In order to apply data to a wide range of artificial intelligence (AI) models, including machine learning/deep learning models, the data should be used rather than applying raw, unorganized data as is. It is necessary to first identify and identify metadata, which is structured data that can be explained.

한편, 데이터 중, 정형 데이터의 경우 각 필드의 데이터 타입, 값의 통계적 특성, 최대/최소/평균/표준 편차 등과 같은 분포, 이상치에 해당하는 범위, 데이터의 버전, 데이터 및 필드의 성명 등과 같은 메타데이터가 일목요연(一目瞭然)하게 생성된 상태에서 인공지능 모델에 적용됨으로써 모델의 성능이 현저하게 향상될 수 있는바, 이는 데이터에 대한 일종의 전처리 과정으로 볼 수 있으며 종래에는 인건비가 높은 전문 인력이 파이썬(Python)과 같은 고급 프로그래밍 언어를 이용하여 데이터를 탐색하거나 필드의 데이터 타입을 일일이 결정하는 전처리 방식이 주를 이뤘다. Meanwhile, in the case of structured data, meta data such as data type of each field, statistical characteristics of values, distribution such as maximum/minimum/average/standard deviation, range corresponding to outliers, data version, and name of data and field, etc. The performance of the model can be significantly improved by applying it to the artificial intelligence model while the data is clearly generated. This can be seen as a kind of preprocessing process for the data, and in the past, specialized personnel with high labor costs were used to run Python ( The main method was preprocessing, which involved exploring data using a high-level programming language such as Python or manually determining the data type of a field.

한편, 메타데이터 생성을 위한 데이터 전처리 과정 중, 데이터 타입의 판별은 그 무엇보다 중요한 의미를 갖는바, 인공지능 모델은 기본적으로 숫자형 데이터만 입력 받을 수 있기 때문에 텍스트형 또는 범주형과 같은 비숫자형 데이터는 반드시 숫자형 데이터로 변환해야 하며, 텍스트형 데이터라 할지라도 일반 텍스트형 데이터인지, 특정 범주를 나타내는 범주형 데이터인지 아니라면 날짜형 데이터인지에 따라 사용하는 통계적 정보뿐만 아니라 전처리 방식 자체가 달라지기 때문이다. Meanwhile, during the data preprocessing process for metadata creation, the determination of data type is more important than anything else. Since artificial intelligence models can basically only receive numeric data, non-numeric data such as text or category can be used as input. Data must be converted to numeric data, and even if it is text-type data, not only the statistical information used but also the pre-processing method itself varies depending on whether it is general text-type data, categorical data representing a specific category, or date-type data. Because.

그러나 종래의 데이터 타입 판별 방식은 전문 인력의 역량과 이들 각각이 데이터를 바라보는 관점에 따라 동일한 데이터에 대한 타입 판별 결과가 상이해지고, 이는 인공지능 모델의 성능에까지 영향을 미칠 수 있다는 문제점이 있었다. 또한, 전문 인력이 직접 처리하는 것이기에 많은 시간과 높은 인건비가 소요될 수밖에 없는바, 자금 운용에 여력이 없는 스타트업이나 중소기업의 경우 데이터 타입 판별 작업을 수행하지 못하거나 내부 인력을 통해 긴 시간에 걸쳐 데이터 타입 판별 작업을 수행함으로써 출시된 제품의 성능이 저하되거나 제품 출시의 시기가 늦춰진다는 문제점까지 존재하였다. However, the conventional data type determination method had a problem in that the type determination results for the same data were different depending on the capabilities of the experts and their respective perspectives on the data, which could even affect the performance of the artificial intelligence model. In addition, since it is handled directly by professional personnel, it is bound to take a lot of time and high labor costs. In the case of startups or small and medium-sized businesses that do not have sufficient financial resources, they are unable to perform data type determination or require internal personnel to process data over a long period of time. By performing the type determination task, there was even a problem that the performance of the released product was degraded or the timing of product launch was delayed.

본 발명은 이러한 문제점들을 반영하여 데이터의 전처리 과정에 해당하는 메타데이터의 생성, 보다 구체적으로 메타데이터의 생성에 이용되는 데이터 타입의 판별을 전문 인력의 처리 없이 손쉽고 간편하게 수행할 수 있는 방법 및 장치에 관한 것이다. The present invention reflects these problems and provides a method and device that can easily and conveniently perform the generation of metadata corresponding to the data preprocessing process, and more specifically, the determination of the data type used in the creation of metadata, without processing by specialized personnel. It's about.

대한민국 등록특허공보 제10-2310598호(2021.10.01)Republic of Korea Patent Publication No. 10-2310598 (2021.10.01)

본 발명이 해결하고자 하는 기술적 과제는 정형 데이터에 대한 데이터 전처리 과정에 해당하는 메타 데이터의 생성, 보다 구체적으로 메타데이터의 생성에 이용되는 데이터 타입의 판별을 전문 인력의 처리 없이 손쉽고 간편하게 수행할 수 있는 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법 및 이를 위한 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치를 제공하는 것이다. The technical problem that the present invention aims to solve is the creation of metadata corresponding to the data preprocessing process for structured data, and more specifically, the ability to easily and conveniently determine the data type used to create metadata without processing by specialized personnel. The aim is to provide a method for automatically determining the type of data and generating metadata, and a data type determination device using machine learning/deep learning models for this purpose.

본 발명의 해결하고자 하는 기술적 과제는 정형 데이터에 대한 메타데이터의 생성 결과, 보다 구체적으로 메타데이터의 생성에 이용되는 데이터 타입의 판별 결과를 지속적으로 학습함으로써 머신러닝/딥러닝 모델의 성능을 나날이 향상시킬 수 있는 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법 및 이를 위한 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치를 제공하는 것이다. The technical problem to be solved by the present invention is to continuously improve the performance of machine learning/deep learning models by continuously learning the results of generating metadata for structured data, and more specifically, the results of determining the data type used to generate metadata. The aim is to provide a method for automatically determining the type of data that can be used to generate metadata, and a data type determination device using machine learning/deep learning models for this purpose.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

상기 기술적 과제를 달성하기 위한 본 발명의 일 실시 예에 따른 머신러닝(Machine Learning)/딥러닝(Deep Learning) 모델을 이용한 메타데이터(Metadata) 생성 장치가 데이터의 타입 - 상기 데이터의 타입은 숫자형, 문자형, 범주형 및 날짜형을 포함함 - 을 자동으로 판별하여 메타데이터를 생성하는 방법은 (a) 수신한 데이터의 필드값이 날짜형 타입인지 1차적으로 판단하는 단계, (b) 상기 (a) 단계의 판단 결과, 날짜형 타입이 아니라고 1차적으로 판단되었다면, 상기 수신한 데이터의 필드값에 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건을 적용하여 범주형 및 날짜형 중 어느 하나의 타입인지 2차적으로 판단하는 단계, (c) 상기 (b) 단계의 판단 결과, 범주형 및 날짜형 중 어느 하나의 타입이 아니라고 2차적으로 판단되었다면, 상기 수신한 데이터의 필드명을 상기 메타데이터 생성 규칙이 포함하는 데이터 타입의 판별에 관한 필드 매핑 테이블(Field Mapping Table)에 적용하여 숫자형, 문자형, 범주형 및 날짜형 중 어느 하나의 타입인지 최종적으로 판단하고 메타데이터를 생성하는 단계 및 (d) 상기 생성한 메타데이터를 머신러닝/딥러닝 모델로 학습하여 상기 메타데이터 생성 규칙을 업데이트하는 단계를 포함한다. A metadata generating device using a machine learning/deep learning model according to an embodiment of the present invention to achieve the above technical problem is a data type - the data type is numeric. , including character type, category type, and date type - The method of automatically determining and generating metadata includes (a) first determining whether the field value of the received data is a date type, (b) the above ( As a result of the judgment in step a), if it is initially determined that it is not a date type, the data type determination conditions included in the metadata creation rule are applied to the field value of the received data to determine whether it is a categorical or date type. (c) If it is determined secondarily that it is not one of the categorical and date types as a result of the determination in step (b), the field name of the received data is used to generate the metadata. A step of applying the rule to the Field Mapping Table for determining the data type included in the rule to finally determine whether it is a numeric type, character type, categorical type, or date type, and generating metadata; and (d) ) It includes the step of learning the generated metadata with a machine learning/deep learning model and updating the metadata creation rule.

일 실시 예에 따르면, 상기 (a) 단계의 판단은 공개된 프로그래밍 언어에 따를 수 있다. According to one embodiment, the determination in step (a) may be based on a publicly available programming language.

일 실시 예에 따르면, 상기 (b) 단계의 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건은, 상기 수신한 데이터의 전체 필드값의 수 대비 고유한(Unique) 필드값의 수의 비율이 소정 비율 미만인지 여부에 대한 제1 조건을 포함하며, 상기 제1 조건에 따라 소정 비율 미만이라면, 상기 수신한 데이터의 타입은 범주형일 수 있다. According to one embodiment, the data type determination condition included in the metadata creation rule of step (b) is that the ratio of the number of unique field values to the total number of field values of the received data is a predetermined ratio. It includes a first condition as to whether the data is less than a certain percentage, and if it is less than a predetermined percentage according to the first condition, the type of the received data may be categorical.

일 실시 예에 따르면, 상기 소정 비율은, 0 내지 0.5 사이 중 어느 하나일 수 있다. According to one embodiment, the predetermined ratio may be any one between 0 and 0.5.

일 실시 예에 따르면, 상기 (b) 단계의 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건은, 상기 수신한 데이터의 필드값이 년도 패턴, 월 패턴, 일 패턴, 시 패턴, 분 패턴, 초 패턴 중 하나 이상의 패턴을 포함하는지 여부에 대한 제2 조건을 포함하며, 상기 제2 조건에 따라 하나 이상의 패턴을 포함한다면, 상기 수신한 데이터의 타입은 날짜형일 수 있다. According to one embodiment, the data type determination condition included in the metadata creation rule of step (b) is that the field value of the received data is a year pattern, a month pattern, a day pattern, an hour pattern, a minute pattern, and a second pattern. It includes a second condition regarding whether it includes one or more patterns, and if it includes one or more patterns according to the second condition, the type of the received data may be a date type.

일 실시 예에 따르면, 상기 (c) 단계의 필드 매핑 테이블은, 하나 이상의 필드명 각각에 대하여 (풀네임 ID, 참조 횟수)가 하나 이상 매핑된 필드명/풀네임 ID 테이블 및 상기 하나 이상의 풀네임 ID 각각에 대하여 숫자형에 해당할 확률, 문자형에 해당할 확률, 범주형에 해당할 확률 및 날짜형에 해당할 확률이 매핑된 데이터 타입 매핑 테이블을 포함할 수 있다. According to one embodiment, the field mapping table of step (c) includes a field name/full name ID table in which one or more (full name ID, reference count) is mapped to each of one or more field names, and the one or more full names. For each ID, a data type mapping table may be included in which the probability of being a numeric type, the probability of being a character type, the probability of being a categorical type, and the probability of being a date type are mapped.

일 실시 예에 다르면, 상기 (c) 단계의 필드 매핑 테이블은, 상기 하나 이상의 풀네임 ID 각각에 대하여 해당하는 풀네임, 필드 설명 및 참조횟수를 포함하며, 상기 데이터 타입 매핑 테이블이 포함하는 하나 이상의 풀네임 ID 각각과 1:1로 대응되는 풀네임 정보 테이블을 더 포함할 수 있다.According to one embodiment, the field mapping table of step (c) includes a full name, field description, and reference count for each of the one or more full name IDs, and one or more data type mapping tables included in the data type mapping table. A full name information table corresponding 1:1 to each full name ID may be further included.

일 실시 예에 따르면, 상기 (d) 단계 이후에, (e) 상기 (c) 단계에서 최종적으로 판단한 데이터의 타입이 숫자형 및 날짜형 중 어느 하나의 타입인지 판단하는 단계, (f) 상기 (e) 단계의 판단 결과, 숫자형 및 날짜형 중 어느 하나의 타입이라고 판단되었다면, 상기 수신한 데이터의 필드값을 정렬하여 중복된 필드값을 포함하는지 판단하는 단계 및 (g) 상기 (f) 단계의 판단 결과, 중복된 필드값을 포함하는 것으로 판단되었다면, 상기 정렬한 필드값 중 서로 이웃하는 두 필드값 사이의 간격값을 산정하는 단계, (h) 상기 산정한 두 필드값 사이의 간격값 중 가장 많은 수의 간격값의 비율이 소정 비율 이상인지 판단하는 단계 및 (i) 상기 (h) 단계의 판단 결과, 소정 비율 이상이라고 판단되었다면, 상기 수신한 데이터는 시계열 데이터셋으로 판단하는 단계를 더 포함할 수 있다. According to one embodiment, after step (d), (e) determining whether the data type finally determined in step (c) is a numeric type or a date type, (f) the ( As a result of the determination in step e), if it is determined that it is either a numeric type or a date type, step (g) sorting the field values of the received data to determine whether it contains duplicate field values; and (g) step (f) above. As a result of the determination, if it is determined that it contains duplicate field values, calculating the interval value between two neighboring field values among the sorted field values, (h) among the calculated interval values between the two field values. (i) determining whether the ratio of the largest number of interval values is more than a predetermined ratio; and (i) determining that the received data is a time series dataset if it is determined that it is more than a predetermined ratio as a result of the determination in step (h). It can be included.

일 실시 예에 따르면, (f´) 상기 (e) 단계의 판단 결과, 숫자형 및 날짜형 중 어느 하나의 타입이 아니라고 판단되었다면, 상기 수신한 데이터는 비시계열 데이터셋으로 판단하는 단계, (g´) 상기 (f) 단계의 판단 결과, 중복된 필드값을 포함하지 않는 것으로 판단되었다면, 상기 수신한 데이터는 시계열 데이터셋으로 판단하는 단계 및 (i´) 상기 (h) 단계의 판단 결과, 소정 비율 이상이 아니라고 판단되었다면, 상기 수신한 데이터는 비시계열 데이터셋으로 판단하는 단계를 더 포함할 수 있다. According to one embodiment, (f´) if, as a result of the determination in step (e), it is determined that the received data is not one of the numeric and date types, determining that the received data is a non-time series dataset, (g ´) If, as a result of the determination in step (f), it is determined that it does not contain duplicate field values, determining that the received data is a time series dataset; and (i´) As a result of the determination in step (h), a predetermined If it is determined that the ratio is not higher, a step of determining that the received data is a non-time series dataset may be further included.

상기 기술적 과제를 달성하기 위한 본 발명의 일 실시 예에 따른 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치는 하나 이상의 프로세서, 네트워크 인터페이스, 상기 프로세서에 의해 수행되는 컴퓨터 프로그램을 로드(Load)하는 메모리 및 대용량 네트워크 데이터 및 상기 컴퓨터 프로그램을 저장하는 스토리지를 포함하되, 상기 컴퓨터 프로그램은 상기 하나 이상의 프로세서에 의해, (A) 수신한 데이터의 필드값이 숫자형 및 문자형 타입 중 어느 하나의 타입인지 1차적으로 판단하는 오퍼레이션, (B) 상기 (A) 오퍼레이션의 판단 결과, 숫자형 및 문자형 타입 중 어느 하나의 타입이라고 1차적으로 판단되었다면, 상기 수신한 데이터의 필드값에 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건을 적용하여 범주형 및 날짜형 중 어느 하나의 타입인지 2차적으로 판단하는 오퍼레이션, (C) 상기 (B) 오퍼레이션의 판단 결과, 범주형 및 날짜형 중 어느 하나의 타입이 아니라고 2차적으로 판단되었다면, 상기 수신한 데이터의 필드명을 상기 메타데이터 생성 규칙이 포함하는 데이터 타입의 판별에 관한 필드 매핑 테이블(Field Mapping Table)에 적용하여 숫자형, 문자형, 범주형 및 날짜형 중 어느 하나의 타입인지 최종적으로 판단하고 메타데이터를 생성하는 오퍼레이션 및 (D) 상기 생성한 메타데이터를 머신러닝/딥러닝 모델로 학습하여 상기 메타데이터 생성 규칙을 업데이트하는 오퍼레이션을 실행한다. A data type determination device using a machine learning/deep learning model according to an embodiment of the present invention to achieve the above technical problem includes one or more processors, a network interface, and a memory that loads a computer program executed by the processor. and storage for storing large-capacity network data and the computer program, wherein the computer program determines, by the one or more processors, (A) whether the field value of the received data is one of a numeric type and a character type; (B) If, as a result of the judgment of the operation (A), it is primarily determined that it is either a numeric type or a character type, the data included in the metadata creation rule in the field value of the received data An operation that secondarily determines whether the type is a categorical or date type by applying the type determination condition, (C) As a result of the judgment of the operation (B) above, a secondary judgment is made that it is not a categorical or date type. If it is determined, the field name of the received data is applied to the Field Mapping Table for determining the data type included in the metadata creation rule to determine one of numeric type, character type, categorical type, and date type. (D) an operation to finally determine whether the type is and generate metadata, and (D) an operation to update the metadata creation rule by learning the generated metadata with a machine learning/deep learning model.

상기와 같은 본 발명에 따르면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치가 3 단계의 판단 과정을 거쳐 수신한 데이터의 타입을 숫자형, 문자형, 범주형 및 날짜형 중 어느 하나의 타입으로 자동으로 판별할 수 있는바, 데이터 타입의 판별을 전문 인력의 처리 없이 손쉽고 간편하게 수행할 수 있으므로 불필요한 시간 및 비용 소모를 방지할 수 있으며, 데이터에 대한 전처리 프로세스를 손쉽고 간편하게 수행할 수 있다는 효과가 있다. According to the present invention as described above, a data type determination device using a machine learning/deep learning model goes through a three-step decision process to determine the type of the received data as one of numeric type, character type, category type, and date type. Since it can be determined automatically, the data type can be determined easily and conveniently without processing by specialized personnel, thereby preventing unnecessary consumption of time and cost, and has the effect of allowing the preprocessing process for the data to be performed easily and conveniently. .

또한, 메타데이터 생성 결과를 머신러닝/딥러닝 모델로 지속적으로 학습하여 메타데이터 생성 규칙을 업데이트하는바, 그 이후의 메타데이터 생성, 보다 구체적으로 데이터 타입의 판별에 정확도를 향상시킬 수 있다 효과가 있다. In addition, the metadata creation results are continuously learned using machine learning/deep learning models to update the metadata creation rules, which can improve the accuracy of subsequent metadata creation and, more specifically, data type determination. there is.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 제1 실시 예에 따른 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치가 포함하는 전체 구성을 나타낸 도면이다.
도 2는 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법의 대표적인 단계를 나타낸 순서도이다.
도 3은 필드 매핑 테이블을 예시적으로 도시한 도면이다.
도 4는 필드 매핑 테이블을 예시적으로 도시한 또 다른 도면이다.
도 5는 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 있어서, 풀네임 정보 테이블은 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블에는 포함되어 있지 않으나, 풀네임 정보 테이블에 해당 필드명이 나타내는 풀네임이 포함되어 있는 경우 그리고 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블 그리고 풀네임 역시 풀네임 정보 테이블에도 포함되어 있지 않은 경우에 선행되는 단계를 도시한 순서도이다.
도 6은 도 5에 도시된 순서도에서 일부를 분리하여 도시한 순서도 1이다.
도 7은 도 5에 도시된 순서도에서 일부를 분리하여 도시한 순서도 2이다.
도 8은 S300-1 단계 내지 S300-5 단계에 따라 필드 매핑 테이블에서 참조 횟수가 업데이트되는 모습을 예시적으로 도시한 도면이다.
도 9는 S300-2′ 단계 내지 S300-6′ 단계에 따라 필드 매핑 테이블에서 참조 횟수가 업데이트되는 모습을 예시적으로 도시한 도면이다.
도 10은 S300-2′ 단계 내지 S300-6′ 단계에 따라 필드명/풀네임 매핑 테이블에 새로운 필드명이 업데이트되는 모습을 예시적으로 도시한 도면이다.
도 11은 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 데이터를 수신하고, 메타데이터 생성 규칙을 적용하여 메타데이터를 생성한 후, 이를 학습함으로써 메타데이터 생성 규칙을 업데이트하는 모습을 예시적으로 도시한 도면이다.
도 12는 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 있어서, S240 단계 이후에 수행되는 시계열 데이터셋과 비시계열 데이터셋을 판별하는 방법을 추가하여 도시한 순서도이다.Figure 1 is a diagram showing the overall configuration of a data type determination device using a machine learning/deep learning model according to a first embodiment of the present invention.
Figure 2 is a flowchart showing representative steps of a method for automatically determining the type of data and generating metadata according to the second embodiment of the present invention.
Figure 3 is a diagram illustrating an exemplary field mapping table.
Figure 4 is another diagram illustrating a field mapping table as an example.
5 shows a method of automatically determining the type of data and generating metadata according to the second embodiment of the present invention, where the field name of the received data is included in the full name information table in the field name/full name mapping table. However, if the full name indicated by the field name is included in the full name information table, and the field name of the received data is not included in the field name/full name mapping table and the full name information table, it is preceded by This is a flowchart showing the steps.
FIG. 6 is flowchart 1 showing a portion separated from the flowchart shown in FIG. 5.
FIG. 7 is flowchart 2 showing a portion separated from the flowchart shown in FIG. 5.
Figure 8 is a diagram illustrating how the reference count is updated in the field mapping table according to steps S300-1 to S300-5.
Figure 9 is a diagram illustrating how the reference count is updated in the field mapping table according to steps S300-2' to S300-6'.
Figure 10 is a diagram illustrating how a new field name is updated in the field name/full name mapping table according to steps S300-2' to S300-6'.
Figure 11 shows the data type determination device 100 using a machine learning/deep learning model receiving data, applying the metadata creation rule to generate metadata, and then updating the metadata creation rule by learning it. This is an exemplary drawing.
Figure 12 shows a method for automatically determining the type of data and generating metadata according to the second embodiment of the present invention, by adding a method for discriminating between a time series dataset and a non-time series dataset performed after step S240. This is the flow chart shown.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the attached drawings. The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms, and the present embodiments are merely intended to ensure that the disclosure of the present invention is complete and to provide common knowledge in the technical field to which the present invention pertains. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings that can be commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

한편, 본 명세서에서 사용된 용어는 실시 예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Meanwhile, the terms used in this specification are for describing embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated in the phrase.

본 명세서에서 사용되는 "포함한다 (comprises)" 및/또는 "포함하는 (comprising)"은 언급된 구성 요소, 단계, 동작뿐만 아니라 하나 이상의 다른 구성 요소, 단계, 동작의 존재 또는 추가를 배제하지 않는다.As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of the referenced component, step, or operation as well as one or more other components, steps, or operations. .

도 1은 본 발명의 제1 실시 예에 따른 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 포함하는 전체 구성을 나타낸 도면이다. Figure 1 is a diagram showing the overall configuration included in the data type determination device 100 using a machine learning/deep learning model according to the first embodiment of the present invention.

그러나 이는 본 발명의 목적을 달성하기 위한 바람직한 실시 예일 뿐이며, 필요에 따라 일부 구성이 추가되거나 삭제될 수 있고, 어느 한 구성이 수행하는 역할을 다른 구성이 함께 수행할 수도 있음은 물론이다. However, this is only a preferred embodiment for achieving the purpose of the present invention, and some components may be added or deleted as needed, and of course, the role played by one component may be performed by another component.

본 발명의 제1 실시 예에 따른 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)는 프로세서(10), 네트워크 인터페이스(20), 메모리(30), 스토리지(40) 및 이들을 연결하는 데이터 버스(50)를 포함할 수 있으며, 그 자체로써 독립된 장치로 구현하거나 인하우스 시스템 및 공간 임대형 시스템 등과 같은 유형의 물리적인 서비스 서버 또는 무형의 클라우드(Cloud) 서비스 서버 등과 같은 형태로 구현할 수도 있다 할 것이다. The data type determination device 100 using a machine learning/deep learning model according to the first embodiment of the present invention includes a processor 10, a network interface 20, a memory 30, a storage 40, and data connecting them. It may include a bus 50, and may be implemented as an independent device itself, or may be implemented in the form of a tangible physical service server such as an in-house system or space rental system, or an intangible cloud service server. will be.

프로세서(10)는 각 구성의 전반적인 동작을 제어한다. 프로세서(10)는 CPU(Central Processing Unit), MPU(Micro Processer Unit), MCU(Micro Controller Unit) 또는 본 발명이 속하는 기술 분야에서 널리 알려져 있는 형태의 프로세서 중 어느 하나일 수 있으며, 머신러닝 모델 프로세서 또는 딥러닝 모델 프로세서 등과 같이 인공지능 모델 프로세서로 구현할 수 있다. 아울러, 프로세서(10)는 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법을 수행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. The processor 10 controls the overall operation of each component. The processor 10 may be a CPU (Central Processing Unit), MPU (Micro Processer Unit), MCU (Micro Controller Unit), or any of the types of processors widely known in the technical field to which the present invention pertains, and may be a machine learning model processor. Alternatively, it can be implemented with an artificial intelligence model processor, such as a deep learning model processor. In addition, the processor 10 may perform an operation on at least one application or program to perform a method of automatically determining the type of data and generating metadata according to the second embodiment of the present invention.

네트워크 인터페이스(20)는 본 발명의 제1 실시 예에 따른 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)의 유무선 인터넷 통신을 지원하며, 그 밖의 공지의 통신 방식을 지원할 수도 있다. 따라서 네트워크 인터페이스(20)는 그에 따른 통신 모듈을 포함하여 구성될 수 있다.The network interface 20 supports wired and wireless Internet communication of the data type determination device 100 using a machine learning/deep learning model according to the first embodiment of the present invention, and may also support other known communication methods. Accordingly, the network interface 20 may be configured to include a corresponding communication module.

메모리(30)는 각종 정보, 명령 및/또는 정보를 저장하며, 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법을 수행하기 위해 스토리지(40)로부터 하나 이상의 컴퓨터 프로그램(41)을 로드할 수 있다. 도 1에서는 메모리(30)의 하나로 RAM을 도시하였으나 이와 더불어 다양한 저장 매체를 메모리(30)로 이용할 수 있음은 물론이다. The memory 30 stores various information, commands, and/or information, and receives one from the storage 40 to perform the method of automatically determining the type of data and generating metadata according to the second embodiment of the present invention. The above computer programs 41 can be loaded. In FIG. 1, RAM is shown as one of the memories 30, but of course, various storage media can also be used as the memory 30.

스토리지(40)는 하나 이상의 컴퓨터 프로그램(41) 및 대용량 네트워크 정보(42)를 비임시적으로 저장할 수 있다. 이러한 스토리지(40)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 널리 알려져 있는 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체 중 어느 하나일 수 있다. Storage 40 may non-temporarily store one or more computer programs 41 and large-capacity network information 42. This storage 40 may be non-volatile memory such as ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), flash memory, hard disk, removable disk, or in the technical field to which the present invention pertains. It may be any of the widely known computer-readable recording media.

컴퓨터 프로그램(41)은 메모리(30)에 로드되어, 하나 이상의 프로세서(10)가 (A) 수신한 데이터의 필드값이 날짜형 타입인지 1차적으로 판단하는 오퍼레이션, (B) 상기 (A) 오퍼레이션의 판단 결과, 날짜형 타입이 아니라고 1차적으로 판단되었다면, 상기 수신한 데이터의 필드값에 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건을 적용하여 범주형 및 날짜형 중 어느 하나의 타입인지 2차적으로 판단하는 오퍼레이션, (C) 상기 (B) 오퍼레이션의 판단 결과, 범주형 및 날짜형 중 어느 하나의 타입이 아니라고 2차적으로 판단되었다면, 상기 수신한 데이터의 필드명을 상기 메타데이터 생성 규칙이 포함하는 데이터 타입의 판별에 관한 필드 매핑 테이블(Field Mapping Table)에 적용하여 숫자형, 문자형, 범주형 및 날짜형 중 어느 하나의 타입인지 최종적으로 판단하고 메타데이터를 생성하는 오퍼레이션 및 (D) 상기 생성한 메타데이터를 머신러닝/딥러닝 모델로 학습하여 상기 메타데이터 생성 규칙을 업데이트하는 오퍼레이션을 실행할 수 있다. The computer program 41 is loaded into the memory 30, and one or more processors 10 perform (A) an operation to primarily determine whether the field value of the received data is a date type, and (B) the (A) operation. As a result of the judgment, if it is primarily determined that it is not a date type, it is secondarily determined whether it is a categorical or date type by applying the data type determination conditions included in the metadata creation rule to the field value of the received data. (C) If, as a result of the judgment of the operation (B), it is secondarily determined that it is not one of the categorical and date types, the field name of the received data is included in the metadata creation rule. An operation that applies to the Field Mapping Table for determining the data type to finally determine whether it is a numeric type, character type, categorical type, or date type and generates metadata, and (D) the above generation. An operation that updates the metadata creation rule can be performed by learning metadata with a machine learning/deep learning model.

이상 간단하게 언급한 컴퓨터 프로그램(41)이 수행하는 오퍼레이션은 컴퓨터 프로그램(41)의 일 기능으로 볼 수 있으며, 보다 자세한 설명은 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 대한 설명에서 후술하도록 한다. The operation performed by the computer program 41 briefly mentioned above can be viewed as a function of the computer program 41, and a more detailed description will be provided by automatically determining the type of data according to the second embodiment of the present invention and This will be explained later in the description of how to generate data.

데이터 버스(50)는 이상 설명한 프로세서(10), 네트워크 인터페이스(20), 메모리(30) 및 스토리지(40) 사이의 명령 및/또는 정보의 이동 경로가 된다. The data bus 50 serves as a path for moving instructions and/or information between the processor 10, network interface 20, memory 30, and storage 40 described above.

이상 설명한 본 발명의 제1 실시 예에 따른 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 독립된 장치의 형태로 구현된 경우 사용자 단말(미도시)이 네트워크를 통해 해당 장치에 데이터를 송신하거나 생성된 메타데이터를 수신 받을 수 있을 것이며, 서버의 형태로 구현된 경우 서버와 네트워크를 통해 연동된 전용 어플리케이션에서 제공하는 메타데이터 자동 생성 기능을 사용자 단말(미도시)에게 제공할 수 있을 것이다. If the data type determination device 100 using the machine learning/deep learning model according to the first embodiment of the present invention described above is implemented in the form of an independent device, a user terminal (not shown) sends data to the device through the network. It will be possible to transmit or receive generated metadata, and if implemented in the form of a server, the automatic metadata generation function provided by a dedicated application linked to the server and network can be provided to the user terminal (not shown). .

이하, 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 대하여 도 2 내지 도 10을 참조하여 설명하도록 한다. Hereinafter, a method for automatically determining the type of data and generating metadata according to the second embodiment of the present invention will be described with reference to FIGS. 2 to 10.

도 2는 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법의 대표적인 단계를 나타낸 순서도이다. Figure 2 is a flowchart showing representative steps of a method for automatically determining the type of data and generating metadata according to the second embodiment of the present invention.

그러나 이는 본 발명의 목적을 달성함에 있어서 바람직한 실시 예일 뿐이며, 필요에 따라 일부 단계가 추가 또는 삭제될 수 있음은 물론이며, 어느 한 단계가 다른 단계에 포함되어 수행될 수도 있다. However, this is only a preferred embodiment in achieving the purpose of the present invention, and of course, some steps may be added or deleted as needed, and one step may be performed by being included in another step.

또한, 각 단계는 앞서 본 발명의 제1 실시 예에 따른 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100), 구현 형태는 서버의 형태에 의해 수행됨을 전제로 하며, 사용자 단말(미도시)의 경우 네트워크 기능을 보유한 단말, 예를 들어 스마트폰, 스마트 워치, 스마트 글라스, 노트북 컴퓨터, 테블릿 컴퓨터, PDA, PDP, PMP 등이라면 어떠한 것이라도 무방하나 이하의 설명에서는 사용자 단말(미도시)이 데스크톱 PC임을 전제로 설명을 이어가도록 한다. In addition, each step is assumed to be performed by the data type determination device 100 using the machine learning/deep learning model according to the first embodiment of the present invention, and the implementation form is performed in the form of a server, and a user terminal (not shown) ), any terminal with a network function, such as a smartphone, smart watch, smart glasses, laptop computer, tablet computer, PDA, PDP, PMP, etc. may be used, but in the following description, the user terminal (not shown) We will continue the explanation assuming that this is a desktop PC.

우선, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 수신한 데이터의 필드값이 날짜형 타입인지 1차적으로 판단한다(S210). First, the data type determination device 100 using a machine learning/deep learning model first determines whether the field value of the received data is a date type (S210).

여기서 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 데이터를 수신할 수 있도록 데이터를 송신하는 주체는 대표적으로 사용자 단말(미도시)일 수 있으며, 사용자 단말(미도시)이 아니라 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100) 스스로 또는 관리자의 조작에 의해 인터넷에서 크롤링하거나 특정 데이터베이스에 접속하여 다운로드할 수도 있다 할 것인바, 데이터를 송신하는 주체는 특별히 한정하지 않는다 할 것이다. Here, the entity that transmits data so that the data type determination device 100 using a machine learning/deep learning model can receive the data may typically be a user terminal (not shown), and the machine is not a user terminal (not shown). The data type determination device 100 using a learning/deep learning model may crawl from the Internet or access a specific database and download it by itself or by operation of an administrator, and the subject transmitting the data is not particularly limited. .

데이터를 수신한 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)는 이를 스토리지(30) 등과 같은 내부 저장 공간 또는 네트워크를 통해 연결된 외부 저장 공간에 저장할 수 있으며, 저장 없이 곧바로 데이터의 필드값이 날짜형 타입인지 1차적으로 판단할 수도 있다 할 것이다. The data type determination device 100 using a machine learning/deep learning model that has received the data can store it in an internal storage space such as the storage 30 or an external storage space connected through a network, and the field value of the data is immediately stored without storage. You can first determine whether this is a date type.

한편, S210 단계는 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)에 의해 수행되는 것이나, 보다 구체적으로는 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)에 설치된 파이썬(Pyton) 등과 같은 공개된 프로그래밍 언어에 의해 수행되는 것으로 볼 수 있으며, 그에 따라 S210 단계에서의 데이터 타입의 1차적인 판단은 해당 프로그래밍 언어인 파이썬 등이 인식할 수 있는 데이터 타입에 관한 것일 수 있다. Meanwhile, step S210 is performed by the data type determination device 100 using a machine learning/deep learning model, or more specifically, Python installed in the data type determination device 100 using a machine learning/deep learning model. ), etc., and accordingly, the primary determination of the data type in step S210 may be about a data type that can be recognized by the corresponding programming language, such as Python.

이 경우, 파이썬에 의한다면 인식할 수 있는 데이터 타입은 크게 숫자형, 문자형 및 날짜형일 수 있으며, 여기서 숫자형(= 수치형)의 경우 정수 및 실수 등을 포함하는 숫자로 이루어진 타입, 문자형(= 텍스트형, 스트링형)의 경우 텍스트로 이루어진 타입, 날짜형은 프로그래밍 언어가 인식할 수 있는 날짜를 나타내는 포맷으로 이루어진 타입일 수 있다. In this case, according to Python, the recognizable data types can be largely numeric, character, and date types. Here, in the case of numeric types (= numeric types), types consisting of numbers including integers and real numbers, character types (= In the case of text type, string type), it may be a type composed of text, and the date type may be a type composed of a format representing a date that a programming language can recognize.

머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)는 파이썬에 기초하여 데이터의 필드값이 전부 숫자로 이루어진 경우 숫자형으로 1차적으로 판단하며, 날짜를 나타내는 포맷, 예를 들어 YYYY-MM-DD, YYYY/MM/DD 등과 같은 포맷으로만 이루어진 경우 날짜형으로 1차적으로 판단하고, 나머지는 문자형으로 판단할 수 있다. The data type determination device 100 using a machine learning/deep learning model is based on Python, and when all field values of the data are made up of numbers, it is primarily judged as a numeric type and a format representing the date, for example, YYYY-MM. If it consists only of formats such as -DD, YYYY/MM/DD, etc., it can be judged primarily as a date type, and the rest can be judged as a character type.

S210 단계의 판단 결과, 수신한 데이터의 필드값이 날짜형 타입이 아니라고 1차적으로 판단되었다면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 수신한 데이터의 필드값에 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건을 적용하여 범주형 및 날짜형 중 어느 하나의 타입인지 2차적으로 판단한다(S220). As a result of the determination in step S210, if it is initially determined that the field value of the received data is not a date type, the data type determination device 100 using a machine learning/deep learning model creates metadata in the field value of the received data. By applying the data type determination conditions included in the rule, it is secondarily determined whether it is a categorical type or a date type (S220).

S220 단계에서의 2차적인 판단은 S210 단계의 판단 결과 수신한 데이터의 필드값이 날짜형 타입이 아니라고 1차적으로 판단된 경우에만 수행되고, S210 단계의 판단 결과 수신한 데이터의 필드값이 날짜형 타입이 라고 1차적으로 판단된 경우, S220 단계가 수행되지 않고 데이터의 타입이 날짜형으로 최종 판단되어 프로세스가 종료된다. The secondary judgment in step S220 is performed only when it is primarily determined that the field value of the received data is not a date type as a result of the judgment in step S210, and the field value of the received data is a date type as a result of the judgment in step S210. If it is initially determined that the data type is a date type, step S220 is not performed and the data type is finally determined to be a date type, and the process ends.

이와 같이 S220 단계가 S210 단계의 판단 결과 수신한 데이터의 필드값이 날짜형 타입이 아니라고 1차적으로 판단된 경우에만 수행되는 이유는 날짜형이 아닌 숫자형 타입과 문자형 타입의 경우 숫자 만으로 또는 문자만으로 이루어져 있다 할지라도 진정한 의미의 숫자형 타입 또는 문자형 타입이 아니라 범주형 타입 또는 날짜형 타입일 수도 있기 때문이다. In this way, the reason why step S220 is performed only when it is primarily determined that the field value of the received data is not a date type as a result of the determination in step S210 is that in the case of numeric types and character types that are not date types, only numbers or letters are used. This is because even if it is configured, it may not be a numeric type or character type in the true sense, but a categorical type or date type.

예를 들어 설명하면, 고객 등급이라는 데이터에 대한 필드값이 1, 2, 3, 4, 5인 경우 숫자만으로 이루어져 있기에 S210 단계에 따라 숫자형으로 판단될 것이나, 진정한 의미로는 고객 등급을 나타내는 범주에 해당하는 필드값이기 때문이며, 마찬가지로 품질이라는 데이터에 대한 필드값이 very good, good, normal, bad, very bad인 경우 숫자형 또는 날짜형 타입이 아니기 때문에 S210 단계에 따라 문자형으로 판단될 것이나, 진정한 의미로는 품질을 나타내는 범주에 해당하는 필드값이기 때문이다. 또한, 어떤 데이터에 대한 필드값이 20211111인 경우 프로그래밍 언어가 인식할 수 있는 날짜의 포맷이 아니기에 S210 단계에 따라 날짜형이 아닌 숫자형으로 판단될 것이나, 진정한 의미로는 2021년 11월 11일과 같은 날짜에 해당하는 필드값이기 때문이며, 마찬가지로 어떤 데이터에 대한 필드값이 2021년 11월 11일인 경우 별도로 날짜 포맷을 명시하지 않는 이상 S210 단계에 따라 날짜형이 아닌 문자형으로 판단될 것이나, 진정한 의미로는 날짜를 나타내는 필드값이기 때문인바, 데이터 타입의 정확한 세부적 판단이 요구되는 경우는 숫자형과 문자형 타입에 국한될 수 있다 할 것이다. For example, if the field values for data called customer rating are 1, 2, 3, 4, and 5, it will be judged as a numeric type according to step S210 because it consists only of numbers, but in the true sense, it is a category representing the customer rating. This is because it is a field value corresponding to, and similarly, if the field value for quality data is very good, good, normal, bad, or very bad, it is not a numeric or date type, so it will be judged as a character type according to step S210, but it is not a numeric or date type. This is because, in meaning, it is a field value that corresponds to a category indicating quality. Additionally, if the field value for some data is 20211111, it is not a date format that the programming language can recognize, so it will be judged as a numeric type rather than a date type according to step S210, but in true meaning, it is a date format such as November 11, 2021. This is because it is a field value corresponding to a date, and similarly, if the field value for some data is November 11, 2021, unless the date format is specified separately, it will be judged as a character type rather than a date type according to step S210, but in the true sense, it will be judged as a character type rather than a date type. This is because it is a field value representing a date, so cases where an accurate detailed judgment of the data type is required may be limited to numeric and character types.

머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 수신한 데이터의 필드값이 범주형 및 날짜형 중 어느 하나의 타입인지 2차적으로 판단함에는 메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건을 이용하는바, 이하 설명하도록 한다. Data type determination using a machine learning/deep learning model To secondarily determine whether the field value of the data received by the device 100 is a categorical or date type, determine the data type included in the metadata creation rule. Conditions are used, which will be explained below.

메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건 중 하나는 수신한 데이터의 전체 필드값의 수 대비 고유한(Unique) 필드값의 수의 비율이 소정 비율 미만인지 여부에 대한 제1 조건이며, 제1 조건에 따라 소정 비율 미만이라면 수신한 데이터의 타입은 범주형 타입일 수 있는바, 숫자형 및 문자형 타입 중 범주형 타입에 해당하는지를 판별하는 조건에 해당한다. One of the data type determination conditions included in the metadata creation rule is the first condition for whether the ratio of the number of unique field values to the total number of field values of the received data is less than a certain ratio, and the first condition is If it is less than a predetermined ratio depending on the condition, the type of the received data may be a categorical type, and this corresponds to the condition for determining whether it is a categorical type among numeric and character types.

예를 들어 설명하도록 한다. 1,000개의 샘플로 구성된 데이터셋에서 초기에 프로그래밍 언어에 의해 숫자형으로 판단된 "고객 등급"이라는 필드의 필드값 중, 고유한 필드값이 10개인 경우, 전체 필드값의 수인 1,000 대비 고유한 필드값의 수인 10의 비율은 10/1,000인 0.001되고, 이는 0에 가까운 매우 작은 값, 통계적인 측면에서 보았을 때 거의 없다고 보아도 무방할 정도의 값인바, 1,000개의 데이터셋은 고유한 필드값을 포함함이 없이 동일한 필드값을 갖는 복수 개의 집합 - 복수 개의 집합 자체에 대한 필드값은 서로 상이한, 예를 들어 필드값을 1로 가진 100개, 필드값을 2로 가진 200개, 필드값을 3으로 가진 500개, 필드값을 4로 가진 100개, 필드값을 5로 가진 99개, 필드값을 6으로 가진 1개(고유한 필드값) - 으로 이루어져 있을 것이므로 숫자형이 아닌 범주형으로, 이상의 예에서 1, 2, 3, 4, 5를 범주로 갖는 범주형 타입으로 판별할 수 있는 것이다. Let me explain with an example. In a dataset consisting of 1,000 samples, if there are 10 unique field values among the field values of a field called "Customer Rating" that was initially determined to be a numeric type by the programming language, the unique field value is compared to 1,000, the total number of field values. The ratio of 10, which is the number of , is 0.001, which is 10/1,000, which is a very small value close to 0, a value that can be considered almost non-existent from a statistical perspective, and 1,000 datasets contain unique field values. Multiple sets with identical field values - The field values for the multiple sets themselves are different from each other, for example, 100 sets with a field value of 1, 200 sets with a field value of 2, and 500 sets with a field value of 3. , 100 with a field value of 4, 99 with a field value of 5, and 1 with a field value of 6 (unique field value) - so it is a categorical type rather than a numeric type. In the above example, It can be identified as a categorical type with 1, 2, 3, 4, and 5 as categories.

한편, 미만인지 여부의 판단에 기준이 되는 소정 비율은 0 내지 0.5 사이 중 어느 하나일 수 있는바, 소정 비율이 커질수록 고유한 필드값의 수가 높아지는 경우에도 범주형으로 판별되어 판별의 정확도가 저하될 수 있을 것인바, 소정 비율은 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)의 관리자가 0에 가깝게 설정함이 바람직하다 할 것이며, 후술할 S240 단계에서 생성한 메타데이터를 머신러닝/딥러닝 모들로 학습하여 메타데이터 생성 규칙을 업데이트하는 경우, 소정 비율 역시 학습한 결과로 조정될 수 있을 것이다. Meanwhile, the predetermined ratio that serves as the standard for determining whether or not it is less than 0.5 may be any one between 0 and 0.5. As the predetermined ratio increases, even if the number of unique field values increases, it is determined as a categorical type and the accuracy of determination decreases. It may be desirable for the manager of the data type determination device 100 using a machine learning/deep learning model to set the predetermined ratio close to 0, and the metadata generated in step S240, which will be described later, is machine learning /When updating the metadata creation rules by learning with a deep learning model, the predetermined ratio may also be adjusted based on the learning results.

메타데이터 생성 규칙이 포함하는 데이터 타입 결정 조건 중 또 다른 하나는 수신한 데이터의 필드값이 년도 패턴, 두 글자의 월 패턴, 일 패턴, 시 패턴, 분 패턴, 초 패턴 중 하나 이상의 패턴을 포함하는지 여부에 대한 제2 조건이며, 제2 조건에 따라 하나 이상의 패턴을 포함한다면, 수신한 데이터의 타입은 날짜형 타입일 수 있는바, 숫자형 및 문자형 타입 중 날짜형 타입에 해당하는지를 판별하는 조건에 해당한다.Another condition for determining the data type included in the metadata creation rule is whether the field value of the received data includes one or more of the following patterns: a year pattern, a two-letter month pattern, a day pattern, an hour pattern, a minute pattern, and a second pattern. This is the second condition for whether or not, and if it includes one or more patterns according to the second condition, the type of the received data may be a date type, and the condition for determining whether it corresponds to a date type among numeric and character types is It applies.

이는 앞선 예에서 필드값이 20211111인 경우 또는 2021년 11월 11일인 경우 각각 숫자형과 문자형으로 판단될 것이나, 진정한 의미로는 모두 날짜형에 해당하기에 이러한 경우 날짜형으로 판단함으로써 데이터 타입의 정확한 판별을 기하기 위함이다. In the previous example, if the field value is 20211111 or November 11, 2021, it will be judged as a numeric type and a character type, respectively, but in true meaning, they all correspond to a date type, so in these cases, the exact data type can be confirmed by judging it as a date type. This is for discrimination purposes.

보다 구체적으로 제2 조건이 포함할 수 있는 년도 패턴은 두 글자 또는 네 글자의 숫자와 년도를 나타내는 문자, 예를 들어 년, Y, year 등을 포함하는 패턴이며, 월 패턴은 한 글자 또는 두 글자의 숫자와 월을 나타내는 문자, 예를 들어 월, M, Month 등을 포함하거나, 숫자를 포함하지 않고 그 자체로써 월을 나타내는 문자, 예를 들어, Jan, Feb 등과 같이 월의 약어로 기재한 문자나 January, February 등과 같이 전체로 월을 기재한 문자 등을 포함하는 패턴이고, 일 패턴은 한 글자 또는 두 글자의 숫자와 일을 나타내는 문자, 예를 들어 일, D, Day 등을 포함하는 패턴이며, 분 패턴은 한 글자 또는 두 글자의 숫자와 분을 나타내는 문자, 예를 들어 분, m, min, minute 등을 포함하는 패턴이고, 초 패턴은 한 글자, 두 글자 또는 소수점 이하의 숫자와 초를 나타내는 문자, 예를 들어 초, s, second 등을 포함하는 패턴인바, 이상의 패턴에 대한 사항은 예시에 해당할 뿐이며, 얼마든지 다양한 패턴들을 날짜형 타입에 해당하는지 판별하는 조건으로 추가할 수 있음은 물론이다. More specifically, the year pattern that the second condition can include is a pattern that includes two or four numbers and letters representing the year, such as year, Y, year, etc., and the month pattern is one letter or two letters. Includes a number and a letter representing the month, such as month, M, Month, etc., or a character that represents the month by itself without including a number, such as an abbreviation for the month such as Jan, Feb, etc. It is a pattern that includes letters that describe the month in its entirety, such as January, February, etc., and a day pattern is a pattern that includes one or two letters of numbers and letters representing the day, such as day, D, Day, etc. , a minute pattern is a pattern that contains one or two letters of a number and a character representing the minute, such as minute, m, min, minute, etc., and a second pattern is a pattern that contains one or two letters or a number after the decimal point and a second. Since it is a pattern that includes the characters it represents, for example, seconds, s, second, etc., the above pattern information is only an example, and as many different patterns can be added as conditions to determine whether it corresponds to a date type. Of course.

S220 단계에 따라 범주형 및 날짜형 중 어느 하나의 타입인지 2차적으로 판단한 결과, 범주형 및 날짜형 중 어느 하나의 타입이 아니라고, 즉 숫자형 및 문자형 중 어느 하나의 타입이라고 2차적으로 판단된 경우, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 수신한 데이터의 필드명을 메타데이터 생성 규칙이 포함하는 데이터 타입의 판별에 관한 필드 매핑 테이블(Field Mapping Table)에 적용하여 숫자형, 문자형, 범주형 및 날짜형 중 어느 하나의 타입인지 최종적으로 판단하고 메타데이터를 생성한다(S230). As a result of secondarily determining whether it is a categorical or date type according to step S220, it was secondarily determined that it is not either a categorical or date type, that is, either a numeric type or a character type. In this case, the data type determination device 100 using a machine learning/deep learning model applies the field name of the received data to the field mapping table for determination of the data type included in the metadata creation rule to generate a number. It is finally determined whether the type is type, character type, category type, or date type, and metadata is generated (S230).

앞선 S210 단계 및 S220 단계에서 데이터의 필드값을 이용한 것과 달리, S230 단계에서는 데이터의 필드명을 이용하는 차이가 있으며, 메타데이터 생성 규칙은 앞선 데이터 타입 결정 조건뿐만 아니라 데이터 타입의 판별에 관한 필드 매핑 테이블(Field Mapping Table)을 더 포함할 숫 있는바, S230 단계에서는 필드 매핑 테이블을 이용하여 데이터의 타입을 최종적으로 판단할 수 있다. Unlike using the field values of the data in the preceding steps S210 and S220, there is a difference in using the field names of the data in the step S230, and the metadata creation rule is based on not only the preceding data type determination conditions but also a field mapping table for data type determination. (Field Mapping Table) may be further included, and in step S230, the type of data can be finally determined using the field mapping table.

도 3은 필드 매핑 테이블을 예시적으로 도시한 도면인바, 도 3을 참조하면 필드 매핑 테이블은 도 3의 좌측에 도시된 바와 같이 하나 이상의 필드명 각각에 대하여 (풀네임 ID, 참조 횟수)가 하나 이상 매핑된 필드명/풀네임 매핑 테이블과 우측에 도시된 바와 같이 하나 이상의 풀네임 ID 각각에 대하여 숫자형에 해당할 확률, 문자형에 해당할 확률, 범주형에 해당할 확률 및 날짜형에 해당할 확률이 매핑된 데이터 타입 매핑 테이블을 포함할 수 있다. Figure 3 is a diagram illustrating a field mapping table by way of example. Referring to Figure 3, the field mapping table has one (full name ID, reference count) for each of one or more field names as shown on the left side of Figure 3. As shown in the field name/full name mapping table mapped above and on the right, for each of one or more full name IDs, there is a probability of being a numeric type, a probability of being a character type, a probability of being a categorical type, and a probability of being a date type. It may include a data type mapping table to which probabilities are mapped.

필드명/풀네임 ID 테이블을 참조하면, 필드명 CUSNUM, PRCS_DT, PRCSDT, CSTMRGRAD 각각에 (2,1), (1,3)/(4,2)/(3,1), (1,3)/(3,1), (5,2)/(2,1)이 매핑되어 있음을 확인할 수 있는바, 하나의 필드명에는 복수 개의 풀네임 ID가 매핑될 수 있으며, 필드명/풀네임 ID 테이블은 각각의 필드명에 대하여 참조 횟수가 높은 풀네임 ID의 순서대로 정렬되므로 해당 필드명의 데이터 타입은 가장 높은 참조 횟수를 갖는 풀네임 ID를 참조하면 된다. Referring to the field name/full name ID table, the field names CUSNUM, PRCS_DT, PRCSDT, and CSTMRGRAD are (2,1), (1,3)/(4,2)/(3,1), (1,3), respectively. )/(3,1), (5,2)/(2,1) can be confirmed to be mapped, so multiple full name IDs can be mapped to one field name, and the field name/full name Since the ID table is sorted in order of the full name ID with the highest reference count for each field name, the data type of the field name can refer to the full name ID with the highest reference count.

예를 들어 설명하도록 한다. 수신한 데이터의 필드명이 CSTMRGRAD인 경우, 필드명/풀네임 ID 테이블에서 해당 필드명인 CSTMRGRAD를 탐색하고, 가장 높은 참조 횟수인 2를 갖는 풀네임 ID인 풀네임 ID 5를 확인할 수 있으며, 풀네임 ID 5를 데이터 타입 매핑 테이블에서 탐색한 후, 가장 높은 확률인 0.75가 기재된 타입인 범주형 타입을 데이터의 최종 타입으로 판단할 수 있으며, 필드명이 PRCS_DT인 경우, 필드명/풀네임 ID 테이블에서 해당 필드명인 PRCS_DT를 탐색하고, 가장 높은 참조 횟수인 3dmf 갖는 풀네임 ID인 풀네임 ID 1을 확인할 수 있으며, 풀네임 ID 1을 데이터 타입 매핑 테이블에서 탐색한 후, 가장 높은 확률인 0.90이 기재된 타입인 날짜형 타입을 데이터의 최종 타입으로 판단할 수 있다. Let me explain with an example. If the field name of the received data is CSTMRGRAD, you can search for the field name CSTMRGRAD in the field name/full name ID table and check the full name ID 5, which is the full name ID with the highest reference count of 2, and the full name ID 5. After searching 5 in the data type mapping table, the categorical type, which is the type with the highest probability of 0.75, can be determined as the final type of the data. If the field name is PRCS_DT, the corresponding field in the field name/full name ID table You can search the name PRCS_DT and check full name ID 1, which is the full name ID with the highest reference count of 3dmf. After searching the full name ID 1 in the data type mapping table, you can find the date type with the highest probability of 0.90. The type type can be judged as the final type of data.

이상의 설명은 수신한 데이터의 필드명이 필드 매핑 테이블, 보다 구체적으로 필드명/풀네임 매핑 테이블에 포함되어 있는 경우를 전제로 하는바, 필드명/풀네임 매핑 테이블에 수신한 데이터의 필드명이 포함되어 있지 않은 경우도 얼마든지 발생할 수 있으며, 이러한 경우를 위해 필드 매핑 테이블은 풀네임 정보 테이블을 더 포함할 수 있다. The above explanation assumes that the field name of the received data is included in the field mapping table, more specifically, the field name/full name mapping table. The field name of the received data is included in the field name/full name mapping table. Cases where it is not present may occur as many times as possible, and for such cases, the field mapping table may further include a full name information table.

도 4는 필드 매핑 테이블을 예시적으로 도시한 또 다른 도면인바, 도 4를 참조하면 도 3 대비 우측 상단에 하나 이상의 풀네임 ID 각각에 대하여 해당하는 풀네임, 필드 설명 및 참조횟수를 포함하는 풀네임 정보 테이블을 더 포함하고 있음을 확인할 수 있으며, 풀네임 정보 테이블이 포함하고 있는 풀네임 ID는 우측 하단에 데이터 타입 매핑 테이블이 포함하고 있는 풀네임 ID 각각과 1:1로 대응될 수 있다. Figure 4 is another diagram illustrating a field mapping table. Referring to Figure 4, in the upper right corner compared to Figure 3, there is a full name, field description, and reference count for each of one or more full name IDs. It can be confirmed that it further includes a name information table, and the full name ID included in the full name information table can correspond 1:1 with each of the full name IDs included in the data type mapping table at the bottom right.

여기서 풀네임 정보 테이블은 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블에는 포함되어 있지 않으나, 풀네임 정보 테이블에 해당 필드명이 나타내는 풀네임이 포함되어 있는 경우 그리고 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블 그리고 풀네임 역시 풀네임 정보 테이블에도 포함되어 있지 않은 경우에는 데이터 타입을 어떻게 판단할 것인지 문제된다. 이 경우 데이터 타입의 판단에 앞서 데이터 필드명의 풀네임 설정과 추천 그리고 필드명/풀네임 매핑 테이블에 필드명의 추가 등이 이루어져야 하는바, 이하 설명하도록 한다. Here, the full name information table is used when the field name of the received data is not included in the field name/full name mapping table, but the full name information table includes the full name that the field name represents, and the field name of the received data is the field name. /Full name mapping table. If the full name is also not included in the full name information table, the problem is how to determine the data type. In this case, prior to determining the data type, the full name of the data field name must be set and recommended, and the field name must be added to the field name/full name mapping table, which will be explained below.

도 5는 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 있어서, 풀네임 정보 테이블은 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블에는 포함되어 있지 않으나, 풀네임 정보 테이블에 해당 필드명이 나타내는 풀네임이 포함되어 있는 경우 그리고 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블 그리고 풀네임 역시 풀네임 정보 테이블에도 포함되어 있지 않은 경우에 선행되는 단계를 도시한 순서도이며, 도 6 및 도 7은 첨부하는 도면의 크기 문제로 인해 도 5에 도시된 도면을 분리하여 도시한 도면이다. 5 shows a method of automatically determining the type of data and generating metadata according to the second embodiment of the present invention, where the field name of the received data is included in the full name information table in the field name/full name mapping table. However, if the full name indicated by the field name is included in the full name information table, and the field name of the received data is not included in the field name/full name mapping table and the full name information table, it is preceded by It is a flowchart showing the steps, and FIGS. 6 and 7 are diagrams showing the drawing shown in FIG. 5 separated due to size issues of the accompanying drawings.

우선, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 데이터 필드명의 풀네임 설정과 추천 규칙에 관한 필드 매핑 테이블이 포함하는 필드명/풀네임 매핑 테이블에 수신한 데이터의 필드명을 포함하는지 판단한다(S300-1). First, the data type determination device 100 using a machine learning/deep learning model sets the full name of the data field name and sets the field name of the received data to the field name/full name mapping table included in the field mapping table for the recommendation rule. Determine whether it is included (S300-1).

여기서 필드명은 약어에 해당하며, 수신한 데이터의 필드명이 PRCS_DT인 경우를 예로 하여 설명을 이어가도록 한다. Here, the field name corresponds to an abbreviation, and the explanation will be continued using the case where the field name of the received data is PRCS_DT as an example.

다시 도 4를 참조하면, 좌측 테이블인 필드명/풀네임 매핑 테이블이 다양한 필드명이 기재되어 있음을 확인할 수 있는바, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)는 수신한 데이터의 필드명이 PRCS_DT이기 때문에 필드명/풀네임 매핑 테이블에 필드명 PRCS_DT이 포함되어 있는지 판단하고, 1열 3행에 포함되어 있기에 포함하는 것으로 판단할 수 있다. Referring again to FIG. 4, it can be seen that the left table, the field name/full name mapping table, contains various field names, and the data type determination device 100 using a machine learning/deep learning model determines the type of data received. Since the field name is PRCS_DT, it can be determined whether the field name PRCS_DT is included in the field name/full name mapping table, and it can be determined that it is included because it is included in column 1, row 3.

S300-1 단계의 판단 결과, 포함하는 것으로 판단되었다면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 해당 필드명에 대하여 참조 횟수가 가장 높은 풀네임 ID를 탐색한다(S300-2). If it is determined to be included as a result of the determination in step S300-1, the data type determination device 100 using a machine learning/deep learning model searches for the full name ID with the highest number of references to the field name (S300-2 ).

다시 도 4를 참조하면 좌측 테이블인 필드명/풀네임 매핑 테이블의 각 필드명에는 (Full Name ID, 참조 횟수)가 하나 이상 매칭되어 있음을 확인할 수 있는바, 여기서 풀네임 ID는 우측 상단 테이블인 풀네임 정보 테이블이 포함하는 특정 풀넴임을 나타내는 ID에 해당하며, 참조 횟수는 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100) 내에서 해당 필드명이 특정 풀네임으로 몇 번이나 사용되었는지를 나타내는 숫자에 해당하는바, 특정 풀네임 ID에 대한 참조 횟수가 높을수록 해당 필드명은 해당 특정 풀네임 ID에 해당하는 풀네임으로 많이 사용된 것으로 볼 수 있는바, 참조 횟수가 가장 높은 풀네임 ID를 탐색하는 것은 지금까지의 경험에 의해 해당 필드명에 해당할 가능성이 가장 높은 풀네임을 추천하기 위함이라 할 것이다.Referring to Figure 4 again, it can be seen that one or more (Full Name ID, reference count) matches each field name in the field name/full name mapping table, which is the left table, where the full name ID is the field name/full name mapping table, which is the table on the top right. It corresponds to an ID indicating a specific full name included in the full name information table, and the reference count indicates how many times the field name was used as a specific full name within the data type determination device 100 using a machine learning/deep learning model. It corresponds to a number. The higher the number of references to a specific full name ID, the more frequently the field name is used as the full name corresponding to that specific full name ID. Search for the full name ID with the highest number of references. The purpose of this is to recommend the full name that is most likely to correspond to the field name based on experience so far.

이를 앞선 예에 적용하면 수신한 데이터의 필드명이 PRCS_DT이었기에, S300-2 단계에 따르면 가장 높은 참조 횟수를 탐색하여 참조 횟수가 3인 풀네임 ID 1을 선정할 수 있다. Applying this to the previous example, since the field name of the received data was PRCS_DT, according to step S300-2, full name ID 1 with a reference count of 3 can be selected by searching for the highest reference count.

이후, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 탐색한 참조 횟수가 가장 높은 풀네임 ID를 필드 매핑 테이블이 포함하는 풀네임 정보 테이블에서 탐색하고 이에 매핑된 풀네임을 수신한 데이터에 대한 메타데이터 필드명의 풀네임으로 설정한다(S300-3). Afterwards, the full name ID with the highest reference count searched by the data type determination device 100 using a machine learning/deep learning model is searched in the full name information table included in the field mapping table, and the full name mapped thereto is received. Set the full name of the metadata field name for the data (S300-3).

앞서 S300-2 단계에서 참조 횟수가 가장 높은 풀네임 ID를 탐색하였기에, 풀네임 정보 테이블에서 해당 풀네임 ID를 탐색한다면 이에 매칭된 풀네임을 데이터에 대한 메타데이터 필드명의 풀네임으로 설정할 수 있다. Since the full name ID with the highest reference count was previously searched in step S300-2, if the full name ID is searched in the full name information table, the matching full name can be set as the full name of the metadata field name for the data.

이를 앞선 예에 적용하면, 데이터의 필드명 PRCS_DT에 대하여 참조 횟수가 가장 높은 풀네임 ID인 풀네임 ID 1을 탐색하였기에, 풀네임 ID 1을 도 4의 우측 상단에 도시된 풀네임 정보 테이블에서 탐색하고, 이에 매핑된 풀네임인 Process Date를 수신한 데이터에 대한 메타데이터 필드명 PRCS_DT의 풀네임으로 설정할 수 있는 것이다. Applying this to the previous example, since full name ID 1, which is the full name ID with the highest number of references to the data field name PRCS_DT, was searched, full name ID 1 is searched in the full name information table shown in the upper right corner of Figure 4. And, the full name mapped to this, Process Date, can be set as the full name of the metadata field name PRCS_DT for the received data.

한편, 여기서의 설정은 추천의 의미도 포함하는바, 필드명인 약어가 어떠한 풀네임에 해당하는지 정확하게 알 수 없는 상태에서 해당할 가능성이 가장 높은 풀네임을 참조 횟수에 기초하여 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 설정해주기 때문이며, 어디까지나 추천이기 때문에 추천해주는 풀네임을 사용자가 데이터에 대한 메타데이터의 필드명의 풀네임으로 반드시 선택할 필요는 없으며, 직접 입력할 수 있음은 물론이라 할 것이다. Meanwhile, the setting here also includes the meaning of recommendation. In a state where it is not clear exactly which full name the field name abbreviation corresponds to, the machine learning/deep learning model uses the full name that is most likely to correspond to it based on the number of references. This is because the data type determination device 100 using It will be said.

수신한 데이터에 대한 메타데이터 필드명의 풀네임으로 설정했다면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 필드명/풀네임 매핑 테이블이 포함하는 해당 데이터의 필드명에 대하여 상기 참조 횟수가 가장 높은 풀네임 ID에 대한 참조 횟수를 1 증가시키며(S300-4), 풀네임 정보 테이블이 포함하는 앞서 탐색한 참조 횟수가 가장 높은 풀네임 ID에 대한 참조 횟수를 1 증가시킨다(S300-5).If the full name of the metadata field name for the received data is set, the data type determination device 100 using a machine learning/deep learning model refers to the field name of the data included in the field name/full name mapping table above. The reference count for the full name ID with the highest count is increased by 1 (S300-4), and the reference count for the full name ID with the highest reference count previously searched included in the full name information table is increased by 1 (S300- 5).

이러한 S300-5 단계는 필드 매핑 테이블을 이용하여 필드명의 풀네임을 설정하였기에 필드 매핑 테이블, 보다 구체적으로 필드명/풀네임 매핑 테이블과 풀네임 정보 테이블을 최신 상태로 업데이트시키는 것으로 볼 수 있는바, 이를 앞선 예에 적용하면 도 8에 도시된 바와 같이 좌측의 필드명/풀네임 매핑 테이블이 포함하는 해당 데이터의 필드명인 PRCS_DT 중, 앞서 탐색한 참조 횟수가 가장 높은 풀네임 ID 1에 대한 참조 횟수 3에서 1을 증가시켜 4로 업데이트하는 것이며, 우측의 풀네임 정보 테이블이 포함하는 풀네임 ID 1에 대한 참조 횟수를 11에서 1을 증가시켜 12로 업데이트하는 것이다. Since this step S300-5 sets the full name of the field name using the field mapping table, it can be seen as updating the field mapping table, more specifically the field name/full name mapping table and the full name information table, to the latest state. Applying this to the previous example, as shown in FIG. 8, among PRCS_DT, which is the field name of the corresponding data included in the field name/full name mapping table on the left, the reference count for full name ID 1 with the highest previously searched reference count is 3. This is updated to 4 by increasing 1, and the reference count for full name ID 1 included in the full name information table on the right is updated to 12 by increasing 1 from 11.

이상 설명한 S300-1 단계 내지 S300-5 단계는 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블에 포함된 경우에 관한 것이며, 이하, 포함되지 않는 경우에 대하여 설명하도록 한다. Steps S300-1 to S300-5 described above relate to the case where the field name of the received data is included in the field name/full name mapping table, and the case where it is not included will be described below.

다시 도 4를 참조하여 S300-2 단계에서 NO를 따라가면, 즉, S300-1 단계의 판단 결과 포함하지 않는 것으로 판단되었다면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 수신한 데이터에 대한 메타데이터 필드명의 풀네임을 사용자로부터 한 글자 이상 수신한다(S300-2′)Referring again to FIG. 4, if NO is followed in step S300-2, that is, if it is determined not to be included as a result of the determination in step S300-1, the data type determination device 100 using a machine learning/deep learning model receives Receive the full name of the metadata field name for the data from the user (S300-2′)

이는 기 설정된 필드명/풀네임 매핑 테이블에 필드명에 해당하는 약어가 포함되어 있지 않기 때문에 사용자로부터 직접 풀네임에 대한 단서인 한 글자 이상을 수신함으로써 풀네임 정보 테이블에서 이에 해당하는 풀네임을 추천하기 위함인 것이며, 사용자로부터 수신한 글자가 Cu 두 글자임을 예로 하여 설명을 이어가도록 한다. This is because the preset field name/full name mapping table does not include the abbreviation corresponding to the field name, so by receiving one or more characters that are clues to the full name directly from the user, the corresponding full name is recommended from the full name information table. This is to do so, and the explanation will be continued by taking as an example that the letters received from the user are two letters Cu.

이후, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 사용자로부터 수신한 한 글자 이상의 메타데이터 필드명의 풀네임에 대하여 그 뒤에 이어지는 글자 또는 단어를 예측하여 완성된 풀네임을 추천한다(S300-3′)Afterwards, the data type determination device 100 using a machine learning/deep learning model predicts the letters or words that follow the full name of a metadata field name of one or more characters received from the user and recommends the completed full name ( S300-3′)

앞서 S300-2′ 단계에서 사용자로부터 수신한 글자가 Cu이기에, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)는 풀네임 정보 테이블이 포함하는 풀네임 중, Cu로 시작하는 풀네임인 Customer Number과 Customer Grade를 추천할 수 있다.Since the character received from the user in step S300-2′ is Cu, the data type determination device 100 using a machine learning/deep learning model selects a full name starting with Cu among the full names included in the full name information table. Customer Number and Customer Grade can be recommended.

완성된 풀네임을 추천했다면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 추천한 완성된 풀네임을 사용자로부터 선택 받는지 판단하며(S300-4′), 판단 결과 사용자로부터 선택 받았다면, 풀네임 정보 테이블이 포함하는 선택 받은 풀네임에 대한 참조 횟수를 1 증가시키고(S300-5′), 필드명/풀네임 매핑 테이블이 포함하는 하나 이상의 필드명 중, 선택 받은 풀네임에 매칭된 필드명에 대한 참조 횟수를 1 증가시킨다(S300-6′)If a completed full name is recommended, the data type determination device 100 using a machine learning/deep learning model determines whether the recommended completed full name is selected by the user (S300-4′), and as a result of the determination, it is selected by the user. If so, the reference count for the selected full name included in the full name information table is increased by 1 (S300-5′), and matched to the selected full name among one or more field names included in the field name/full name mapping table. Increase the reference count for the specified field name by 1 (S300-6′)

이는 S300-4′ 단계에서 추천한 완성된 풀네임을 사용자로부터 선택 받았다함은 풀네임이 풀네임 정보 테이블에 포함되어 있다는 것이며, 이는 해당 풀네임에 매칭된 필드명 역시 필드명/풀네임 매핑 테이블에 포함되어 있다는 것으로 볼 수 있기에, 참조 횟수를 1씩 증가시킴으로써 필드 매핑 테이블을 최신 상태로 업데이트하는 것이다. This means that the user has selected the completed full name recommended in step S300-4′, meaning that the full name is included in the full name information table, and the field name matched to the full name is also included in the field name/full name mapping table. Since it can be seen that it is included in , the field mapping table is updated to the latest state by increasing the reference count by 1.

이를 앞선 예에 적용하여 사용자가 추천된 풀네임 Customer Number과 Customer Grade 중, Customer Grade를 선택했다면 도 9에 도시된 바와 같이 우측의 풀네임 정보 테이블이 포함하는 풀네임 Customer Grade의 참조 횟수를 2에서 1을 증가시켜 3으로 업데이트하는 것이며, 좌측의 필드명/풀네임 매핑 테이블에서 풀네임 Customer Grade의 풀네임 ID인 풀네임 ID 5에 대한 필드명 CSTMRGRAD에서의 참조 횟수를 2에서 1을 증가시켜 3으로 업데이트하는 것이다. Applying this to the previous example, if the user selected Customer Grade among the recommended full names Customer Number and Customer Grade, the reference count of the full name Customer Grade included in the full name information table on the right is changed from 2 as shown in Figure 9. It is updated to 3 by increasing 1, and in the field name/full name mapping table on the left, the number of references in the field name CSTMRGRAD for full name ID 5, which is the full name ID of the full name Customer Grade, is increased from 2 to 3. is to update.

이상 설명한 S300-2′ 단계 내지 S300-6′ 단계는 데이터의 필드명과 정확하게 일치하는 필드명이 필드명/풀네임 매핑 테이블에 포함되어 있지는 않으나, 해당 필드명에 대한 풀네임이 풀네임 정보 테이블에 포함되어 있음으로써 필드명/풀네임 매핑 테이블에 실질적으로 동일한 의미의 필드명이 포함되는 경우에 관한 것인바, 예를 들어, 데이터의 필드명이 CUGRAD라면 필드명/풀네임 매핑 테이블에 포함되어 있지 않으므로 S300-2′ 단계가 실시될 것이나, 필드명 CUGRAD의 풀네임에 대하여 사용자로부터 수신한 글자가 CU이고, 추천 받은 풀네임 중 Customer Grade를 선택 받았다면 필드명/풀네임 매핑 테이블이 포함하는 CSTMRGRAD가 실질적으로 참조된 것으로 보는 것이다. In steps S300-2′ to S300-6′ described above, the field name that exactly matches the field name of the data is not included in the field name/full name mapping table, but the full name for the field name is included in the full name information table. This refers to a case where a field name with substantially the same meaning is included in the field name/full name mapping table. For example, if the field name of the data is CUGRAD, it is not included in the field name/full name mapping table, so S300- Step 2 will be implemented, but if the letters received from the user for the full name of the field name CUGRAD are CU, and Customer Grade is selected among the recommended full names, CSTMRGRAD included in the field name/full name mapping table is actually It is considered as referenced.

이와 별개로, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)는 도 10에 도시된 바와 같이 필드명/풀네임 매핑 테이블에 포함되지 않은 필드명, 예를 들어 CUGRAD를 추가하고, 이에 매칭되는 풀네임 아이디인 5를 기재하되, 이에 대한 참조 횟수를 1로 기재함으로써 새로운 필드명을 업데이트할 수도 있는바, 이는 하나의 풀네임이 복수 개의 필드명에 매핑될 수 있음을 전제로 한 것이며, 이와 역으로 하나의 필드명이 복수 개의 풀네임에 매칭될 수도 있음은 물론이라 할 것이고, 새롭게 추가한 필드명과 실질적으로 동일한 의미를 갖는 기존의 필드명, 예를 들어 CSTMRGRAD의 참조 횟수를 1 증가시킬지 여부는 관리자의 설정에 의할 수 있다 할 것이며, 도 10에서는 증가시키지 않은 상태를 예로 하여 도시하였다. Separately, the data type determination device 100 using a machine learning/deep learning model adds a field name not included in the field name/full name mapping table, for example, CUGRAD, as shown in FIG. 10, and You can update the new field name by entering the matching full name ID, 5, but specifying the reference count as 1. This is based on the premise that one full name can be mapped to multiple field names. , Conversely, it goes without saying that one field name may match multiple full names, and whether to increase the reference count of an existing field name that has substantially the same meaning as the newly added field name, for example, CSTMRGRAD, by 1. This may depend on the administrator's settings, and FIG. 10 shows the non-increased state as an example.

이번에는 S300-4′ 단계에서 추천한 완성된 풀네임을 사용자로부터 선택 받지 않은 경우에 대하여 설명하도록 한다. This time, we will explain the case where the completed full name recommended in step S300-4′ is not selected by the user.

S300-4′ 단계의 판단 결과, 추천한 완성된 풀네임을 사용자로부터 선택 받지 않았다면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 수신한 데이터에 대한 메타데이터 필드명의 풀네임을 사용자로부터 전부 수신하며(S300-5″), 사용자로부터 전부 수신한 메타데이터 필드명의 풀네임을 풀네임 정보 테이블이 포함하는지 판단한다(S300-6″). As a result of the determination in step S300-4′, if the recommended completed full name has not been selected by the user, the data type determination device 100 using a machine learning/deep learning model determines the full name of the metadata field name for the received data. All are received from the user (S300-5″), and it is determined whether the full name information table includes the full name of the metadata field name all received from the user (S300-6″).

추천한 완성된 풀네임을 사용자로부터 선택 받지 않았다 함은 사용자가 의도하는 해당 필드명에 대한 풀네임을 풀네임 정보 테이블이 포함하지 않는다는 것일 가능성이 높으나, 사용자에 따라서는 포함하는 경우라도 추천한 완성된 풀네임을 의도적으로 선택하지 않을 수도 있으며, 이에 보다 정확성을 기하기 위해 풀네임 정보 테이블에서 사용자로부터 수신한 풀네임을 다시 한번 탐색하는 것이고, 판단 결과 포함하는 것으로 판단되었다면 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 풀네임 정보 테이블이 포함하는 사용자로부터 전부 수신한 메타데이터 필드명의 풀네임에 대한 참조 횟수를 1 증가시키며(S300-7″), 이에 대한 설명은 앞서 S300-5′ 단계에 대한 설명과 동일하므로 중복 서술을 방지하기 위해 자세한 설명은 생략하도록 하고, 그 이후 후술할 S300-8″ 단계가 실시된다. If the recommended completed full name was not selected by the user, it is likely that the full name information table does not include the full name for the field name intended by the user. However, depending on the user, even if it is included, the recommended completed full name is not selected. The full name may not be selected intentionally, and to ensure greater accuracy, the full name received from the user is searched again in the full name information table, and if it is determined to be included, a machine learning/deep learning model is used. The data type determination device 100 increases the number of references to the full name of the metadata field name included in the full name information table by 1 (S300-7″), and this is explained in S300-7″ above. Since it is the same as the description for step 5′, detailed description will be omitted to prevent duplicate description, and then step S300-8″, which will be described later, is performed.

한편, S300-6″ 단계의 판단 결과 포함하지 않는 것으로 판단되었다면 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 풀네임 정보 테이블에 사용자로부터 전부 수신한 메타데이터 필드명의 풀네임 및 이에 대한 풀네임 ID를 추가하고, 참조 횟수를 1로 설정한다(S300-7′″). On the other hand, if it is determined that it is not included as a result of the judgment in step S300-6″, the data type determination device 100 using a machine learning/deep learning model enters the full name of the metadata field name received from the user in the full name information table and the corresponding Add the full name ID and set the reference count to 1 (S300-7′″).

이후, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 사용자로부터 전부 수신한 메타데이터 필드명의 풀네임과 매핑되는 데이터의 필드명을 필드명/풀네임 매핑 테이블이 포함하는지 판단하며(S300-8″), 판단 결과 포함하지 않는 것으로 판단되었다면 필드명/풀네임 매핑 테이블에 사용자로부터 전부 수신한 메타데이터 필드명의 풀네임 또는 이에 대한 약어를 추가하고, 풀네임 정보 테이블이 포함하는 사용자로부터 전부 수신한 메타데이터 필드명의 풀네임에 대한 풀네임 ID를 추가하되, 참조 횟수는 1로 설정하며(S300-9″), 판단 결과 포함하는 것으로 판단되었다면 해당 필드명에 대하여 S300-7″ 단계에서 참조 횟수를 1 증가시킨 풀네임 ID의 참조 횟수를 1 증가시킨다(S300-10″). Afterwards, the data type determination device 100 using a machine learning/deep learning model determines whether the field name/full name mapping table includes the full names of the metadata field names received from the user and the field names of the data mapped ( S300-8″), if it is judged not to be included as a result of the judgment, add the full name or abbreviation for the metadata field name received from the user to the field name/full name mapping table, and add the full name of the metadata field name received from the user in the field name/full name mapping table, and add Add the full name ID for the full name of all received metadata field names, but set the reference count to 1 (S300-9″). If it is determined to be included as a result of the judgment, the field name is added at step S300-7″. Increase the reference count of the full name ID whose reference count was increased by 1 (S300-10″).

이상의 S300-9″ 단계는 풀네임 정보 테이블에는 포함되어 있되, 이에 대한 필드명이 필드명/풀네임 매핑 테이블에 포함되어 있지 않은 경우 필드명/풀네임 매핑 테이블을 최신 상태로 업데이트하는 것이며, S300-7′″ 단계는 풀네임 정보 테이블과 필드명/풀네임 매핑 테이블 모두에 포함되어 있지 않은 경우에 풀네임 정보 테이블 및 필드명/풀네임 매핑 테이블 모두를 최신 상태로 업데이트하는 것인바, 최신 상태로의 업데이트에 관함은 앞선 설명과 동일하므로 중복 서술을 방지하기 위해 자세한 설명은 생략하도록 한다. The above S300-9″ step is to update the field name/full name mapping table to the latest state if the field name is included in the full name information table but is not included in the field name/full name mapping table. S300- Step 7′″ is to update both the full name information table and the field name/full name mapping table to the latest state if they are not included in both the full name information table and the field name/full name mapping table. Since the update is the same as the previous explanation, detailed explanation will be omitted to prevent redundant description.

이상 설명한 S300-1 단계 내지 S300-5 단계, S300-2′ 단계 내지 S300-6′ 단계, S300-5″ 단계 내지 S300-10″ 단계 모두 프로세스가 완료된 이후에는 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 필드명 및 필드명에 대한 참조 횟수, 풀네임 및 풀네임에 대한 참조 횟수를 기반으로 필드 매핑 테이블을 재구성하며, 필드명/풀네임 매핑 테이블의 경우 특정 필드명에 대한 업데이트된 참조 횟수에 따라 매핑 목록을 재정렬할 것이다. After the above-described steps S300-1 to S300-5, S300-2′ to S300-6′, and S300-5″ to S300-10″ are all processed, data using machine learning/deep learning models is collected. The type determination device 100 reorganizes the field mapping table based on the field name, the number of references to the field name, the full name, and the number of references to the full name. In the case of the field name/full name mapping table, the number of references to the field name is We will reorder the mapping list according to the updated reference count.

이상 도 4 내지 도 10을 참조하여 설명한 바와 같이 설명한 S300-1 단계 내지 S300-5 단계, S300-2′ 단계 내지 S300-6′ 단계, S300-5″ 단계 내지 S300-10″ 단계 모두 프로세스가 완료되었다면 풀네임 정보 테이블은 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블에는 포함되어 있지 않으나, 풀네임 정보 테이블에 해당 필드명이 나타내는 풀네임이 포함되어 있는 경우 그리고 수신한 데이터의 필드명이 필드명/풀네임 매핑 테이블 그리고 풀네임 역시 풀네임 정보 테이블에도 포함되어 있지 않은 경우 모두 수신한 데이터에 대한 필드명과 풀네임 각각이 필드명/풀네임 매핑 테이블과 풀네임 정보 테이블에 포함될 수 있기에, 풀네임 ID를 탐색하고, 해당 풀네임 ID를 데이터 타입 매핑 테이블에 적용함으로써 가장 높은 확률값을 갖는 데이터 타입을 탐색하여 데이터의 최종 타입으로 판별할 수 있을 것이다. As described above with reference to FIGS. 4 to 10, all processes of steps S300-1 to S300-5, steps S300-2′ to S300-6′, and steps S300-5″ to S300-10″ are completed. If the field name of the received data is not included in the field name/full name mapping table, but the full name information table contains the full name indicated by the field name, and the field name of the received data is the field name. /If the full name mapping table and the full name are also not included in the full name information table, the field name and full name for all received data can be included in the field name/full name mapping table and the full name information table, respectively. By searching the ID and applying the corresponding full name ID to the data type mapping table, the data type with the highest probability value can be searched and determined as the final type of data.

다시 도 2에 대한 설명으로 돌아가도록 한다. Let us return to the description of FIG. 2 again.

수신한 데이터의 타입을 최종적으로 판단하고 메타데이터를 생성했다면, 생성한 메타데이터를 머신러닝/딥러닝 모델로 학습하여 메타데이터 생성 규칙을 업데이트한다(S240). If the type of received data is finally determined and metadata is created, the metadata creation rules are updated by learning the generated metadata with a machine learning/deep learning model (S240).

여기서 메타데이터 생성 규칙은 앞서 언급한 바와 같이 데이터 타입 결정 조건과 필드명/풀네임 ID 맵핑 테이블, 데이터 타입 매핑 테이블 및 풀네임 정보 테이블을 포함하는 필드 매핑 테이블을 전부 포함하는바, S240 단계는 메타데이터 생성 결과에 따라 이들 전부를 업데이트하여 최신 학습 상태로 만드는 단계라 할 수 있으며, 도 11에 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 데이터를 수신하고, 메타데이터 생성 규칙을 적용하여 메타데이터를 생성한 후, 이를 학습함으로써 메타데이터 생성 규칙을 업데이트하는 모습을 예시적으로 도시해 놓았다. Here, as mentioned above, the metadata creation rule includes data type determination conditions and a field mapping table including a field name/full name ID mapping table, data type mapping table, and full name information table. Step S240 is a meta data creation rule. This can be said to be a step of updating all of these to the latest learning state according to the data generation results. In Figure 11, the data type determination device 100 using a machine learning/deep learning model receives data and sets metadata creation rules. An example of how to create metadata by applying it and then learning it to update the metadata creation rules is shown as an example.

지금까지 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 대하여 설명하였다. 본 발명에 따르면, 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치(100)가 3 단계의 판단 과정을 거쳐 수신한 데이터의 타입을 숫자형, 문자형, 범주형 및 날짜형 중 어느 하나의 타입으로 판별할 수 있는바, 데이터 타입의 판별을 전문 인력의 처리 없이 손쉽고 간편하게 수행할 수 있으므로 불필요한 시간 및 비용 소모를 방지할 수 있으며, 데이터에 대한 전처리 프로세스를 손쉽고 간편하게 수행할 수 있다. 또한, 메타데이터 생성 결과를 머신러닝/딥러닝 모델로 지속적으로 학습하여 메타데이터 생성 규칙을 업데이트하는바, 그 이후의 메타데이터 생성, 보다 구체적으로 데이터 타입의 판별에 정확도를 향상시킬 수 있다. So far, a method for automatically determining the type of data and generating metadata according to the second embodiment of the present invention has been described. According to the present invention, the data type determination device 100 using a machine learning/deep learning model goes through a three-step decision process to determine the type of the received data as any one of numeric type, character type, categorical type, and date type. Since it is possible to determine the data type easily and conveniently without processing by specialized personnel, unnecessary time and cost consumption can be prevented, and the preprocessing process for the data can be performed easily and conveniently. In addition, the metadata creation results are continuously learned with a machine learning/deep learning model to update the metadata creation rules, thereby improving the accuracy of subsequent metadata creation and, more specifically, data type determination.

한편, 별도로 설명하지는 않았지만 이상 설명한 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 있어서, 메타데이터 생성 규칙이 데이터 타입 결정 조건을 포함하기 위해서는 인공지능 모델을 자연어 문장들을 사전에 학습한 머신러닝/딥러닝 기반의 NLP 모델로 채택함이 필요하다 할 것이다. Meanwhile, although not separately explained, in the method of automatically determining the type of data and generating metadata according to the second embodiment of the present invention described above, an artificial intelligence model is required for the metadata creation rule to include a data type determination condition. It would be necessary to adopt a machine learning/deep learning-based NLP model that learned natural language sentences in advance.

더 나아가, 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법은 도 12 에 도시된 바와 같이 S240 단계 이후에, S230단계에서 최종적으로 판단한 데이터의 타입이 숫자형(그 중에서 정수형과 같이 순서를 나타내는 필드) 및 날짜형 중 어느 하나의 타입인지 판단하는 단계(S250), S250 단계의 판단 결과, 숫자형 및 날짜형 중 어느 하나의 타입이라고 판단되었다면, 수신한 데이터의 필드값을 정렬하여 중복된 필드값을 포함하는지 판단하는 단계(S260) 및 S260 단계의 판단 결과, 중복된 필드값을 포함하는 것으로 판단되었다면, 정렬한 필드값 중 서로 이웃하는 두 필드값 사이의 간격값을 산정하는 단계(S270), 산정한 두 필드값 사이의 간격값 중 가장 많은 수의 간격값의 비율이 소정 비율 이상인지 판단하는 단계(S280) 및 S280 단계의 판단 결과, 소정 비율 이상이라고 판단되었다면, 상기 수신한 데이터는 시계열 데이터셋으로 판단하는 단계(S290)를 더 포함할 수 있으며, S250 단계의 판단 결과, 숫자형 및 날짜형 중 어느 하나의 타입이 아니라고 판단되었다면, 수신한 데이터는 비시계열 데이터셋으로 판단하는 단계, S260 단계의 판단 결과, 중복된 필드값을 포함하지 않는 것으로 판단되었다면, 수신한 데이터는 시계열 데이터셋으로 판단하는 단계 및 S280 단계의 판단 결과, 소정 비율 이상이 아니라고 판단되었다면, 수신한 데이터는 비시계열 데이터셋으로 판단하는 단계를 더 포함함으로써 데이터셋이 시계열 데이터셋인지 비시계열 데이터셋인지까지 판별할 수 있다 할 것이다. Furthermore, in the method of automatically determining the type of data and generating metadata according to the second embodiment of the present invention, as shown in FIG. 12, after step S240, the type of data finally determined in step S230 is a number. A step (S250) of determining whether the received field is one of a type (a field indicating an order, such as an integer type) or a date type. As a result of the determination in step S250, if it is determined that the received type is one of a number type and a date type, A step of sorting the field values of the data to determine whether they contain duplicate field values (S260). If it is determined that the data contains duplicate field values as a result of step S260, the difference between two neighboring field values among the sorted field values is A step of calculating the interval value (S270), a step of determining whether the ratio of the largest number of interval values among the calculated interval values between the two field values is more than a predetermined ratio (S280), and the determination result of step S280 is more than a predetermined ratio. If it is determined that the received data is a time series dataset, it may further include a step (S290), and if it is determined that the received data is not one of the numeric and date types as a result of the determination in step S250, the received data If, as a result of the judgment in step S260, it is determined that it is a non-time series data set, it is determined that it does not contain duplicate field values, the received data is determined to be a time series dataset, and as a result of the judgment in step S280, the data is determined to be more than a certain ratio. If it is determined that it is not, it is possible to determine whether the data set is a time series data set or a non-time series data set by further including a step of determining that the received data is a non-time series data set.

이상 설명한 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법은 동일한 기술적 특징을 전부 포함하는 본 발명의 제3 실시 예에 따른 매체에 저장된 컴퓨터 프로그램으로 구현할 수 있는바, 중복 서술을 방지하기 위해 자세히 설명하지 않겠지만 이상 설명한 본 발명의 제2 실시 예에 따른 데이터의 타입을 자동으로 판별하여 메타데이터를 생성하는 방법에 적용되는 기술적 특징 모두, 본 발명의 제3실시 예에 따른 매체에 저장된 컴퓨터 프로그램에 동일하게 적용될 수 있음은 물론이다. The method of automatically determining the type of data and generating metadata according to the second embodiment of the present invention described above can be implemented with a computer program stored in the medium according to the third embodiment of the present invention including all the same technical features. As such, it will not be described in detail to prevent redundant description, but all technical features applied to the method for automatically determining the type of data and generating metadata according to the second embodiment of the present invention described above are included in the second embodiment of the present invention. Of course, the same can be applied to the computer program stored in the medium according to the third embodiment.

이상 첨부된 도면을 참조하여 본 발명의 실시 예들을 설명하였지만, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing its technical idea or essential features. You will be able to understand it. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

100: 머신러닝/딥러닝 모델을 이용한 데이터 타입 판별 장치
10: 프로세서
20: 네트워크 인터페이스
30: 메모리
40: 스토리지
41: 컴퓨터 프로그램
50: 데이터 버스100: Data type determination device using machine learning/deep learning model
10: processor
20: network interface
30: Memory
40: storage
41: computer program
50: data bus

Claims

A data type determination device using a machine learning/deep learning model automatically determines the type of data - the data types include numeric, character, categorical, and date types - and In the method of generating data,
(a) First determining whether the field value of the received data is a date type;
(b) If, as a result of the determination in step (a), it is initially determined that it is not a date type, the data type determination conditions included in the metadata creation rule are applied to the field value of the received data to determine categorical and date type. A step of secondarily determining which type is one of the following;
(c) If, as a result of the determination in step (b), it is secondarily determined that it is not a category type or a date type, the field name of the received data is determined as a data type included in the metadata creation rule. Applying to the Field Mapping Table to finally determine whether it is a numeric type, character type, categorical type, or date type, and generating metadata; and
(d) updating the metadata generation rule by learning the generated metadata with a machine learning/deep learning model;
In a method of automatically determining the type of data containing and generating metadata,
After step (d) above,
(e) determining whether the data type finally determined in step (c) is a numeric type or a date type;
(f) if, as a result of the determination in step (e), it is determined that the data is of a numeric type or a date type, sorting field values of the received data to determine whether they contain duplicate field values; and
(g) if, as a result of the determination in step (f), it is determined that duplicate field values are included, calculating an interval value between two neighboring field values among the sorted field values;
(h) determining whether the ratio of the largest number of interval values among the calculated interval values between the two field values is greater than or equal to a predetermined ratio; and
(i) If it is determined as a result of step (h) that the ratio is higher than a predetermined ratio, determining that the received data is a time series dataset;
A method of automatically determining the type of data that further includes and generating metadata.

According to paragraph 1,
The judgment in step (a) is based on a published programming language,
A method of automatically determining the type of data and creating metadata.

According to paragraph 1,
The data type determination conditions included in the metadata creation rule of step (b) are:
A second condition is provided regarding whether the field value of the received data includes one or more of a year pattern, a month pattern, a day pattern, an hour pattern, a minute pattern, and a second pattern,
If it includes one or more patterns according to the second condition, the type of the received data is a date type,
A method of automatically determining the type of data and creating metadata.

According to paragraph 1,
(f´) if, as a result of the determination in step (e), it is determined that the received data is not one of the numeric and date types, determining that the received data is a non-time series data set;
(g´) if it is determined as a result of step (f) that it does not contain duplicate field values, determining that the received data is a time series dataset; and
(i´) if, as a result of the determination in step (h), it is determined that the ratio is not more than a predetermined ratio, determining that the received data is a non-time series dataset;
A method of automatically determining the type of data that further includes and generating metadata.

One or more processors;
network interface;
a memory that loads a computer program executed by the processor; and
Including storage for storing large-capacity network data and the computer program,
The computer program is operated by the one or more processors,
(A) An operation that primarily determines whether the field value of the received data is a date type;
(B) If, as a result of the judgment of the operation (A) above, it is initially determined that it is not a date type, the data type determination conditions included in the metadata creation rule are applied to the field value of the received data to determine categorical and date type. An operation to secondarily determine which type is one of the following;
(C) If, as a result of the judgment of the operation (B), it is secondarily determined that it is not a category type or a date type, the field name of the received data is determined as a data type included in the metadata creation rule. An operation that applies to the Field Mapping Table to finally determine whether it is a numeric type, character type, categorical type, or date type and generate metadata; and
(D) An operation to update the metadata creation rule by learning the generated metadata with a machine learning/deep learning model;
In a data type determination device using a machine learning/deep learning model that executes,
After operation (D) above,
(E) an operation to determine whether the data type finally determined in operation (C) above is a numeric type or a date type;
(F) If, as a result of the determination of the operation (E), it is determined that the data is of either numeric type or date type, an operation for sorting the field values of the received data to determine whether they contain duplicate field values; and
(G) If, as a result of the determination of the operation (F), it is determined that duplicate field values are included, an operation for calculating an interval value between two neighboring field values among the sorted field values;
(H) an operation to determine whether the ratio of the largest number of interval values among the calculated interval values between the two field values is more than a predetermined ratio; and
(I) an operation that determines that the received data is a time series dataset if it is determined that the (H) operation is higher than a predetermined ratio;
Data type determination device using machine learning/deep learning model that executes