KR102590576B1

KR102590576B1 - Dynamic data structure search method using data semantic classification

Info

Publication number: KR102590576B1
Application number: KR1020230051064A
Authority: KR
Inventors: 오경조; 이종규; 박희성; 이진연
Original assignee: 주식회사 에이오디컨설팅
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-10-24

Abstract

본 발명은 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법에 관한 것으로, 입력된 데이터의 기본적인 구분 기호(basic delimiter) 기반으로 파싱(parsing)하는 제 1단계; 상기 제 1단계를 통해 분류된 데이터를 의미 분류에 따라 각각의 도메인으로 분류하는 제 2단계; 상기 제 1단계에서 파싱된 데이터의 범위와 상기 제 2단계에서 분류된 도메인 범위를 적용하여 입력된 데이터의 구조를 점수화하는 제 3단계; 및 상기 제 3단계에서 점수화된 데이터에 대하여 점수가 가장 높을 때까지 데이터 구조간 결합 또는 분리를 반복 수행하여 가장 높은 점수의 구조를 획득하는 제 4단계;를 포함하고, 상기 제 2단계는, 분류된 데이터에 대하여 의미별 컬럼 분류를 위하여 의미별 카테고리에 따라 컬럼을 부여하여 컬럼 의미 분류를 결정하는 제 2-1단계와, 상기 제 2-1단계에서 결정된 컬럼 의미 분류에 따라 인공지능 기반으로 언어 모형을 결정하는 제 2-2단계;를 포함하여 구성된다.The present invention relates to a dynamic data structure search method using data semantic classification, comprising: a first step of parsing based on the basic delimiter of input data; A second step of classifying the data classified through the first step into each domain according to semantic classification; A third step of scoring the structure of the input data by applying the range of data parsed in the first step and the domain range classified in the second step; And a fourth step of obtaining the structure with the highest score by repeatedly combining or separating data structures for the data scored in the third step until the score is the highest, and the second step is classification. Step 2-1 of determining column semantic classification by assigning columns according to semantic categories to classify columns by meaning for the data, and language based on artificial intelligence according to the column semantic classification determined in step 2-1. It is composed of steps 2-2 of determining the model.

Description

Dynamic data structure search method using data semantic classification}

본 발명은 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법에 관한 것으로, 좀 더 상세하게는 빈번하게 변경되는 동적 데이터(비정형 데이터) 소스로부터 받은 다양한 데이터의 구조를 쉽게 파악할 수 있는 동적 데이터 구조 검색 방법에 관한 것이다.The present invention relates to a dynamic data structure search method using data semantic classification. More specifically, a dynamic data structure search method that can easily identify the structure of various data received from a frequently changing dynamic data (unstructured data) source. It's about.

일반적인 컴퓨터를 이용한 분류 시스템에서는 주어진 데이터의 통계적 특성을 분석하여 같은 통계적 특성을 가진 것으로 판단되는 새로운 데이터의 종류를 결정한다. 이 경우, 분류 성능을 향상하기 위해서는 주어진 데이터를 처리하여 신뢰도가 높은 데이터를 선택하는 과정이 선행될 필요가 있다. 이러한 과정을 데이터 전처리(preprocessing)라고 하는데, 이 과정은 주어진 입력 데이터에서 본질적인 정보를 추출하기 쉽도록 현재 주목하고자 하는 부분 데이터를 선정하거나 데이터를 정형화하여 불필요한 정보를 분리하기 위한 예비적인 조작이다.In a general computer-based classification system, the statistical characteristics of given data are analyzed to determine the type of new data that is judged to have the same statistical characteristics. In this case, in order to improve classification performance, it is necessary to first process the given data and select data with high reliability. This process is called data preprocessing. This process is a preliminary operation to select partial data of current interest or to separate unnecessary information by standardizing the data so that essential information can be easily extracted from the given input data.

이러한 전처리 과정에는 데이터 정규화(normalization), 데이터 선택, 잡음 데이터 제거 등과 같은 처리가 포함된다. 데이터 전처리 과정 이후에는 계산량을 적절한 수준으로 낮추고 데이터의 질을 향상시키기 위해 원래의 데이터를 보다 낮은 차원의 새로운 데이터로 변환할 필요가 있다. 차원 감소(dimension reduction)를 위해 얻어지는 새로운 데이터를 원래 데이터의 특징(feature)이라고 부르며, 그러한 특징을 추출하는 과정을 특징추출(feature extraction)이라고 한다. 데이터로부터 특징이 추출되면 최종적으로 분류기(classifier)를 이용하여 주어진 데이터의 종류 정보가 최종적으로 확정된다.This preprocessing process includes processing such as data normalization, data selection, and noise data removal. After the data preprocessing process, it is necessary to convert the original data into new data with a lower dimension in order to reduce the amount of calculation to an appropriate level and improve data quality. New data obtained for dimension reduction are called features of the original data, and the process of extracting such features is called feature extraction. Once features are extracted from the data, the type information of the given data is finally determined using a classifier.

데이터 분류를 위한 특징추출에서 고려되어야 하는 사항은, 종류 정보에 따라 데이터를 분류하기에 적합한 정보가 추출된 특징에 충분히 포함되어야 한다는 것이다. 일반적으로 특징추출은 크게 전역적(global)인 방법과 국소적(local)인 방법으로 구분되는데, 원래 데이터에 포함된 모든 차원이 특징의 추출에 있어 고려되는 방식을 전역적인 방법이라고 하며, 이와는 달리 원래 데이터의 일부 차원과 그러한 차원 간의 기하학적인 관계가 고려되는 방식을 국소적인 방법이라고 한다.What must be considered in feature extraction for data classification is that the extracted features must sufficiently include information suitable for classifying data according to type information. In general, feature extraction is largely divided into global and local methods. A method in which all dimensions included in the original data are considered in feature extraction is called a global method, and unlike this, Methods in which some dimensions of the original data and the geometric relationships between those dimensions are taken into account are called local methods.

한편, 비정형 데이터(unstructured data)란, 텍스트나 이미지, 동영상과 같이 사전에 정의된 정형(structure)을 따라지 않는 데이터를 의미한다. 비정형 데이터는 뉴스, 댓글, SNS 데이터, 이메일, 보고서 등 다양하며 채널 또한 다양하다.Meanwhile, unstructured data refers to data that does not follow a predefined structure, such as text, images, or videos. Unstructured data is diverse, including news, comments, SNS data, emails, and reports, and the channels are diverse.

기업, 기관, 개인은 비정형 데이터를 매일 매시간 생산하고 있다. 하지만 대부분의 비정형 데이터는 분류되지 않고 사장되고 있다. 이런 비정형 데이터가 의미 있고 가치 있는 정보가 되기 위해서는 분석이 필수적이다.Companies, institutions, and individuals are producing unstructured data every hour of every day. However, most unstructured data is not classified and is lost. Analysis is essential for this unstructured data to become meaningful and valuable information.

비정형 데이터의 첫 번째 분석 방법은 분류분석(classification analysis) 또는 군집분석(clustering analysis)을 이용하는 것이고, 두 번째 분석 방법으로는 특정 범주로의 카테고라이징(categorizing)을 수행하는 것이 있다. 그동안 2가지 분석 방법은 수작업 처리 방법과 자동화된 처리 방법을 활용하였으나 산업 분야별 적용에는 아직 어려움이 크다.The first analysis method for unstructured data is to use classification analysis or clustering analysis, and the second analysis method is categorizing into specific categories. So far, the two analysis methods have utilized manual processing methods and automated processing methods, but there are still great difficulties in applying them to each industrial field.

일반적으로 텍스트 문서에 대한 자동 분류 시스템은 그 성능이 학습 알고리즘 자체보다는 특징 선택(feature selection) 알고리즘에 의존하는 경향이 크다. 특징 선택이란 학습 문서에 존재하는 특징(또는 단어)들 속에서 카테고리 간 차별화에 기여하는 특징만을 골라내는 기법을 의미한다.In general, the performance of automatic classification systems for text documents tends to depend largely on the feature selection algorithm rather than the learning algorithm itself. Feature selection refers to a technique that selects only the features that contribute to differentiation between categories from among the features (or words) existing in a learning document.

하지만, 점차 더 처리해야 할 정보와 문서의 양이 방대해지고 복잡해지면서 이는 빠르게 전달해야 하는 뉴스의 속도를 저하시킬 뿐만 아니라 인력자원의 투입으로 인해 더 많은 비용이 소비되고 있다. 따라서 문서 분류의 자동화에 대한 필요성은 더욱 증대되고 있다.However, as the amount of information and documents that need to be processed becomes increasingly vast and complex, this not only slows down the speed of news that needs to be delivered quickly, but also consumes more costs due to the input of human resources. Therefore, the need for automation of document classification is increasing.

또한, 기존에 문서 분류의 자동화를 위하여 단순히 문서에 나타나는 단어의 빈도수를 이용하여 적합한 범주를 지정하는 통계적인 분류 방법이 이용되거나, 분류에 필요한 주요 단어들을 추출하고 추출된 단어들을 기반으로 KNN, 의사결정트리, 베이지언 네트워크, 인공 신경망 등의 데이터 마이닝 알고리즘을 이용한 연구가 진행되었다.In addition, to automate document classification, a statistical classification method is used that simply uses the frequency of words appearing in the document to designate an appropriate category, or extracts key words needed for classification and uses KNN or pseudocode based on the extracted words. Research was conducted using data mining algorithms such as decision trees, Bayesian networks, and artificial neural networks.

하지만, 다양한 산업 현장에서 설비/공장/기업 간 특성 산업에서 발생되는 수많은 종류의 비정형 데이터의 표준화를 위하여 작업자에 의한 수작업을 통해 가공되고 처리되고 있어 시간적, 비용적 문제를 일으키고 있다.However, in order to standardize the numerous types of unstructured data generated in specific industries between facilities/factories/companies in various industrial sites, it is processed and processed manually by workers, causing time and cost problems.

KR 등록특허 10-1247307호KR Registered Patent No. 10-1247307 KR 등록특허 10-1408345호KR Registered Patent No. 10-1408345 KR 등록특허 10-1588431호KR Registered Patent No. 10-1588431 KR 등록특허 10-2367859호KR Registered Patent No. 10-2367859 KR 등록특허 10-1843066호KR Registered Patent No. 10-1843066 KR 등록특허 10-2175176호KR Registered Patent No. 10-2175176 KR 등록특허 10-2008845호KR Registered Patent No. 10-2008845 KR 등록특허 10-2461857호KR Registered Patent No. 10-2461857 KR 등록특허 10-2496030호KR Registered Patent No. 10-2496030 KR 등록특허 10-2459971호KR Registered Patent No. 10-2459971 KR 등록특허 10-2465571호KR Registered Patent No. 10-2465571

상기와 같은 문제점을 해결하기 위한 본 발명은, 데이터 구조 분석 결과에 따라 해당 데이터의 특성에 적절한 데이터 품질 규칙을 자동 적용하는 목적이 있다.The purpose of the present invention to solve the above problems is to automatically apply data quality rules appropriate to the characteristics of the data according to the results of data structure analysis.

특히, 본 발명은 데이터값의 특성 파악을 통해 데이터의 의미와 구조를 분석하는 방법으로, 데이터 품질 평가 대상에서 데이터에 대한 도메인 분류, 품질 평가 규칙 설정 등을 Data Dictionary, Table 정의서, Column 정의서, Code 정의서 등의 산출물 기반으로 수작업에 의존하여 설정하던 방식을 자동화하는 목적이 있다. In particular, the present invention is a method of analyzing the meaning and structure of data by identifying the characteristics of data values. In the data quality evaluation target, domain classification of data, quality evaluation rule setting, etc. are classified into Data Dictionary, Table Definition, Column Definition, and Code. The purpose is to automate settings that used to rely on manual work based on outputs such as definitions.

상기와 같은 목적을 달성하기 위한 본 발명은, 입력된 데이터의 기본적인 구분 기호(basic delimiter) 기반으로 파싱(parsing)하는 제 1단계; 상기 제 1단계를 통해 분류된 데이터를 의미 분류에 따라 각각의 도메인으로 분류하는 제 2단계; 상기 제 1단계에서 파싱된 데이터의 범위와 상기 제 2단계에서 분류된 도메인 범위를 적용하여 입력된 데이터의 구조를 점수화하는 제 3단계; 및 상기 제 3단계에서 점수화된 데이터에 대하여 점수가 가장 높을 때까지 데이터 구조간 결합 또는 분리를 반복 수행하여 가장 높은 점수의 구조를 획득하는 제 4단계;를 포함하고, 상기 제 2단계는, 분류된 데이터에 대하여 의미별 컬럼 분류를 위하여 의미별 카테고리에 따라 컬럼을 부여하여 컬럼 의미 분류를 결정하는 제 2-1단계와, 상기 제 2-1단계에서 결정된 컬럼 의미 분류에 따라 인공지능 기반으로 언어 모형을 결정하는 제 2-2단계;를 포함하여 구성된다.The present invention for achieving the above object includes a first step of parsing the input data based on the basic delimiter; A second step of classifying the data classified through the first step into each domain according to semantic classification; A third step of scoring the structure of the input data by applying the range of data parsed in the first step and the domain range classified in the second step; And a fourth step of obtaining the structure with the highest score by repeatedly combining or separating data structures for the data scored in the third step until the score is the highest, and the second step is classification. Step 2-1 of determining column semantic classification by assigning columns according to semantic categories to classify columns by meaning for the data, and language based on artificial intelligence according to the column semantic classification determined in step 2-1. It is composed of steps 2-2 of determining the model.

또한, 상기 제 2-1단계는, 상기 컬럼 의미 분류를 결정하기 전에 기입력된 데이터에 대하여 용어 분리와 용어 표준화 단계와, 신규 단어에 대한 단어 수집과 단어 표준화 단계를 더 포함하여 구성되고, 상기 용어 표준화 단계와 신규 단어 표준화 단계는 국문과 영문을 모두 포함하여 표준화하는 것을 특징으로 한다.In addition, step 2-1 further includes a term separation and term standardization step for the input data before determining the column semantic classification, and a word collection and word standardization step for new words, The term standardization stage and the new word standardization stage are characterized by standardization including both Korean and English.

또한, 상기 데이터 처리 시스템은, 외부에서 입력되는 학습 데이터나 신규 데이터를 입력받고 저장할 수 있는 데이터 입력부와 입력된 데이터의 언어 모형을 결정하기 위한 자연어 처리부와 입력된 데이터의 컬럼 의미 분류를 분석하기 위한 컬럼 분석부와 의미론적 분류에 따라 도메인을 분류하는 도메인 분류부와 데이터 구조 점수화를 통해 최적화된 구조를 처리하는 구조 처리부를 포함하여 구성된다.In addition, the data processing system includes a data input unit for receiving and storing externally input learning data or new data, a natural language processing unit for determining the language model of the input data, and a data processing unit for analyzing column semantic classification of the input data. It consists of a column analysis unit, a domain classification unit that classifies domains according to semantic classification, and a structure processing unit that processes the optimized structure through data structure scoring.

상기와 같이 구성되고 작용되는 본 발명은, 다양한 산업에서 발생되는 비정형 데이터 소스에 해당되는 IoT, Sensor, Log, XML, JSON 등의 구조를 데이터 의미론적 분류를 이용하여 자동으로 비정형 데이터의 구조를 파악할 수 있는 장점이 있다.The present invention, constructed and operated as described above, automatically identifies the structure of unstructured data using data semantic classification of the structures such as IoT, Sensor, Log, XML, JSON, etc., corresponding to unstructured data sources generated in various industries. There are advantages to this.

도 1은 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법의 순서도,
도 2는 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법에서 의미 분류를 위한 세부 순서도,
도 3은 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법에서 데이터 처리 시스템의 세부 구성도,
도 4는 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법에서 의미 분류의 예를 도시한 도면.1 is a flowchart of a dynamic data structure search method using data semantic classification according to the present invention;
Figure 2 is a detailed flowchart for semantic classification in the dynamic data structure search method using data semantic classification according to the present invention;
3 is a detailed configuration diagram of a data processing system in the dynamic data structure search method using data semantic classification according to the present invention;
Figure 4 is a diagram showing an example of semantic classification in a dynamic data structure search method using data semantic classification according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법을 상세히 설명하면 다음과 같다.Hereinafter, a dynamic data structure search method using data semantic classification according to the present invention will be described in detail with reference to the attached drawings.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1개의 유닛이 2개 이상의 하드웨어를 이용하여 실현되어도 되고, 2개 이상의 유닛이 1개의 하드웨어에 의해 실현되어도 된다. 한편, '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, '~부'는 어드레싱 할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체 지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로 더 분리될 수 있다. 뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU들을 재생시키도록 구현될 수도 있다.In this specification, 'part' includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Additionally, one unit may be realized using two or more pieces of hardware, and two or more units may be realized using one piece of hardware. Meanwhile, '~ part' is not limited to software or hardware, and '~ part' may be configured to reside in an addressable storage medium or may be configured to reproduce one or more processors. Therefore, as an example, '~ part' refers to components such as software components, object-oriented software components, class components, and task components, as well as processes, functions, properties, procedures, Includes subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. The functions provided within the components and 'parts' may be combined into a smaller number of components and 'parts' or may be further separated into additional components and 'parts'. Additionally, components and 'parts' may be implemented to regenerate one or more CPUs within a device or a secure multimedia card.

이하에서 언급되는 " 단말"은 네트워크를 통해 서버나 타 단말에 접속할 수 있는 컴퓨터나 휴대용 단말기로 구현될 수 있다. 여기서, 컴퓨터는 예를 들어, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(desktop), 랩톱(laptop), VR HMD(예를 들어, HTC VIVE, Oculus Rift, GearVR, DayDream, PSVR 등) 등을 포함할 수 있다.The “terminal” mentioned below may be implemented as a computer or portable terminal that can connect to a server or other terminal through a network. Here, the computer is, for example, a laptop equipped with a web browser, a desktop, a laptop, a VR HMD (e.g., HTC VIVE, Oculus Rift, GearVR, DayDream, PSVR, etc.), etc. may include.

여기서, VR HMD는 PC용 (예를 들어, HTC VIVE, Oculus Rift, FOVE, Deepon 등)과 모바일용(예를 들어, GearVR, DayDream, 폭풍마경, 구글 카드보드 등) 그리고 콘솔용(PSVR)과 독립적으로 구현되는 Stand Alone 모델(예를 들어, Deepon, PICO 등) 등을 모두 포함한다. 휴대용 단말기는 예를 들어, 휴대성과 이동성이 보장되는 무선 통신 장치로서, 스마트폰(smartphone), 태블릿 PC, 웨어러블 디바이스 뿐만 아니라, 블루투스(BLE, Bluetooth Low Energy), NFC, RFID, 초음파(Ultrasonic), 적외선, 와이파이(WiFi), 라이파이(LiFi) 등의 통신 모듈을 탑재한 각종 디바이스를 포함할 수 있다. Here, VR HMD is for PC (e.g. HTC VIVE, Oculus Rift, FOVE, Deepon, etc.), mobile (e.g. GearVR, DayDream, Storm Magic, Google Cardboard, etc.), and console (PSVR). Includes independently implemented Stand Alone models (e.g. Deepon, PICO, etc.). Portable terminals are, for example, wireless communication devices that ensure portability and mobility, and include not only smartphones, tablet PCs, and wearable devices, but also Bluetooth (BLE, Bluetooth Low Energy), NFC, RFID, ultrasonic, It may include various devices equipped with communication modules such as infrared, WiFi, and LiFi.

또한, "네트워크"는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. 무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다.In addition, “network” refers to a connection structure that allows information exchange between nodes such as terminals and servers, including a local area network (LAN), a wide area network (WAN), and the Internet. (WWW: World Wide Web), wired and wireless data communication network, telephone network, wired and wireless television communication network, etc. Examples of wireless data communication networks include 3G, 4G, 5G, 3GPP (3rd Generation Partnership Project), LTE (Long Term Evolution), WIMAX (World Interoperability for Microwave Access), Wi-Fi, Bluetooth communication, infrared communication, and ultrasound. This includes, but is not limited to, communication, Visible Light Communication (VLC), LiFi, etc.

본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법은, 입력된 데이터의 기본적인 구분 기호(basic delimiter) 기반으로 파싱(parsing)하는 제 1단계; 상기 제 1단계를 통해 분류된 데이터를 의미 분류에 따라 각각의 도메인으로 분류하는 제 2단계; 상기 제 1단계에서 파싱된 데이터의 범위와 상기 제 2단계에서 분류된 도메인 범위를 적용하여 입력된 데이터의 구조를 점수화하는 제 3단계; 및 상기 제 3단계에서 점수화된 데이터에 대하여 점수가 가장 높을 때까지 데이터 구조간 결합 또는 분리를 반복 수행하여 가장 높은 점수의 구조를 획득하는 제 4단계;를 포함하고, 상기 제 2단계는, 분류된 데이터에 대하여 의미별 컬럼 분류를 위하여 의미별 카테고리에 따라 컬럼을 부여하여 컬럼 의미 분류를 결정하는 제 2-1단계와, 상기 제 2-1단계에서 결정된 컬럼 의미 분류에 따라 인공지능 기반으로 언어 모형을 결정하는 제 2-2단계;를 포함하여 구성된다.The dynamic data structure search method using data semantic classification according to the present invention includes a first step of parsing the input data based on the basic delimiter; A second step of classifying the data classified through the first step into each domain according to semantic classification; A third step of scoring the structure of the input data by applying the range of data parsed in the first step and the domain range classified in the second step; And a fourth step of obtaining the structure with the highest score by repeatedly combining or separating data structures for the data scored in the third step until the score is the highest, and the second step is classification. Step 2-1 of determining column semantic classification by assigning columns according to semantic categories to classify columns by meaning for the data, and language based on artificial intelligence according to the column semantic classification determined in step 2-1. It is composed of steps 2-2 of determining the model.

본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법은, 빈번하게 변경되는 연속 공정산업의 동적(비정형) 데이터 소스로부터 받은 다양한 데이터(IoT, Sensor, Log 등)의 구조를 파악하고, 의미론적 분류를 이용하고 정보 연결을 위한 알고리즘을 적용하여 자동으로 비정형 데이터의 구조 파악이 가능한 딥러닝 기반의 자연어 처리(Natural Language Processing) 알고리즘을 이용한 컬럼 의미 분류 기반의 데이터 구조 분석을 통한 데이터 분류 방법을 주요 기술적 목적으로 한다.The dynamic data structure search method using data semantic classification according to the present invention identifies the structure of various data (IoT, Sensor, Log, etc.) received from dynamic (unstructured) data sources in the continuous process industry that change frequently, and determines the meaning. A data classification method through data structure analysis based on column semantic classification using a deep learning-based Natural Language Processing algorithm that can automatically identify the structure of unstructured data by using theoretical classification and applying an algorithm for information connection. For main technical purposes.

도 1은 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법의 순서도이다.1 is a flowchart of a dynamic data structure search method using data semantic classification according to the present invention.

본 발명은 크게 입력된 데이터를 파싱하는 제 1단계(S100)와 파싱된 데이터를 도메인별로 분류하는 제 2단계(S200)와, 분류된 데이터 구조에 대하여 점수화하는 제 3단계(S300)와, 데이터 전체에 대해 점수가 가장 높을 때까지 데이터 구조를 결합, 분리하는 과정을 반복하여 최종적으로 가장 적합한 구조를 획득하는 제 4단계(S400)로 구성된다. The present invention broadly includes a first step of parsing input data (S100), a second step of classifying the parsed data by domain (S200), a third step of scoring the classified data structure (S300), and data It consists of a fourth step (S400) in which the process of combining and separating data structures is repeated until the overall score is the highest, ultimately obtaining the most suitable structure.

본 발명에 따른 주요 기술적 요지로는 컬럼 의미 분류 기반을 통한 동적 데이터에 대한 분류 기준을 정의하여 산업 현장에서 사용되는 다양하나 동적 데이터(IoT, Sensor, Log, XML, JSON 등)의 비정형 데이터 소스를 수신받아 데이터 의미론적 분류를 이용하여 비정형 데이터의 구조를 자동으로 해석하는 것에 특징이 있다.The main technical point according to the present invention is to define classification criteria for dynamic data based on column semantic classification to identify unstructured data sources of various but dynamic data (IoT, Sensor, Log, XML, JSON, etc.) used in industrial fields. It is characterized by automatically interpreting the structure of unstructured data using received data semantic classification.

상기 제 1단계(S100)는 입력된 데이터에 대하여 기본적인 구분 기호(basic delimiter) 기반으로 파싱(parsing)하는 과정을 거친다. 바람직하게 입력된 데이터는 다양한 산업 현장이나 설비에서 획득되는 동적(비정형) 데이터로써, 상기 동적 데이터를 의미론적 분류를 통해 분류한 후 데이터 구조를 해석하는데 활용되는데, 의미론적 분류의 방법으로 아래에서 구체적으로 설명하기로 한다.The first step (S100) goes through a process of parsing the input data based on a basic delimiter. Preferably, the input data is dynamic (unstructured) data acquired from various industrial sites or facilities, and is used to interpret the data structure after classifying the dynamic data through semantic classification. The method of semantic classification is detailed below. It will be explained as follows.

상기 제 2단계(S200)는 분류에 따라 각각의 도메인으로 분류하는 과정으로써, 도메인 분류는 해당 데이터에 대한 테이블 속성값을 갖도록 의미 분류에 따라 분류된 각각의 데이터에 대하여 도메인값을 할당하게 된다.The second step (S200) is a process of classifying each domain according to the classification. The domain classification assigns a domain value to each data classified according to the semantic classification so that the data has a table attribute value.

앞서 설명한 바와 같이 본 발명에 따른 의미 분류는 동일한 카테고리값을 값는 컬럼을 분류하기 위한 것으로, 컬럼의 예로는 전화번호, 날짜, 상품번호, 이메일, 성명, 주소, 회사명, 업무명, 고객정보 등 분류 기준에 해당될 수 있다. 즉, 컬럼 의미 분류를 수신된 학습 데이터에 대해 컬럼 의미를 분류하여 결정하는 것으로, 수신된 학습 데이터에 따라 컬럼 A ~ 컬럼 Z까지의 컬럼 의미를 분류할 수 있는 것이다. 이 과정은 도 2에 도시된 바와 같이 제 2-1단계(S210)에 해당된다.As described above, semantic classification according to the present invention is for classifying columns that have the same category value. Examples of columns include phone number, date, product number, email, name, address, company name, business name, customer information, etc. It may fall under the classification criteria. In other words, column meaning classification is determined by classifying the column meaning of the received training data, and the column meaning from column A to column Z can be classified according to the received training data. This process corresponds to step 2-1 (S210) as shown in FIG. 2.

이때, 상기 컬럼 의미 분류를 결정하기 위해서는 사전에 입력된 학습 데이터에 대하여 용어 분리와 용어 표준화 단계와, 신규 단어에 대한 단어 수집과 단어 표준화 단계를 더 포함하여 구성되고, 상기 용어 표준화 단계와 신규 단어 표준화 단계는 국문과 영문을 모두 포함하여 표준화하는 것이 바람직하다.At this time, in order to determine the column semantic classification, it further includes a term separation and term standardization step for the learning data input in advance, a word collection and word standardization step for new words, and the term standardization step and the new word standardization step. It is desirable that the standardization step includes both Korean and English.

컬럼 의미 분류가 결정되면, 컬럼 의미 분류 기준에 따라 인공지능 기반으로 언어 모형을 결정하는 제 2-2단계(S220)를 수행한다. 상기 언어 모형은 입력되는 수많은 학습 데이터를 컬럼 의미 분류 기준에 따라 분류하기 위해 다양한 언어에 대한 분류 기준에 해당된다. 다시 말해, 컬럼 의미 분류의 하위 그룹에 해당되는 언어 분류 기준으로써, 예를 들어 날짜의 경우 2023년 4월 10일에 대하여 "20230410", "230410", "23-4-10", "23/04/10"등 다양하게 입력될 수 있기 때문에 이들을 날짜 컬럼 의미 분류에 해당되도록 처리하는 것을 언어 모형의 결정이라 할 수 있다.Once the column semantic classification is determined, step 2-2 (S220) is performed to determine a language model based on artificial intelligence according to the column semantic classification criteria. The language model corresponds to classification standards for various languages in order to classify numerous input learning data according to column meaning classification standards. In other words, as a language classification standard corresponding to a subgroup of column semantic classification, for example, in the case of a date, for April 10, 2023, "20230410", "230410", "23-4-10", "23/ Since the input can be input in various ways, such as "04/10", processing them so that they correspond to the semantic classification of the date column can be said to be a decision of the language model.

또한, 언어 모형의 예로 자연어 처리 기법이 해당될 수 있으며, 다양한 용어에 대해 컬럼 의미 기준으로 처리하기 위해 자연어 처리를 통해 언어 모형을 결정하여 컬럼 의미 분류를 처리할 수 있는 것이다.In addition, natural language processing techniques may be an example of a language model, and column meaning classification can be processed by determining a language model through natural language processing to process various terms based on column meaning.

다음으로 상기 제 3단계(S300)는 상기 제 2단계에서 도메인으로 분류된 데이터를 데이터 범위와 도메인 범위를 적용하여 데이터 구조를 점수화하는 단계에 해당된다. 그리고 상기 제 3단계를 통해 점수화된 데이터 구조에서 점수가 가장 높을 때까지 해당 데이터 구조간 결합, 분리 등 변형을 적용하여 최종적으로 가장 적합한 구조를 획득하는 제 4단계(S400)를 포함한다.Next, the third step (S300) corresponds to scoring the data structure by applying the data range and domain range to the data classified into domains in the second step. And it includes a fourth step (S400) in which transformations such as combination and separation are applied between the data structures scored through the third step until the score is the highest, thereby finally obtaining the most suitable structure.

도 3은 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법을 위한 시스템의 구성도이다. Figure 3 is a configuration diagram of a system for a dynamic data structure search method using data semantic classification according to the present invention.

본 발명에 따른 동적 데이터 구조 검색 방법을 구현하는 데이터 처리 시스템(100)은 외부에서 입력되는 학습 데이터나 신규 데이터를 입력받고 저장할 수 있는 데이터 입력부(110)와 입력된 데이터의 언어 모형을 결정하기 위한 자연어 처리부(120), 입력된 데이터의 컬럼 의미 분류를 분석하기 위한 컬럼 분석부(130)와 의미론적 분류에 따라 도메인을 분류하는 도메인 분류부(140)와 데이터 구조 점수화를 통해 최적화된 구조를 처리하는 구조 처리부(150)를 포함하여 구성된다.The data processing system 100, which implements the dynamic data structure search method according to the present invention, includes a data input unit 110 for receiving and storing learning data or new data input from the outside, and a data input unit 110 for determining a language model of the input data. A natural language processing unit 120, a column analysis unit 130 for analyzing column semantic classification of input data, and a domain classification unit 140 for classifying domains according to semantic classification process the optimized structure through data structure scoring. It is configured to include a structure processing unit 150 that does.

상기 자연어 처리부(120)는 학습 데이터 또는/ 및 신규 데이터에 대한 언어 모형을 결정하기 위한 것으로, 자연어 처리는 크게 전처리, 토크나이징(Tokenizing), 어휘 분석(Lexical analysis), 구문 분석(Syntactic analysis), 의미 분석(Semantic analysis)의 과정을 거쳐 자연어를 처리한다.The natural language processing unit 120 is used to determine a language model for learning data and/or new data. Natural language processing largely includes preprocessing, tokenizing, lexical analysis, and syntactic analysis. , Natural language is processed through the process of semantic analysis.

상기 컬럼 분석부(130)는 사전에 정의된 컬럼 구조를 기반으로 입력된 데이터에 대하여 의미별 컬럼을 분류하여 컬럼 의미 분류를 결정하는 것이다.The column analysis unit 130 determines the column meaning classification by classifying the input data into columns by meaning based on a predefined column structure.

상기 도메인 분류부(140)는 분류된 데이터에 대하여 의미론적 분류에 의해 각각의 도메인으로 분류한다. 도메인은 데이터에 적용된 데이터 타입, 데이터 길이 등을 통해 데이터에 해당되는 도메인으로 분류할 수 있다.The domain classification unit 140 classifies the classified data into each domain based on semantic classification. Domains can be classified into domains corresponding to data through the data type and data length applied to the data.

상기 구조 처리부(150)는 상기 제 1단계에서 파싱된 데이터의 범위와 상기 제 2단계에서 분류된 도메인 범위를 적용하여 입력된 데이터의 구조를 점수화하며, 이와 함께 상기 제 3단계에서 점수화된 데이터에 대하여 점수가 가장 높을 때까지 데이터 구조간 결합 또는 분리를 반복 수행하여 가장 높은 점수의 구조를 획득하도록 데이터를 처리하게 되는 것이다.The structure processing unit 150 scores the structure of the input data by applying the range of data parsed in the first step and the domain range classified in the second step, and also scores the structure of the data scored in the third step. Data is processed to obtain the structure with the highest score by repeatedly combining or separating data structures until the score is the highest.

한편, 상기 데이터 처리 시스템(100)은 인공지능 처리부를 포함하여 구성되며, 상기 인공지능 처리부는 신경망을 통해 상기 컬럼 분석부와 자연어 처리부에서 처리되는 데이터를 인공기능 기반으로 처리하며 이는 외부 시스템 또는 자체 시스템을 통해 구현될 수 있다.Meanwhile, the data processing system 100 is comprised of an artificial intelligence processing unit, and the artificial intelligence processing unit processes data processed by the column analysis unit and the natural language processing unit through a neural network based on artificial functions, which are processed by an external system or its own It can be implemented through a system.

도 4는 본 발명에 따른 데이터 의미론적 분류를 이용한 동적 데이터 구조 검색 방법에서 컬럼 의미 분류의 예를 도시한 도면이다. 도시된 바와 같이 입력 컬럼 의미 분류의 기준으로 전화번호, 사람 이름, 회사 이름, 이메일 주소의 의미별 컬럼 기준이 결정되어 있으며, 외부에서 입력되는 입력 데이터는 컬럼 의미 분류 기준에 따라 컬럼을 분류하여 결정한다.Figure 4 is a diagram showing an example of column semantic classification in the dynamic data structure search method using data semantic classification according to the present invention. As shown, the column standards for each meaning of phone number, person name, company name, and email address are determined as the standard for classifying the meaning of the input column, and input data input from outside is determined by classifying the column according to the column meaning classification standard. do.

컬럼 의미 분류는 의미별 다양한 기준을 결정할 수 있으며, 앞서 설명한 바와 같이 언어 모형의 분류에 따라 입력된 데이터는 컬럼 의미 분류 기준에 따라 컬럼을 분류함으로써 입력된 데이터의 표준화를 구현할 수 있다. 특히, 본 발명은 비정형 데이터에 대하여 컬럼 의미 분류에 따라 분류함으로써 다양한 산업 현장에서 발생되는 비정형 데이터를 표준화하여 데이터의 품질을 개선시킬 수 있는 것이다.Column semantic classification can determine various standards for each meaning, and as described above, data input according to the classification of the language model can be standardized by classifying the columns according to the column semantic classification criteria. In particular, the present invention can improve data quality by standardizing unstructured data generated in various industrial sites by classifying unstructured data according to column semantic classification.

이와 같이 구성되는 본 발명은 다양한 산업에서 발생되는 비정형 데이터 소스에 해당되는 IoT, Sensor, Log, XML, JSON 등의 구조를 데이터 의미론적 분류를 이용하여 자동으로 비정형 데이터의 구조를 파악할 수 있는 장점이 있다.The present invention, constructed in this way, has the advantage of automatically identifying the structure of unstructured data using data semantic classification of the structures such as IoT, Sensor, Log, XML, and JSON, which are unstructured data sources occurring in various industries. there is.

이상, 본 발명의 원리를 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 그와 같이 도시되고 설명된 그대로의 구성 및 작용으로 한정되는 것이 아니다. 오히려, 첨부된 청구범위의 사상 및 범주를 일탈함이 없이 본 발명에 대한 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다.Although the present invention has been described and illustrated in connection with preferred embodiments for illustrating the principles of the present invention, the present invention is not limited to the construction and operation as so shown and described. Rather, those skilled in the art will appreciate that many changes and modifications can be made to the present invention without departing from the spirit and scope of the appended claims. Accordingly, all such appropriate changes, modifications and equivalents shall be considered to fall within the scope of the present invention.

100 : 데이터 처리 시스템
110 : 데이터 입력부
120 : 자연어 처리부
130 : 컬럼 분석부
140 : 도메인 분류부
150 : 구조 처리부100: Data processing system
110: data input unit
120: Natural language processing unit
130: Column analysis unit
140: Domain classification unit
150: Rescue processing unit

Claims

A data input unit that can receive and store training data or new data input from outside, a natural language processing unit to determine the language model of the input data, a column analysis unit and semantic classification to analyze the column semantic classification of the input data. In the dynamic data structure search method using data semantic classification, which is a data processing system consisting of a domain classification unit that classifies the domain according to the domain and a structure processing unit that processes the optimized structure through data structure scoring,
A first step of parsing based on the basic delimiter of the input data;
A second step of classifying the data classified through the first step into each domain according to semantic classification;
A third step of scoring the structure of the input data by applying the range of data parsed in the first step and the domain range classified in the second step; and
A fourth step of obtaining the structure with the highest score by repeatedly combining or separating data structures for the data scored in the third step until the score is the highest,
The second step includes a 2-1 step of determining the column semantic classification by assigning columns according to semantic categories to classify the classified data into columns by meaning, and the column semantic classification determined in the 2-1 step. It is composed of steps 2-2 of determining a language model based on artificial intelligence,
The 2-1 step further includes a term separation and term standardization step for the entered data before determining the column semantic classification, and a word collection and word standardization step for new words,
The term standardization step and the new word standardization step are standardized including both Korean and English,
The data processing system includes a data input unit that can receive and store training data or new data input from outside, a natural language processing unit that determines the language model of the input data, and a column analysis unit that analyzes the column meaning classification of the input data. A dynamic data structure search method using data semantic classification, which includes a domain classification unit that classifies domains according to semantic classification and a structure processing unit that processes optimized structures through data structure scoring.

delete