KR20190021877A

KR20190021877A - Method and system for ontology-based big data harness for accelerated machine learning and big data analytics

Info

Publication number: KR20190021877A
Application number: KR1020170107245A
Authority: KR
Inventors: 백옥기
Original assignee: 백옥기
Priority date: 2017-08-24
Filing date: 2017-08-24
Publication date: 2019-03-06

Abstract

The present invention relates to a method which dynamically generates and manages metadata based on a data classification system and ontology in order to automatically represent semantic features of the metadata. Moreover, the method provides a smart data framework or an intelligent data framework for accelerated machine learning or deep learning, which are not necessary to develop each connection and each pre-processing software for utilizing big data.

Description

TECHNICAL FIELD [0001] The present invention relates to ontology-based large data access and utilization methods and systems for accelerating machine learning and big data analysis,

본 발명은 정보기술 및 관련 분야에 관한 것으로서, 기계 학습(machine learning) 또는 "딥 러닝(deep learning)", 인지 컴퓨팅(cognitive computing) 또는 인지 분석(cognitive analytics)에서 활용되는 빅데이터의 접속, 획득(ingestion)및 활용을 효과적으로 함으로써 기계 학습 및 인지 분석을 가속화하기 위한 데이터 관리와 활용 방법 및 시스템에 관한 것이다.Field of the Invention The present invention relates to information technology and related fields, and relates to a method and apparatus for accessing and acquiring big data utilized in machine learning or " deep learning ", cognitive computing or cognitive analytics and to methods and systems for data management and utilization for accelerating machine learning and cognitive analysis by effectively exploiting and ingesting data.

빅데이터의 의미론적(내용 및 의미) 특성을 위한 표준화의 불충분 또는 부재뿐만 아니라 데이터의 구문론적 측면을 취급하는 데이터 표준화의 불충분으로 인하여, 방대한 다차원적, 그리고 이질적 데이터를 활용하는 기계 학습은 지속적인 도전으로 지체되어 왔으며 기술적인 공백으로 남았다.Due to the lack of standardization of data handling the syntactic aspects of data as well as the insufficiency or absence of standardization for the semantic (content and semantic) nature of Big Data, machine learning, which utilizes vast multidimensional and heterogeneous data, And has remained a technical void.

본 발명은 데이터 분류 체계(taxonomy)및 온톨로지(ontology)에 기반한 메타데이터(metadata)의 동적 생성 관리를 통하여 데이터의 의미론적 속성들을 시스템적으로 제시함으로써 다차원적이고 이질적인 빅데이터에 접속함에 관련된 문제점들을 해결하고자 한다.The present invention solves the problems associated with connecting multidimensional and heterogeneous big data by systematically presenting semantic properties of data through dynamic generation management of metadata based on taxonomy and ontology I want to.

본 발명은 각종의 다양한 데이터 접속(data adapters)및 전처리(preprocessors)를 위한 소프트웨어를 일일이 개발할 필요없이, 그리고 개개의 데이터에 대한 사전 지식 없이, 기계 학습 알고리즘으로 하여금 빅데이터에 접속, 이를 획득 및 분석할 수 있게 한다. 데이터 분석 소프트웨어 또는 기계 학습 알고리즘들은 "스마트 데이터 인터페이스"를 이용하여 빅데이터에 관한 사전 지식 없이 접속 및 획득시에 그 빅데이터의 의미론적 및 구문론적 특성들을 시스템적으로 발견할 수 있다. The present invention allows machine learning algorithms to access, acquire and analyze large data without the need to develop software for a variety of different data adapters and preprocessors, and without prior knowledge of individual data. I can do it. Data analysis software or machine learning algorithms can systematically discover the semantic and syntactic characteristics of the Big Data at the time of connection and acquisition without prior knowledge of Big Data using a " Smart Data Interface ".

본 발명으로써 빅데이터를 활용하기 위하여 개개의 데이터 접속(data adapters)및 전처리(preprocessors)를 위한 소프트웨어를 개발할 필요를 제거함으로써 기계 학습 또는 딥 러닝을 가속화하고 효율을 높이기 위한 프레임워크(framework)를 제공한다.The present invention provides a framework for accelerating machine learning or deep learning and improving efficiency by eliminating the need to develop software for individual data adapters and preprocessors to utilize the Big Data do.

본 발명은 데이터 수집시에 데이터 분류 체계 및 온톨로지에 기초하여 그 데이터의 의미론적 특성들을 기술하는 메타데이터를 자동적으로 생성하고 관리하는 데 이용된다. 달리 말하자면, 데이터 수집시에 그 데이터의 의미론적 및 구문론적 특성들에 대한 메타데이터가 자동으로 생성되며, 그 메타데이터는 상기 빅데이터와, 범용 리소스 로케이터(URL; universal resource locator)를 이용하여, 느슨하게 논리적으로 결합된다. 각자의 데이터에 관한 사전 지식 없이 분석 소프트웨어 또는 기계 학습 알고리즘이 그 데이터의 의미론적 및 구문론적 특성들을 발견할 수 있도록 "스마트 데이터 인터페이스(smart data interface)"가 제공된다.The present invention is used to automatically generate and manage metadata describing the semantic properties of the data based on a data classification scheme and an ontology at the time of data collection. In other words, at the time of data collection, metadata for the semantic and syntactic properties of the data is automatically generated, and the metadata is stored using the big data and the universal resource locator (URL) Loosely coupled logically. A " smart data interface " is provided so that analysis software or machine learning algorithms can discover the semantic and syntactic characteristics of the data without prior knowledge of their respective data.

본 발명은 데이터에 관한 사전 지식 없이 미리 정의된 "스마트 데이터 인터페이스"를 통하여, 각종 빅데이터의 의미론적 및 구문론적 특성들을 데이터 접속시에 자동적으로 발견하도록 하는 "스마트 데이터(smart data framework)" 또는 "지능형 데이터(intelligent data framework)"의 개념을 실현할 것이다. The present invention relates to a " smart data framework " that automatically detects semantic and syntactic characteristics of various big data at the time of data connection, through a predefined " smart data interface " Will realize the concept of " intelligent data framework ".

본 발명의 방법 및 시스템은 다양한 하드웨어 시스템들, 운영체제들, 네트워크 인프라스트럭처, 또는 시스템 구현 언어에 구애받지 않고, 그들의 개발과 운영체계에 무관하게 독립적으로 구현 및 운용될 수 있다.The method and system of the present invention can be independently implemented and operated independent of various hardware systems, operating systems, network infrastructure, or system implementation language, regardless of their development and operating system.

전술한 기계 학습 또는 빅데이터 분석에 관련된 기술적인 문제를 해결하기 위하여 메타데이터를 자동으로 생성하고 관리하기 위한 상당수의 선행 기술들이 발표되었다. 그리고, 상당수의 선행 기술들이, "자기-기술적 데이터(Self-Describing Data)" 또는 "스마트 데이터"를 위한 방법 및 시스템으로 불리고, 전술한 각종의 빅데이터의 접속 및 활용에 관련된 기술적인 문제들을 해결했다고 주장하고 있다.A number of prior art techniques have been published for automatically generating and managing metadata to address the technical problems associated with machine learning or big data analysis described above. A number of prior arts are also referred to as methods and systems for " Self-Describing Data " or " Smart Data ", and solve technical problems related to the connection and utilization of various types of large data .

그러나 그런 선행 기술들은 데이터의 구문론적 측면(데이터의 구성 및 포맷)의 일부분을 해결했을 뿐이며, 데이터의 의미론적 속성들(데이터의 내용 및 의미)에 대한 문제점들에 대해서는 해결책을 제공하지 않고 있다.However, such prior art solves some of the syntactic aspects of the data (the organization and format of the data) and does not provide a solution to the problems with the semantic properties of the data (content and meaning of the data).

본 발명은, 데이터의 구문론적 측면뿐만 아니라 의미론적 측면에 관한 전술한 문제점들을 해결함으로써, 데이터 분석 소프트웨어 또는 기계 학습 알고리즘들로 하여금 그 데이터에 관한 사전 지식 없이 미리 정의된 "스마트 데이터 인터페이스"를 통하여 데이터의 의미론적 속성들을 발견할 수 있게 하는 신규한 방법 및 시스템에 관한 것이다.SUMMARY OF THE INVENTION The present invention solves the above-mentioned problems of not only the syntactic as well as the semantic aspects of data, but also enables data analysis software or machine learning algorithms to be implemented through predefined " smart data interfaces " To a novel method and system for discovering semantic properties of data.

결론적으로, 본 발명은 "자기-기술적 데이터(Self-Describing Data)", "스마트 데이터", 또는 유사한 명칭을 포함하거나 기술하고 있는 제반 선행기술과는 전혀 다른 방법 및 시스템에 관한 것인바, 본 발명의 핵심은 빅데이터의 의미론적 속성들(데이터의 내용 및 의미)에 관한 문제점들을 해결하기 위한 새로운 방법 및 시스템이다.In conclusion, the present invention relates to a method and system that is entirely different from any prior art that includes or describes "Self-Describing Data", "smart data", or similar names, Is a new method and system for solving the problems of semantic properties (data content and meaning) of big data.

본 발명의 실시예의 설명에 이용되기 위하여 첨부된 아래 도면들은 본 발명의 실시예들 중 단지 일부일 뿐이며, 본 발명이 속한 기술분야에서 통상의 지식을 가진 사람(이하 “통상의 기술자”라 함)에게 있어서는 발명적 작업이 이루어짐 없이 이 도면들에 기초하여 다른 도면들이 얻어질 수 있다.
도 1에는 점진적인 기계 학습 또는 딥 러닝을 가속화하는 온톨로지-기반의 빅데이터 활용을 위한 전체 시스템 아키텍처가 도시된다.
도 2에는 데이터의 사전 지식 없이, 그리고 그 데이터에 접근하는 개개의 데이터 접속 및 전처리 소프트웨어들을 개발할 필요없이, 미리 정의된 "스마트 데이터 인터페이스"를 통하여 접속 및 활용될 수 있는 메타데이터를 자동적으로 생성하기 위한 방법 및 전체적인 프로세스가 도시된다.
도 3에는 데이터의 사전 지식 없이 미리 정의된 "스마트 데이터 인터페이스"를 통하여 접속 및 활용될 수 있는 데이터들을 시스템적으로 생성하는 프로세스 흐름 및 순서(sequence)들이 도시된다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention to those skilled in the art Other drawings can be obtained based on these figures without an inventive task being performed.
Figure 1 illustrates an overall system architecture for ontology-based big data utilization that accelerates progressive machine learning or deep learning.
FIG. 2 illustrates a method for automatically generating metadata that can be accessed and utilized through a predefined " smart data interface " without the need to develop individual data access and preprocessing software to access the data, And the overall process are illustrated.
Figure 3 shows process flows and sequences for systematically generating data that can be accessed and utilized through a predefined " smart data interface " without prior knowledge of the data.

후술하는 본 발명에 대한 상세한 설명은, 본 발명의 목적들, 기술적 해법들 및 장점들을 분명하게 하기 위하여 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 통상의 기술자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. The following detailed description of the invention refers to the accompanying drawings, which illustrate, by way of example, specific embodiments in which the invention may be practiced in order to clarify the objects, technical solutions and advantages of the invention. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention.

본 발명의 상세한 설명 및 청구항들에 걸쳐 '학습'은 절차에 따라 기계 학습(machine learning), 예컨대 딥 러닝 등을 수행함을 일컫는 용어인바, 인간의 교육 활동과 같은 정신적 작용을 지칭하도록 의도된 것이 아님을 통상의 기술자는 이해할 수 있을 것이다.Throughout the detailed description and claims of the present invention, "learning" is a term referring to performing machine learning, for example, deep learning, etc., in accordance with procedures, and is not intended to refer to mental actions such as human educational activities It will be understood by those of ordinary skill in the art.

그리고 본 발명의 상세한 설명 및 청구항들에 걸쳐, '데이터'는 원시(raw)의 데이터, 필터링(filtered)된 데이터, 전처리된 데이터, 큐레이션(curated)된 데이터, 분석된 데이터, 조작(manipulated)된 데이터, 합성된 데이터, 날조된 데이터 등등 모든 종류의 데이터를 지칭하는 용어이며, '데이터 개체(data entity)'는 최소한의 필터링 및 전처리를 거친 유용한, 그리고 소정의 특정 도메인 또는 목적에 따라 정리된 데이터의 모음을 지칭하는 용어이다.And throughout the description and claims of the present invention, "data" refers to data, raw data, filtered data, preprocessed data, curated data, analyzed data, manipulated data, Refers to all sorts of data, such as data, synthesized data, fabricated data, etc., and a 'data entity' is a term that is useful, filtered through a minimum of filtering and preprocessing, It is a term referring to a collection of data.

다음으로, 본 발명의 상세한 설명 및 청구항들에 걸쳐, '데이터를 채취(capture)하다'라는 용어 및 그 변형은 그 채취된 데이터가 특정 기계에서 생성된 초기의 원시 데이터(initial raw data)임을 지칭하는 것인 반면, '데이터 페이로드(data payload)'라는 용어는 노이즈(noise)를 제거하고 초기의 필터링(initial filtering)을 거쳐 만들어진 데이터 개체를 특정 스펙(specification)에 맞춰 포맷팅(formatting)한 것을 일컫는데, 이 데이터 페이로드는 데이터 헤더(data header)나 트레일러(trailer)를 포함할 수 있으며, 또한 암호화(enncryption) 또는 압축(compression)될 수도 있다. 여기에서, 필터링이라 함은 본 발명의 데이터 획득(data ingestion)시의 전처리와는 상이한 것이다.Next, throughout the description and claims of the present invention, the term " capture data " and variations thereof indicate that the collected data is initial raw data generated on a particular machine While the term 'data payload' refers to the formatting of a data object created by initial filtering and eliminating noise to a specific specification Which may include a data header or a trailer and may also be encrypted or compressed. Here, the filtering is different from the preprocessing in the data ingestion of the present invention.

또한, 본 발명에서 이용되는 '메타데이터'는 하기에서 더 구체적으로 설명될 바와 같이 '데이터 개체의 데이터 페이로드에 관한 의미론적 및 구문론적 정보'를 지칭하는 용어이다.Also, 'metadata' used in the present invention is a term referring to 'semantic and syntactic information about data payload of a data entity' as will be described in more detail below.

그리고 본 발명의 상세한 설명 및 청구항들에 걸쳐, '포함하다'라는 용어 및 그 변형은 다른 기술적 특징들, 부가물들, 구성요소들 또는 단계들을 제외하는 것으로 의도된 것이 아니다. 통상의 기술자에게 본 발명의 다른 목적들, 장점들 및 특성들이 일부는 본 설명서로부터, 그리고 일부는 본 발명의 실시로부터 드러날 것이다. 아래의 예시 및 도면은 실례로서 제공되며, 본 발명을 한정하는 것으로 의도된 것이 아니다.And throughout the specification and claims of this invention, the term " comprises " and variations thereof are not intended to exclude other technical features, additions, elements, or steps. Other objects, advantages and features of the present invention will become apparent to those skilled in the art from this description, and in part from the practice of the invention. The following examples and figures are provided by way of illustration and are not intended to limit the invention.

더욱이 본 발명은 본 명세서에 표시된 실시예들의 모든 가능한 조합들을 망라한다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 사상 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 사상 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. Moreover, the present invention encompasses all possible combinations of embodiments shown herein. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with one embodiment. It should also be understood that the position or arrangement of individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is to be limited only by the appended claims, along with the full scope of equivalents to which such claims are entitled, if properly explained.

본 명세서에서 달리 표시되거나 분명히 문맥에 모순되지 않는 한, 단수로 지칭된 항목은, 그 문맥에서 달리 요구되지 않는 한, 복수의 것을 아우른다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Unless otherwise indicated herein or clearly contradicted by context, items referred to in the singular are intended to encompass a plurality unless otherwise specified in the context. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

본 발명은 기계 학습을 위한 빅데이터의 메타데이터를 데이터 분류 체계와 온톨로지에 기반을 두고, 자동적으로 생성 관리하기 위한 새로운 방법 및 시스템을 제공한다. 데이터 분석 소프트웨어 또는 기계 학습 알고리즘들로 하여금, 각종의 다양하고 이질적이며 분산된 그리고 낮은 신뢰성(low veracity)을 내포하고 있는 각종의 빅데이터의 내용 및 의미를 스스로 발견할 수 있게 함으로써, 기계 학습의 효율성 및 효능을 크게 향상시킨다. 여기에서 '낮은 신뢰성'이라 함은 인터넷이나 IoT(Internet of Things) 또는 CPS 장비(Cyber Physical System device)로부터 획득한 빅데이터 자체의 품질 및/또는 신빙성에 의문이 있다는 점을 의미하는바, 그 데이터에 대한 추가적인 데이터 전처리, 검증(validation), 큐레이션(curation) 절차가 필요하다는 점을 내포한다.The present invention provides a new method and system for automatically generating and managing metadata of big data for machine learning based on a data classification system and an ontology. By allowing data analysis software or machine learning algorithms to discover the content and meaning of various types of big data that are diverse, heterogeneous, distributed, and low veracity, And potency. Here, 'low reliability' means that there is a doubt about the quality and / or authenticity of the big data obtained from the Internet, Internet of Things ("IoT") or CPS equipment (Cyber Physical System device) , Additional data preprocessing, validation, and curation procedures are needed.

본 발명은 다양한 실시예들로 실시될 수 있는바, 그와 같은 실시예들에서 언급되는 "컴퓨팅 엔티티(computing entity)"는, 본 명세서의 상세한 설명 및 청구항들에 걸쳐, 많은 양의 데이터를 고속으로 처리할 수 있는 기계(machine)인 데이터 프로세싱 장치를 지칭하는 용어이다. It should be understood that the present invention can be practiced in a variety of different embodiments, and the " computing entity " referred to in such embodiments is intended to encompass a wide variety of data, Quot; is a term that refers to a data processing device that is a machine that is capable of processing data.

예를 들어, 컴퓨팅 엔티티는 현존하는 폰 노이만 기계들 및 이에 부수하는 소프트웨어 플랫폼을 포함할 수 있는바, 범용으로도 이용될 수 있는 컴퓨터, 워크스테이션, 포터블 또는 모바일 컴퓨팅 장치일 수 있다. 그러한 컴퓨팅 엔티티는 인터넷 연결 또는 다른 네트워크들{예컨대, LAN(local area network), 또는 WAN(wide area network)}에 연결될 수 있다. 컴퓨팅 엔티티들의 인터넷, 인트라넷 또는 내부 네트워크에 대한 연결은 무선 연결이거나 유선 연결일 수 있다. LAN, WAN, 인트라넷, 내부 네트워크 및 인터넷 연결은 상세하게 설명될 필요가 없는바, 이는 통상의 기술자에게 잘 알려져 있기 때문이다. 컴퓨팅 엔티티가 데이터 네트워크의 일부인 노드로 지칭될 수 있다는 점이 이해될 것인바, 그러한 데이터 네트워크는 통신 경로에 의하여 연동되는 다수의 노드들을 포함할 수 있다.For example, the computing entity may include existing von Neumann machines and associated software platforms, and may be a general purpose computer, workstation, portable or mobile computing device. Such computing entities may be connected to an Internet connection or other networks (e.g., a local area network (LAN), or a wide area network (WAN)). The connection of the computing entities to the Internet, intranet or internal network may be a wireless connection or a wired connection. LANs, WANs, intranets, internal networks, and Internet connections need not be described in detail because they are well known to those of ordinary skill in the art. It will be appreciated that a computing entity may be referred to as a node that is part of a data network, such a data network may include a plurality of nodes coupled by a communication path.

달리 말하자면, 컴퓨팅 엔티티는 무선 또는 유선의 통신 링크를 통하여 데이터 네트워크 내에서 다른 노드들과 데이터를 교환할 수 있는 노드로서 작동하며, 소프트웨어를 실행하는 프로세서를 포함하는 컴퓨터 플랫폼일 수 있다. 컴퓨팅 엔티티는 모바일 장치이거나 정적인 장치일 수 있다.In other words, the computing entity may be a computer platform that includes a processor that runs software and acts as a node that can exchange data with other nodes in the data network over a wireless or wired communication link. The computing entity may be a mobile device or a static device.

컴퓨팅 엔티티는 전술한 예에 한정되지 않는바, 통상의 기술자는 향후 개발될 뉴로모픽 프로세서(neuromorphic processors), 분자 DNA 컴퓨터(molecular DNA computers), 광자 컴퓨터(photonic computers), 양자 컴퓨터(quantum computers) 등에도 본 발명의 방법, 프로그램 및 시스템이 이용될 수 있다는 점을 이해할 수 있을 것이다.Computational entities are not limited to the examples described above, and ordinary technicians may use neuromorphic processors, molecular DNA computers, photonic computers, quantum computers, And the like can be used in the present invention.

부연하면, 본 명세서에서 지칭하는 컴퓨팅 엔티티와 그 네트워크는 다른 노드들간에 데이터를 교환할 뿐만 아니라, 소프트웨어가 스스로 그 소프트웨어에 부여된 특정한 목적을 수행하기 위하여 제반 노드들을 방문하고 (mobile software migrating to various system node based on its itinerary) 각 노드에서 특정한 작업을 수행할 수 있다.In addition, the computing entities referred to herein and their networks not only exchange data between other nodes, but also allow software to migrate to various nodes for performing specific purposes assigned to the software (mobile software migrating to various system node based on its itinerary) You can perform specific tasks on each node.

컴퓨팅 엔티티의 다른 몇몇 실시예에서는, 상기 컴퓨팅 엔티티는 컴퓨터 판독가능 메모리에 저장된 하나 이상의 데이터베이스들을 가지는 적어도 하나의 포터블 또는 논-포터블(non-portable) 컴퓨터에 의하여 구현될 수 있는바, 그 적어도 하나의 포터블 또는 논-포터블 컴퓨터 또한 소프트웨어로써 프로그램된 적어도 하나의 컴퓨팅 유닛 또는 프로세서를 구비할 수 있으며, 그 소프트웨어가 실행되는 때에는 상기 소프트웨어 내에 포함된 일련의 단계들이 실행된다.In some other embodiments of a computing entity, the computing entity may be implemented by at least one portable or non-portable computer having one or more databases stored in a computer-readable memory, A portable or non-portable computer can also have at least one computing unit or processor programmed with software, and when the software is executed, a series of steps included in the software are executed.

다만, 통상의 기술자는 이와 같은 "컴퓨팅 엔티티"가 폰 노이만식 컴퓨터에 국한되지 않으며, 본 발명에 따른 소프트웨어 전체 또는 적어도 일부를 실행할 수 있는 장치라면 무엇이든 이 "컴퓨팅 엔티티"에 포함될 수 있을 것이라는 점, 그리고 그와 같은 "적어도 하나의 컴퓨팅 엔티티" 의 상호 연동에 의하여 본 발명에 따른 소프트웨어 전체가 실행되는 경우가 본 발명의 일 실시예로서 포함될 것이라는 점을 이해할 수 있을 것이다. 또한 그러한 상호 연동은 앞서 설명된 무선 연결 또는 유선 연결을 통하여 이루어질 것이라는 점은 자명하다.It will be appreciated by those of ordinary skill in the art that such a " computing entity " is not limited to a von Neumann type of computer, and any device capable of executing all or at least some of the software according to the present invention may be included in the & , And such a " at least one computing entity ", as a whole, will be included as an embodiment of the present invention. It is also evident that such interworking will be accomplished through the wireless or wired connection described above.

본 발명은, 기계 학습 알고리즘 또는 데이터 분석 소프트웨어가, 사전 지식없이 데이터를 획득 할 때, 구문론적 특성뿐 아니라 의미론적 특성을 자동적으로 발견할 수 있는 "스마트 데이터"또는 "지능형 데이터"를 구현한다. The present invention implements " smart data " or " intelligent data " that can automatically detect semantic properties as well as syntactic characteristics when a machine learning algorithm or data analysis software acquires data without prior knowledge.

본 발명은 메타데이터를 사용하여 각종 데이터의 구문론적 특성을 기술하는 선행기술들을 보강한다. The present invention augments prior art describing the syntactic properties of various data using metadata.

본 발명은, 기계 학습 알고리즘 또는 데이터 분석 소프트웨어로 하여금, 제반 빅데이터들을, 그 데이터에 대한 내용이나 의미 또는 데이터 구조 및 포맷에 대한 사전 지식 없이, 투명한 접속 및 획득을 가능하게 하는 방법을 제공한다. The present invention provides a method for allowing machine learning algorithms or data analysis software to enable transparent connection and acquisition without having to know all the big data, its contents or semantics, or the data structure and format.

빅데이터의 의미론적 범주(semantic properties of data; 예컨대, 단백질 구조, 인간 유전자 서열, 단백질 구조, 달에서 가지고 온 암석의 3차원 구조, 뇌의 진단 이미지)를 도메인(domain) 전문가가 제공하는 의미 정보에 기반하여 데이터 분류 체계(taxonomy)와 온톨로지(ontology)를 확립하고, 그 온톨로지를 기반으로 메타데이터를 자동적으로 생성하고 관리하는 프레임워크를 구현한다. Semantic properties of data (eg, protein structure, human gene sequence, protein structure, 3-D structure of rock taken from the Moon, brain image of the brain) Based on the data classification system (taxonomy) and ontology is established, and ontology based on the metadata to automatically create and manage a framework to implement.

본 발명에서 일컫는 "하이퍼 온톨로지(hyper-ontology)"는 어떤 특정한 도메인에 관한 전문 용어와 동의어뿐만 아니라, 그 특정한 도메인에 관련된 다른 분야의 전문 용어들과 동의어들을 융합하여 총체적으로 집적한 온톨로지(integrated holistic ontology)이다. 예를 들면, 의료관계의 하이퍼 온톨로지는 의학 및 병리학에 관한 전문 용어뿐만 아니라, 생명공학, 유전학, 생물학, 생화학, 식품영양학, 물리학, 심리학, 철학, 환경공학, 신학, 사회학, 경제학 등의 관련분야의 전문 용어와 동의어를 포함한다.The term " hyper-ontology " referred to in the present invention refers not only to terminology and synonyms for a specific domain, but also to integrative terms and synonyms associated with other domains related to the specific domain, ontology. For example, a medical-related hyper-ontology is not only a jargon about medical and pathology, but also related fields such as biotechnology, genetics, biology, biochemistry, food and nutrition, physics, psychology, philosophy, environmental engineering, theology, sociology, And terminology and synonyms.

데이터의 의미론적 범주라 하면, 데이터의 내용 및 의미에 해당하는 것으로, 그 예를 들자면, 흉부 X선, EEG, 심장 소리, 유전자 발현 결과, 단백질 구조, 인간 유전자 서열, 단백질 구조, 달에서 가지고 온 암석의 3차원 구조, 뇌의 진단 이미지와 같은 데이터의 의미, 데이터 소유자 및 버전, 데이터 생성 날짜 및 시간, 데이터 보안을 위한 접속 제한 등을 들 수 있다. The semantic category of data refers to the content and meaning of data such as chest x-ray, EEG, heart sound, gene expression results, protein structure, human gene sequence, protein structure, The three-dimensional structure of the rock, the meaning of the data such as the diagnostic image of the brain, the data owner and version, the data creation date and time, and the access restrictions for data security.

빅데이터의 구문론적 특성(syntactic properties of data)이라 하면, 바이너리, 문자열, 수열, 이미지, XML, tabular, CSV 등의 데이터 차원, BSML, MAGE-ML, CDA/HL7, PSI, DFDL 등의 데이터 표준 표기법이 포함된다. 데이터 보호 및 보안을 위하여, 암호화할 경우, 그 복호화를 위한 소프트웨어를 구문론적 특성의 일부로 지정할 수 있다. 또한, 데이터를 압축할 경우, 압축 해제를 위한 소프트웨어를 구문론적 특성의 일부로 지정할 수 있다. Syntactic properties of data are data standards such as binary, string, sequence, image, XML, tabular, CSV, BSML, MAGE-ML, CDA / HL7, PSI, Notation is included. For data protection and security, encryption can specify software for decryption as part of the syntactic property. In addition, when compressing data, software for decompression can be specified as part of the syntactic property.

이 경우, 빅데이터의 구문론적 특성에 따라, 데이터 획득 및 전처리를 위한 소프트웨어, 데이터 표준 표기법에 따른 데이터 획득 소프트웨어, 데이터 복호화를 위한 소프트웨어, 데이터 압축 해제를 위한 소프트웨어들은 필요에 따라 동적으로 활용될 수 있다(dynamic software loading and binding on demand). In this case, software for data acquisition and preprocessing, data acquisition software according to data standard notation, software for data decoding, and software for data decompression can be dynamically used as needed according to the syntactic characteristics of the big data (Dynamic software loading and binding on demand).

이하, 통상의 기술자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, so that those skilled in the art can easily carry out the present invention.

도 1에 묘사되어 있듯이, 기계 학습 알고리즘(101) 또는 빅데이터 분석 소프트웨어(102)와 같은 대용량의 이질적인 데이터를 활용하는 응용 프로그램은 미리 정의된 "스마트 데이터 인터페이스"(103)를 사용함으로써, 각각의 데이터를 위해서 일일이 따로 데이터 접속 및 전처리를 위한 소프트웨어를 개발할 필요 없이, 다양한 데이터에 접속하고 이를 활용할 수 있게 된다. As depicted in FIG. 1, application programs that utilize large amounts of heterogeneous data, such as machine learning algorithms 101 or big data analysis software 102, use predefined " smart data interfaces " It is possible to access and utilize various data without having to develop software for data connection and preprocessing separately for data.

기계 학습 알고리즘(101) 또는 빅데이터 분석 소프트웨어(102)가 미리 정의된 "스마트 데이터 인터페이스"를 통해 다양한 이질적 데이터에 접속할 수 있게 해주는 메타데이터 생성 및 관리 프레임워크(107)는 하이퍼 온톨로지(104)와 그 하이퍼 온톨로지를 보완하는 도메인별 제어 어휘(domain-specific controlled vocabularies; 105)와 동의어 모음(106)으로 구성된다.The metadata generation and management framework 107 that allows the machine learning algorithm 101 or the big data analysis software 102 to access various heterogeneous data through a predefined " smart data interface " Specific controlled vocabularies (105) and a synonym collection (106) that complement the hyper-ontology.

메타데이터 관리 프레임워크(107)는 사람이 입력한 데이터 또는 기계가 생성한 데이터를 수집함과 동시에, 메타데이터(108)를 자동 생성한다. 메타데이터(108)는 해당 데이터(110)의 의미론적 특성과 구문론적 속성을 포함한다.The metadata management framework 107 collects data entered by a person or generated by a machine, and at the same time automatically generates the metadata 108. Metadata 108 includes the semantic and syntactic attributes of the data 110.

데이터의 의미론적 특성(semantic properties of data)과 데이터 개체(data entity; 110)의 구문론적 속성(syntactic properties of data)의 예가 표(109)에 묘사되어 있다. Examples of semantic properties of data and examples of syntactic properties of data entity 110 are depicted in table 109. [

이와 같은 하이퍼 온톨로지(104), 관련된 어휘(105) 및 동의어 모음(106)의 정의에는 데이터 과학자(data scientist)가 관여할 수 있으며, 그 하이퍼 온톨로지에 기초한 메타데이터의 속성 혹은 특성(metadata attributes; 109)의 정의에는 도메인 전문가가 관여할 수 있다.A data scientist may participate in the definition of the hypertextoligon 104, the related vocabulary 105 and the synonym collection 106, and the metadata attributes 109 ) May be involved in the definition of domain experts.

실제로 구현할 경우, 도 1에 예시한 표(109)보다 훨씬 많은 정보들이 메타데이터(108)에 포함될 수 있다.In actual implementation, much more information than the table 109 illustrated in FIG. 1 may be included in the metadata 108.

실제 데이터 개체(110)에 접속하기 위한 데이터 경로 또는 URL은 표(109)의 맨 끝에 보인 것과 같이 메타데이터(108)에 포함된다. 데이터 개체의 접속이 URL에 따라서 그 실제 저장 시스템과 위치를 자동적으로 찾게 됨으로, 그 데이터 개체(110)는 어디에나 저장될 수 있다.The data path or URL for connecting to the real data entity 110 is included in the metadata 108 as shown at the end of the table 109. [ The connection of the data object automatically finds its actual storage system and its location according to the URL so that the data object 110 can be stored everywhere.

도 2에 묘사한 것과 같이, "스마트 데이터 인터페이스"(216)를 통해 접속 및 획득되고 이용될 수 있게 되는 빅데이터에 대한 메타데이터를 체계적으로 생성하고 관리하기 위한 방법 및 프로세스를 두 가지 개별적인 방법과 프로세스로 묘사한다. 하나는 기계로부터 자동 생성되는 데이터(201)이고 다른 하나는 사람이 수동으로 입력하는 데이터(207)이다.As depicted in FIG. 2, the method and process for systematically creating and managing metadata for the Big Data, which may be accessed and obtained via the " Smart Data Interface " 216, Process. One is data 201 automatically generated from the machine and the other is data 207 manually entered by a person.

기계로부터 자동 생성된 데이터(201)의 경우, 기계의 구성(configuration)과 운영 매개변수(runtime parameters)가 인간-기계 인터페이스(202)를 통해 제공된다. 기계 구성 및 운영 매개변수의 예는 도 1의 표(109)에 묘사되어 있다.In the case of the data 201 automatically generated from the machine, the configuration of the machine and the runtime parameters are provided through the man-machine interface 202. Examples of machine configuration and operating parameters are depicted in table 109 of FIG.

일 예시로서, 기계의 구성에 관한 설정은 도메인 전문가에 의하여 수행될 수 있으며, 운영 매개변수는 그 기계의 운영자(operator)에 의하여 수행될 수 있다. 구체적인 예를 들면, 병원에서 활용되는 기능성 자기 공명 영상(functional MRI)는 영상의학과의 방사선 전문가가 그 초기 셋업 및 구성을 수행(즉, 기계의 구성을 설정함)할 수 있으며, MRI 스캔시에 필요한 운영 매개변수(예컨대, 환자에 대한 정보, 스캔의 일시, 스캔을 요청한 의사에 관한 정보, 스캔의 목적, 운영자의 ID 등)는 MRI 기계 운영자가 수행할 수 있다.As an example, the configuration regarding the configuration of the machine may be performed by a domain expert, and the operating parameters may be performed by an operator of the machine. For example, a functional MRI used in a hospital can be used by a radiology specialist in the radiology department to perform its initial setup and configuration (i.e., to set up the machine configuration) The operating parameters (e.g., information about the patient, date and time of the scan, information about the physician requesting the scan, purpose of the scan, operator's ID, etc.) can be performed by the MRI machine operator.

이와 같은 기계(예컨대, 기능성 자기 공명 영상, functional MRI)로부터 새로운 데이터가 생성될 때(203), 기계 구성 및 운영 매개변수에 따라, 기계로부터 생성되는 빅데이터에 대한 메타데이터가 자동적으로 생성된다(204). When new data is generated (203) from such a machine (e.g., functional MRI), metadata is automatically generated for the big data generated from the machine, depending on the machine configuration and operating parameters 204).

그 기계로부터 생성된 데이터(205)는, 노이즈 제거 및 포맷 작업을 거쳐서 빅데이터 메모리 콘테이너(212)를 거친 후, 저장 시스템이 지정되면(213), 그 저장 시스템의 데이터 경로 또는 URL이 메타데이터에 추가되어 완성된 메타데이터(214)가 생성된다. The data 205 generated from the machine is passed through the big data memory container 212 after the noise removal and formatting operation and then the storage system is designated 213 and the data path or URL of the storage system is assigned to the metadata And the completed metadata 214 is generated.

그 완성된 메타데이터(214)는, 해당하는 도메인의 데이터 분류 체계(taxonomy)에 따른 데이터 계층 구조에서의 자리를 지정함으로써(215) 저장된다. 이 메타데이터 및 데이터 개체의 계층 구조는 각각의 도메인의 분류 체계에 기반하여, 데이터 과학자와 데이터 모델러(data modeller)에 의하여 그 계층 구조가 정의되고, 그 계층 구조에 기반하여 데이터 모델러와 데이터베이스 전문가(database specialist)에 의하여 시스템화된다.The completed metadata 214 is stored 215 by specifying a place in the data hierarchy according to the data classification scheme (taxonomy) of the corresponding domain. The hierarchical structure of the metadata and data objects is defined by data scientists and data modelers based on the classification system of each domain. Based on the hierarchical structure, data modelers and database experts database specialist.

새로 생성된 데이터가, 이미 존재하는 데이터의 새로운 버전인 경우에는, 과거 버전 데이터 및 관련 메타데이터는 시스템에 다른 버전으로 유지 관리되며, 최근 버전은 데이터 계층 구조에서의 계단식(cascade) 상단에 배치된다. 여러 버전의 데이터가 존재하는 경우, 스마트 데이터 인터페이스(216)를 통하면, 최근 버전의 데이터에 접속된다. 옛 버전의 데이터는, 스마트 데이터 인터페이스(216)을 이용할 때, 특정 버전 번호나 또는 생성 시간대를 명시하여 특정한 버전을 접속할 수 있다.If the newly created data is a new version of the already existing data, the past version data and related metadata are maintained in different versions in the system, and the latest version is placed at the top of the cascade in the data hierarchy . If there are multiple versions of data, via the smart data interface 216, they are connected to the latest version of the data. When using the smart data interface 216, the old version of data can be connected to a specific version by specifying a specific version number or generation time zone.

사람이 입력한 데이터(207)의 경우는, 데이터 입력(210)이 시작될 때, 시스템이 인간-기계 인터페이스(208)를 통하여 그 입력되는 데이터의 의미론적 특성 및 구문 속성에 관한 정보를 요구함으로써 채취(capture)한다(209). 사람이 입력한 의미론적 특성 및 구문 속성에 관한 정보(209)에 기반하여, 사람이 수동으로 입력한 데이터에 대한 메타데이터(212)가 자동으로 생성된다.In the case of human-input data 207, when the data entry 210 is started, the system requests information about the semantic properties and syntax attributes of the input data through the human-machine interface 208, (209). Metadata 212 for manually entered data is automatically generated based on information 209 about semantic properties and syntax attributes entered by a person.

사람이 입력한 데이터(210)는 입력 장치에 따라 포맷 작업을 거친 후 입력 데이터(211)로서 채취된다. The data 210 input by a person is subjected to a formatting operation according to the input device and is then collected as input data 211.

이 후의 프로세스(212 내지 219) 흐름은, 상기의 기계로부터 생성되는 데이터 처리에 관련된 프로세스 흐름과 같으므로 그 설명을 생략한다.The flow of subsequent processes 212 to 219 is the same as the process flow related to the data processing generated from the above-mentioned machine, and a description thereof will be omitted.

데이터 분석 소프트웨어 또는 기계 학습 알고리즘(219)은 스마트 데이터 인터페이스(216)를 이용하여 데이터 접속을 하게 된다. 이때, 데이터 분석 소프트웨어 또는 기계 학습 알고리즘(219)은, 통상적인 방법인 핵심 단어를 이용한 검색(key word search)으로 메타데이터를 찾을 수도 있지만, 본 발명에 따른 시스템에서는 데이터 분류 체계에 의한 데이터 계층(taxonomy-based hierarchical tree structure)을 따라(215) 해당 메타데이터에 직접 접속할 수도 있다. The data analysis software or machine learning algorithm 219 makes a data connection using the smart data interface 216. At this time, the data analysis software or the machine learning algorithm 219 can search the metadata by a key word search using a conventional method. However, in the system according to the present invention, taxonomy-based hierarchical tree structure (215).

데이터 분석 소프트웨어 또는 기계 학습 알고리즘과 같은 데이터 소비 주체는 메타데이터를 분석하여 데이터의 의미론적 특성을 검색하게 된다(217).The data consuming entity, such as data analysis software or machine learning algorithms, analyzes the metadata to retrieve the semantic properties of the data (217).

메타데이터로부터 데이터의 구문 속성이 검색되고 해당 데이터의 구문 분석기(218)와 데이터 파서가 라이브러리에서 자동적으로 로드(load)되어 필요한 데이터(213)에 접속하고 이를 획득하게 된다. The syntax attributes of the data are retrieved from the metadata and the parser 218 and the data parser of the data are automatically loaded in the library to access and acquire the necessary data 213.

상세한 프로세스 흐름 및 체계적으로 메타데이터를 생성 관리하는 개별 단계 및 시퀀스는 도 3에 묘사되어 있다. 프로세스 흐름과 개별 프로세스 단계는 위에서 도 2를 설명할 때 이미 자세히 기술하였는 바, 그 설명을 생략한다. 대신, 본 발명의 핵심적인 기술 혁신 사항을 하기와 같이 요약한다.The detailed process flow and the individual steps and sequences for systematically generating and managing metadata are depicted in FIG. The process flow and the individual process steps have already been described above with reference to FIG. 2, and the description thereof will be omitted. Instead, the key technical innovations of the present invention are summarized as follows.

본 발명의 핵심은, 데이터에 대한 사전 지식 없이, 데이터 분석 소프트웨어 또는 기계 학습 알고리즘이, 데이터 접속시에, 데이터에 대한 구문론적 속성들뿐만 아니라 의미론적 특성을 프로그램에 의하여 발견할 수 있도록 하는 것이다.The essence of the present invention is that data analysis software or machine learning algorithms, without prior knowledge of the data, can discover semantic properties as well as syntactic attributes for data at the time of data access.

본 발명에서는 다양하고 이질적이며 고차원적 빅데이터에 대한 투명한 접근을 제공하기 위하여 통합 색인 메커니즘(unified indexing mechanism)이 이용된다. 이 통합 색인 메커니즘은 데이터 분류 체계 및 온톨로지-기반의 동적 메타데이터 관리와 결합하여, 다차원적 이질적인 빅데이터의 투명한 접속과 활용을 가능하게 할 뿐만 아니라 동적인 메타데이터 생성과 관리를 가능하게 한다.In the present invention, a unified indexing mechanism is used to provide transparent access to various, heterogeneous, and high-dimensional big data. This integrated indexing mechanism, coupled with data classification schemes and ontology - based dynamic metadata management, enables transparent access and utilization of multidimensional heterogeneous big data as well as dynamic metadata creation and management.

또한, 본 발명에서는 각종 데이터의 의미론적 카테고리(예컨대, 단백질 구조, 인간 유전자 시퀀스, 단백질 구조, 월석의 결정학적 성질, 뇌 PET 스캔, 흉부 X선, EEG, 심음 오디오, 유전자 발현 데이터), 데이터 소유자 및 버전(예컨대, 생성 일시, 기관)과 같은 의미론적 특성을 온톨로지-기반의 메타데이터를 생성함으로써 시스템적으로 관리한다. 게다가, 본 발명에서는 각종 데이터의 구문론적 특성(예컨대, 바이너리, 문자열, 이미지, XML), 및 데이터 포맷(예컨대, BSML, MAGE-ML, CDA/HL7, PSI, DFDL)과 같은 구문론적 특성을 발견할 수 있도록 필요에 따라 각종 데이터 파서를 동적으로 활용하여, 데이터 분석 소프트웨어 또는 기계 학습 알고리즘이 개개의 데이터 접속과 전처리를 위한 소프트웨어를 필요로 하지 않게 함으로써, 그 효율과 효능을 높인다.In the present invention, the semantic categories (e.g., protein structure, human gene sequence, protein structure, crystallographic properties of moonstone, brain PET scan, chest X-ray, EEG, heart sound audio, gene expression data) And version (e.g., date and time of creation, agency) by generating ontology-based metadata. Furthermore, the present invention discovers syntactic characteristics such as syntactic characteristics (e.g., binary, string, image, XML) and data formats (e.g., BSML, MAGE-ML, CDA / HL7, PSI, DFDL) To dynamically utilize various data parsers as needed to increase the efficiency and effectiveness of data analysis software or machine learning algorithms by eliminating the need for individual data access and preprocessing software.

상기 메타데이터는 데이터 분석 소프트웨어 또는 기계 학습 알고리즘뿐만 아니라 사람이 직접 판독 가능하도록 XML 표기법이나 다른 유사한 표기법을 이용할 수 있다.The metadata may use XML notation or other similar notation for human readability as well as data analysis software or machine learning algorithms.

지금까지 설명된 본 발명의 기술적 사상이 전술한 예시와 같은 실제적인 용례들에 국한되는 것으로 이해되어서는 아니된다는 점은 명백하다.It is apparent that the technical idea of the present invention described above should not be construed as being limited to practical examples such as the above-mentioned examples.

Claims

At least one computing entity may automatically and systematically generate and manage metadata for the semantic properties and syntactic attributes of the big data based on a data classification scheme (taxonomy) and ontology at the time of data generation ,
(a) metadata for machine-generated data is automatically generated and managed based on machine configuration and operating parameters,
(b) metadata about human-generated data is generated and managed based on human input through human-machine interaction.

The method according to claim 1,
(i) jargon and synonyms for a particular domain; And
(ii) jargon and synonyms in other fields related to the particular domain;
Characterized in that an integrated holistic hyper-ontology is used by fusing the hyper-ontology.

The method of claim 2,
The metadata includes:
(a) metadata defining the semantic properties of the big data based on the hyper-on -tology associated with the domain; And
(b) metadata defining a syntactic attribute of the big data;
&Lt; / RTI >

The computing entity stores and manages the big data and the corresponding metadata according to a hierarchical structure based on the data classification scheme according to claim 1,
(a) the computing entity searches for the metadata according to a conventional search mechanism based on a key word or phrase, and a process for directly searching for the metadata along a hierarchical tree structure Performing at least one;
(b) if there are multiple versions of data, the computing entity manages all versions of the data and associated metadata in a common stack, placing the latest version of the data and associated metadata at the top of the stack Thereby automatically connecting to the latest version of the data if there is no specific request; And
(c) directing, by the computing entity, the data directly via a universal resource locator, regardless of the actual physical location of the data;
&Lt; / RTI >

The method according to claim 1,
Machine learning or data analysis software implements a smart data interface that is provided to connect big data without development of individual data access and preprocessing software for each specific data and without prior knowledge of the data How to.

The method according to claim 1,
A method for implementing a human-machine interface used when connecting to various big data without prior knowledge.

The method according to claim 1,
A method for implementing an integrated, independent software system for accessing and utilizing big data, regardless of the hardware platform, operating system, or programming model used as the computing entity.

A program recorded on a machine-readable non-volatile storage medium, comprising instructions that direct the computing entity to perform the method of any one of claims 1-7.