KR102100346B1

KR102100346B1 - Apparatus and method for managing dataset

Info

Publication number: KR102100346B1
Application number: KR1020190106619A
Authority: KR
Inventors: 전종훈
Original assignee: (주)프람트테크놀로지
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2020-04-14

Abstract

According to one embodiment of the present invention, an apparatus for managing a dataset comprises: a meta data extraction unit extracting meta data from each of the plurality of datasets; a similarity calculation unit calculating similarity between the plurality of datasets based on degree of similarity between the meta data; an interlinking calculation unit calculating interlinking between the plurality of datasets based on whether each of the plurality of datasets refers each other; a semantic association calculation unit calculating semantic association between the plurality of datasets based on semantic information of each metadata; and a relation map generation unit generating a relation map displaying relation between the plurality of datasets based on at least one of the calculated similarity, the calculated interlinking, and the calculated semantic association. A researcher may view the plurality of datasets at a glance through the relation map.

Description

Data set management device and method {APPARATUS AND METHOD FOR MANAGING DATASET}

본 발명은 데이터셋 관리 장치 및 방법에 관한 것이다.The present invention relates to a data set management apparatus and method.

인터넷의 발달로 다양한 종류의 데이터들이 양산되고 있으며, 이러한 데이터들을 활용하는 기술 또한 발전하고 있다. 예컨대 빅데이터를 기반으로 데이터를 정제하거나 분류하는 과정 등을 수반하는 데이터 마이닝이나 데이터를 기초로 소정의 기능을 수행하도록 모듈을 학습시키는 머신 러닝 기법 등을 그 예로 들 수 있다.With the development of the Internet, various types of data are being mass-produced, and technologies utilizing these data are also developing. For example, data mining that involves purifying or classifying data based on big data, or machine learning techniques that train modules to perform a predetermined function based on data, for example.

여기서, 전술한 데이터의 활용 기술의 발전을 위해서는, 어떠한 종류의 데이터들이 확보되어 있는지가 연구자에게 한눈에 인식되는 것이 바람직하다. 더 나아가서는 단순히 데이터의 종류가 확보되는 것뿐 아니라 다양한 종류의 데이터들이 서로 간에 어떤 관련성을 가지고 있는지까지 연구자에게 한눈에 인식되는 것이 바람직하다.Here, in order to develop the technology for utilizing the above-described data, it is desirable that the researcher recognize what kind of data is secured at a glance. Furthermore, it is desirable not only to secure the type of data, but also to recognize at a glance how the various types of data relate to each other.

이에, 다양한 종류의 데이터들을 수집하고 정제한 뒤, 이들을 리스팅하거나 한 눈에 볼 수 있도록 하는 기술들이 개발 내지 연구되고 있는 실정이다.Accordingly, various types of data are collected and refined, and then technologies that enable listing or viewing at a glance are being developed or researched.

한국공개특허공보, 10-2005-0084866호 (2005.08.29. 공개)Korean Patent Publication No. 10-2005-0084866 (published on August 29, 2005)

본 발명의 해결하고자 하는 과제는, 관련된 주제를 갖는 데이터들의 묶음(또는 데이터셋) 복수 개를 관리하는 기술을 제안하는 것이다.The problem to be solved by the present invention is to propose a technique for managing a plurality of bundles (or datasets) of data having related topics.

예컨대 이러한 데이터셋 복수 개가 서로 간에 어떠한 유사도, 연계도, 시멘틱 연관도를 갖는지를 추출하고, 이렇게 추출된 유사도와 연계도 내지 시멘틱 연관도 중 적어도 하나에 기초해서 이러한 복수 개의 데이터셋 간의 관계를 표시하는 관계맵을 생성해서 연구자 등에게 제공하는 기술을 제공하는 것이 본 발명의 해결하고자 하는 과제에 포함될 수 있다.For example, extracting what similarities, associations, and semantic associations between a plurality of these datasets are extracted from each other, and displaying the relationships between the plurality of datasets based on at least one of the extracted similarities, associations, or semantic associations Producing a relationship map and providing a technique provided to a researcher or the like can be included in the problem to be solved of the present invention.

다만, 본 발명의 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problem to be solved of the present invention is not limited to those mentioned above, and another problem not to be solved can be clearly understood by a person having ordinary knowledge to which the present invention belongs from the following description. will be.

일 실시예에 따른 데이터셋 관리 장치는 복수 개의 데이터셋 각각으로부터 메타 데이터를 추출하는 메타 데이터 추출부와, 상기 메타 데이터 간의 유사 정도에 기초해서 상기 복수 개의 데이터셋 간의 유사도를 산출하는 유사도 산출부와, 상기 복수 개의 데이터셋 각각이 서로 간에 참조하는지 여부를 기초로 상기 복수 개의 데이터셋 간의 연계(interlinking)도를 산출하는 연계도 산출부와, 상기 메타 데이터 각각이 갖는 시멘틱 정보에 기초해서 상기 복수 개의 데이터셋 간의 시멘틱 연관도를 산출하는 시멘틱 연관도 산출부와, 상기 산출된 유사도, 상기 산출된 연계도 및 상기 산출된 시멘틱 연관도 중 적어도 하나에 기초해서, 상기 복수 개의 데이터셋 간의 관계를 표시하는 관계맵을 생성하는 관계맵 생성부를 포함한다.The data set management apparatus according to an embodiment includes a meta data extraction unit for extracting metadata from each of a plurality of data sets, and a similarity calculation unit for calculating similarities between the plurality of data sets based on the similarity between the metadata. , Linkage calculation unit for calculating the degree of interlinking between the plurality of datasets based on whether each of the plurality of datasets refers to each other, and the plurality of datasets based on the semantic information of each of the metadata. A relationship between the plurality of datasets is displayed based on at least one of a semantic association calculation unit calculating a semantic association between datasets, and the calculated similarity, the calculated association degree, and the calculated semantic association degree. It includes a relationship map generating unit for generating a relationship map.

또한, 상기 복수 개의 데이터셋 각각은 주제가 관련되어 있는 데이터를 복수 개 포함할 수 있다.In addition, each of the plurality of data sets may include a plurality of data related to the subject.

또한, 상기 복수 개의 데이터는 적어도 두 개의 항목 각각에 대응되는 컨텐츠를 포함하고, 상기 메타 데이터 추출부는 상기 적어도 두 개의 항목 중 적어도 일부와 상기 적어도 두 개의 항목 각각에 대응되는 컨텐츠 중 적어도 일부를, 해당하는 데이터셋에 대한 메타 데이터로서 추출할 수 있다.In addition, the plurality of data includes content corresponding to each of at least two items, and the metadata extracting unit corresponds to at least a part of the at least two items and at least a part of the content corresponding to each of the at least two items, corresponding It can be extracted as metadata about the data set.

또한, 상기 메타 데이터 간의 유사도는 상기 메타 데이터 각각에 대해 산출된 벡터 간의 코사인 유사도 또는 자카드(jaccard) 유사도에 의해 산출된 것일 수 있다.Further, the similarity between the metadata may be calculated by cosine similarity or jaccard similarity between vectors calculated for each of the metadata.

또한, 상기 복수 개의 데이터셋 각각의 서로 간에 참조하는지 여부는 소정의 결과를 도출하는 과정에서 같이 이용되는지 여부로 판단될 수 있다.In addition, whether each of the plurality of data sets refers to each other may be determined as whether they are used together in the process of deriving a predetermined result.

또한, 상기 데이터셋 관리 장치는 상기 산출된 유사도, 상기 산출된 연계도 및 상기 산출된 시멘틱 연관도 중에서 상기 관계맵 생성의 기초가 되는 것을 입력받는 인터페이스부를 더 포함하고, 상기 관계맵은 상기 산출된 유사도, 상기 산출된 연계도 및 상기 산출된 시멘틱 연관도 중에서 상기 인터페이스부를 통해 입력받은 것을 기초로 생성될 수 있다.In addition, the data set management apparatus further includes an interface unit that receives an input that is a basis for generating the relationship map among the calculated similarity, the calculated linkage, and the calculated semantic association, and the relationship map is calculated. The similarity, the calculated linkage, and the calculated semantic association may be generated based on the input through the interface unit.

또한, 상기 데이터셋 관리 장치는 상기 유사도에 대한 제1 임계치, 상기 연계도에 대한 제2 임계치 및 상기 시멘틱 연관도에 대한 제3 임계치 중 적어도 하나를 입력받는 인터페이스부를 더 포함하고, 상기 관계맵 생성부는 상기 산출된 유사도 중에서 상기 제1 임계치 이상인 유사도, 상기 산출된 연계도 중에서 상기 제2 임계치 이상인 연계도 및 상기 산출된 시멘틱 연관도 중에서 상기 제3 임계치 이상인 시멘틱 연관도 중 적어도 하나에 기초해서, 상기 관계맵을 생성할 수 있다.In addition, the data set management apparatus further includes an interface unit that receives at least one of a first threshold for the similarity, a second threshold for the association, and a third threshold for the semantic association, and generates the relationship map. Based on at least one of the similarity of the first threshold or higher among the calculated similarities, the association of the second or higher thresholds among the calculated associations, and the semantic association of the third or higher thresholds among the calculated semantic associations, the You can create a relationship map.

또한, 상기 관계맵은, 직경이 서로 상이한 복수 개의 원이 동심원을 이루는 동심원 구조를 갖되, 상기 복수 개의 데이터셋 중 어느 하나는 상기 동심원의 중심에 배치되고 나머지는 상기 복수 개의 원 중에서 상기 동심원의 중심이 아닌 원 상에 배치될 수 있다.In addition, the relationship map has a concentric circle structure in which a plurality of circles having different diameters form a concentric circle, one of the plurality of data sets is disposed at the center of the concentric circle, and the other is the center of the concentric circle among the plurality of circles. It can be placed on a circle rather than.

또한, 상기 동심원을 이루는 원의 개수를 입력받는 인터페이스부를 더 포함할 수 있다.In addition, an interface unit for receiving the number of circles constituting the concentric circle may be further included.

또한, 상기 데이터셋 관리 장치는 상기 복수 개의 데이터셋 각각으로부터 키워드를 추출하는 키워드 추출부를 더 포함하고, 상기 유사도 산출부는 상기 키워드 간의 유사도를 산출하고, 상기 연계도 산출부는 상기 키워드 간의 연계도를 산출하며, 상기 시멘틱 연관도 산출부는 상기 키워드 간의 시멘틱 연관도를 산출하고, 상기 관계맵 생성부는 상기 키워드 간에 산출된 유사도, 연계도 및 시멘틱 연관도 중 적어도 하나에 기초해서, 상기 키워드 간의 관계를 표시하는 관계맵을 생성할 수 있다.In addition, the data set management apparatus further includes a keyword extracting unit for extracting keywords from each of the plurality of data sets, the similarity calculating unit calculates similarity between the keywords, and the linkage calculation unit calculates a linkage between the keywords The semantic association calculation unit calculates the semantic association between the keywords, and the relationship map generation unit displays the relationship between the keywords based on at least one of the similarity, linkage, and semantic association calculated between the keywords. You can create a relationship map.

일 실시예에 따른 데이터셋 관리 장치가 수행하는 데이터셋 관리 방법은 복수 개의 데이터셋 각각으로부터 메타 데이터를 추출하는 단계와, 상기 메타 데이터 간의 유사 정도에 기초해서 상기 복수 개의 데이터셋 간의 유사도를 산출하는 단계와, 상기 복수 개의 데이터셋 각각이 서로 간에 참조하는지 여부를 기초로 상기 복수 개의 데이터셋 간의 연계(interlinking)도를 산출하는 단계와, 상기 메타 데이터 각각이 갖는 시멘틱 정보에 기초해서 상기 복수 개의 데이터셋 간의 시멘틱 연관도를 산출하는 단계와, 상기 산출된 유사도, 상기 산출된 연계도 및 상기 산출된 시멘틱 연관도 중 적어도 하나에 기초해서, 상기 복수 개의 데이터셋 간의 관계를 표시하는 관계맵을 생성하는 단계를 포함해서 수행된다.The data set management method performed by the data set management apparatus according to an embodiment includes extracting metadata from each of a plurality of data sets and calculating similarities between the plurality of data sets based on the similarity between the metadata. Calculating a degree of interlinking between the plurality of data sets based on a step and whether each of the plurality of data sets refers to each other, and the plurality of data based on the semantic information of each of the meta data. Calculating a relationship map between the plurality of data sets based on at least one of the calculated similarity, the calculated association, and the calculated semantic association; Including the steps.

일 실시예에 따르면, 데이터셋 복수 개에 대한 관계맵이 연구자 등에게 제공될 수 있는 바, 연구자는 이러한 관계맵을 통해 데이터셋 복수 개를 한눈에 조망할 수 있다.According to an embodiment, since a relationship map for a plurality of datasets can be provided to a researcher, etc., the researcher can view a plurality of datasets at a glance through such a relationship map.

도 1은 일 실시예에 따른 데이터셋 관리 장치가 운용되는 네트워크의 전체 구성을 나타내는 시스템 구성도이다.
도 2는 일 실시예에 따라 데이터셋 관리 장치가 제공할 수 있는 관계맵에 대한 예시적인 개념도이다.
도 3은 일 실시예에 따른 데이터셋 관리 장치의 구성에 대한 예시적인 구성도이다.
도 4는 일 실시예에 따른 데이터셋 관리 장치가 연구자 등에게 디스플레이 해주는 화면에 대한 예시적인 개념도이다.
도 5는 일 실시예에 따른 데이터셋 관리 방법에 대한 예시적인 절차 수행도이다.1 is a system configuration diagram showing an overall configuration of a network in which a data set management apparatus according to an embodiment is operated.
2 is an exemplary conceptual diagram of a relationship map that can be provided by a data set management apparatus according to an embodiment.
3 is an exemplary configuration diagram of a configuration of a data set management apparatus according to an embodiment.
4 is an exemplary conceptual diagram of a screen displayed by a data set management apparatus according to an embodiment to a researcher and the like.
5 is a diagram illustrating an exemplary procedure for a data set management method according to an embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In describing embodiments of the present invention, when it is determined that a detailed description of known functions or configurations may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to a user's or operator's intention or practice. Therefore, the definition should be made based on the contents throughout this specification.

도 1은 일 실시예에 따른 데이터셋 관리 장치(100)가 운용되는 네트워크의 전체 구성을 나타내는 시스템 구성도이다. 도 1에 대한 설명에 앞서, 데이터셋 관리 장치(100)는 개인용 컴퓨터, 스마트 기기, 서버 또는 클라우드와 같은 장치에서 구현될 수 있다.1 is a system configuration diagram showing an entire configuration of a network in which the data set management apparatus 100 according to an embodiment is operated. Prior to the description of FIG. 1, the dataset management device 100 may be implemented in a device such as a personal computer, smart device, server, or cloud.

도 1을 참조하면, 데이터셋 관리 장치(100)는 네트워크(300)를 통해 다양한 종류의 서버(200,210)와 연결될 수 있으며, 이들 서버(200,210)로부터 다양한 종류의 데이터셋을 제공받을 수 있다.Referring to FIG. 1, the data set management apparatus 100 may be connected to various types of servers 200 and 210 through a network 300, and various types of data sets may be provided from these servers 200 and 210.

여기서 네트워크(300)는 유선 또는 무선 방식으로 구현 가능한, 공지된 망(network)을 의미한다.Here, the network 300 means a known network that can be implemented in a wired or wireless manner.

아울러, 전술한 다양한 종류의 서버(200,210)는 자신이 보유하고 있는 데이터셋을 공중(public)에게 공개할 수도 있고, 또는 특정한 권한을 갖는 주체에게만 공개할 수도 있다. 이하에서는 공중에게 공개되는 데이터셋은 개방 데이터셋이라고 지칭될 수 있고, 공중이 아닌 특정한 권한을 갖는 주체에게만 공개되는 데이터셋은 미개방 데이터셋이라고 지칭하기로 한다.In addition, the various types of servers 200 and 210 described above may disclose their own datasets to the public, or only to subjects with specific authority. Hereinafter, a dataset disclosed to the public may be referred to as an open dataset, and a dataset disclosed only to a subject having specific authority, not to the public, will be referred to as an unopened dataset.

한편, 데이터셋이란 관련된 주제를 갖는 복수 개의 데이터를 포함하는, 개념적인 집단 내지 집합을 가리킨다. 예컨대, 어느 하나의 데이터셋에는 2018년 산불과 관련된 데이터들이 포함될 수 있고, 또 다른 하나의 데이터셋에는 2000년대 청소년들의 평균 신장과 체중에 관한 데이터들이 포함될 수 있다.Meanwhile, a data set refers to a conceptual group or set containing a plurality of data having related topics. For example, one dataset may include data related to forest fires in 2018, and another dataset may include data regarding average height and weight of adolescents in the 2000s.

일 실시예에 따르면 데이터셋 관리 관치(100)는 이러한 서버(200,210)로부터 다양한 종류의 데이터셋 복수 개를 획득한 뒤, 이러한 복수 개의 데이터셋 간의 유사도, 연계도(연계도에 대해서는 후술하기로 한다) 및 시멘틱 연관도(시멘틱 연관도에 대해서는 후술하기로 한다) 중 적어도 하나를 산출할 수 있고, 이렇게 산출된 것들 중 적어도 하나에 기초해서 관계맵을 생성한 뒤, 이를 다양한 주체에게 제공할 수 있다. According to an embodiment of the present invention, the data set management management 100 obtains a plurality of data sets of various types from the servers 200 and 210, and similarities and linkages between the plurality of data sets will be described later. ) And semantic association (which will be described later on the semantic association) can be calculated, and based on at least one of the calculated values, a relationship map can be generated and then provided to various subjects. .

여기서, 관계맵이란 복수 개의 데이터셋이 서로 간에 어느 정도로 유사한지, 또한 복수 개의 데이터셋 각각이 어떤 목적 해결에 사용되기 위해 또 다른 데이터셋 중 어떠한 것들을 참조하는지, 또한 복수 개의 데이터셋이 시멘틱(언어적)적으로 어느 정도로 연관이 되어 있는지를 나타내는, 일종의 그래프를 지칭한다.Here, the relationship map refers to the extent to which a plurality of datasets are similar to each other, and which of the other datasets each of the plurality of datasets refers to for which purpose, and also, the plurality of datasets are semantic (language It refers to a kind of graph that indicates the degree to which it is related).

연구자 등은, 이러한 관계맵을 통해, 여러 곳에 산재해있는 다양한 종류의 데이터셋을 한눈에 조망할 수 있을 뿐 아니라, 이를 통해 자신의 연구에 필요한 데이터셋을 이러한 관계맵에 기초해서 효율적이면서도 효과적으로 선별할 수 있다. 일반적으로 데이터는, 그 자체로 의미가 있기도 하지만 다른 데이터와 결합 내지 관련성을 가질 때 더 큰 의미가 있고 연구에 도움이 되는 바, 일 실시예에 따르면 여러 곳에 산재해있는 다양한 종류의 데이터셋이 이와 같이 서로 간에 결합 내지 관련도를 갖는 정도가 한눈에 파악될 수 있다는 점에서, 이는 본 발명의 실시예에 따른 효과라고 할 수 있을 것이다.Through this relationship map, researchers and others can not only view various types of datasets scattered at a glance, but also efficiently and effectively select datasets required for their research based on these relationship maps. You can. In general, data is meaningful in itself, but is more meaningful when it is combined or related with other data, and is useful for research. According to an embodiment, various types of datasets scattered in various places can be used. Likewise, it can be said that this is an effect according to an embodiment of the present invention in that the degree of binding or relevance between each other can be grasped at a glance.

도 2는 이러한 일 실시예에 따른 관계맵, 그리고 이러한 관계맵의 기초가 되는 구성들을 개념적으로 도시한 도면이다FIG. 2 is a diagram conceptually showing a relationship map according to an exemplary embodiment and the structures underlying the relationship map.

도 2를 참조하면, 3가지 평면(20,30,40)이 존재한다. 첫번 째 평면(20)은 복수 개의 데이터셋(21)이 존재하는 평면(20)이다. 복수 개의 데이터셋(21)은 서로 간에 유사도, 연계도 및 시멘틱 연관도를 갖고 연결되는데, 도 2에서는 이러한 유사도, 연계도 및 시멘틱 연관도가 데이터셋(21)을 연결하는 선의 모양에 따라 표시되어 있다. 예컨대 유사도는 직선, 연계도는 점선, 시멘틱 연관도는 나선으로 표시되어 있다.Referring to Figure 2, there are three planes (20, 30, 40). The first plane 20 is a plane 20 in which a plurality of data sets 21 are present. The plurality of data sets 21 are connected with each other with similarity, linkage, and semantic association, and in FIG. 2, the similarity, linkage, and semantic association are displayed according to the shape of the line connecting the dataset 21. have. For example, similarity is indicated by a straight line, linkage by a dotted line, and semantic association by a spiral.

두번 째 평면(30)은 복수 개의 메타 데이터(31)가 존재하는 평면(30)이다. 복수 개의 메타 데이터(31) 역시 서로 간에 전술한 유사도, 연계도 및 시멘틱 연관도를 갖고 연결되는데, 도 2에서는 이러한 유사도, 연계도 및 시멘틱 연관도가 메타 데이터(31)를 연결하는 선의 모양에 따라 표시되어 있다. 예컨대 유사도는 직선, 연계도는 점선, 시멘틱 연관도는 나선으로 표시되어 있다.The second plane 30 is a plane 30 in which a plurality of meta data 31 is present. The plurality of meta data 31 is also connected to each other with the above-described similarity, linkage, and semantic association. In FIG. 2, the similarity, linkage, and semantic association depend on the shape of the line connecting the meta data 31. Is marked. For example, similarity is indicated by a straight line, linkage by a dotted line, and semantic association by a spiral.

여기서, 각각의 메타 데이터(31)는 첫번 째 평면(20)에 존재하는 데이터셋(21) 각각으로부터 추출된 것일 수 있다(메타 데이터(31)의 정의 및 이러한 메타 데이터(31)를 데이터셋(21)으로부터 추출하는 구체적인 방법론에 대해서는 후술하기로 한다).Here, each meta data 31 may be extracted from each of the data sets 21 existing in the first plane 20 (definition of the meta data 31 and the meta data 31 data set ( The specific methodology extracted from 21) will be described later).

세번 째 평면(40)은 복수 개의 데이터셋(41)이 동심원 상에 배치되어 있는, 일 실시예에 따른 관계맵이 존재하는 평면(40)이다. 여기서 세번 째 평면(41)에 존재하는 데이터셋(41)은 첫번 째 평면(20)에 존재하는 데이터셋(21)과는 달리, 각각의 데이터셋(41)에 얼마만큼의 용량의 데이터가 포함되어 있는지를 소정의 방식으로 나타낼 수 있고, 또한 각각의 데이터셋(41)이 개방 데이터셋인지 아니면 미개방 데이터셋인지를 소정의 방식으로 나타낼 수 있다. 물론, 세번 째 평면(40)에서도, 복수 개의 데이터셋(41)은 서로 간의 유사도, 연계도 및 시멘틱 연관도를 갖고 연결되며, 도 2에서는 이러한 유사도, 연계도 및 시멘틱 연관도가 메타 데이터(31)를 연결하는 선의 모양에 따라 표시되어 있다. 예컨대 유사도는 직선, 연계도는 점선, 시멘틱 연관도는 나선으로 표시되어 있다.The third plane 40 is a plane 40 in which a relationship map according to an embodiment in which a plurality of data sets 41 are disposed on a concentric circle is present. Here, unlike the data set 21 existing in the first plane 20, the data set 41 existing in the third plane 41 contains a certain amount of data in each data set 41. It can be expressed in a predetermined manner, and it can also be expressed in a predetermined manner whether each data set 41 is an open data set or an unopened data set. Of course, even in the third plane 40, the plurality of data sets 41 are connected with similarity, linkage, and semantic association between each other, and in FIG. 2, the similarity, linkage, and semantic association are meta data 31 ) According to the shape of the line connecting them. For example, similarity is indicated by a straight line, linkage by a dotted line, and semantic association by a spiral.

여기서, 관계맵이 생성되는 과정을 간략하게 살펴보도록 한다. 먼저 복수 개의 데이터셋(21)이 첫번 째 평면(20)에 있는 것과 같이 마련된다. 이후 각각의 데이터셋(21)으로부터 메타 데이터(31)가 추출되어서, 두번 째 평면(30)에서와 같이 배치된다. 이 후 이렇게 추출된 메타 데이터(31) 간에 전술한 유사도, 연계도 및 시멘틱 연관도가 산출된다. 이 후, 이렇게 산출된 메타 데이터 간의 유사도, 연계도 및 시멘틱 연관도에 기초해서 복수 개의 메타 데이터(31)가 도 2에 도시된 것과 같이 서로 간에 연결된다. 이 후, 복수 개의 메타 데이터(31)가 서로 간에 연결된 것에 기초해서, 첫번 째 평면(20)에 존재하는 복수 개의 데이터셋(21)이 서로 간에 도 2에 도시된 것과 같이 연결된다.Here, the process of generating the relationship map will be briefly described. First, a plurality of data sets 21 are provided as in the first plane 20. Thereafter, meta data 31 is extracted from each data set 21 and is disposed as in the second plane 30. Thereafter, similarity, linkage, and semantic association between the extracted meta data 31 are calculated. Thereafter, a plurality of metadata 31 are connected to each other as illustrated in FIG. 2 based on the similarity, linkage, and semantic association between the metadata thus calculated. Thereafter, based on the plurality of meta data 31 connected to each other, the plurality of data sets 21 existing in the first plane 20 are connected to each other as shown in FIG. 2.

마지막으로, 연구자 내지 사용자로부터 소정의 요청을 받으면, 이러한 요청에 부합되는 데이터셋이 선택되고, 선택된 데이터셋을 동심원의 중심으로 하는 관계맵이 도 2의 세번 째 평면(40)에 도시된 것과 같이 생성된다. 이 때 동심원 상에 배치된 복수 개의 데이터셋(41)의 경우, 첫번 째 평면(20)에서 복수 개의 데이터셋(21)이 서로 간에 연결되어 있는 것에 기초해서, 유사도, 연계도 및 시멘틱 연관도가 표시되도록 연결된다.Finally, when a predetermined request is received from a researcher or a user, a data set that satisfies this request is selected, and a relationship map centering the selected data set as a center of a concentric circle is shown in the third plane 40 of FIG. 2. Is generated. In this case, in the case of a plurality of datasets 41 disposed on concentric circles, similarity, linkage, and semantic association are based on the connection of the plurality of datasets 21 to each other in the first plane 20. It is connected to be displayed.

여기서, 연구자 내지 사용자에게 표시 내지 제공되는 것은, 도 2에 도시된 평면 중 세번 째 평면(40)에 표시된 관계맵일 수 있다. 연구자 내지 사용자는 이렇게 표시된 관계맵을 통해, 복수 개의 데이터셋을 한눈에 조망할 수 있고, 이러한 관계맵을 통해 복수 개의 데이터셋으로부터 소정의 의미를 도출해낼 수 있을 것이다.Here, what is displayed or provided to the researcher or the user may be a relationship map displayed on the third plane 40 among the planes illustrated in FIG. 2. A researcher or a user can view a plurality of data sets at a glance through the relationship map displayed as described above, and a predetermined meaning can be derived from the plurality of data sets through the relationship map.

이하에서는 지금까지 살펴보면 관계맵을 생성하는 구성인, 데이터셋 관리 장치(100)의 구체적인 구성에 대해 보다 자세하게 살펴보기로 한다.Hereinafter, a detailed configuration of the data set management apparatus 100, which is a configuration for generating a relationship map, will be described in more detail.

도 3은 일 실시예에 따른 데이터셋 관리 장치(100)의 구성에 대한 개략적인 구성도이다. 도 3을 참조하면, 데이터셋 관리 장치(100)는 저장부(110), 메타 데이터 추출부(120), 유사도 산출부(130), 연계도 산출부(140), 시멘틱 연관도 산출부(150), 관계맵 생성부(160), 인터페이스부(170), 제어부(180) 및 디스플레이부(190)를 포함할 수 있으며, 다만 실시예에 따라 전술한 구성 중 적어도 하나를 포함하지 않거나 언급되지 않은 구성을 추가로 포함할 수도 있다3 is a schematic configuration diagram of a configuration of a data set management apparatus 100 according to an embodiment. Referring to FIG. 3, the data set management apparatus 100 includes a storage unit 110, a meta data extraction unit 120, a similarity calculation unit 130, a linkage calculation unit 140, and a semantic association calculation unit 150 ), A relationship map generating unit 160, an interface unit 170, a control unit 180, and a display unit 190, but does not include or are not mentioned at least one of the above-described configurations according to an embodiment. You may also include additional configurations

우선, 데이터셋 관리 장치(100) 및 이에 포함되는 각각의 구성은, 이하에서 설명될 기능을 수행하도록 프로그램된 명령어를 저장하는 메모리 및 이러한 명령어를 수행하는 마이크로프로세서에 의해 구현 가능하다.First, the data set management apparatus 100 and each configuration included therein can be implemented by a memory storing instructions programmed to perform a function to be described below and a microprocessor performing these instructions.

이 중, 저장부(110)에는 다양한 종류의 데이터셋이 저장될 수 있다. 이렇게 저장된 데이터셋은 외부로부터 획득된 것일 수 있는데, 예컨대 도 1에 도시된 제1 서버(200) 내지 제2 서버(210)로부터 데이터셋 관리 장치(100)가 획득한 것일 수 있다. 저장부(110)에 저장된 데이터셋은 갱신 가능하다.Among them, various types of data sets may be stored in the storage unit 110. The data set stored in this way may be obtained from the outside, for example, the data set management apparatus 100 may be obtained from the first server 200 to the second server 210 illustrated in FIG. 1. The data set stored in the storage unit 110 can be updated.

메타 데이터 추출부(120)는 저장부(110)에 저장된 복수 개의 데이터셋 각각으로부터 메타 데이터를 추출할 수 있다.The meta data extraction unit 120 may extract meta data from each of a plurality of data sets stored in the storage unit 110.

여기서 메타 데이터란, 각각의 데이터셋의 속성을 나타내는 정보를 지칭한다. 예컨대 데이터셋을 제공한 제공기관명, 관리부서명, 등록된 날짜, 수정된 날짜 뿐 아니라, 해당 데이터셋에 포함된 파일의 개수, 확장자, 조회수 내지 다운로드 횟수 등이 메타 데이터의 예시일 수 있다.Here, the meta data refers to information representing the properties of each data set. For example, the name of the provider that provided the data set, the name of the management department, the registered date, and the modified date, as well as the number of files included in the data set, extension, number of views or downloads, etc. may be examples of metadata.

메타 데이터 추출부(120)는 이러한 메타 데이터를 데이터셋으로부터 추출하기 위해, 다양한 방법을 채용할 수 있다. 예컨대 데이터셋에 복수 개의 데이터가 포함되어 있는데, 그 중 어느 하나의 데이터는 나머지 데이터에 대한 엑셀 파일 형식의 리스트라고 하자. 이러한 리스트에는 나머지 데이터가 나열(리스팅)되어 있고, 각각의 데이터의 속성을 나타내는 항목이 마련되어 있고, 각 항목에 대응되는 컨텐츠가 내용으로서 포함되어 있을 수 있다. 그러면 메타 데이터 추출부(120)는 이러한 항목 중 적어도 일부, 그리고 각 항목에 대응되는 컨텐츠 중 적어도 일부를 메타 데이터로서 추출할 수 있다. 이 때 엑셀 파일로부터 항목이나 컨텐츠를 추출하는 기술 그 자체는 공지된 기술이므로, 이에 대한 설명은 생략하기로 한다.The meta data extraction unit 120 may employ various methods to extract the meta data from the data set. For example, a plurality of data is included in a data set, and any one of them is an Excel file format list of the remaining data. The rest of the data is listed (listed) in this list, and items representing the properties of each data are provided, and content corresponding to each item may be included as content. Then, the metadata extracting unit 120 may extract at least some of these items and at least some of the content corresponding to each item as metadata. At this time, since the technology itself for extracting items or contents from an Excel file is a known technology, a description thereof will be omitted.

유사도 산출부(130)는 메타 데이터 간의 유사 정도에 기초해서, 데이터셋 간의 유사도를 산출한다. 여기서 메타 데이터 간의 유사 정도에 기초한다는 의미는, 각 데이터셋에 포함된 메타 데이터를 벡터로 표현했을 때, 이러한 벡터 간의 유사도를 측정하는 것일 수 있으나 이에 한정되는 것은 아니다. 메타 데이터 간의 유사 정도를 판단하는 방법은, 예컨대 메타 데이터를 나타내는 벡터 간의 코사인 유사도 또는 자카드(jaccard) 유사도 중 어느 하나일 수 있으나, 이에 한정되는 것은 아니다.The similarity calculating unit 130 calculates the similarity between data sets based on the degree of similarity between meta data. Here, the meaning based on the degree of similarity between metadata may be to measure the similarity between the vectors when the metadata included in each dataset is expressed as a vector, but is not limited thereto. The method of determining the degree of similarity between metadata may be, for example, one of cosine similarity or jaccard similarity between vectors representing metadata, but is not limited thereto.

연계도 산출부(140)는 복수 개의 데이터셋 각각이 서로 간에 참조하는지 또는 이전에 참조한 적이 있는지 여부를 기초로 연계도를 산출한다. 예컨대 제1 데이터셋이 서울 지역의 지하철 노선도이고, 제2 데이터셋이 서울 지역의 음식점 리스트일 때, '서울 강남 지역의 음식점 리스트'를 찾기 위해서는 제1 데이터셋과 제2 데이터셋이 서로가 참조되어서 활용되어야 한다. 이렇게 참조가 되는지 여부에 기초해서 연계도가 산출될 수 있다. 이러한 연계도 산출을 위해서는, 소정의 목적 달성을 위해, 복수 개의 데이터셋 중 이전에 함께 이용된 적이 있는지에 대한 정보가 확보되어서 활용될 수 있을 것이다.The linkage calculation unit 140 calculates the linkage based on whether each of the plurality of datasets refers to each other or has been previously referenced. For example, when the first dataset is a subway map of the Seoul area and the second dataset is a restaurant list in the Seoul area, the first dataset and the second dataset refer to each other in order to find the 'restaurant list in Gangnam, Seoul'. It has to be used. The linkage can be calculated based on whether the reference is made in this way. In order to calculate the linkage, in order to achieve a predetermined purpose, information on whether or not a plurality of datasets have been used together before may be secured and utilized.

시멘틱 연관도 산출부(150)는 메타 데이터 각각이 갖는 시멘틱 정보에 기초해서, 데이터셋 간의 시멘틱 연관도를 산출한다. 여기서 시멘틱 연관도란, 단순히 메타 데이터의 동일성에 기초한 것이 아니라, 메타 데이터가 서로 간에 갖는 의미론, 즉 시멘틱 정보가 얼마나 동일 또는 유사한지를 나타내는 척도이다. 이러한 시멘틱 연관도는 시멘틱 정보를 메타 데이터로부터 추출하도록 고안된 소정의 공지된 모듈에 의해 산출될 수 있을 것이다.The semantic association degree calculating unit 150 calculates a semantic association degree between datasets based on the semantic information of each meta data. Here, the semantic association degree is not simply based on the identity of meta data, but is a measure indicating how much the meta data has the same or similar semantics, that is, semantic information. Such semantic association may be calculated by a predetermined known module designed to extract semantic information from metadata.

관계맵 생성부(160)는 복수 개의 데이터셋 중 적어도 일부 간의 관계를 표시하는 관계맵을 생성한다. 여기서 '관계'란 전술한 것과 같이 산출된 유사도, 연계도 및 시멘틱 연관도 중 적어도 하나를 나타낸다. 관계맵의 생성 과정 내지 절차에 대해서는 도 2에서 설명하였으므로 이에 대한 설명은 생략하기로 한다The relationship map generation unit 160 generates a relationship map that displays a relationship between at least some of the plurality of data sets. Here, 'relationship' refers to at least one of similarity, linkage, and semantic association calculated as described above. The process or procedure of generating the relationship map has been described in FIG. 2, and thus a description thereof will be omitted.

인터페이스부(150)는 데이터셋 관리 장치(100)의 사용자 내지 연구자 또는 외부의 주체로부터 다양한 종류의 요청 내지 입력을 받아들여서 처리하는 구성이다. 예컨대 인터페이스부(150)는 터치 스크린, 키보드, 마이크나 마우스 뿐 아니라 이러한 입력 장치로부터의 입력을 신호 처리해서 그 의미를 파악하는 장치 내지 구성까지를 지칭하는 것일 수 있다.The interface unit 150 is a component that accepts and processes various types of requests or inputs from a user or a researcher or an external subject of the data set management apparatus 100. For example, the interface unit 150 may be a touch screen, a keyboard, a microphone, or a mouse, as well as a device to configure an input signal from such an input device to understand its meaning.

인터페이스부(150)를 통해, 일 실시예에 따른 데이터셋 관리 장치(100)에서는 관계맵과 관련된 다양한 기능이 수행될 수 있다.Through the interface unit 150, various functions related to the relationship map may be performed in the data set management apparatus 100 according to an embodiment.

예컨대 인터페이스부(150)는 관계맵을 생성하는 과정에서, 전술한 유사도, 연계도 및 시멘틱 연관도 중 어떠한 것이 이용될지를 입력받을 수 있다. For example, in the process of generating the relationship map, the interface unit 150 may receive input of which of the above-described similarity, linkage, and semantic association will be used.

또한 인터페이스부(150)는 관계맵의 생성에 이용되는 데이터가 소정의 기준 이상인 것으로 선별될 수 있도록, 그 기준을 입력받을 수 있다. 예컨대 인터페이스부(150)는 전술한 유사도, 연계도 및 시멘틱 연관도 각각에 대한 제1 임계치, 제2 임계치 및 제3 임계치를 입력받을 수 있다. 이 경우 관계맵 생성부(160)는 복수 개의 데이터셋 각각을 모두 이용해서 관계맵을 생성하기 보다는, 복수 개의 데이터셋 중 전술한 임계치들 이상인 유사도, 연계도 및 시멘틱 연관도를 갖도록 연결된 데이터셋만을 이용해서 관계맵을 생성할 수 있다.In addition, the interface unit 150 may receive the criterion so that data used for generating the relationship map can be selected as a predetermined criterion or higher. For example, the interface unit 150 may receive a first threshold, a second threshold, and a third threshold for each of the similarity, linkage, and semantic association described above. In this case, the relationship map generation unit 160 does not generate a relationship map by using each of a plurality of data sets, but only a data set connected to have similarity, linkage, and semantic association of at least the above-mentioned thresholds among the plurality of data sets. Can be used to create a relationship map.

뿐만 아니라, 인터페이스부(150)는 관계맵이 동심원 구조를 이룰 때, 이러한 동심원을 이루는 원의 최대 개수를 입력받을 수도 있다. 만약 원의 최대 개수가 10개인 경우와 5개인 경우를 비교해보면, 전자인 경우가 후자인 경우보다 관계맵의 크기가 더 클 것이며, 보다 많은 용량 내지 수의 데이터셋이 관계맵에 엮여서 표시될 수 있을 것이다. 물론, 경우에 따라 연구자 내지 사용자는 보다 적은 수의 원을 갖는 동심원 구조의 관계맵을 원할 수도 있는 바, 이러한 경우 인터페이스부(150)를 통해 연구자 내지 사용자는 동심원에 대한 원하는 개수의 원의 개수를 입력할 수 있을 것이다.In addition, when the relationship map forms a concentric circle structure, the interface unit 150 may receive the maximum number of circles constituting such concentric circles. If the maximum number of circles is compared with the case of 10 and 5, the size of the relationship map will be larger than the case of the former case and the number of datasets with a larger capacity or number may be displayed by being intertwined in the relationship map. There will be. Of course, in some cases, the researcher or the user may want a relationship map of a concentric circle structure having a smaller number of circles. In this case, the researcher or the user can select the desired number of circles for the concentric circle through the interface unit 150. You will be able to enter.

제어부(180)는 관계맵과 관련된 다양한 종류의 제어에 이용될 수 있다. 예컨대 데이터셋 관리 장치(100)는 데이터 모드와 키워드 모드(키워드에 대해서는 후술하기로 한다) 사이에서 운영 모드가 토글되면서 운영될 수 있는데, 제어부(180)는 인터페이스부(150)를 통해 연구자 내지 사용자가 운영 모드를 데이터 모드를 선택하면 그에 따라 관계맵이 생성될 수 있도록 각각의 구성(110 내지 190 등)을 제어하고, 이와 달리 연구자 내지 사용자가 인터페이스부(150)를 키워드 모드를 선택하면 그에 따라 관계맵이 생성될 수 있도록 각각의 구성(110 내지 190 등)을 제어할 수 있다.The controller 180 may be used for various types of control related to the relationship map. For example, the data set management apparatus 100 may be operated while the operating mode is toggled between the data mode and the keyword mode (to be described later with respect to keywords). The controller 180 may be a researcher or a user through the interface unit 150. When the operation mode selects the data mode, each configuration (110 to 190, etc.) is controlled so that a relationship map can be generated accordingly. Alternatively, when a researcher or user selects the interface unit 150 in the keyword mode, accordingly Each configuration (such as 110 to 190) can be controlled so that a relationship map can be generated.

뿐만 아니라 제어부(180)는 사용자의 마우스 포인터의 위치에 따라 다양한 종류의 서비스가 관계맵과 관련해서 제공되도록 할 수 있다. 예컨대 특정 위치에서 범례 등이 제공되도록 할 수 있으며, 키워드나 데이터셋이 검색되는 창이 표시 내지 제공되도록 할 수 있고, 탐색 결과가 표시되는 영역을 생성해서 제공할 수 있으며, 다만 이에 한정되는 것은 아니다. 제어부(180)가 제공하는 이러한 서비스에 대한 예시는 도 4에서 보다 자세하게 설명하기로 한다.In addition, the controller 180 may allow various types of services to be provided in relation to the relationship map according to the location of the user's mouse pointer. For example, a legend may be provided at a specific location, a window in which a keyword or dataset is searched may be displayed or provided, and an area in which a search result is displayed may be generated and provided, but is not limited thereto. An example of such a service provided by the controller 180 will be described in more detail in FIG. 4.

디스플레이부(190)는 관계맵 내지 제어부(180)가 제공하는 다양한 관계맵과 관련된 구성들을 표시 내지 디스플레이하는 장치이며, 모니터, LCD 패널 등으로 구현 가능하다.The display unit 190 is a device that displays or displays components related to various relationship maps provided by the relationship map or the controller 180, and can be implemented by a monitor, an LCD panel, and the like.

키워드 추출부(191)는 데이터셋에 포함된 데이터가 실질적으로 갖고 있는 정보를 지칭한다. 예컨대 데이터셋이 경상북도의 2019년 산불 정보에 관한 것이라면, 키워드는 '산불', '2019년', '경상북도' 등이 될 수 있으나 이에 한정되는 것은 아니다. 여기서, 데이터로부터 키워드를 추출하는 기술 그 자체는 공지된 것이므로, 이에 관한 자세한 설명은 생략하기로 한다.The keyword extraction unit 191 refers to information substantially included in data included in the data set. For example, if the dataset is related to forest fire information in 2019 in Gyeongsangbuk-do, the keywords may be 'wildfire', '2019', 'Gyeongsangbuk-do', but are not limited thereto. Here, since the technology itself for extracting keywords from data is known, detailed descriptions thereof will be omitted.

키워드 추출부(191)가 키워드를 추출하면, 유사도 산출부(130), 연계도 산출부(140) 및 시멘틱 연관도 산출부(150)는 각각의 키워드 간의 유사도, 연계도 및 시멘틱 연관도를 산출할 수 있는데, 그 구체적인 방법은 메타 데이터 내지 데이터셋에 적용된 것과 동일할 수 있다.When the keyword extraction unit 191 extracts keywords, the similarity calculation unit 130, the linkage calculation unit 140, and the semantic association calculation unit 150 calculate similarity, linkage, and semantic association between each keyword It can be, the specific method may be the same as applied to the metadata or dataset.

아울러, 관계맵 생성부(160)는 데이터셋에 대한 관계맵이 아닌, 이러한 키워드에 대한 관계맵을 생성해서 제공할 수 있다. 즉, 관계맵 생성부(160)는 키워드 간의 유사도, 연계도 및 시멘틱 연관도 등이 나타난 관계맵을 생성해서 제공할 수 있다.In addition, the relationship map generator 160 may generate and provide a relationship map for these keywords, not a relationship map for the dataset. That is, the relationship map generating unit 160 may generate and provide a relationship map in which similarities, linkages, and semantic associations between keywords are displayed.

이상에서 살펴본 바와 같이, 일 실시예에 따른 복수 개의 데이터셋의 관계를 나타내는 관계맵이 생성되서 연구자 내지 사용자에게 제공될 수 있으며, 연구자 내지 사용자는 이러한 관계맵을 자신이 원하는 형태로 제어 내지 가공해서 이로부터 원하는 정보를 획득할 수 있다.As described above, a relationship map representing a relationship between a plurality of datasets according to an embodiment may be generated and provided to a researcher or a user, and the researcher or user may control or process the relationship map in a desired form. From this, desired information can be obtained.

이하에서는 관계맵 및 이러한 관계맵과 관련된 정보가 표시되는 화면에 대한 예시를 도 4를 통해 살펴보기로 한다. 다만, 도 4는 예시적인 것에 불과하므로, 본 발명의 사상이 도 4에 도시된 것으로 한정 해석되는 것은 아니다.Hereinafter, an example of a relationship map and a screen displaying information related to the relationship map will be described with reference to FIG. 4. However, since FIG. 4 is merely exemplary, the spirit of the present invention is not limited to being illustrated in FIG. 4.

도 4를 참조하면, 관계맵이 표시된다. 이러한 관계맵은 지금까지 설명된 관계맵일 수 있다. 여기서, 관계맵에서 데이터셋 각각을 나타내는 구성은 원일 수 있는데, 원의 크기와 종류는 해당 데이터셋이 개방 데이터셋인지 아니면 미개방 데이터셋인지를 나타내거나 또는 해당 데이터셋이 어느 정도의 용량의 데이터가 포함되어 있는지를 나타낼 수 있다.4, a relationship map is displayed. This relationship map may be the relationship map described so far. Here, the configuration representing each data set in the relationship map may be a circle, and the size and type of the circle indicate whether the corresponding data set is an open data set or an unopened data set, or the data set of a certain capacity. It may indicate whether is included.

아울러, 인터페이스부(150)를 통해 마우스 포인터 등이 움직이다가 특정 데이터셋 위에 위치할 경우, 제어부(180)는 해당 데이터셋에 포함된 메타 데이터를 팝업으로 띄워서 디스플레이할 수 있는데, 도 4에는 메타 데이터가 화면의 어디에서 팝업되어 표시되는지가 예시적으로 도시되어 있다.In addition, when the mouse pointer or the like moves through the interface unit 150 and is positioned on a specific data set, the controller 180 can display the meta data included in the corresponding data set as a pop-up. It is exemplarily shown where data is popped up and displayed on the screen.

한편, 화면의 왼쪽 위에서부터 하나씩 구성을 살펴보기로 한다.On the other hand, let's look at the configuration one by one from the top left of the screen.

'모드 토글버튼'은 데이터셋 관리 장치(100)의 운용 모드를, 예컨대 데이터 모드와 키워드 모드(키워드에 대해서는 후술하기로 한다) 사이에서 정하는 수단으로서, 전술한 인터페이스부(170)를 통해 구현 가능하다. 만약, 연구자 내지 사용자가 운영 모드를 데이터 모드를 선택하면 그에 따라 관계맵이 생성될 수 있도록 각각의 구성(110 내지 190 등)을 제어하고, 이와 달리 연구자 내지 사용자가 인터페이스부(150)를 키워드 모드를 선택하면 그에 따라 관계맵이 생성될 수 있도록 각각의 구성(110 내지 190 등)을 제어할 수 있다.The 'mode toggle button' is a means for determining an operation mode of the data set management apparatus 100, for example, between a data mode and a keyword mode (to be described later with respect to keywords), and can be implemented through the interface unit 170 described above. Do. If a researcher or a user selects an operation mode and a data mode, each configuration (110 to 190, etc.) is controlled so that a relationship map can be generated accordingly. Alternatively, the researcher or user searches the interface unit 150 in keyword mode. If is selected, it is possible to control each configuration (110 to 190, etc.) so that a relationship map can be generated accordingly.

'데이터셋'은 데이터셋과 관련된 기능의 수행을 위해 마련된 버튼이다. 예컨대 '데이터셋'을 클릭하면 펼쳐보기가 되면서 데이터셋에 대한 정보가 리스팅되어서 표시될 수 있고, 사용자는 이러한 리스트를 이용해서 특정한 데이터셋(데이터셋)을 선택할 수 있다The 'dataset' is a button provided to perform functions related to the dataset. For example, if you click 'Dataset', the information about the dataset can be listed and displayed as it is expanded, and the user can select a specific dataset (dataset) using this list.

'분류체계', '연도', '확장자', '지역', '이용허락범위' 등도 마찬가지로 데이터셋 내지 관계맵과 관련된 소정의 기능일 수행되도록 마련된 버튼이다.The 'Classification System', 'Year', 'Extension', 'Region', and 'Limit of Use' are similarly designed to perform certain functions related to the dataset or relationship map.

한편, '검색란'은 연구자 내지 사용자로부터, 원하는 데이터셋 내지 키워드를 입력받기 위해 마련된 구성이며, 인터페이스부(170)에 의해 구현된 것이다. 연구자 내지 사용자가 데이터셋 내지 키워드를 입력하면, 제어부(180)는 저장부(110)에 저장된 데이터셋이나 이에 포함된 키워드 중에서 입력된 데이터셋 내지 입력된 키워드에 해당하는 데이터셋이나 키워드를 선별하고, 관계맵 생성부(160)는 이러한 데이터셋이나 키워드를 기초로 관계맵을 생성할 수 있다Meanwhile, the 'search column' is a configuration provided to receive a desired data set or keyword from a researcher or user, and is implemented by the interface unit 170. When a researcher or a user inputs a data set or a keyword, the controller 180 selects a data set or keyword corresponding to the input data set or input keyword from the data set stored in the storage unit 110 or keywords included therein. , The relationship map generator 160 may generate a relationship map based on these data sets or keywords.

'데이터 바구니'는 연구자 내지 사용자가 데이터 분석을 위해 선택한 데이터셋 등을 일시적으로 저장하는, 개념적인 저장소이다. 이러한 데이터 바구니를 통해, 연구자 내지 사용자는 자신이 선택한 데이터셋들을 한꺼번에 다운로드할 수 있다. 따라서, 이들을 융합하여 분석하는 절차에 소요되는 시간이 단축될 수 있다.The 'data basket' is a conceptual storage that temporarily stores datasets, etc. selected by researchers or users for data analysis. Through this data basket, researchers or users can download data sets of their choice at once. Therefore, the time required for the procedure of fusion and analysis of these can be shortened.

'다차원맵 보기'는 관계맵에 반영될 것을 유사도, 연계도 및 시멘틱 연관도 중에서 선택하는 기능을 제공하며, 인터페이스부(170)에 의해 구현 가능하다. 여기서 유사도는 유사맵의 형태로, 연계도는 연계맵의 형태로, 시멘틱 연관도는 의미맵의 형태로 표시될 수 있다.The 'multi-dimensional map view' provides a function for selecting among similarity, linkage, and semantic association to be reflected in the relationship map, and can be implemented by the interface unit 170. Here, the degree of similarity may be displayed in the form of a similarity map, the degree of association may be in the form of a connection map, and the semantic association degree may be displayed in the form of a semantic map.

'데이터 필터'는 연구자 내지 사용자가 자신이 원하는 형태로 관계맵을 보기 위해 조작하는 구성이며, 마찬가지로 인터페이스부(170)에 의해 구현 가능하다. 이러한 데이터 필터는, 전술한 임계치 내지 동심원 구조를 이루는 원의 개수를 입력받는데 활용될 수 있으며, 이에 대해서는 이미 설명하였으므로 중복된 설명은 생략하기로 한다.The 'data filter' is a structure that a researcher or a user manipulates to view a relationship map in a form desired by the user, and can be implemented by the interface unit 170 likewise. Such a data filter can be used to receive the number of circles forming the above-described threshold or concentric circle structure, and since it has already been described, a duplicate description will be omitted.

'검색 리스트'는 연구자 내지 사용자가 검색란을 통해 검색한 히스토리를 저장하고 보여주는 수단 내지 버튼이고, 이 또한 인터페이스부(170)에 의해 구현 가능하다.The 'search list' is a means or button for storing and showing the history searched by the researcher or user through the search field, and this can also be implemented by the interface unit 170.

도 5는 일 실시예에 따른 데이터셋 관리 방법의 절차를 나타내는 순서도이며, 이러한 데이터셋 관리 방법은 전술한 데이터셋 관리 장치(100)에 의해 수행될 수 있다. 이 때 도 5에 도시된 순서도는 예시적인 것에 불과하므로, 도 5에 도시된 것과 상이한 순서로 수행되는 등의 실시예가 도 5에 의 해 배제되는 것은 아니다5 is a flowchart illustrating a procedure of a dataset management method according to an embodiment, and such a dataset management method may be performed by the dataset management apparatus 100 described above. At this time, since the flowchart shown in FIG. 5 is merely exemplary, embodiments such as performed in a different order from that shown in FIG. 5 are not excluded by FIG. 5.

도 5를 참조하면, 복수 개의 데이터셋 각각으로부터 메타 데이터를 추출하는 단계(S100)와, 상기 메타 데이터 간의 유사 정도에 기초해서 상기 복수 개의 데이터셋 간의 유사도를 산출하고, 상기 복수 개의 데이터셋 각각이 서로 간에 참조하는지 여부를 기초로 상기 복수 개의 데이터셋 간의 연계(interlinking)도를 산출하며, 상기 메타 데이터 각각이 갖는 시멘틱 정보에 기초해서 상기 복수 개의 데이터셋 간의 시멘틱 연관도를 산출하는 단계(S110)와, 상기 산출된 유사도, 상기 산출된 연계도 및 상기 산출된 시멘틱 연관도 중 적어도 하나에 기초해서, 상기 복수 개의 데이터셋 간의 관계를 표시하는 관계맵을 생성하는 단계(S120)가 포함하여서 수행될 수 있다.Referring to FIG. 5, extracting meta data from each of a plurality of data sets (S100), and calculating the similarity between the plurality of data sets based on the degree of similarity between the metadata, and each of the plurality of data sets Calculating the degree of interlinking between the plurality of datasets based on whether they are referred to each other, and calculating the semantic correlation between the plurality of datasets based on the semantic information of each of the metadata (S110) And, based on at least one of the calculated similarity, the calculated linkage and the calculated semantic association, generating a relationship map displaying a relationship between the plurality of data sets (S120) You can.

이하, 이러한 데이터셋 관리 방법은 데이터셋 관리 장치(100)에 의해 수행된다는 점에서, 실질적으로는 데이터셋 관리 장치(100)와 실질적으로 동일한 기술적 사상을 가지고 있는 바, 이하에서 설명되지 않은 부분은 데이터셋 관리 장치(100)에 대한 설명 부분을 원용하기로 한다.Hereinafter, since the data set management method is performed by the data set management apparatus 100, it has substantially the same technical idea as the data set management apparatus 100. The description portion of the data set management apparatus 100 will be used.

한편, 일 실시예에 따른 데이터셋 관리 방법은, 이러한 방법에 포함된 각 단계를 수행하도록 프로그램된 컴퓨터 프로그램을 저장하는 컴퓨터 판독가능한 기록매체, 또는 컴퓨터 판독가능한 기록매체에 저장된 컴퓨터 프로그램의 형태로 구현 가능하다.On the other hand, the data set management method according to an embodiment is implemented in the form of a computer-readable recording medium storing a computer program programmed to perform each step included in the method, or a computer program stored in the computer-readable recording medium. It is possible.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical spirit of the present invention, and those of ordinary skill in the art to which the present invention pertains will be capable of various modifications and variations without departing from the essential quality of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

100: 데이터셋 관리 장치100: data set management device

Claims

A metadata extracting unit that extracts metadata from each of the plurality of data sets
A similarity calculating unit calculating a similarity between the plurality of data sets based on the similarity between the metadata;
A linkage calculation unit for calculating an interlinking degree between the plurality of datasets based on whether each of the plurality of datasets refers to each other,
A semantic correlation calculating unit for calculating a semantic correlation between the plurality of data sets based on the semantic information of each of the metadata;
A relationship map generating unit generating a relationship map indicating a relationship between the plurality of data sets based on at least one of the calculated similarity, the calculated association, and the calculated semantic association;
And an interface unit that receives at least one of a first threshold for the similarity, a second threshold for the association, and a third threshold for the semantic association,
The relationship map generation unit,
The relationship based on at least one of the similarity of the first threshold or higher among the calculated similarities, the association of the second or higher thresholds among the calculated associations, and the semantic association of the third or higher thresholds among the calculated semantic associations. To generate a map
Dataset management device.

According to claim 1,
Each of the plurality of data sets,
Containing a plurality of data related to the subject
Dataset management device.

According to claim 2,
The plurality of data includes content corresponding to each of at least two items,
The metadata extraction unit,
Extracting at least a portion of the at least two items and at least a portion of content corresponding to each of the at least two items as metadata for a corresponding data set
Dataset management device.

According to claim 1,
Similarity between the metadata,
It is calculated by the cosine similarity or jaccard similarity between vectors calculated for each of the metadata.
Dataset management device.

According to claim 1,
Whether each of the plurality of data sets is referenced to each other,
Among the plurality of datasets, it is determined that the datasets used together in the process of deriving a predetermined result are referred to each other, and the datasets not used together in the process of deriving the result are referred to each other. It is judged not to,
In the process of deriving the result, information on whether some of the plurality of data sets are used together is information obtained in advance.
Dataset management device.

According to claim 1,
The interface unit,
Among the calculated similarity, the calculated linkage, and the calculated semantic association, further receiving the basis for generating the relationship map,
The relationship map,
It is generated based on the more input through the interface unit
Dataset management device.

delete

According to claim 1,
The relationship map has a concentric circle structure in which a plurality of circles having different diameters form a concentric circle,
One of the plurality of data sets is disposed at the center of the concentric circle, and the other is disposed on a circle other than the center of the concentric circle among the plurality of circles.
Dataset management device.

The method of claim 8,
The interface unit,
The number of circles constituting the concentric circle is further input.
Dataset management device.

According to claim 2,
The data set management device,
Further comprising a keyword extraction unit for extracting keywords from each of the plurality of data,
The similarity calculating unit calculates the similarity between the keywords,
The linkage calculation unit calculates a linkage between the keywords,
The semantic correlation calculating unit calculates the semantic association between the keywords,
The relationship map generation unit,
Based on at least one of the similarity, linkage, and semantic association calculated between the keywords, generating a relationship map indicating the relationship between the keywords
Dataset management device.

As a data set management method performed by the data set management device,
Extracting metadata from each of the plurality of data sets,
Calculating the similarity between the plurality of data sets based on the similarity between the metadata,
Calculating a degree of interlinking between the plurality of data sets based on whether each of the plurality of data sets refers to each other,
Calculating a semantic association between the plurality of data sets based on the semantic information of each of the meta data,
Generating a relationship map displaying a relationship between the plurality of data sets based on at least one of the calculated similarity, the calculated association, and the calculated semantic association;
And receiving at least one of a first threshold for the similarity, a second threshold for the association, and a third threshold for the semantic association,
The step of generating the relationship map,
The relationship based on at least one of the similarity of the first threshold or higher among the calculated similarities, the association of the second or higher thresholds among the calculated associations, and the semantic association of the third or higher thresholds among the calculated semantic associations. To generate a map
How to manage datasets.

A computer-readable recording medium storing a computer program,
The computer program,
Comprising instructions for causing the processor to perform the method according to claim 11
Computer-readable recording media.

A computer program stored on a computer-readable recording medium,
The computer program,
Comprising instructions for causing the processor to perform the method according to claim 11
Computer program.