KR102337290B1

KR102337290B1 - Method And Apparatus for Generating Context Category Dataset

Info

Publication number: KR102337290B1
Application number: KR1020200143376A
Authority: KR
Inventors: 김주호; 김현우; 고은영
Original assignee: 한국과학기술원
Priority date: 2020-03-10
Filing date: 2020-10-30
Publication date: 2021-12-08
Also published as: KR20210114324A

Abstract

본 개시는 맥락 카테고리 데이터셋 생성 장치 및 방법을 제공한다.
본 개시의 일 측면에 의하면, 사용자가 입력한 해시태그가 속할 맥락 카테고리(context category)를 예측하고, 사용자로부터 해시태그가 속할 맥락 카테고리를 입력받아 맥락 카테고리 데이터셋을 생성 및 갱신하는 맥락 카테고리 데이터셋 생성 장치 및 방법을 제공한다.The present disclosure provides an apparatus and method for generating a context category dataset.
According to an aspect of the present disclosure, a context category dataset for predicting a context category to which a hashtag belongs by a user, and generating and updating a context category dataset by receiving an input from a user of a context category to which a hashtag belongs A generating apparatus and method are provided.

Description

Apparatus and method for generating context category dataset {Method And Apparatus for Generating Context Category Dataset}

본 발명은 맥락 카테고리 데이터셋 생성 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for generating a context category dataset.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute the prior art.

텍스트 생성(text generation), 자연어 생성(natural language generation), 지능형 에이전트(intelligent agent) 등 자연어 처리 및 응용과 관련된 기술이 고도화되고 있다. 그러나 이러한 자연어 처리 및 응용에 사용되는 모델을 디자인하고 훈련시킬 수 있는 데이터셋(dataset)의 제공에는 어려움을 겪고 있다.Techniques related to natural language processing and application, such as text generation, natural language generation, and intelligent agent, are being advanced. However, it is difficult to provide a dataset that can design and train a model used for such natural language processing and application.

모델을 디자인하거나 훈련시킬 수 있는 데이터셋을 생성하는 방법 중 하나는 크라우드 소싱(Crowd Sourcing) 방식을 이용하는 것이다. 크라우드 소싱이란, 대중(Crowd)과 아웃소싱(Outsourcing)의 합성어로 데이터 수집을 포함한 다양한 생산 활동에 대중을 참여시키는 것을 의미한다. 크라우드 소싱은 다수를 생산 활동에 참여시킴으로써 해당 생산 과정에 소요되는 시간을 단축시킬 수 있다.One of the ways to create a dataset that can design or train a model is to use a crowd-sourcing method. Crowdsourcing is a compound word of Crowd and Outsourcing, and it means to engage the public in various production activities, including data collection. Crowdsourcing can shorten the time required for the production process by involving a large number of people in production activities.

그러나 크라우드 소싱 방식은 생성하는 데이터의 규모에 비례하여 비용 및 시간이 소요되는 만큼, 최근에는 인공지능(Artificial Intelligence: AI) 모델의 기계 예측(machine prediction)을 이용하여 데이터셋을 생산하는 방식을 채용하는 사례가 증가하고 있다. 그러나 기계 예측으로 생성된 데이터셋은 인간이 직접 분류 또는 레이블링하여 생성한 데이터셋 대비 정확성(accuracy)이 떨어지는 단점이 있다.However, as the crowdsourcing method takes cost and time in proportion to the size of the data to be generated, recently, the method of producing datasets using machine prediction of artificial intelligence (AI) models has been adopted. cases are increasing. However, datasets generated by machine prediction have a disadvantage in that accuracy is lower than datasets generated by direct classification or labeling by humans.

그에 따라, 최근에는 인간-기계 협업(human-machine collaboration) 방식을 이용한 데이터의 분류 또는 레이블링 작업이 도입되고 있다. 이러한 인간-기계 협업은, 기계가 선제적으로 분류 또는 레이블링 예측 결과를 제시하면 인간이 예측 결과를 검토하는 형태로 이루어지는 것이다. 이러한 협업 방식은 특히 인간으로부터의 피드백(feedback)을 이용한다는 점에서 전술한 자연어 처리 및 응용과 관련된 기술 분야에 도움이 될 것으로 예상된다.Accordingly, recently, classification or labeling of data using a human-machine collaboration method has been introduced. This human-machine collaboration is in the form of a machine preemptively presenting classification or labeling prediction results, and humans reviewing the prediction results. This collaborative method is expected to be helpful in the fields of technology related to natural language processing and application described above, especially in that it uses feedback from humans.

따라서, 인간-기계 협업 방식을 이용하여 자연어를 분류하는 데이터셋 생성 방안의 고안이 필요하다.Therefore, it is necessary to devise a data set creation method for classifying natural language using a human-machine collaboration method.

미국 공개특허공보 제2014-0337257호 (2014.11.13)US Patent Publication No. 2014-0337257 (2014.11.13)

본 개시의 일 측면에 의하면, 사용자가 입력한 해시태그가 속할 맥락 카테고리(context category)를 예측하고, 사용자로부터 해시태그가 속할 맥락 카테고리를 입력받아 맥락 카테고리 데이터셋을 생성 및 갱신하는 맥락 카테고리 데이터셋 생성 장치 및 방법을 제공하는 데 주된 목적이 있다.According to one aspect of the present disclosure, a context category dataset for predicting a context category to which a hashtag entered by a user will belong, and generating and updating a context category dataset by receiving a context category to which a hashtag belongs from the user The main object is to provide an apparatus and method for generating it.

본 개시의 일 측면에 의하면, 사용자 인터페이스(User Interface: UI)를 이용하여 맥락 카테고리 데이터셋(context category dataset)을 생성하는 장치에 있어서, 맥락 카테고리(context category)별 해시태그 리스트(hashtag list)를 제공하는 리스트제공부; 및 상기 해시태그 리스트를 기초로 상기 맥락 카테고리별로 생성된 단어 임베딩 벡터(word embedding vector)를 이용하여, 상기 사용자 인터페이스로부터 입력된 해시태그 정보(hashtag information)의 맥락 카테고리를 하나 이상 예측하는 카테고리예측부를 포함하고, 상기 사용자 인터페이스는, 예측된 맥락 카테고리를 사용자에게 제공하고, 상기 사용자로부터 맥락 카테고리 정보(context category information)를 입력받아 상기 리스트제공부에 제공하는 것을 특징으로 하는 맥락 카테고리 데이터셋 생성장치를 제공한다.According to an aspect of the present disclosure, in an apparatus for generating a context category dataset using a user interface (UI), a hashtag list for each context category is provided. List providing unit to provide; and a category prediction unit for predicting one or more context categories of hashtag information input from the user interface using word embedding vectors generated for each context category based on the hashtag list. comprising, wherein the user interface provides a predicted context category to a user, receives context category information from the user, and provides it to the list providing unit. to provide.

본 개시의 다른 측면에 의하면, 맥락 카테고리(context category)별 해시태그 리스트(hashtag list)를 기초로 상기 맥락 카테고리별로 단어 임베팅 벡터(word embedding vector)를 생성하는 과정; 사용자 인터페이스(User Interface: UI)로부터 해시태그 정보(hashtag information)를 입력받는 과정; 상기 단어 임베딩 벡터를 이용하여, 상기 해시태그 정보의 맥락 카테고리를 하나 이상 예측하는 과정; 사용자에게 예측된 맥락 카테고리를 상기 사용자 인터페이스를 통해 제공하는 과정; 상기 사용자로부터 맥락 카테고리 정보(context category information)를 입력받는 과정; 및 상기 맥락 카테고리 정보를 기초로 상기 해시태그 리스트를 새로이 생성 또는 갱신(update)하는 과정을 포함하는 것을 특징으로 하는 맥락 카테고리 데이터셋 생성 방법을 제공한다.According to another aspect of the present disclosure, a process of generating a word embedding vector for each context category based on a hashtag list for each context category; a process of receiving hashtag information from a user interface (UI); predicting one or more context categories of the hashtag information by using the word embedding vector; providing a predicted context category to a user through the user interface; receiving context category information from the user; and generating or updating the hashtag list based on the context category information.

본 개시의 일 측면에 의하면, 사용자가 입력한 해시태그가 속할 맥락 카테고리(context category)를 예측하고, 사용자로부터 해시태그가 속할 맥락 카테고리를 입력받아 맥락 카테고리 데이터셋을 생성 및 갱신하는 장치 및 방법을 제공함으로써, 인간의 상황 및 맥락을 고려하여 분류된 자연어 데이터셋을 선제적으로 제공할 수 있는 효과가 있다.According to one aspect of the present disclosure, there is provided an apparatus and method for predicting a context category to which a hashtag entered by a user will belong, and for generating and updating a context category dataset by receiving an input from a user of a context category to which a hashtag belongs. By providing it, there is an effect that it is possible to preemptively provide a natural language dataset classified in consideration of the human situation and context.

이러한 인간의 상황 및 맥락을 고려한 자연어 분류 결과는, 자연어의 맥락을 고려하여 특정 상황을 묘사하는 텍스트를 생성하는 텍스트 생성(text generation), 자동으로 사용자 상황 및 맥락을 인식하여 적절한 단어 및 문장을 생성해주는 지능형 에이전트(intelligent agent) 또는 자연어 생성(natural language generation) 목적의 기계학습(machine learning) 모델 또는 인공지능(artificial intelligence) 모델을 디자인하고 훈련(training)시키는 데 사용될 수 있다.The natural language classification result in consideration of the human situation and context is text generation that generates text describing a specific situation in consideration of the context of natural language, and automatically recognizes the user situation and context to generate appropriate words and sentences It can be used to design and train a machine learning model or artificial intelligence model for the purpose of intelligent agent or natural language generation.

도 1은 본 개시의 맥락 카테고리 데이터셋이 생성되는 과정을 설명하기 위한 개념도이다.
도 2는 본 개시의 일 실시예에 따른 맥락 카테고리 데이터셋 생성장치를 나타내는 블록구성도이다.
도 3a 및 도 3b는 본 개시의 일 실시예에 따른 단어 임베딩 벡터 및 해시태그 정보의 임베딩 벡터의 예시도이다.
도 4는 본 개시의 일 실시예에 따른 사용자 인터페이스의 예시도이다.
도 5는 본 개시의 일 실시예에 따른 맥락 카테고리 데이터셋 생성방법을 나타내는 흐름도이다.1 is a conceptual diagram for explaining a process of generating a context category dataset of the present disclosure.
2 is a block diagram illustrating an apparatus for generating a context category dataset according to an embodiment of the present disclosure.
3A and 3B are exemplary diagrams of a word embedding vector and an embedding vector of hash tag information according to an embodiment of the present disclosure.
4 is an exemplary diagram of a user interface according to an embodiment of the present disclosure.
5 is a flowchart illustrating a method for generating a context category dataset according to an embodiment of the present disclosure.

이하, 본 개시의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 열람부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 개시를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 개시의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present disclosure will be described in detail with reference to exemplary drawings. It should be noted that, in adding the reference numerals to the components of each drawing, the same components are to have the same reference numerals as much as possible even if they are displayed on different drawings. In addition, in describing the present disclosure, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present disclosure, the detailed description thereof will be omitted.

또한, 본 개시의 구성 요소를 설명하는 데 있어서, 제2, 제1 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '…부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, in describing the components of the present disclosure, terms such as second and first may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 개시의 예시적인 실시형태를 설명하고자 하는 것이며, 본 개시가 실시될 수 있는 유일한 실시형태를 나타내고자 하는 것이 아니다.DETAILED DESCRIPTION The detailed description set forth below in conjunction with the appended drawings is intended to describe exemplary embodiments of the present disclosure and is not intended to represent the only embodiments in which the present disclosure may be practiced.

본 개시에서 맥락 카테고리 데이터셋(context category dataset)이란, 특정 카테고리와 맥락적인 연결성(connectivity)을 가지는 해시태그(hashtag) 또는 해시태그의 단어(word)들을 제공하는 데이터셋을 의미한다. 이러한 맥락 카테고리 데이터셋은 사용자에게 해시태그가 속할 맥락 카테고리에 대한 기계 예측(machine prediction) 결과를 제시하고, 사용자로부터 검토 결과를 입력받음으로써 맥락 카테고리 예측의 정확성(accuracty)을 향상시킴으로써 생성된다. 즉, 본 개시에서의 맥락 카테고리 데이터셋은 인간-기계의 크라우드 소싱(crowd sourcing) 방식에 의하여 생성 및 수집된다.In the present disclosure, a context category dataset refers to a dataset that provides a hashtag or words of a hashtag having a specific category and contextual connectivity. This context category dataset is created by presenting a machine prediction result for a context category to which a hashtag will belong to a user, and improving the accuracy of context category prediction by receiving a review result from the user. That is, the context category dataset in the present disclosure is generated and collected by a human-machine crowd sourcing method.

본 개시의 맥락 카테고리 데이터셋은 맥락 카테고리에 따른 해시태그 리스트(hashtag list)로서 구성됨을 전제로 설명한다. 그러나, 맥락 카테고리 데이터셋의 데이터 구조(data structure)가 반드시 리스트(list) 구조에 한정되는 것은 아니고, 트리(tree), 해시 테이블(hash table) 등 맥락 카테고리에 속하는 해시태그들을 저장 및 관리(예: 생성, 삭제, 탐색, 순회, 참조 등)할 수 있는 데이터 구조면 본 개시에서의 맥락 카테고리 데이터셋의 데이터 구조가 될 수 있다. The context category dataset of the present disclosure will be described on the premise that it is configured as a hashtag list according to the context category. However, the data structure of the context category dataset is not necessarily limited to the list structure, and hashtags belonging to context categories such as tree and hash table are stored and managed (e.g. : If it is a data structure that can be created, deleted, searched, traversed, referenced, etc.), it may be a data structure of the context category dataset in the present disclosure.

본 개시에서의 맥락 카테고리, 카테고리 리스트의 원소(element), 해시태그 정보에 속한 해시태그 및 맥락 카테고리 정보에 속한 맥락 카테고리 각각은 하나 이상일 수 있음을 전제로 한다.It is assumed that each of a context category, an element of a category list, a hashtag included in the hashtag information, and a context category included in the context category information in the present disclosure may be one or more.

도 1은 본 개시의 맥락 카테고리 데이터셋이 생성되는 과정을 설명하기 위한 개념도이다.1 is a conceptual diagram for explaining a process of generating a context category dataset of the present disclosure.

사용자로부터 하나 이상의 해시태그로 구성된 해시태그 정보(hashtag information)를 입력받으면, 맥락 카테고리 데이터셋 생성장치는 해시태그 정보의 각 해시태그가 속할 것으로 예상되는 맥락 카테고리를 사용자에게 제시한다. 맥락 카테고리의 예측은, 각 맥락 카테고리의 해시태그 리스트를 기초로 생성된 단어 임베딩 벡터(word embedding vector)를 이용하여 수행된다. 여기서 단어 임베딩 벡터란, 임베딩 벡터(embedding vector) 공간상에서의 맥락 카테고리의 위치로, 맥락 카테고리의 카테고리 리스트 원소를 단어 임베딩(word embedding)한 임베딩 벡터(embedding vector)를 기초로 산출된 벡터이다. 이러한 단어 임베딩 벡터는 예컨대, 각 맥락 카테고리에 대응하는 해시태그 리스트 원소의 각 임베딩 벡터로 구성된 클러스터로부터 중심(centroid)을 연산함으로써 획득될 수 있다.When receiving hashtag information composed of one or more hashtags from the user, the apparatus for generating a context category dataset presents a context category to which each hashtag of the hashtag information is expected to belong to the user. Prediction of context categories is performed using a word embedding vector generated based on the hashtag list of each context category. Here, the word embedding vector is a location of a context category in an embedding vector space, and is a vector calculated based on an embedding vector obtained by word embedding of a category list element of the context category. Such a word embedding vector may be obtained, for example, by calculating a centroid from a cluster consisting of each embedding vector of a hashtag list element corresponding to each context category.

사용자는 맥락 카테고리 데이터셋 생성장치로부터 제시된 맥락 카테고리를 기초로 각 해시태그에 대한 맥락 카테고리를 선택한다. 즉, 제시된 맥락 카테고리와 동일한 선택을 하거나, 제시된 맥락 카테고리의 전부 또는 일부를 수정함으로써 맥락 카테고리 정보(context category information)로서 각 해시태그에 대한 맥락 카테고리를 맥락 카테고리 데이터셋 생성장치에 제공한다. 맥락 카테고리 데이터셋 생성장치는 맥락 카테고리의 예측 정확성(prediction accuracy)을 향상시키기 위하여, 제공된 맥락 카테고리 정보와 종전 해시태그 정보를 기초로 맥락 카테고리 데이터셋을 갱신한다. 이러한 갱신은 예컨대, 맥락 카테고리 정보에 따라 각 맥락 카테고리의 해시태그 리스트에 해시태그 또는 해시태그의 각 단어를 새로운 원소로서 추가하거나, 기존 원소를 대체함으로써 이루어질 수 있다.The user selects a context category for each hashtag based on the context category presented from the context category dataset generating device. That is, by making the same selection as the presented context category or by modifying all or part of the presented context category, the context category for each hashtag as context category information is provided to the context category dataset generating device. The context category dataset generating apparatus updates the context category dataset based on the provided context category information and previous hashtag information in order to improve prediction accuracy of the context category. Such updating may be accomplished by, for example, adding a hashtag or each word of a hashtag as a new element to the hashtag list of each context category according to context category information, or replacing an existing element.

도 2는 본 개시의 일 실시예에 따른 맥락 카테고리 데이터셋 생성장치를 나타내는 블록구성도이다.2 is a block diagram illustrating an apparatus for generating a context category dataset according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따른 맥락 카테고리 데이터셋 생성장치(200)는 리스트제공부(list provider, 210), 카테고리예측부(category prediction unit, 220) 및 사용자 인터페이스(user interface, 230)를 전부 또는 일부 포함한다. 도 2에 도시된 맥락 카테고리 데이터셋 생성장치(200)는 본 개시의 일 실시예에 따른 것으로서, 도 2에 도시된 모든 구성이 필수 구성요소는 아니며, 다른 실시예에서 일부 구성이 추가, 변경 또는 삭제될 수 있다. 예컨대, 다른 실시예에서 맥락 카테고리 데이터셋 생성장치는 각 맥락 카테고리의 단어 임베딩 벡터의 재생성(regeneration)에 사용되는 맥락 카테고리 예측의 성능평가를 수행하는 성능평가부(performance estimate unit, 미도시)를 더 포함할 수 있다.The context category dataset generating apparatus 200 according to an embodiment of the present disclosure includes all or all of a list provider 210 , a category prediction unit 220 , and a user interface 230 . includes some The context category dataset generating apparatus 200 shown in FIG. 2 is according to an embodiment of the present disclosure, and not all components shown in FIG. 2 are essential components, and some components are added, changed, or may be deleted. For example, in another embodiment, the context category dataset generating apparatus further includes a performance estimate unit (not shown) that performs performance evaluation of context category prediction used for the regeneration of word embedding vectors of each context category. may include

도 2는 설명의 편의를 위해 맥락 카테고리 데이터셋 생성장치(200)를 장치로서 도시하였으나, 이는 설명의 편의를 위한 것으로, 다른 실시예에서 맥락 카테고리 데이터셋 생성장치는 각 구성(210 내지 230)의 기능을 수행하는 소프트웨어 모듈 또는 프로세서로 구현될 수 있다.2 shows the context category dataset generating apparatus 200 as a device for convenience of explanation, this is for convenience of description, and in another embodiment, the context category dataset generating device includes the components 210 to 230 It may be implemented as a software module or a processor that performs a function.

리스트제공부(210)는 맥락 카테고리 데이터셋으로서, 맥락 카테고리별 해시태그 리스트를 생성 및 관리하여 카테고리예측부(220)에 제공한다. 리스트제공부(210)는 사용자 인터페이스(230)로부터 입력받은 해시태그 정보 및 맥락 카테고리 정보를 기초로 새로운 맥락 카테고리의 해시태그 리스트를 생성하거나, 기존 맥락 카테고리의 해시태그 리스트를 갱신(update)할 수 있다.The list providing unit 210 generates and manages a hash tag list for each context category as a context category dataset, and provides it to the category prediction unit 220 . The list providing unit 210 may generate a hashtag list of a new context category based on the hashtag information and context category information input from the user interface 230 or update the hashtag list of an existing context category. have.

리스트제공부(210)는 해시태그 정보를 전처리(preprocessing)한 후 해시태그 리스트를 생성 및 갱신할 수 있다. 이러한 전처리는, 예컨대 해시태그 정보에 포함된 각 해시태그를 대문자 또는 소문자로 일괄 변형하거나, 각 해시태그에 포함된 공백 또는 특수문자를 제거하거나, 각 해시태그를 구성하는 복수의 단어, 복수의 문자, 단어 및 숫자의 조합 등을 확률적으로 분류(stochastically classify)하는 작업일 수 있으며 이에 한하지 않는다.The list providing unit 210 may generate and update the hash tag list after preprocessing the hash tag information. Such pre-processing, for example, collectively transforms each hashtag included in the hashtag information into uppercase or lowercase letters, removes spaces or special characters included in each hashtag, or a plurality of words and a plurality of characters constituting each hashtag , may be a task of stochastically classifying a combination of words and numbers, and the like, but is not limited thereto.

카테고리예측부(220)는 리스트제공부(210)가 제공한 해시태그 리스트를 기초로, 맥락 카테고리별로 생성된 단어 임베팅 벡터를 이용하여, 사용자 인터페이스(230)로부터 입력된 해시태그 정보의 맥락 카테고리를 하나 이상 예측한다. 여기서 해시태그 정보의 맥락 카테고리를 예측한다는 것의 의미는, 해시태그 정보에 포함된 각 해시태그가 속하는 맥락 카테고리를 하나 이상씩 예측한다는 것이다. 구체적으로, 카테고리예측부(220)는 각 맥락 카테고리의 단어 임베딩 벡터 전부 또는 일부를 생성 또는 갱신하는 벡터제공부(vector provider, 222) 및 해시태그 정보가 특정 맥락 카테고리에 속할 확률을 맥락 카테고리별로 연산하여 해시태그 정보의 맥락 카테고리를 하나 이상 예측하는 예측부(prediction unit, 224)를 전부 또는 일부 포함한다. 그러나 도 2에 도시된 모든 구성이 필수 구성요소는 아니며, 일부 구성이 추가, 변경 또는 삭제될 수 있다.The category prediction unit 220 uses the word embedding vector generated for each context category based on the hashtag list provided by the list providing unit 210 , the context category of the hashtag information input from the user interface 230 . predict more than one Here, predicting the context category of the hashtag information means predicting one or more context categories to which each hashtag included in the hashtag information belongs. Specifically, the category prediction unit 220 calculates the probability that the vector provider 222 generates or updates all or part of the word embedding vector of each context category and the hashtag information belongs to a specific context category for each context category. to include all or part of a prediction unit 224 for predicting one or more contextual categories of hashtag information. However, not all components shown in FIG. 2 are essential components, and some components may be added, changed, or deleted.

벡터제공부(222)는 최초 해시태그 리스트를 생성하는 경우, 각 해시태그 리스트에 기 설정된 하나 이상의 기초 태그(basic tag)에 단어 임베딩을 수행하여 생성된 기초 태그의 임베딩 벡터로 구성된 클러스터(cluster)를 기 정의된 임베딩 벡터 공간 상에 생성한다. 이러한 임베딩 벡터 공간의 차원은 벡터제공부(222)에 기 설정된 파라미터(parameter), 단어 임베딩에 요구되는 최소 차원 또는 예측부(224)의 예측 성능 등에 의해 재정의될 수 있다. 벡터제공부(222)는 각 클러스터의 중심을 대응하는 맥락 카테고리의 단어 임베딩 벡터로 설정한다. 설정된 각 단어 임베딩 벡터는 대응하는 해시태그 리스트에 추가되는 원소의 임베딩 벡터를 반영하여 갱신될 수 있다. 예컨대, 벡터제공부(222)는 특정 맥락 카테고리의 해시태그 리스트에 새로이 추가되는 원소의 수가 기 설정된 개수가 되는 경우마다 클러스터를 다시 구성하여 새로이 중심을 찾아 해당 맥락 카테고리의 단어 임베딩 벡터로 재설정할 수 있다.When the vector providing unit 222 generates the first hashtag list, a cluster composed of embedding vectors of the basic tags generated by performing word embedding on one or more basic tags preset in each hashtag list. is created on a predefined embedding vector space. The dimension of the embedding vector space may be redefined by a parameter preset in the vector providing unit 222 , a minimum dimension required for word embedding, or the prediction performance of the prediction unit 224 . The vector providing unit 222 sets the center of each cluster as a word embedding vector of a corresponding context category. Each set word embedding vector may be updated by reflecting the embedding vector of an element added to the corresponding hashtag list. For example, the vector providing unit 222 reconfigures the cluster whenever the number of elements newly added to the hashtag list of a specific context category becomes a preset number, finds a new center, and resets it to a word embedding vector of the corresponding context category. have.

예측부(224)는 해시태그 정보 내 각 해시태그의 임베딩 벡터를 생성하여 각 해시태그별로, 해시태그의 임베딩 벡터와 각 맥락 카테고리의 단어 임베딩 벡터 간 거리(distance)를 각각 연산함으로써 해당 해시태그가 각 맥락 카테고리에 속할 확률을 각각 연산한다. 구체적으로, 예측부(224)는 각 해시태그의 임베딩 벡터를 생성하기 위하여, 해시태그의 임베딩 벡터를 추출할 수 있는 경우 추출된 임베딩 벡터와 각 단어 임베딩 벡터 간 거리를 연산하고, 추출할 수 없는 경우 해시태그를 단어별로 분류하여 각 단어의 임베딩 벡터를 합 연산한 것을 해시태그의 임베딩 벡터로서 생성하여 거리를 연산한다. 예측부(224)는 연산된 거리를 기초로 거리가 가장 짧은 상위 N개(N은 1 이상의 자연수)의 맥락 카테고리를 해당 해시태그의 예측된 맥락 카테고리로 사용자 인터페이스(230)에 제공할 수 있다.The prediction unit 224 generates an embedding vector of each hashtag in the hashtag information and calculates the distance between the embedding vector of the hashtag and the embedding vector of the word in each context category for each hashtag, respectively, so that the corresponding hashtag is Calculate the probability of belonging to each context category, respectively. Specifically, the prediction unit 224 calculates the distance between the extracted embedding vector and each word embedding vector when the embedding vector of the hashtag can be extracted to generate the embedding vector of each hashtag, and cannot be extracted. In this case, the distance is calculated by classifying the hashtags by word and calculating the sum of the embedding vectors of each word as an embedding vector of the hashtag. The prediction unit 224 may provide the top N context categories with the shortest distance (N is a natural number equal to or greater than 1) based on the calculated distance to the user interface 230 as the predicted context category of the corresponding hashtag.

또 다른 예로, 예측부(224)는 연산된 거리를 각 맥락 카테고리별로 정규화(normalize)하여 해시태그 정보가 각 맥락 카테고리에 속할 확률을 연산하여, 확률값이 기 설정된 임계치 이상인 맥락 카테고리의 전부 또는 일부를 해당 해시태그의 예측된 맥락 카테고리로서 사용자 인터페이스(230)에 제공할 수 있다.As another example, the prediction unit 224 normalizes the calculated distance for each context category, calculates the probability that the hashtag information belongs to each context category, and selects all or part of the context category in which the probability value is equal to or greater than a preset threshold. It may be provided to the user interface 230 as a predicted context category of the corresponding hashtag.

사용자 인터페이스(230)는 사용자로부터 해시태그 정보를 입력받아 리스트제공부(210) 및/또는 카테고리예측부(220)에 제공하고, 이후 카테고리예측부(220)로부터 제공받은 예측된 맥락 카테고리를 사용자에 제공한 후 사용자로부터 맥락 카테고리 정보를 입력받아 리스트제공부(210)에 제공한다. 사용자 인터페이스(230)의 구체적인 예시는 도 4에서 후술한다.The user interface 230 receives the hashtag information from the user and provides it to the list providing unit 210 and/or the category prediction unit 220, and then provides the predicted context category provided from the category prediction unit 220 to the user. After providing, context category information is received from the user and provided to the list providing unit 210 . A specific example of the user interface 230 will be described later with reference to FIG. 4 .

도 3a 및 도 3b는 본 개시의 일 실시예에 따른 단어 임베딩 벡터 및 해시태그 정보의 임베딩 벡터의 예시도이다.3A and 3B are exemplary diagrams of a word embedding vector and an embedding vector of hash tag information according to an embodiment of the present disclosure.

도 3a 및 도 3b의 A, B 및 C는 각각 맥락 카테고리의 단어 임베딩 벡터로서, 대응하는 해시태그 리스트 원소의 임베딩 벡터로 구성된 클러스터의 중심을 3차원 임베딩 벡터 공간상에 나타낸 것이다. 도 3a 및 도 3b에서는 임베딩 벡터 공간을 3차원 공간으로 나타내고 있으나, 이는 예시적인 것이며 임베딩 벡터 공간의 차원은 3차원 이상 또는 이하일 수 있음은 자명하다.A, B, and C of FIGS. 3A and 3B are word embedding vectors of context categories, respectively, showing the center of a cluster consisting of embedding vectors of corresponding hashtag list elements on a three-dimensional embedding vector space. Although the embedding vector space is shown as a three-dimensional space in FIGS. 3A and 3B, this is exemplary and it is obvious that the dimension of the embedding vector space may be three-dimensional or more or less.

사용자가 사용자 인터페이스를 통해 입력한 해시태그 정보(도 3a 및 도 3b의 hashtag)에 있어서, 도 3a는 임베딩 벡터 공간상에서 해시태그 정보의 임베딩 벡터가 추출 가능한 경우를, 도 3b는 추출 가능하지 않은 경우를 나타낸다. 여기서 해시태그 정보의 임베딩 벡터를 추출한다는 것의 의미는, 해시태그 정보에 포함된 해시태그 각각에 대하여 임베딩 벡터를 추출하는 것을 의미하여, 동일한 해시태그 정보에 포함된 해시태그 간에도 추출 가능 여부가 달라질 수 있다.In the hashtag information input by the user through the user interface (the hashtag in FIGS. 3A and 3B), FIG. 3A shows a case in which an embedding vector of hash tag information can be extracted in an embedding vector space, and FIG. 3B shows a case where it is not extractable. indicates Here, extracting the embedding vector of the hashtag information means extracting the embedding vector for each hashtag included in the hashtag information, and whether it is possible to extract even between the hashtags included in the same hashtag information may vary. have.

임베딩 벡터가 추출 가능한 경우, 맥락 카테고리 데이터셋 생성장치는 해시태그 정보의 임베딩 벡터를 추출한다. 그러나 해시태그 정보의 임베딩 벡터가 추출 가능하지 않은 경우, 맥락 카테고리 데이터셋 생성장치는 해시태그 정보가 복수의 단어로 구성된 것으로 추측하여 해시태그 정보에 포함된 단어를 분류해주는 알고리즘(algorithm) 또는 라이브러리(library) 등을 이용하여 분류된 단어의 임베딩 벡터(도 3b의 hastag#1 및 hashtag#2)를 각각 추출한다. 이후, 맥락 카테고리 데이터셋 생성장치는 각 단어의 임베딩 벡터를 합 연산하여 해시태그 정보의 임베딩 벡터를 획득한다.When the embedding vector can be extracted, the context category dataset generating apparatus extracts the embedding vector of the hashtag information. However, when the embedding vector of the hashtag information cannot be extracted, the context category dataset generating apparatus assumes that the hashtag information is composed of a plurality of words and classifies the words included in the hashtag information by using an algorithm or library ( library) and the like to extract embedding vectors (hatag#1 and hashtag#2 in FIG. 3b) of the classified words, respectively. Thereafter, the context category dataset generating apparatus obtains an embedding vector of hash tag information by summing the embedding vectors of each word.

맥락 카테고리 데이터셋 생성장치는 추출 또는 획득한 해시태그 정보의 임베딩 벡터와 각 맥락 카테고리의 단어 임베딩 벡터 간 거리를 연산하여, 거리가 가장 가까울수록 해시태그 정보가 해당 맥락 카테고리에 속할 확률이 높아지는 것으로 예측한다. The context category dataset generating device calculates the distance between the extracted or acquired hashtag information embedding vector and the word embedding vector of each context category, and it is predicted that the closer the distance, the higher the probability that the hashtag information belongs to the context category. do.

도 4는 본 개시의 일 실시예에 따른 사용자 인터페이스의 예시도이다.4 is an exemplary diagram of a user interface according to an embodiment of the present disclosure.

도 4의 실시예에서는 이미지를 이용하여 해시태그와 맥락 카테고리 간 연결성(connectivity)을 부여하였다. 즉, 사용자는 사용자 인터페이스에 이미지와 해시태그 정보를 입력하고, 맥락 카테고리 데이터셋 생성장치가 예측한 맥락 카테고리가 제공되면 제공된 맥락 카테고리를 선택 또는 수정한 맥락 카테고리 정보를 사용자 인터페이스에 입력한다. 맥락 카테고리 데이터셋 생성장치는 입력받은 맥락 카테고리 정보와 해시태그 정보를 기초로 해시태그 리스트 및/또는 맥락 카테고리의 단어 임베딩 벡터를 재생성(regenerate) 또는 갱신(update)함으로써 예측의 정확도를 높이고, 맥락 카테고리 데이터셋을 축적할 수 있다.In the embodiment of FIG. 4 , connectivity between a hashtag and a context category is provided using an image. That is, the user inputs image and hashtag information into the user interface, and when the context category predicted by the context category dataset generating device is provided, the user selects or modifies the context category information into the user interface. The context category dataset generating apparatus increases the accuracy of prediction by regenerating or updating the word embedding vector of the hashtag list and/or context category based on the received context category information and hashtag information, and the context category Data sets can be accumulated.

도 4의 (a)를 참조하면, 맥락 카테고리 데이터셋 생성장치는 사용자 인터페이스로부터 사용자로부터 이미지와 이미지에 관한 하나 이상의 해시태그를 해시태그 정보로서 입력받는다. 이미지와 해시태그를 함께 입력받음으로써, 입력되는 해시태그 간에는 함께 입력된 이미지와 관련된 맥락이 있을 것으로 예상할 수 있다. 사용자 인터페이스는 맥락 카테고리 데이터셋 생성을 용이하게 하기 위하여, 2자 이상의 문자로 된 해시 태그를 입력하도록 요청 또는 이러한 해시태그를 일정 개수 이상(예: 5개 이상) 입력하도록 요청함이 바람직하다. 또한, 사용자 인터페이스는 맥락 카테고리 데이터셋 생성에 유효한 해시태그를 수집하기 위하여 중복되는 해시태그 정보를 작성하는 경우, 이를 무효화하고 사용자에게 재입력을 요청하는 것이 바람직하다.Referring to FIG. 4A , the context category dataset generating apparatus receives an image and one or more hashtags related to the image from the user as hashtag information from the user interface. By receiving the image and the hashtag together, it can be expected that there is a context related to the image input together between the inputted hashtags. In order to facilitate the creation of the context category dataset, the user interface preferably requests to input a hashtag of two or more characters or requests to input a certain number of such hashtags (eg, five or more). In addition, when the user interface creates overlapping hashtag information in order to collect hashtags valid for generating the context category dataset, it is preferable to invalidate it and request a re-entry from the user.

도 4의 (a)의 예시에서는 사용자 인터페이스에 나무와 강에 관한 이미지가 입력되고, 관련하여 #strasboug, #petitefrance, #sundaymorning, #morningwalk, #christmasvacation의 해시태그가 각각 입력되었다.In the example of FIG. 4 ( a ), images of trees and rivers are input to the user interface, and hashtags of #strasboug, #petitefrance, #sundaymorning, #morningwalk, and #christmasvacation are inputted in relation to each other.

그에 따라, 맥락 카테고리 데이터셋 생성장치는 해시태그 정보에 포함된 각 해시태그의 임베딩 벡터를 추출 또는 획득하여 맥락 카테고리를 예측하고, 사용자 인터페이스를 통해 사용자에게 예측된 맥락 카테고리를 제공한다. 맥락 카테고리 데이터셋 생성장치는 해시태그 각각의 임베딩 벡터를 용이하게 획득하기 위하여, 각 해시태그를 전처리(예: 단어별로 분류, 대/소문자 변환, 공백 또는 기호 제거 등)할 수 있다. 도 4의 (b)와 (c)를 참조하면, 맥락 카테고리에는 감정(Emotion), 분위기(Mood), 위치(Location), 시간(Time), 물체(Object), 활동(Activity), 행사(Event), 기타(Other)가 포함되어 있음을 알 수 있으나, 이는 예시적인 것으로 이에 한하지 않음은 자명하다.Accordingly, the context category dataset generating apparatus extracts or obtains an embedding vector of each hashtag included in the hashtag information to predict the context category, and provides the predicted context category to the user through the user interface. In order to easily obtain an embedding vector of each hashtag, the context category dataset generating apparatus may pre-process each hashtag (eg, classification by word, case conversion, space or symbol removal, etc.). Referring to (b) and (c) of Figure 4, the context category includes emotion, mood, location, time, object, activity, and event. ), it can be seen that other (Other) is included, but it is self-evident that this is not limited thereto.

도 4의 (b)를 참조하면, 맥락 카테고리 데이터셋 생성장치는 각 해시태그의 맥락 카테고리로서, #strasboug에는 위치 카테고리, #petitefrance에는 위치 카테고리, #sundaymorning에는 시간 카테고리, #morningwalk에는 시간 카테고리와 활동 카테고리, #christmasvacation에는 시간 카테고리와 행사 카테고리를 예측된 맥락 카테고리로서 제공한다.Referring to (b) of FIG. 4 , the context category dataset generating device is a context category for each hashtag, with a location category in #strasboug, a location category in #petitefrance, a time category in #sundaymorning, and a time category and activity in #morningwalk. The category, #christmasvacation, provides the time category and event category as predicted context categories.

도 4의 (c)에는 사용자가 사용자 인터페이스를 통해 맥락 카테고리를 선택한 결과가 나타나 있다. 사용자는 제공된 맥락 카테고리 가운데 #christmasvacation 해시태그의 맥락 카테고리에서 행사 카테고리를 제외하였다. 맥락 카테고리 데이터셋 생성장치는 사용자의 선택 결과인 맥락 카테고리 정보를 기초로 맥락 카테고리별 해시태그 리스트를 갱신한다. 예컨대, 위치 카테고리의 해시태그 리스트에 strasboug, petitefrance를, 시간 카테고리의 해시태그 리스트에 sundaymorning, morningwalk, christmasvacation를, 활동 카테고리의 해시태그 리스트에 morningwalk를 추가할 수 있다.4C shows a result of the user selecting a context category through the user interface. The user excluded the event category from the context category of the hashtag #christmasvacation among the provided context categories. The context category dataset generating apparatus updates the hashtag list for each context category based on context category information that is a user's selection result. For example, you can add strasboug and petitefrance to the hashtag list of the location category, sundaymorning, morningwalk, and christmasvacation to the hashtag list of the time category, and morningwalk to the hashtag list of the activity category.

또 다른 예로 맥락 카테고리 데이터셋 생성장치는 각 해시태그를 단어별로 분류하는 전처리를 수행하여, 각 해시태그 리스트에 해시태그의 단어를 추가할 수 있다. 예컨대, 시간 카테고리의 해시태그 리스트에 sundaymorning, morningwalk, christmasvacation 대신 sunday, morning, morningwalk, christmas, vacation을 추가하는 것이다.As another example, the apparatus for generating a context category dataset may add a word of a hashtag to each hashtag list by performing pre-processing of classifying each hashtag by word. For example, to add sunday, morning, morningwalk, christmas, vacation instead of sundaymorning, morningwalk, and christmasvacation to the hashtag list of the time category.

도 5는 본 개시의 일 실시예에 따른 맥락 카테고리 데이터셋 생성방법을 나타내는 흐름도이다.5 is a flowchart illustrating a method for generating a context category dataset according to an embodiment of the present disclosure.

맥락 카테고리 데이터셋 생성 장치는 맥락 카테고리별 해시태그 리스트를 이용하여 맥락 카테고리별 단어 임베딩 벡터를 생성한다(S500).The context category dataset generating apparatus generates a word embedding vector for each context category by using the hashtag list for each context category (S500).

맥락 카테고리 데이터셋 생성 장치는 사용자 인터페이스를 이용하여 해시태그 정보를 입력받는다(S510). 이러한 해시태그 정보는, 사용자 인터페이스를 통해 업로드되는 자료(예: 이미지, 영상, 문서 등)와 관련된 하나 이상의 해시태그일 수 있다.The context category dataset generating apparatus receives hashtag information using a user interface (S510). The hashtag information may be one or more hashtags related to materials (eg, images, videos, documents, etc.) uploaded through the user interface.

임베딩 벡터의 공간 상의 위치로서, 입력받은 해시태그 정보의 임베딩 벡터를 추출가능 여부를 판단한다(S520). 임베딩 벡터를 추출 가능하다고 판단한 경우, 해시태그 정보의 임베딩 벡터와 각 단어 임베딩 벡터 간 거리를 산출하여, 거리가 가장 짧은 순서대로 상위 N 개(N은 1 이상 자연수)의 단어 임베딩 벡터에 대응하는 맥락 카테고리들을 예측된 맥락 카테고리로서 산출한다(S530). 또는, 거리가 기 설정된 임계치 이상인 임베딩 벡터에 대응하는 맥락 카테고리들을 예측된 맥락 카테고리로서 산출한다.As the location of the embedding vector in space, it is determined whether it is possible to extract the embedding vector of the received hashtag information (S520). If it is determined that the embedding vector can be extracted, the distance between the embedding vector of the hashtag information and each word embedding vector is calculated, and the context corresponding to the top N word embedding vectors in the order of the shortest distance (N is a natural number greater than or equal to 1) The categories are calculated as predicted context categories (S530). Alternatively, context categories corresponding to embedding vectors having a distance equal to or greater than a preset threshold are calculated as predicted context categories.

임베딩 벡터를 추출 가능하지 않다고 판단한 경우, 맥락 카테고리 데이터셋 생성 장치는 해시태그 정보를 분류한 각 단어의 임베딩 벡터로부터 해시태그 정보의 임베딩 벡터를 획득한다(S522). 이후, 해시태그 정보의 임베딩 벡터와 각 단어 임베딩 벡터 간 거리를 산출하여 맥락 카테고리를 예측한다(S530).If it is determined that the embedding vector cannot be extracted, the apparatus for generating the context category dataset obtains an embedding vector of the hashtag information from the embedding vector of each word in which the hashtag information is classified (S522). Thereafter, the context category is predicted by calculating the distance between the embedding vector of the hashtag information and each word embedding vector ( S530 ).

맥락 카테고리 데이터셋 생성 장치는 사용자 인터페이스를 통해 사용자에게 예측된 맥락 카테고리를 제공하고, 사용자로부터 맥락 카테고리 정보를 입력받는다(S540). 이로써, 사용자는 예측된 맥락 카테고리를 고려하여, 맥락 카테고리 정보를 사용자 인터페이스에 입력하게 된다.The context category dataset generating apparatus provides the predicted context category to the user through the user interface, and receives context category information from the user ( S540 ). Accordingly, the user inputs context category information into the user interface in consideration of the predicted context category.

해시태그 정보 및 맥락 카테고리 정보를 기초로, 맥락 카테고리별 해시태그 리스트에 새로운 원소를 추가하거나, 새로운 맥락 카테고리와 그 해시태그 리스트에 원소를 추가한다(S550).Based on the hashtag information and the context category information, a new element is added to the hashtag list for each context category, or an element is added to the new context category and the hashtag list (S550).

도 5에서는 과정 각 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 개시의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 개시의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 개시의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 5에 기재된 순서를 변경하여 실행하거나 각 과정 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 5의 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each process is sequentially executed in FIG. 5 , this is merely illustrative of the technical idea of an embodiment of the present disclosure. In other words, those of ordinary skill in the art to which an embodiment of the present disclosure pertain may change the order described in FIG. Since it is possible to apply various modifications and variations by executing in parallel, it is not limited to the time-series order of FIG. 5 .

본 명세서에 설명되는 장치, 부(unit), 과정, 단계 등의 다양한 구현예들은, 디지털 전자 회로, 집적 회로, FPGA(field programmable gate array), ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 이러한 다양한 구현예들은 프로그래밍 가능 시스템상에서 실행 가능한 하나 이상의 컴퓨터 프로그램들로 구현되는 것을 포함할 수 있다. 프로그래밍 가능 시스템은, 저장 시스템, 적어도 하나의 입력 디바이스, 그리고 적어도 하나의 출력 디바이스로부터 데이터 및 명령을 수신하고 이들에게 데이터 및 명령을 전송하도록 결합된 적어도 하나의 프로그래밍 가능 프로세서(이것은 특수 목적 프로세서일 수 있거나 혹은 범용 프로세서일 수 있음)를 포함한다. 컴퓨터 프로그램들(이것은 또한 프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 혹은 코드로서 알려져 있음)은 프로그래밍 가능 프로세서에 대한 명령어들을 포함하며 "컴퓨터가 읽을 수 있는　기록매체"에 저장된다. Various implementations of the devices, units, processes, steps, etc., described herein may include digital electronic circuits, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or a combination thereof. These various implementations may include being implemented in one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor). Computer programs (also known as programs, software, software applications or code) contain instructions for a programmable processor and are stored on a "computer-readable recording medium".

컴퓨터가 읽을 수 있는　기록매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 이러한 컴퓨터가 읽을 수 있는　기록매체는 ROM, CD-ROM, 자기 테이프, 플로피디스크, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등의 비휘발성(non-volatile) 또는 비 일시적인(non-transitory) 매체 또는 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한, 컴퓨터가 읽을 수 있는　기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다.The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. These computer-readable recording media are non-volatile or non-transitory, such as ROM, CD-ROM, magnetic tape, floppy disk, memory card, hard disk, magneto-optical disk, and storage device. It may further include a medium or a transitory medium such as a data transmission medium. In addition, the computer-readable recording medium may be distributed in network-connected computer systems, and computer-readable codes may be stored and executed in a distributed manner.

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 프로그램가능 컴퓨터에 의하여 구현될 수 있다. 여기서, 컴퓨터는 프로그램가능 프로세서, 데이터 저장 시스템(휘발성 메모리, 비휘발성 메모리, 또는 다른 종류의 저장 시스템이거나 이들의 조합을 포함함) 및 적어도 한 개의 커뮤니케이션 인터페이스를 포함한다. 예컨대, 프로그램가능 컴퓨터는 서버, 네트워크 기기, 셋톱박스, 내장형 장치, 컴퓨터 확장 모듈, 개인용 컴퓨터, 랩톱, PDA(Personal Data Assistant), 클라우드 컴퓨팅 시스템 또는 모바일 장치 중 하나일 수 있다.Various implementations of the systems and techniques described herein may be implemented by a programmable computer. Here, the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof), and at least one communication interface. For example, the programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and a person skilled in the art to which this embodiment belongs may make various modifications and variations without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

200: 맥락 카테고리 데이터셋 생성장치
210: 리스트제공부
220: 카테고리예측부
222: 벡터제공부
224: 예측부
230: 사용자 인터페이스200: Context category dataset generator
210: list providing unit
220: category prediction unit
222: vector provider
224: prediction unit
230: user interface

Claims

In an apparatus for generating a context category dataset using a user interface (UI),
a list providing unit that provides a hashtag list for each context category; and
A category prediction unit for predicting one or more context categories of the hashtag information input from the user interface by using the word embedding vector generated for each context category based on the hashtag list but,
The user interface is
Providing the predicted context category to the user, receiving context category information from the user and providing it to the list providing unit
Context category dataset generating device, characterized in that.

According to claim 1,
The word embedding vector is
A context category dataset generating apparatus, characterized in that it is a vector indicating the location of a context category corresponding to itself on a predefined embedding vector space.

According to claim 1,
The list providing unit,
Creating or updating the hashtag list by creating a hashtag list of a new context category or updating the hashtag list of an existing context category based on the hashtag information and the context category information
Context category dataset generating device, characterized in that.

4. The method of claim 3,
The list providing unit,
Creating or updating the hash tag list after performing preprocessing of classifying the hash tag information by word
Context category dataset generating device, characterized in that.

According to claim 1,
The category prediction unit,
a vector providing unit for generating or updating the word embedding vector; and
A prediction unit predicting one or more context categories to which the hashtag information belongs by calculating a probability that the hashtag information belongs to the context category by using the word embedding vector
Context category dataset generating device, characterized in that it comprises a.

6. The method of claim 5,
The word embedding vector generation by the vector providing unit,
Based on one or more basic tags preset in the hashtag list, the centroid of a cluster consisting of the embedding vectors of the basic tags is assigned to the word embedding vector of the corresponding context category. A device for generating a contextual category dataset characterized by it.

6. The method of claim 5,
The word embedding vector update by the vector providing unit,
When there is an addition of an element to the hashtag list, a word embedding vector of a context category corresponding to the center of a cluster consisting of an embedding vector of an element included in the hashtag list by further considering the embedding vector of the added element Context category dataset generating apparatus, characterized in that performed by resetting (reassign) to.

8. The method of claim 7,
The reset is
The apparatus for generating a context category dataset, characterized in that it is performed whenever the number of hashtags added to the hashtag list becomes a preset number.

6. The method of claim 5,
The prediction unit,
Context category dataset creation, characterized in that by generating each embedding vector of the hashtag information and calculating a distance from each of the word embedding vectors, the probability that the hashtag information belongs to each of the context categories Device.

10. The method of claim 9,
The embedding vector of the hashtag information is,
It is generated by extracting an embedding vector from the hashtag information, and when the embedding vector cannot be extracted from the hashtag information, the hash tag information is classified by word and the embedding vector of each word is summed. Context category dataset generator.

10. The method of claim 9,
The prediction unit,
The apparatus for generating a context category dataset, characterized in that all or part of a context category in which the distance between the embedding vector of the hashtag information and each of the word embedding vectors is equal to or greater than a preset threshold is predicted as the context category of the hashtag.

generating a word embedding vector for each context category based on a hashtag list for each context category;
a process of receiving hashtag information from a user interface (UI);
predicting one or more context categories of the hashtag information by using the word embedding vector;
providing a predicted context category to a user through the user interface;
receiving context category information from the user; and
A process of newly creating or updating the hashtag list based on the context category information
Context category dataset creation method comprising a.

13. The method of claim 12,
The process of generating the word embedding vector is,
A method for generating a context category dataset, characterized in that assigning a centroid of a cluster composed of embedding vectors of elements included in the hashtag list to a word embedding vector of a corresponding context category.

13. The method of claim 12,
The process of predicting one or more context categories,
A method for generating a context category dataset, comprising generating an embedding vector of the hashtag information and predicting it based on calculating a distance from each of the word embedding vectors.

15. The method of claim 14,
The prediction is
The distance from each of the word embedding vectors is normalized for each corresponding context category to calculate the probability that the hashtag information belongs to each context category, and the top N items with a high calculated probability (N is a natural number greater than or equal to 1) A method for creating a context category dataset, characterized in that it is performed by selecting a context category of

A computer program stored in a computer-readable recording medium to execute each process included in the method for generating a context category dataset according to any one of claims 12 to 15.