KR102479381B1

KR102479381B1 - System for content clustering based on thesaurus and method therefor

Info

Publication number: KR102479381B1
Application number: KR1020210013444A
Authority: KR
Inventors: 장준도; 김상원; 이종영
Original assignee: 인하대학교 산학협력단
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2022-12-19
Also published as: KR20220109886A

Abstract

본 발명은 유의어에 기반한 콘텐츠 클러스터링 시스템에 관한 것으로써, 유의어에 기반하여 콘텐츠 요소들을 분류하여 데이터베이스를 구축하기 위한 유의어에 기반한 콘텐츠 클러스터링 시스템 및 그 방법에 관한 것이다. 본 발명에 따르면, 유사한 의미를 지닌 동등한 수준의 표현에 대한 데이터베이스를 작성하고, 계층적으로 배치될 수 있는 표현의 범주가 다른 표현에 대한 데이터베이스를 분류하여 작성할 수 있는 효과가 있다.The present invention relates to a content clustering system based on synonyms, and more particularly to a content clustering system and method for constructing a database by classifying content elements based on synonyms. According to the present invention, there is an effect of creating a database for expressions of an equal level having similar meanings, and classifying and creating a database for expressions having different categories of expressions that can be hierarchically arranged.

Description

System for content clustering based on thesaurus and method therefor}

본 발명은 유의어에 기반한 콘텐츠 클러스터링 시스템에 관한 것으로써, 더욱 상세하게는 유의어에 기반하여 콘텐츠 요소들을 분류하여 데이터베이스를 구축하기 위한 유의어에 기반한 콘텐츠 클러스터링 시스템 및 그 방법에 관한 것이다.The present invention relates to a content clustering system based on synonyms, and more particularly, to a content clustering system and method for constructing a database by classifying content elements based on synonyms.

현재 인터넷의 발달로 사용자가 급속히 늘어가면서 웹 서비스 환경이 다양하게 변화하고 있다. 종래의 웹 서비스가 정적으로 수동적인데 반하여, 웹 서비스는 점차 동적이고 능동적으로 변화되고 있으며, 이러한 웹 서비스 변화의 흐름을 반영하기 위하여 웹 2.0이 도입되었다.As the number of users rapidly increases with the development of the Internet, the web service environment is changing in various ways. While conventional web services are statically passive, web services are gradually changing dynamically and actively, and Web 2.0 has been introduced to reflect the trend of web service changes.

웹 2.0이란 정보의 개방을 통해 인터넷 사용자들 간의 정보 공유와 참여를 이끌어내고, 창조된 정보의 가치를 지속적으로 증대시키기 위하여 개발된 일련의 움직임을 의미한다. 즉, 웹 2.0에서는 개방적인 웹 환경을 기반으로 네티즌들이 자유롭게 참여하고, 콘텐츠를 생산 및 재창조, 공유할 수 있다.Web 2.0 refers to a series of movements developed to induce information sharing and participation among Internet users through the opening of information and to continuously increase the value of created information. In other words, in Web 2.0, netizens can freely participate, create, recreate, and share content based on an open web environment.

웹 2.0에서 정보는 사용자에 의하여 생산되고, 사용자가 붙인 태그에 의해 정보가 체계화된다. 사용자들은 이러한 정보를 용이하게 공유할 수 있으며, 따라서 다양한 리소스들이 상호 연관된다. 이와 같이 웹 2.0 현상은 모든 인터넷 사이트의 필수 전략이 되었으며, 웹 2.0을 성공적으로 구현하기 위하여 다양한 기법들이 소개되고 있다.In Web 2.0, information is created by users, and information is organized by tags attached by users. Users can easily share this information, so that various resources are interrelated. As such, the Web 2.0 phenomenon has become an essential strategy for all Internet sites, and various techniques are being introduced to successfully implement Web 2.0.

이러한 기법들 중 하나가 태깅(tagging)이다. 태깅은 블로그와 같은 웹 문서로부터 이미지, 동영상과 같은 멀티미디어 콘텐츠에 까지 폭넓게 이용되고 있는데, 사용자가 자신이 생성한 콘텐츠에 태그를 붙임으로써 검색과 분류가 용이하게 이루어지도록 하는 것이다.One of these techniques is tagging. Tagging is widely used from web documents such as blogs to multimedia contents such as images and videos, and allows users to easily search and classify content by attaching tags to content they create.

종래, 공개특허 2010-0013157에서 연관 태그에 기반한 태그 클러스터링 장치에 의하면, 태그 매핑 과정에서 동일 태그의 출현 빈도를 추출하고, 연관 태그 쌍들 중 임계치 이상의 빈도수를 가지는 연관 태그들만을 추출하여 태그 클러스터를 생성하여 동의어를 나타내는 태그들을 동일한 태그로 간주한다.According to the conventional tag clustering apparatus based on related tags disclosed in Patent Publication 2010-0013157, the occurrence frequency of the same tag is extracted in a tag mapping process, and only related tags having a frequency higher than a threshold value are extracted among related tag pairs to generate a tag cluster. Therefore, tags representing synonyms are regarded as the same tag.

이러한 빈도 기반 텍스트 분석(Frequency Based Text Analysis)과 같은 통계적인 추측은 이론적으로 인간 활동에서 사용된 모든 언어 기록이 있다면 이 문제를 매우 잘 해결할 수 있으나 현실적으로는 수집 가능한 텍스트 정보에는 한계가 있고 지역, 연령, 관심 분야, 직종 등으로 나뉘어질 수 있는 언어 사용의 하위 그룹들의 특수한 언어 사용 습관은 이러한 일반론적 분석으로는 놓칠 수 있는 언어적 다양성을 가지고 있으며 이는 표현의 유사성을 예측하는 통계적 추측 방식에 사용되는 언어 데이터베이스도 모든 상황에서 동일한 것을 사용할 수는 없다는 문제가 있다.Statistical guesses, such as frequency-based text analysis, can theoretically solve this problem very well if there are records of all languages used in human activities, but in reality there are limits to the textual information that can be collected, and there are limits to geography, age, and region. The specific language usage habits of subgroups of language use, which can be divided into areas of interest, occupations, etc., have linguistic diversity that can be missed by this general analysis, which is used in statistical inference methods to predict similarity of expressions. Language databases also have the problem of not being able to use the same one in every situation.

공개특허 제10-2010-0013157호(2010.02.09)Patent Publication No. 10-2010-0013157 (2010.02.09)

본 발명은 상술한 문제를 해결하고자 고안한 것으로, 유사한 데이터를 다양한 언어적 메타 데이터로 정의하고 질의 데이터를 포괄적으로 인식하여 다양한 목적과 상황에서 의미적으로 분류하도록 하는 유의어에 기반한 콘텐츠 클러스터링 시스템 및 그 방법을 제공함에 목적이 있다.The present invention has been devised to solve the above problems, and a synonym-based content clustering system that defines similar data as various linguistic metadata and comprehensively recognizes query data to semantically classify it in various purposes and situations, and its The purpose is to provide a method.

본 발명의 일 측면에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템은 소정의 모집단에 포함되는 콘텐츠 중, 동일 콘텐츠에 관련되는 태그를 추출하는 태그 추출부; 상기 태그 추출 과정에서 의미적으로 위계성을 갖는 계층형 구조에 태그를 배치하는 배치부; 상기 태그를 배치하는 과정에서 의미적으로 근접한 태그를 연결하는 앵커를 구성하는 앵커부; 상기 앵커부를 통한 연결에 대해 태그 개념의 연결 인접성에 가중치를 부여하여 연결 강도에 따라 개념을 분류하여 콘텐츠를 클러스터링하는 분류부;를 포함한다.A content clustering system based on synonyms according to an aspect of the present invention includes a tag extraction unit for extracting a tag related to the same content from among content included in a predetermined population; an arrangement unit for arranging tags in a semantically hierarchical hierarchical structure in the tag extraction process; an anchor unit constituting an anchor connecting semantically adjacent tags in the process of arranging the tags; and a classifier for clustering content by classifying concepts according to connection strength by assigning weights to connection adjacencies of tag concepts for connections through the anchor unit.

바람직하게 태그 추출부는 사용자의 텍스트 형태 질의(Query)에 기반하여 검색결과를 보여주기 위해 텍스트 형태의 미리 정의된 태깅 정보를 포함하는 콘텐츠 중, 동일 콘텐츠에 관련되는 태그를 추출하나.Preferably, the tag extraction unit extracts a tag related to the same content from among contents including predefined tagging information in a text format in order to show search results based on a user's text format query.

바람직하게 상기 분류부는 콘텐츠의 원본 데이터와 함께 데이터의 내용을 언어적으로 설명하기 위한 태그 또는 설명을 포함하는 메타 데이터가 연동되어 저장되도록 한다.Preferably, the classification unit stores metadata including tags or descriptions for linguistically explaining the content of the data together with the original data of the content and storing them in conjunction with each other.

바람직하게 상기 배치부는 태그 추출 과정에서 각 개념들이 의미적으로 위계성을 갖는 계층형 구조에 배치하여 포함 관계와 유의어 관계를 데이터베이스 상에 저장한다.Preferably, the arranging unit arranges each concept in a hierarchical structure having a semantic hierarchy in a tag extraction process, and stores inclusion relationships and synonym relationships in a database.

바람직하게 상기 앵커부는 태그를 배치하는 과정에서 의미적으로 근접한 태그를 횡적으로 연결되는 앵커를 구성하여 인접 개념을 연결한다.Preferably, the anchor unit configures anchors that horizontally connect semantically adjacent tags in the process of arranging tags to connect adjacent concepts.

한편, 유의어에 기반한 콘텐츠 클러스터링 방법은 (a)소정의 모집단에 포함되는 콘텐츠 중, 동일 콘텐츠에 관련되는 태그를 추출하는 단계; (b)상기 태그 추출 과정에서 의미적으로 위계성을 갖는 계층형 구조에 태그를 배치하는 단계; (c)상기 태그를 배치하는 과정에서 의미적으로 근접한 태그를 연결하는 앵커를 구성하는 단계; (d)상기 앵커부를 통한 연결에 대해 태그 개념의 연결 인접성에 가중치를 부여하고 연결 강도에 따라 개념을 분류하여 콘텐츠를 클러스터링하는 분류단계;를 포함한다.Meanwhile, a content clustering method based on synonyms includes the steps of (a) extracting tags related to the same content from among content included in a predetermined population; (b) arranging tags in a hierarchical structure having a semantic hierarchy in the tag extraction process; (c) constructing anchors connecting semantically close tags in the process of arranging the tags; (d) a classification step of clustering content by assigning weights to connection adjacencies of tag concepts for connections through the anchor unit and classifying concepts according to connection strength.

본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템에 의하면, 유사한 데이터를 다양한 언어적 메타 데이터로 정의하여 질의 데이터를 포괄적으로 인식하여 다양한 목적과 상황에서 의미적으로 분류할 수 있는 효과가 있다.According to the synonym-based content clustering system according to an embodiment of the present invention, similar data can be defined as various linguistic meta data to comprehensively recognize query data and semantically classify it for various purposes and situations. .

도 1은 본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템의 구성을 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템의 텍스트 정보 태깅을 나타낸 도면이다.
도 3은 영상에 표시되는 이미지의 범주를 나타낸 도면이다.
도 4는 유의어지만 데이터베이스 상의 표기가 달라지는 상황을 나타낸 도면이다.
도 5는 단어와 단어 간의 연관성을 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템의 계층형 배치 구조와 앵커 연결을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 방법을 나타낸 흐름도이다.1 is a diagram showing the configuration of a content clustering system based on synonyms according to an embodiment of the present invention.
2 is a diagram illustrating text information tagging in a content clustering system based on synonyms according to an embodiment of the present invention.
3 is a diagram showing categories of images displayed on a video.
4 is a diagram illustrating a situation in which a notation on a database is changed even though it is a synonym.
5 is a diagram showing associations between words.
6 is a diagram for explaining a hierarchical arrangement structure and anchor connection of a synonym-based content clustering system according to an embodiment of the present invention.
7 is a flowchart illustrating a content clustering method based on synonyms according to an embodiment of the present invention.

본 발명의 실시예에서 제시되는 특정한 구조 내지 기능적 설명들은 단지 본 발명의 개념에 따른 실시예를 설명하기 위한 목적으로 예시된 것으로, 본 발명의 개념에 따른 실시예들은 다양한 형태로 실시될 수 있다. 또한, 본 명세서에 설명된 실시예들에 한정되는 것으로 해석되어서는 아니 되며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경물, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Specific structural or functional descriptions presented in the embodiments of the present invention are merely exemplified for the purpose of explaining embodiments according to the concept of the present invention, and embodiments according to the concept of the present invention may be implemented in various forms. In addition, it should not be construed as being limited to the embodiments described in this specification, but should be understood to include all modifications, equivalents, or substitutes included in the spirit and scope of the present invention.

한편, 본 발명에서 제1 및/또는 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소들과 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 벗어나지 않는 범위 내에서, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Meanwhile, in the present invention, terms such as first and/or second may be used to describe various elements, but the elements are not limited to the above terms. The above terms are used only for the purpose of distinguishing one component from other components, for example, within a range not departing from the scope of rights according to the concept of the present invention, a first component may be referred to as a second component, Similarly, the second component may also be referred to as the first component.

이하 첨부된 도면을 참조하여 본 발명의 실시예를 설명한다. 본 발명의 실시예를 설명함에 있어서, 관련된 공지기능 혹은 구성에 대한 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 설명을 생략하였다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In describing the embodiments of the present invention, if it is determined that a description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the description is omitted.

도 1은 본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템의 구성을 나타낸 도면이다. 도 1에 도시된 바와 같이, 유의어에 기반한 콘텐츠 클러스터링 시스템(10)은 태그 추출부(100), 배치부(200), 앵커부(300), 분류부(400)를 포함한다.1 is a diagram showing the configuration of a content clustering system based on synonyms according to an embodiment of the present invention. As shown in FIG. 1 , the synonym-based content clustering system 10 includes a tag extraction unit 100 , a placement unit 200 , an anchor unit 300 and a classification unit 400 .

태그 추출부(100)는 소정의 모집단에 포함되는 콘텐츠 중, 동일 콘텐츠에 관련되는 태그를 추출한다. 이러한 태그 추출부는 사용자의 텍스트 형태 질의(Query)에 기반하여 검색결과를 보여주기 위해 텍스트 형태의 미리 정의된 태깅 정보를 포함하는 콘텐츠 중, 동일 콘텐츠에 관련되는 태그를 추출한다.The tag extraction unit 100 extracts tags related to the same content from content included in a predetermined population. The tag extraction unit extracts a tag related to the same content from content including predefined tagging information in text form in order to show a search result based on a user's text form query.

배치부(200)는 태그 추출 과정에서 의미적으로 위계성을 갖는 계층형 구조에 태그를 배치하는 구성이다. 이러한 배치부는 태그 추출 과정에서 각 개념들이 의미적으로 위계성을 갖는 계층형 구조에 배치하여 포함 관계와 유의어 관계를 데이터베이스 상에 저장한다.The arrangement unit 200 is a component that arranges tags in a hierarchical structure having a semantic hierarchy in the tag extraction process. In the process of tag extraction, the arranging unit arranges concepts in a hierarchical structure having semantic hierarchies, and stores inclusion relationships and synonym relationships in a database.

앵커부(300)는 태그를 배치하는 과정에서 의미적으로 근접한 태그를 연결하는 앵커를 구성한다. 이러한 앵커부는 태그를 배치하는 과정에서 의미적으로 근접한 태그를 횡적으로 연결되는 앵커를 구성하여 인접 개념을 연결한다.The anchor unit 300 configures anchors that connect semantically close tags in the process of arranging tags. In the process of arranging tags, the anchor unit configures anchors that horizontally connect semantically adjacent tags to connect adjacent concepts.

분류부(400)는 앵커부를 통한 연결에 대해 태그 개념의 연결 인접성에 가중치를 부여하여 연결 강도에 따라 개념을 분류하여 콘텐츠를 클러스터링한다. 이러한 분류부는 분류부는 콘텐츠의 원본 데이터와 함께 데이터의 내용을 언어적으로 설명하기 위한 태그 또는 설명을 포함하는 메타 데이터가 연동되어 저장되도록 한다. 또한 배치부는 태그 추출 과정에서 각 개념들이 의미적으로 위계성을 갖는 계층형 구조에 배치하여 포함 관계와 유의어 관계를 데이터베이스 상에 저장한다.The classification unit 400 assigns a weight to the connection adjacency of the tag concept for connection through the anchor unit, classifies the concept according to the connection strength, and clusters the contents. The classification unit enables the classification unit to link and store meta data including tags or descriptions for linguistically describing the contents of the data together with the original data of the contents. Also, in the process of tag extraction, each concept is arranged in a hierarchical structure having a semantic hierarchy, and the inclusion relationship and the synonym relationship are stored in the database.

본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템은 서사적 스토리의 콘텐츠 요소들을 유의어에 기반하여 분류하고 데이터베이스화하는 방식으로 효율적인 데이터베이스를 구축하기 위해서는 다음과 같은 조건이 필요하다. 오디오, 비디오, 인터렉티브 콘텐츠 등의 라이브러리를 구성함에 있어 사용자의 텍스트 형태의 질의(Query)에 기반하여 검색결과를 보여주기 위해서는 텍스트 형태로 미리 정의된 태깅 정보가 필요하다. 이러한 비 언어적 데이터의 언어적 설명을 위해서는 원본 데이터와 함께 데이터의 내용을 언어적으로 설명하기 위한 태그(Tag)나 설명(Description)와 같은 메타 데이터(Meta Data)가 연동되어 저장되어야 한다. 언어적 태그 및 설명 정보는 비 언어적 데이터의 내용을 최대한 상세히 표현할 수 있어야 하며 이를 위해 적절한 조건의 분류 기준이 필요하다.The content clustering system based on synonyms according to an embodiment of the present invention classifies content elements of narrative stories based on synonyms and converts them into databases. In order to construct an efficient database, the following conditions are required. In constructing a library of audio, video, interactive contents, etc., predefined tagging information in text form is required to show search results based on a user's text form query. For linguistic explanation of such non-linguistic data, meta data such as tags or descriptions for linguistically explaining the contents of the data along with the original data must be linked and stored. The linguistic tag and description information must be able to express the contents of non-linguistic data in as much detail as possible, and for this purpose, classification criteria under appropriate conditions are required.

도 2는 본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템의 텍스트 정보 태깅을 나타낸 도면이다. 도 2는 영상에 대한 텍스트 정보 태깅의 사례를 나타낸 것으로 효율적인 데이터베이스를 구축하기 위해서 텍스트 형태로 미리 정의된 태깅 정보가 필요하고, 메타 데이터(Meta Data)가 연동되어 저장되어야 하며, 적절한 조건의 분류 기준이 필요하다.2 is a diagram illustrating text information tagging in a content clustering system based on synonyms according to an embodiment of the present invention. 2 shows an example of text information tagging for video. In order to build an efficient database, predefined tagging information in the form of text is required, meta data must be interlocked and stored, and classification criteria under appropriate conditions are required. need this

하지만 이러한 각종 분류 조건에 대해 최대한 자세한 언어적 메타 데이터를 구성한다고 하더라도 유사한 정보에 대해서도 다양한 방식의 언어적 메타 데이터가 구성될 수 있다는 문제가 발생한다. However, even if linguistic metadata as detailed as possible is configured for these various classification conditions, a problem arises in that various types of linguistic metadata can be configured for similar information.

도 3은 영상 정보의 메타 데이터 구성에서 선택적으로, 혹은 동시에 사용될 수 있는 단어들의 예시이다. 도 3에 도시된 바와 같이, 영상에 표시되는 이미지는 범주가 다른 여러 어휘로 표현될 수 있다. 맥락에 따라 다른 태깅이 이루어질 경우 일반적인 방식의 SQL 쿼리로는 이러한 계층적 정보를 충분히 표현하기 어려워진다.3 is an example of words that can be selectively or simultaneously used in the meta data configuration of image information. As shown in FIG. 3 , an image displayed on a video may be expressed in several vocabularies having different categories. When different tagging is performed depending on the context, it becomes difficult to sufficiently express such hierarchical information with a general SQL query.

도 4는 같은 의미의 표현이지만 데이터베이스 상의 표기는 달라질 수 있는 상황으로 유의어지만 데이터베이스 상의 표기가 달라지는 상황을 나타낸 도면이고, 도 5는 단어와 단어 간의 연관성을 나타낸 도면이다. 도 5에 도시된 바와 같이, 어희의 개념이 수평적, 수직적 범주로 표현될 수 없는 상황의 예시가 있다. 단어와 단어 간의 연관성은 단순한 수직적 범주가 아닌 여러 척도를 가진 양적수치로 표현되어야 한다.4 is a diagram illustrating a situation in which expressions of the same meaning but notations on the database may be different, and FIG. 5 is a diagram showing associations between words. As shown in FIG. 5, there is an example of a situation in which the concept of Eo-hee cannot be expressed in horizontal and vertical categories. Word-to-word associations should be expressed as quantitative values with multiple scales, not simple vertical categories.

도 4와 도 5에 도시된 바와 같이, 메타 데이터는 인간의 수작업, 혹은 기계에 의한 자동 생성에 의해 작성될 수 있으나 이 두 가지 방법 모두 영상 정보의 특정 상황을 언어적으로 기술함에 있어 다양한 표현 중 하나, 혹은 일부를 선택해야 하는 상황에 놓이게 된다. 이러한 문제는 아래와 같은 원인으로 발생한다.As shown in FIGS. 4 and 5, meta data can be created manually or automatically by a machine, but both of these methods are one of various expressions in linguistically describing a specific situation of image information. You are put in a situation where you have to choose one or some of them. This problem occurs due to the following reasons.

다른 표현이지만 의미적으로 유사하여 선택적으로 사용될 수 있는 경우, 의미적으로 더 상위에, 혹은 하위에 위치하여 동일한 대상을 지칭하나 해석될 수 있는 표현의 범주가 달라질 수 있는 경우, 복수의 어휘들이 동일한 대상을 지칭할 수 있지만 각 어휘가 규정하고 있는 의미의 범위는 서로 다를 경우를 포함한다.When different expressions are semantically similar and can be selectively used, when semantically located higher or lower and refer to the same object, but the range of expressions that can be interpreted can be different, when a plurality of words are the same Although it can refer to an object, the scope of meaning defined by each vocabulary includes cases where it is different from each other.

이렇게 유사한 데이터를 다른 언어적 메타 데이터로 정의했을 경우 SQL과 같은 일반적인 형태의 데이터베이스의 기본 검색 질의만으로는 질의자에 의도에 따른 이 모든 유사한 데이터베이스의 포괄적인 검색 결과를 제공할 수 없게 된다. 이는 영상 데이터베이스를 다양한 목적과 상황에서 의미적으로 분류하는데 큰 장애가 되고 있으며 질의 데이터를 포괄적으로 인식할 수 있는 수단이 필요해짐을 의미한다.If such similar data is defined as other linguistic meta data, it is impossible to provide comprehensive search results of all similar databases according to the queryer's intention with only a basic search query of a general database such as SQL. This is a major obstacle in semantically classifying image databases for various purposes and situations, and means that a means to comprehensively recognize query data is required.

빈도 기반 텍스트 분석(Frequency Based Text Analysis)과 같은 통계적인 추측은 이론적으로 인간 활동에서 사용된 모든 언어 기록이 있다면 이 문제를 매우 잘 해결할 수 있으나 현실적으로는 수집 가능한 텍스트 정보에는 한계가 있고 지역, 연령, 관심 분야, 직종 등으로 나뉘어질 수 있는 언어 사용의 하위 그룹들의 특수한 언어 사용 습관은 이러한 일반론적 분석으로는 놓칠 수 있는 언어적 다양성을 가지고 있으며 이는 표현의 유사성을 예측하는 통계적 추측 방식에 사용되는 언어 데이터베이스도 모든 상황에서 동일한 것을 사용할 수는 없다는 것을 의미한다.Statistical guesses, such as frequency-based text analysis, can theoretically solve this problem very well if there are records of all languages used in human activities, but in reality there are limits to the textual information that can be collected and the The specific language usage habits of subgroups of language use, which can be divided into fields of interest, occupations, etc., have linguistic diversity that can be missed by this general analysis, which is used in statistical guessing methods to predict the similarity of expressions. Databases also mean that you can't use the same one in every situation.

도 6은 본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템의 계층형 배치 구조와 앵커 연결을 설명하기 위한 도면이다. 이러한 데이터베이스는 도 6과 같이 구성될 수 있다. 각 개념들은 의미적으로 위계성을 갖는 계층형 구조에 배치될 수 있으며 이를 통해 기본적인 포함 관계와 유의어 관계를 데이터베이스 상에 저장할 수 있다. 또한 위계적 개념으로 정의하기 힘든, 다른 위계상에 존재하지만 의미적으로는 근접한 인접한 개념의 경우에는 횡적으로 연결되는 앵커를 구성하여 가능한 모든 인접 개념을 연결할 수 있고 앵커의 연결에 대해 가중치를 부여하는 것으로 보다 긴밀한 연결과 인접성이 상대적으로 적은 막연한 연결의 연결 강도를 규정할 수 있다.6 is a diagram for explaining a hierarchical arrangement structure and anchor connection of a synonym-based content clustering system according to an embodiment of the present invention. This database may be configured as shown in FIG. 6 . Each concept can be semantically arranged in a hierarchical structure, and through this, basic inclusion and synonym relationships can be stored in the database. In addition, in the case of adjacent concepts that are difficult to define as hierarchical concepts, existing on different hierarchies but semantically close, all possible adjacent concepts can be connected by constructing anchors that are connected horizontally, and weighted for anchor connections. As a result, it is possible to specify the connection strength of tighter connections and vague connections with relatively few adjacencies.

초기의 앵커 구성은 통계적 방식을 이용할 수 있으나 인간이 이해할 수 있는 시각화 된 연결성을 제공할 수 있는 본 방식은 인간의 지도학습을 통해 손쉬운 지도학습 절차를 이행함으로서 데이터 학습량이 적거나, 인반적 용도가 아닌 특수한 상황 속에서의 적절한 인접 개념들을 연결해야 할 필요가 있을 때에 빠른 개념 분류 기계를 구룩할 수 있게 된다.The initial anchor configuration can use a statistical method, but this method, which can provide visualized connectivity that humans can understand, implements an easy supervised learning procedure through human supervised learning, so that the amount of data learning is small or it is not suitable for general use. It is possible to build a fast concept classification machine when there is a need to connect appropriate adjacent concepts in a special situation that is not the case.

도 6에 도시된 바와 같이, 위계적 관계를 가진 어휘, 또는 개념들에 대한 트리 구조형 배치로 상위 범주의 개념이 완전히 포괄할 수 없는 하위 개념에 대한 수평적 앵커를 생성하여 유기적인 연결을 제공한다. 유의어 각 개념들의 유의어 개념들에 대한 시각적인 직관적 view를 제공하는 디스플레이를 포함한다.As shown in FIG. 6, a tree-structured arrangement of vocabularies or concepts having a hierarchical relationship creates a horizontal anchor for a lower concept that cannot be fully covered by a concept of a higher category, thereby providing an organic connection. . A synonym includes a display providing a visual and intuitive view of the synonym concepts of each concept.

한편, 본 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 방법은 도 7에 도시된 바와 같이 (701)소정의 모집단에 포함되는 콘텐츠 중, 동일 콘텐츠에 관련되는 태그를 추출한다. 다음으로 (703)태그 추출 과정에서 의미적으로 위계성을 갖는 계층형 구조에 태그를 배치한다. 다음으로 (705)태그를 배치하는 과정에서 의미적으로 근접한 태그를 연결하는 앵커를 구성한다. 다음으로 (707)상기 앵커부를 통한 연결에 대해 태그 개념의 연결 인접성에 가중치를 부여하고 연결 강도에 따라 개념을 분류하여 콘텐츠를 클러스터링한다.Meanwhile, in the content clustering method based on synonyms according to the present embodiment, as shown in FIG. 7 (701), tags related to the same content are extracted from content included in a predetermined population. Next, in the tag extraction process (703), tags are arranged in a hierarchical structure having a semantic hierarchy. Next, in the process of arranging (705) tags, an anchor connecting semantically close tags is configured. Next, in step 707, the connection adjacency of the tag concept is weighted for the connection through the anchor unit, and the concepts are classified according to the connection strength to cluster the contents.

본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템의 기대효과를 설명하면 다음과 같다. 인간의 수집과 수정에 기반한 유의어 데이터베이스는 빅 데이터와 머신 러닝에 소요되는 스토리지, 컴퓨팅 자원 등에 대한 소요를 절약할 수 있으며 수집되는 콘텐츠의 형태에 따라 유연하게 수정될 수 있으며 무엇보다 연관 데이터 추천에 대한 근거를 데이터베이스 관리자가 직관적으로 확인하고 개선 사항을 추가할 수 있다는 장점을 가지고 있다. 빅 데이터에 기반한 통계 분석이 유의미하게 작동하기 위해서는 최소 수십 기가바이트 이상의 텍스트 정보를 보유하고, 지속적으로 업데이트 해야 하나 대부분의 프로젝트에서 이 정도 분량의 텍스트 정보는 실제 분석해야 하는 비디오 내의 내용 정보보다도 많은 양이 될 수 있고 수집된 텍스트 데이터가 영상 데이터베이스를 분석하는데 적합한 샘플인지의 여부도 증명은 불가하다는 점에서 유의어 데이터베이스에 기반한 확장된 검색 질의 방식은 상대적으로 중소규모의 프로젝트에서 큰 효용을 가진다고 할 수 있다.Expected effects of the synonym-based content clustering system according to an embodiment of the present invention are as follows. A synonym database based on human collection and modification can save the storage and computing resources required for big data and machine learning, and can be flexibly modified according to the type of content collected. It has the advantage that the database administrator can intuitively check the evidence and add improvements. In order for statistical analysis based on big data to work meaningfully, text information of at least tens of gigabytes or more must be kept and continuously updated. , and it is impossible to prove whether or not the collected text data is a suitable sample for analyzing the image database, the extended search query method based on the synonym database can be said to have great utility in relatively small and medium-sized projects. .

본 발명의 일 실시예에 따른 유의어에 기반한 콘텐츠 클러스터링 시스템은 유사한 의미를 지닌 동등한 수준의 표현에 대한 데이터베이스를 작성하고, 계층적으로 배치될 수 있는 표현의 범주가 다른 표현에 대한 데이터베이스를 분류하여 작성할 수 있다.A content clustering system based on synonyms according to an embodiment of the present invention creates a database for expressions of equal level with similar meanings, and classifies and creates a database for expressions with different categories of expressions that can be hierarchically arranged. can

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능함은 당업자에게 명백할 것이다.The present invention described above is not limited by the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes are possible without departing from the technical spirit of the present invention will be apparent to those skilled in the art.

100 : 태그 추출부
200 : 배치부
300 : 앵커부
400 : 분류부100: tag extraction unit
200: placement unit
300: anchor part
400: classification unit

Claims

a tag extraction unit for extracting a tag related to the same content from among content included in a predetermined population;
an arrangement unit for arranging tags in a semantically hierarchical hierarchical structure in the tag extraction process;
an anchor unit constituting an anchor connecting semantically adjacent tags in the process of arranging the tags;
A classification unit for clustering content by assigning weights to connection adjacencies of tag concepts for connections through the anchor unit and classifying concepts according to connection strength;
The classification unit constructs anchors that are horizontally connected in the case of adjacent concepts that exist on different hierarchies but are semantically close, cluster all possible content of adjacent concepts, and define the connection strength according to the connection adjacency for the connection of the anchors A content clustering system based on synonyms, characterized in that for doing.

According to claim 1,
The tag extraction unit extracts a tag related to the same content from among content including predefined tagging information in a text form in order to show search results based on a user's text form query. based content clustering system.

According to claim 1,
The synonym-based content clustering system according to claim 1 , wherein the classification unit stores metadata including tags or descriptions for linguistically explaining the content of the data together with the original data of the content.

According to claim 1,
The content clustering system based on synonyms, characterized in that the arrangement unit arranges each concept in a hierarchical structure having a semantic hierarchy in a tag extraction process and stores inclusion relationships and synonym relationships in a database.

According to claim 1,
The content clustering system based on synonyms, characterized in that the anchor unit configures anchors that horizontally connect semantically adjacent tags in the process of arranging tags to connect adjacent concepts.

(a) extracting a tag related to the same content from among content included in a predetermined population;
(b) arranging tags in a hierarchical structure having a semantic hierarchy in the tag extraction process;
(c) constructing anchors connecting semantically close tags in the process of arranging the tags;
(d) a classification step of clustering content by assigning weights to connection adjacencies of tag concepts for connections through the anchors and classifying concepts according to connection strength;
In the step (d), in the case of adjacent concepts that exist on different hierarchies but are semantically close, configure anchors that are horizontally connected, cluster the contents of all possible adjacent concepts, and connect the anchors according to the connection adjacency A content clustering method based on synonyms characterized by defining strength.