KR20190097496A

KR20190097496A - System and method for determiing topic similarity of content

Info

Publication number: KR20190097496A
Application number: KR1020180016987A
Authority: KR
Inventors: 오효정; 건 김; 양동민
Original assignee: 전북대학교산학협력단
Priority date: 2018-02-12
Filing date: 2018-02-12
Publication date: 2019-08-21
Also published as: KR102080910B1

Abstract

Provided is a system for determining topic similarity, which comprises: a topic determination unit performing topic modeling on multiple types of content to determine one or more topics for each content; a classification system setting unit setting an integrated classification system of the content using the determined topics; and a topic histogram generation unit determining association values between topics included in the integrated classification system and each content, and visualizing the association values to generate a topic histogram of each content.

Description

SYSTEM AND METHOD FOR DETERMIING TOPIC SIMILARITY OF CONTENT}

본 발명은 콘텐츠들 사이의 토픽 유사도를 결정하는 기술에 관한 것이다.The present invention relates to a technique for determining topic similarity between contents.

최근 문서, 사진 등의 콘텐츠가 범람하고 있으며, 이들 콘텐츠는 각자 하나 이상의 토픽들을 가진다. 기존의 분류 체계는 제어 범위에 속하는 콘텐츠들을 토픽을 기준으로 분류하고, 이용자가 구축된 분류 체계를 통해 원하는 토픽을 다룬 콘텐츠를 검색하도록 하고 있으나, 분류되지 않은 콘텐츠가 존재하는 경우 이용자의 검색에 문제가 발생하게 된다.Recently, contents such as documents and photographs are overflowing, and each of these contents has one or more topics. Existing classification system categorizes contents within the scope of control based on topics, and allows users to search for contents that cover the topics they want through the established classification system. Will occur.

예를 들면, 국가과학기술지식정보서비스(National Science & Technology Information Service, NSIS)는 연구과제 데이터에 대해 "국가과학기술표준분류체계"에 기반한 "연구분야분류"를 제공하고 있다. "연구분야분류"는 국가연구개발사업의 관리를 위한 표준적인 분류 체계로서, 연구과제 데이터의 토픽을 파악하고자 설정된 분류 체계이다.For example, the National Science & Technology Information Service (NSIS) provides "research field classifications" based on the "National Science and Technology Standard Classification System" for research data. "Research field classification" is a standard classification system for managing national R & D projects, and is a classification system established to grasp the topic of research project data.

그러나, 전체 연구과제 데이터 중에서 연구분야분류를 정확하게 등록한 데이터는 절반이 안되며, 심지어 일부 데이터는 연구분야분류가 전혀 등록되어 있지 않다. 이러한 불완전한 분류 체계는 이용자가 각 연구과제 데이터의 토픽을 파악하고, 자신이 관심있는 학문분야와 관련된 연구과제 데이터를 검색하는데 도움을 줄 수 없다.However, less than half of all research data are correctly registered research field classifications, and even some data are not registered at all. This incomplete classification system cannot help the user to identify the topic of each project data and retrieve the project data related to the field of interest.

또한, 연구분야분류는 미리 설정된 분류 체계에 따라 문서들을 분류하므로, 다른 분류 체계에 따라 분류된 콘텐츠와의 토픽 간 유사도를 결정할 수 없는 한계가 존재한다.In addition, since the research field classification classifies documents according to a predetermined classification system, there is a limit in that the similarity between topics and contents classified according to other classification systems cannot be determined.

본 발명이 해결하고자 하는 과제는 각 콘텐츠로부터 추출된 토픽을 이용하여 해당 콘텐츠들에 대한 토픽 히스토그램을 생성하는 시스템 및 방법을 제공하는 것이다.An object of the present invention is to provide a system and method for generating a topic histogram for corresponding contents by using a topic extracted from each contents.

또한, 본 발명이 해결하고자 하는 과제는 서로 다른 형태의 콘텐츠들 사이의 토픽 유사도를 결정할 수 있는 토픽 히스토그램을 생성하는 시스템 및 방법을 제공하는 것이다.It is also an object of the present invention to provide a system and method for generating a topic histogram capable of determining topic similarity between different types of content.

또한, 본 발명이 해결하고자 하는 과제는 서로 다른 언어로 작성된 콘텐츠들 사이의 토픽 유사도를 결정할 수 있는 토픽 히스토그램을 생성하는 시스템 및 방법을 제공하는 것이다.It is also an object of the present invention to provide a system and method for generating a topic histogram capable of determining topic similarity between contents written in different languages.

본 발명의 일 실시예에 따른 토픽 유사도 결정 시스템은 복수의 콘텐츠들에 대해 토픽 모델링을 수행하여 각 콘텐츠 마다 하나 이상의 토픽들을 결정하는 토픽 결정부, 상기 결정된 토픽들을 이용하여 상기 콘텐츠들의 통합 분류 체계를 설정하는 분류 체계 설정부, 그리고 상기 통합 분류 체계에 포함된 토픽들과 각 콘텐츠 사이의 연관값들을 결정하고, 상기 연관값을 시각화하여 각 콘텐츠의 토픽 히스토그램을 생성하는 토픽 히스토그램 생성부를 포함한다.Topic similarity determination system according to an embodiment of the present invention is a topic determination unit for determining one or more topics for each content by performing a topic modeling on a plurality of contents, an integrated classification system of the contents using the determined topics And a topic histogram generator configured to determine association values between topics included in the integrated classification scheme and respective contents, and visualize the association values to generate a topic histogram of each content.

상기 토픽 결정부는 상기 복수의 콘텐츠들이 서로 다른 언어로 기재된 문서들인 경우, 연동된 번역 엔진을 이용하여 해당 문서들을 특정 언어로 번역하고, 번역된 문서들에 대해 토픽 모델링을 수행한다.When the plurality of contents are documents written in different languages, the topic determiner translates the corresponding documents into a specific language using an associated translation engine and performs topic modeling on the translated documents.

상기 콘텐츠들의 유형은 문서, 사진 또는 미디어 중 어느 하나이다.The type of the content is either document, photograph or media.

상기 토픽 결정부는 콘텐츠의 유형이 문서인 경우 상기 콘텐츠에 대해 형태소 분석기를 선결적으로 적용하고, 적용된 결과에 대해 토픽 모델링을 수행하고, 콘텐츠의 유형이 사진인 경우 상기 콘텐츠에 대해 토픽 모델링을 수행하여 해당 콘텐츠의 캡션을 생성하고, 콘텐츠의 유형이 미디어인 경우 상기 콘텐츠에 대해 음성인식(Speech to Text, STT) 알고리즘을 선결적으로 적용하여 음성 부분을 문서화하고 문서화된 콘텐츠에 대해 토픽 모델링을 수행한다.The topic determination unit preemptively applies the stemmer to the content when the content type is a document, performs topic modeling on the applied result, and performs the topic modeling on the content when the content type is a photo. Generate the caption of the content, and if the content type is media, apply speech to text (STT) algorithm to the content in advance to document the speech part and perform topic modeling on the documented content .

상기 분류 체계 설정부는 상기 결정된 토픽들을 계층적으로 분류하고, 분류된 계층들의 구조를 상기 통합 분류 체계로 설정한다.The classification scheme setting unit classifies the determined topics hierarchically, and sets the structure of the classified hierarchies as the integrated classification scheme.

상기 분류 체계 설정부는 콘텐츠에 미리 설정된 카테고리 정보가 있는 경우, 상기 카테고리 정보를 추가로 고려하여 상기 통합 분류 체계를 설정한다.The classification scheme setting unit sets the integrated classification scheme by further considering the category information when there is category information preset in the content.

상기 토픽 히스토그램 생성부는 상기 통합 분류 체계에 포함된 토픽들과 각 콘텐츠에서 추출된 하나 이상의 키워드들 사이의 연관도를 이용하여 상기 연관값들을 결정한다.The topic histogram generator determines the association values using an association degree between topics included in the integrated classification scheme and one or more keywords extracted from each content.

상기 토픽 히스토그램 생성부는 상기 연관값을 미리 설정된 범위와 비교하여, 상기 연관값이 포함되는 범위에 대응하는 시각 정보로 상기 연관값을 변경한다.The topic histogram generator compares the association value with a preset range and changes the association value to visual information corresponding to a range in which the association value is included.

본 발명의 일 실시예에 따른 토픽 유사도 결정 시스템은 각 콘텐츠의 토픽 히스토그램 사이의 유사도를 비교하여 각 콘텐츠 사이의 토픽 유사도를 결정하는 토픽 유사도 결정부를 더 포함한다.The topic similarity determining system according to an embodiment of the present invention further includes a topic similarity determining unit that compares similarities between topic histograms of respective contents to determine topic similarities between respective contents.

본 발명의 일 실시예에 따른 토픽 히스토그램을 생성하는 방법은 복수의 콘텐츠들에 대해 토픽 모델링을 수행하여 각 콘텐츠 마다 하나 이상의 토픽들을 결정하는 단계, 상기 결정된 토픽들을 계층적으로 분류하고, 분류된 계층들의 구조를 이용하여 상기 콘텐츠들의 통합 분류 체계를 설정하는 단계, 상기 통합 분류 체계에 포함된 토픽들과 각 콘텐츠 사이의 연관값들을 결정하는 단계, 그리고 상기 연관값을 시각화하여 각 콘텐츠의 토픽 히스토그램을 생성하는 단계를 포함한다.A method of generating a topic histogram according to an embodiment of the present invention includes the steps of performing topic modeling on a plurality of contents to determine one or more topics for each content, hierarchically classifying the determined topics, and classifying the classified hierarchies. Establishing an integrated classification scheme of the contents using the structure of the contents, determining association values between topics included in the integrated classification scheme, and each content, and visualizing the association value to generate a topic histogram of each content. Generating.

상기 복수의 콘텐츠들에 대해 토픽 모델링을 수행하는 것은 상기 복수의 콘텐츠들이 서로 다른 언어로 기재된 문서들인 경우, 연동된 번역 엔진을 이용하여 해당 문서들을 특정 언어로 번역하고, 번역된 문서들에 대해 토픽 모델링을 수행한다.Performing topic modeling on the plurality of contents may include translating the corresponding documents into a specific language using a linked translation engine when the plurality of contents are documents written in different languages, and using the translated documents. Perform modeling.

상기 복수의 콘텐츠들에 대해 토픽 모델링을 수행하는 것은 콘텐츠의 유형이 문서인 경우 상기 콘텐츠에 대해 형태소 분석기를 선결적으로 적용하고, 적용된 결과에 대해 토픽 모델링을 수행하고, 콘텐츠의 유형이 사진인 경우 상기 콘텐츠에 대해 토픽 모델링을 수행하여 해당 콘텐츠의 캡션을 생성하고, 콘텐츠의 유형이 미디어인 경우 상기 콘텐츠에 대해 음성인식(Speech to Text, STT) 알고리즘을 선결적으로 적용하여 음성 부분을 문서화하고 문서화된 콘텐츠에 대해 토픽 모델링을 수행한다.Performing topic modeling on the plurality of contents may include preemptively applying a stemmer to the contents when the type of the content is a document, performing topic modeling on the applied result, and when the type of the content is a photograph. Topic modeling is performed on the content to generate a caption of the content, and when the type of the content is media, a speech to text (STT) algorithm is applied in advance to document and document the speech part. Topic modeling for the generated content.

상기 통합 분류 체계를 설정하는 단계는 콘텐츠에 미리 설정된 카테고리 정보가 있는 경우, 상기 카테고리 정보를 추가로 고려하여 상기 통합 분류 체계를 설정한다.In the setting of the integrated classification system, when the category information is preset in the content, the integrated classification system is set by further considering the category information.

상기 연관값들을 결정하는 단계는 상기 통합 분류 체계에 포함된 토픽들과 각 콘텐츠에서 추출된 하나 이상의 키워드들 사이의 연관도를 이용하여 상기 연관값들을 결정한다.The determining of the association values may be performed using the degree of association between topics included in the integrated classification scheme and one or more keywords extracted from each content.

상기 토픽 히스토그램을 생성하는 단계는 상기 연관값을 미리 설정된 범위와 비교하여, 상기 연관값이 포함되는 범위에 대응하는 시각 정보로 상기 연관값을 변경한다.The generating of the topic histogram compares the association value with a preset range and changes the association value to visual information corresponding to a range including the association value.

본 발명의 일 실시예에 따른 토픽 히스토그램을 생성하는 방법은 각 콘텐츠의 토픽 히스토그램 사이의 유사도를 비교하여 각 콘텐츠 사이의 토픽 유사도를 결정하는 단계를 더 포함한다.The method of generating a topic histogram according to an embodiment of the present invention further includes comparing the similarity between the topic histograms of each content to determine the topic similarity between each content.

본 발명에 따르면, 복수의 콘텐츠들로부터 추출된 토픽을 이용하여 통합 분류 체계를 설정하고, 설정된 통합 분류 체계를 사용하여 각 콘텐츠의 토픽 히스토그램을 생성하고 이를 통해 콘텐츠 간 유사도를 결정하는바, 초기에 다른 기준으로 분류된 콘텐츠들 사이에도 토픽 유사도를 결정할 수 있다.According to the present invention, an integrated classification scheme is set using a topic extracted from a plurality of contents, a topic histogram of each content is generated using the set integrated classification scheme, and the similarity between the contents is determined through this. Topic similarity can be determined even between contents classified by other criteria.

또한, 본 발명에 따르면, 서로 다른 형태 및 언어로 작성된 콘텐츠들 사이의 토픽 유사도를 결정할 수 있다.In addition, according to the present invention, topic similarity between contents written in different forms and languages can be determined.

도 1은 한 실시예에 따른 토픽 유사도 결정 시스템이 구현되는 환경을 설명하는 도면이다.
도 2는 한 실시예에 따른 토픽 유사도 결정 시스템의 구조도이다.
도 3은 한 실시예에 따른 분류 체계 설정부가 결정된 토픽들을 계층적으로 분류하는 방법을 나타낸 도면이다.
도 4는 분류된 계층들의 구조를 통합 분류 체계로 설정하는 방법을 나타낸 도면이다.
도 5는 토픽 히스토그램 생성부가 연관값을 시각적으로 표현하는 방법을 나타낸 도면이다.
도 6은 토픽 유사도 결정부가 토픽 히스토그램을 이용하여 콘텐츠 간 토픽 유사도를 결정하는 방법을 나타낸 도면이다.
도 7은 한 실시예에 따른 토픽 유사도 결정 시스템이 토픽 히스토그램을 생성하고, 콘텐츠 간 토픽 유사도를 결정하는 방법을 설명한 도면이다.1 is a diagram illustrating an environment in which a topic similarity determining system is implemented, according to an exemplary embodiment.
2 is a structural diagram of a topic similarity determining system according to an embodiment.
3 is a diagram illustrating a method of classifying determined topics hierarchically according to an embodiment of the present disclosure.
4 is a diagram illustrating a method of setting a structure of classified hierarchies as an integrated classification scheme.
5 is a diagram illustrating a method of visually expressing an association value of a topic histogram generator.
FIG. 6 is a diagram illustrating a method of determining a topic similarity between contents using a topic histogram.
7 is a diagram for describing a method of generating a topic histogram and determining a topic similarity between contents by a topic similarity determining system according to an exemplary embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding other components unless specifically stated otherwise.

도 1은 한 실시예에 따른 토픽 유사도 결정 시스템이 구현되는 환경을 설명하는 도면이다.1 is a diagram illustrating an environment in which a topic similarity determining system is implemented, according to an exemplary embodiment.

도 1을 참고하면, 토픽 유사도 결정 시스템(100)은 입력되는 복수의 콘텐츠들 각각의 토픽을 결정하고 이를 이용하여 콘텐츠들에 대한 분류 체계를 설정한다.Referring to FIG. 1, the topic similarity determination system 100 determines a topic of each of a plurality of input contents and sets a classification scheme for the contents using the topics.

이 때, 토픽 유사도 결정 시스템(100)에 입력되는 콘텐츠들의 형태는 제한이 없다. 즉, 토픽 유사도 결정 시스템(100)이 토픽을 추출할 수만 있다면, 입력되는 콘텐츠들의 형태(예를 들면, 문서, 사진 또는 미디어 등) 및 콘텐츠들이 작성된 언어(예를 들면, 한국어, 일본어, 중국어 또는 영어 등) 등은 무관하다.At this time, the type of contents input to the topic similarity determination system 100 is not limited. That is, if the topic similarity determination system 100 can extract the topic only, the type of input content (eg, document, photo or media) and the language in which the content is written (eg, Korean, Japanese, Chinese or English, etc.) are irrelevant.

토픽 유사도 결정 시스템(100)은 분류 체계에 포함된 토픽들과 각 콘텐츠 사이의 연관값을 시각화하여 각 콘텐츠에 대한 토픽 히스토그램을 생성하고 이를 제공한다.The topic similarity determination system 100 visualizes an association value between topics included in a classification scheme and respective contents, and generates and provides a topic histogram for each contents.

토픽 유사도 결정 시스템(100)이 생성하는 토픽 히스토그램은 각 콘텐츠에서 추출한 토픽을 이용하여 설정된 통합된 분류 체계를 포함하는바, 각 콘텐츠의 토픽 히스토그램 사이의 유사도를 비교하여 각 콘텐츠 사이의 토픽 유사도를 결정할 수 있다.The topic histogram generated by the topic similarity determination system 100 includes an integrated classification scheme set up using the topics extracted from each content. The topic similarity between the topic histograms of each content is compared to determine the topic similarity between the contents. Can be.

도 2는 한 실시예에 따른 토픽 유사도 결정 시스템의 구조도이고, 도 3은 한실시예에 따른 분류 체계 설정부가 결정된 토픽들을 계층적으로 분류하는 방법을 나타낸 도면이고, 도 4는 분류된 계층들의 구조를 통합 분류 체계로 설정하는 방법을 나타낸 도면이고, 도 5는 토픽 히스토그램 생성부가 연관값을 시각적으로 표현하는 방법을 나타낸 도면이고, 도 6은 토픽 유사도 결정부가 토픽 히스토그램을 이용하여 콘텐츠 간 토픽 유사도를 결정하는 방법을 나타낸 도면이다.2 is a diagram illustrating a structure of a topic similarity determining system according to an exemplary embodiment. FIG. 3 is a diagram illustrating a method of classifying determined topics hierarchically according to an embodiment. FIG. 4 illustrates a structure of classified hierarchies. FIG. 5 is a diagram illustrating a method of setting an integrated classification scheme, and FIG. 5 is a diagram illustrating a method of visually expressing an association value by a topic histogram generator. FIG. 6 is a diagram illustrating a topic similarity between contents using a topic histogram. It is a figure which shows the method of determination.

도 2를 참고하면, 토픽 유사도 결정 시스템(100)은 토픽 결정부(110), 분류 체계 설정부(120), 토픽 히스토그램 생성부(130) 및 토픽 유사도 결정부(140)를 포함한다.Referring to FIG. 2, the topic similarity determination system 100 includes a topic determination unit 110, a classification system setting unit 120, a topic histogram generator 130, and a topic similarity determination unit 140.

토픽 결정부(110)는 복수의 콘텐츠들에 대해 토픽 모델링을 수행하여 각 콘텐츠 마다 하나 이상의 토픽들을 결정한다.The topic determiner 110 performs topic modeling on a plurality of contents to determine one or more topics for each content.

구체적으로, 토픽 결정부(110)는 pLSA(Probablistic Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), DMR(Dirichlet Multinomial Regression) 또는 HDP(Hierarchical Dirichlet Process)와 같은 토픽 모델링 기법을 사용하여 복수의 콘텐츠들에 대해 토픽 모델링을 수행할 수 있다. 토픽 모델링 기법을 사용하여 콘텐츠에서 토픽을 추출하는 방법은 공지된 기술이므로, 본 명세서에서는 자세한 설명은 생략한다.In detail, the topic determiner 110 uses a topic modeling technique such as Probbolistic Latent Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA), Dirichlet Multinomial Regression (DMR), or Hierarchical Dirichlet Process (HDP). Topic modeling can be performed for. Since a method of extracting a topic from a content using a topic modeling technique is a known technique, a detailed description thereof will be omitted.

토픽 결정부(110)는 토픽 모델링 수행시 콘텐츠의 유형에 따라 다른 프로세스를 추가로 수행할 수 있다.The topic determiner 110 may additionally perform another process according to the type of content when performing topic modeling.

우선, 토픽 결정부(110)는 입력된 콘텐츠에 대한 유형을 결정한다.First, the topic determiner 110 determines the type of the input content.

일 실시예에서, 토픽 결정부(110)는 입력된 콘텐츠 데이터의 확장자에 기초하여 콘텐츠 유형을 결정할 수 있다. 예를 들면, 토픽 결정부(110)는 입력된 콘텐츠 데이터의 확장자가 hwp 또는 doc인 경우 콘텐츠의 유형을 문서로 결정하고, 확장자가 jpg 또는 bmp인 경우 콘텐츠의 유형을 사진으로 결정하고, 확장자가 mpeg 또는 avi인 경우 콘텐츠의 유형을 미디어로 결정할 수 있다.In one embodiment, the topic determiner 110 may determine the content type based on the extension of the input content data. For example, the topic determiner 110 determines the type of the content as the document when the extension of the input content data is hwp or doc, and determines the type of the content as the picture when the extension is jpg or bmp, and the extension is In the case of mpeg or avi, the type of content can be determined as media.

다른 실시예에서, 토픽 결정부(110)는 입력된 콘텐츠에 대해 부분 단위로 유형을 결정할 수 있다. 예를 들면, 토픽 결정부(110)는 입력된 콘텐츠가 사진이 첨부된 문서인 경우, 콘텐츠의 사진 부분에 대해서는 유형을 사진으로 결정하고, 콘텐츠의 글 부분에 대해서는 유형을 문서로 결정할 수 있다.In another embodiment, the topic determiner 110 may determine a type on a partial basis with respect to the input content. For example, when the input content is a document to which the input content is attached, the topic determiner 110 may determine the type as the picture for the photo portion of the content and the type as the document for the text portion of the content.

이후, 토픽 결정부(110)는 결정된 유형에 따라 서로 다른 방식으로 토픽 모델링을 수행한다.Then, the topic determiner 110 performs topic modeling in different ways according to the determined type.

만일 결정된 유형이 문서인 경우, 토픽 결정부(110)는 형태소 분석기를 선결적으로 적용하여 명사 추출 및 불용어 처리를 수행하고, 결과에 대해 토픽 모델링 기법을 사용하여 하나 이상의 토픽들을 결정할 수 있다.If the determined type is a document, the topic determiner 110 may preemptively apply a morpheme analyzer to perform noun extraction and stopword processing, and determine one or more topics by using a topic modeling technique on the result.

만일 결정된 유형이 사진인 경우, 토픽 결정부(110)는 콘텐츠에 대해 토픽 모델링 기법을 사용하여 해당 콘텐츠의 캡션(caption)을 생성할 수 있다.If the determined type is a photo, the topic determiner 110 may generate a caption of the corresponding content by using a topic modeling technique.

만일 결정된 유형이 미디어인 경우, 토픽 결정부(110)는 음성인식(Speech to Text, STT) 알고리즘을 선결적으로 적용하여 해당 미디어의 음성 부분을 문서화하고, 문서화된 콘텐츠에 대해 토픽 모델링 기법을 사용하여 하나 이상의 토픽들을 결정할 수도 있다.If the determined type is a media, the topic determiner 110 preemptively applies a speech to text (STT) algorithm to document the speech portion of the media and uses a topic modeling technique for the documented content. To determine one or more topics.

또한, 토픽 결정부(110)는 콘텐츠들이 서로 다른 언어로 기재된 문서들인 경우, 연동된 번역 엔진(200)을 이용하여 해당 문서들을 특정 언어로 번역하고, 번역된 문서들에 대해 토픽 모델링을 수행할 수도 있다.In addition, the topic determiner 110, if the content is documents written in different languages, using the linked translation engine 200 to translate the corresponding documents in a specific language, to perform topic modeling on the translated documents It may be.

분류 체계 설정부(120)는 결정된 토픽들을 이용하여 복수의 콘텐츠들의 통합 분류 체계를 설정한다.The classification scheme setting unit 120 sets up an integrated classification scheme of a plurality of contents using the determined topics.

구체적으로, 분류 체계 설정부(120)는 결정된 토픽들을 계층적으로 분류한다.In detail, the classification system setting unit 120 classifies the determined topics hierarchically.

예를 들면, 콘텐츠들에 대해 결정된 토픽들이 "사회과학", "공학", "정치", "전기공학" 및 "전자공학"인 경우, 도 3을 참고하면, 분류 체계 설정부(120)는 "사회 과학" 및 "공학"을 제1 분류로, "정치", "전기 공학" 및 "전자 공학"을 제2 분류(제1 분류의 하위 분류)로 클래스화하여 분류할 수 있다. 이를 위해, 분류 체계 설정부(120)는 기 설정된 토픽 분류 체계를 사용하거나, 계층화 알고리즘 또는 클러스터링 알고리즘을 사용할 수 있다.For example, when the topics determined for the contents are "social science", "engineering", "politics", "electrical engineering", and "electronic engineering", referring to FIG. 3, the classification system setting unit 120 "Social science" and "engineering" can be classified into a first classification, and "politics", "electrical engineering" and "electronic engineering" into a second classification (subclass of the first classification). To this end, the classification scheme setting unit 120 may use a preset topic classification scheme, or may use a layering algorithm or a clustering algorithm.

비록 도 3에서는 모든 최상위 토픽에 대해 제1 분류 및 제2 분류로 클래스화되었으나, 각 최상위 토픽마다 서로 다르게 분류될 수 있음은 물론이다. 예를 들면, 결정된 토픽들이 "사회과학", "공학", "정치", "전기공학", "전자공학", "신경망" 및 "신호처리"인 경우, "사회과학"의 경우 제2 분류로 클래스화되나, "공학"의 경우, "신경망" 및 "신호처리"가 "전자 공학"의 하위 토픽이므로, 제3 분류로 클래스화될 수 있다.Although FIG. 3 classifies all top-level topics into a first category and a second category, it is a matter of course that each top-level topic may be classified differently. For example, if the determined topics are "Social Science", "Engineering", "Politics", "Electrical Engineering", "Electronics", "Neural Network", and "Signal Processing", Second Classification for "Social Sciences" In the case of "engineering", since "neural network" and "signal processing" are subtopics of "electronic engineering", it may be classed in the third classification.

분류 체계 설정부(120)는 분류된 계층들의 구조를 콘텐츠들의 통합 분류 체계로 설정한다.The classification system setting unit 120 sets the structure of the classified hierarchies as an integrated classification system of contents.

예를 들면, 도 4를 참고하면, 분류 체계 설정부(120)는 제1 분류에 포함된 하나 이상의 토픽들을 제1 축에 배열하고, 제2 분류에 포함된 하나 이상의 토픽들을 제2 축에 배열하여 분류된 계층들의 구조를 생성할 수 있고, 생성된 구조를 통합 분류 체계로 설정할 수 있다.For example, referring to FIG. 4, the classification scheme setting unit 120 arranges one or more topics included in the first classification on a first axis, and arranges one or more topics included in the second classification on a second axis. The structure of the classified hierarchies can be generated, and the generated structure can be set as an integrated classification system.

또한, 분류 체계 설정부(120)는 콘텐츠에 미리 설정된 카테고리 정보가 있는 경우, 설정된 카테고리 정보를 추가로 고려하여 통합 분류 체계를 설정할 수 있다.In addition, when there is category information set in advance in the content, the classification system setting unit 120 may set an integrated classification system by further considering the set category information.

예를 들면, 콘텐츠가 연구과제정보에 대한 문서인 경우, 분류 체계 설정부(120)는 문서에 설정된 연구분야코드에 따른 토픽을 결정하고, 해당 토픽을 추가로 고려하여 통합 분류 체계를 설정할 수 있다.For example, when the content is a document for research project information, the classification system setting unit 120 may determine a topic according to the research field code set in the document, and may set an integrated classification system by further considering the corresponding topic. .

토픽 히스토그램 생성부(130)는 통합 분류 체계에 포함된 토픽들과 각 콘텐츠 사이의 연관값을 결정한다.The topic histogram generation unit 130 determines an association value between the topics included in the integrated classification system and each content.

구체적으로, 토픽 히스토그램 생성부(130)는 통합 분류 체계에 포함된 토픽들과 각 콘텐츠에서 추출된 하나 이상의 키워드들 사이의 연관도를 이용하여 연관값들을 결정한다.In detail, the topic histogram generator 130 determines association values by using an association degree between topics included in the integrated classification system and one or more keywords extracted from each content.

예를 들면, 토픽 히스토그램 생성부(130)는 콘텐츠가 문서인 경우, 통합 분류 체계에 포함된 토픽의 워드 벡터와 콘텐츠에서 추출한 키워드의 워드 벡터 사이의 연관도를 결정하고, 결정된 벡터 유사도를 토픽과 콘텐츠 사이의 연관값으로 결정할 수 있다. 이를 위해, 토픽 히스토그램 생성부(130)는 Word2Vector 또는 Glove 기반의 벡터화 알고리즘, 및 코사인 유사도(cosine similarity) 알고리즘 또는 확률(probabilistic) 기반 알고리즘 등을 사용할 수 있다.For example, when the content is a document, the topic histogram generator 130 determines an association degree between the word vector of the topic included in the integrated classification system and the word vector of the keyword extracted from the content, and determines the determined vector similarity with the topic. This can be determined by the association value between the contents. To this end, the topic histogram generator 130 may use a vectorization algorithm based on Word2Vector or Glove, a cosine similarity algorithm, or a probabilistic based algorithm.

만일 콘텐츠가 사진인 경우 토픽 히스토그램 생성부(130)는 콘텐츠에 대해 토픽 모델링 결과 생성된 주석의 워드 벡터에 대해 동일한 과정을 수행할 수 있으며, 만일 콘텐츠가 미디어인 경우 토픽 히스토그램 생성부(130)는 해당 미디어의 음성 부분의 문서화된 콘텐츠에 대해 워드 벡터를 구하여 동일한 과정을 수행할 수 있다.If the content is a photo, the topic histogram generator 130 may perform the same process on the word vector of the annotation generated as a result of the topic modeling for the content. If the content is media, the topic histogram generator 130 may The same process can be performed by obtaining a word vector for the documented content of the speech portion of the media.

또한, 토픽 히스토그램 생성부(130)는 결정된 연관값을 시각화하여 각 콘텐츠의 토픽 히스토그램을 생성한다.In addition, the topic histogram generator 130 generates a topic histogram of each content by visualizing the determined correlation value.

구체적으로, 토픽 히스토그램 생성부(130)는 연관값을 미리 설정된 범위와 비교하여, 연관값이 포함되는 범위에 대응하는 시각 정보로 연관값을 변경한다.In detail, the topic histogram generator 130 compares the association value with a preset range and changes the association value to visual information corresponding to a range including the association value.

예를 들면, 도 5를 참고하면, 토픽 히스토그램 생성부(130)는 각 연관값에 대해 채도가 다르게 설정된 색을 대응시키고, 결정된 연관값에 대응하는 색으로 연관값을 변경하여 연관값을 시각적으로 표현할 수 있다.For example, referring to FIG. 5, the topic histogram generator 130 may correspond to a color having different saturation values for each association value, and visually change the association value by changing the association value to a color corresponding to the determined association value. I can express it.

토픽 유사도 결정부(140)는 각 콘텐츠의 토픽 히스토그램 사이의 유사도를 비교하여 각 콘텐츠 사이의 토픽 유사도를 결정한다.The topic similarity determining unit 140 compares the similarity between the topic histograms of each content to determine the topic similarity between each content.

구체적으로 토픽 유사도 결정부(140)는 각 콘텐츠의 토픽 히스토그램에서 연관도가 높은 분류 체계를 기준으로 비교하여 토픽 유사도를 결정할 수 있다.In detail, the topic similarity determining unit 140 may determine the topic similarity by comparing the topic histogram of each content based on a classification system having a high correlation.

도 6을 참고하면, 알파벳으로 표현된 33개의 토픽을 포함하는 제1 분류와 숫자로 표현된 21개의 토픽을 포함하는 제2 분류로 구성된 토픽 히스토그램이 2개의 콘텐츠에 대해 생성된 예시에서, 토픽 유사도 결정부(140)는 각 콘텐츠의 토픽 히스토그램에서 연관값이 가장 높은 3개의 원소값 추출하고, 추출된 원소값들의 위치가 동일한 콘텐츠들에 대해 토픽 유사도가 높은 것으로 결정할 수 있다.Referring to FIG. 6, in the example in which a topic histogram consisting of a first classification including 33 topics represented by an alphabet and a second classification including 21 topics represented by a number is generated for two contents, the topic similarity The determination unit 140 may extract three element values having the highest correlation value from the topic histogram of each content, and determine that the topic similarity is high for the contents having the same location of the extracted element values.

도 7은 한 실시예에 따른 토픽 유사도 결정 시스템이 토픽 히스토그램을 생성하고, 콘텐츠 간 토픽 유사도를 결정하는 방법을 설명한 도면이다.7 is a diagram for describing a method of generating a topic histogram and determining a topic similarity between contents by a topic similarity determining system according to an exemplary embodiment.

도 7을 참고하면, 토픽 유사도 결정 시스템(100)은 복수의 콘텐츠들을 수신하고(S100), 수신한 콘텐츠들에 대해 토픽 모델링을 수행하여 각 콘텐츠 마다 하나 이상의 토픽들을 결정한다(S110).Referring to FIG. 7, the topic similarity determination system 100 receives a plurality of contents (S100) and performs topic modeling on the received contents to determine one or more topics for each content (S110).

예를 들면, 토픽 유사도 결정 시스템(100)은 pLSA(Probablistic Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), DMR(Dirichlet Multinomial Regression) 또는 HDP(Hierarchical Dirichlet Process)와 같은 토픽 모델링 기법을 사용하여 각 콘텐츠에 대한 토픽들을 결정할 수 있다.For example, the topic similarity determination system 100 may use topic modeling techniques such as Probablistic Latent Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA), Dirichlet Multinomial Regression (DMR), or Hierarchical Dirichlet Process (HDP). Topics for can be determined.

이 경우, 토픽 유사도 결정 시스템(100)은 상기 복수의 콘텐츠들이 서로 다른 언어로 기재된 문서들인 경우, 연동된 번역 엔진(200)을 이용하여 해당 문서들을 특정 언어로 번역하고, 번역된 문서들에 대해 토픽 모델링을 수행할 수 있다. 예를 들면, 토픽 유사도 결정 시스템(100)은 한국어, 중국어, 영어로 기재된 문서들에 대해 중국어로 기재된 문서와 영어로 기재된 문서를 한국어로 번역하고 번역된 문서에 대해 토픽 모델링을 수행할 수 있다.In this case, the topic similarity determination system 100 translates the documents into a specific language using the linked translation engine 200 when the plurality of contents are documents written in different languages, and then translates the translated documents into specific languages. Topic modeling can be performed. For example, the topic similarity determination system 100 may translate documents written in Chinese and documents written in English to Korean and perform topic modeling on the translated documents.

토픽 유사도 결정 시스템(100)은 결정된 토픽들을 계층적으로 분류한다(S120).The topic similarity determination system 100 hierarchically classifies the determined topics (S120).

일 실시예에서, 토픽 유사도 결정 시스템(100)은 토픽들을 클래스화시켜 기 저장하고, 이를 이용하여 결정된 토픽에 대응하는 클래스를 결정함으로써 토픽들을 계층적으로 분류할 수 있다.In one embodiment, the topic similarity determination system 100 may classify and store the topics in advance, and classify the topics hierarchically by determining a class corresponding to the determined topic using the same.

다른 실시예에서, 토픽 유사도 결정 시스템(100)은 계층화 알고리즘 또는 클러스터링 알고리즘을 사용하여 토픽들을 계층화 또는 클러스터링 할 수도 있다.In another embodiment, the topic similarity determination system 100 may layer or cluster the topics using a layering algorithm or a clustering algorithm.

단계 S120에서 계층적으로 분류된 토픽들은 클래스화되어 제1 분류 내지 제n 분류로 구분되며, 제1 분류에서 제n 분류로 갈수록 하위 클래스로 설정된다.Topics hierarchically classified in step S120 are classified into first to nth classifications, and are set as subclasses from the first classification to the nth classification.

토픽 유사도 결정 시스템(100)은 분류된 계층들의 구조를 이용하여 콘텐츠들의 통합 분류 체계를 설정한다(S130).The topic similarity determination system 100 sets up an integrated classification system of contents using the structure of the classified hierarchies (S130).

구체적으로, 토픽 유사도 결정 시스템(100)은 제1 분류 내지 제n 분류에 포함된 토픽들을 해당 분류를 통해 생성된 축에 배열하여 분류된 계층들의 구조를 생성하고, 생성된 구조를 콘텐츠들의 통합 분류 체계로 설정한다.Specifically, the topic similarity determination system 100 generates a structure of classified hierarchies by arranging the topics included in the first to n th classifications on an axis generated through the corresponding classification, and generates the structure of the classified hierarchies. Set up as a system.

이 경우, 토픽 유사도 결정 시스템(100)은 콘텐츠에 미리 설정된 카테고리 정보가 있는 경우, 설정된 카테고리 정보를 추가로 고려하여 통합 분류 체계를 설정할 수 있다.In this case, the topic similarity determination system 100 may set an integrated classification scheme by further considering the set category information when the category information is preset in the content.

토픽 유사도 결정 시스템(100)은 통합 분류 체계에 포함된 토픽들과 각 콘텐츠 사이의 연관값들을 결정한다(S140).The topic similarity determination system 100 determines association values between topics included in the integrated classification system and respective contents (S140).

구체적으로, 토픽 유사도 결정 시스템(100)은 통합 분류 체계에 포함된 토픽들과 각 콘텐츠에서 추출된 하나 이상의 키워드들 사이의 연관도를 이용하여 연관값들을 결정한다. 즉, 토픽 유사도 결정 시스템(100)은 토픽들의 워드 벡터와 콘텐츠에서 추출된 키워드의 워드 벡터 사이의 벡터 연관도를 결정하고 상기 결정한 연관도를 토픽과 콘텐츠 사이의 연관값으로 설정할 수 있다.In detail, the topic similarity determination system 100 determines association values using an association degree between topics included in the integrated classification system and one or more keywords extracted from each content. That is, the topic similarity determination system 100 may determine a vector association between a word vector of topics and a word vector of a keyword extracted from content, and set the determined association as an association value between the topic and the content.

토픽 유사도 결정 시스템(100)은 연관값을 시각화하여 각 콘텐츠의 토픽 히스토그램을 생성한다(S150).The topic similarity determination system 100 visualizes the association value and generates a topic histogram of each content (S150).

구체적으로, 토픽 유사도 결정 시스템(100)은 연관값을 미리 설정된 범위와 비교하여, 연관값이 포함되는 범위에 대응하는 시각 정보로 연관값을 변경한다.In detail, the topic similarity determination system 100 compares the association value with a preset range and changes the association value to visual information corresponding to a range including the association value.

토픽 유사도 결정 시스템(100)은 각 콘텐츠의 토픽 히스토그램 사이의 유사도를 비교하여 각 콘텐츠 사이의 토픽 유사도를 결정한다(S160).The topic similarity determination system 100 compares the similarity between the topic histograms of each content to determine the topic similarity between the respective contents (S160).

각 콘텐츠의 토픽 히스토그램은 복수의 콘텐츠들에 대해 동일한 통합 분류 체계를 사용하여 생성된바, 모든 콘텐츠들에 동일한 기준이 적용된다. 따라서, 각 콘텐츠의 토픽 히스토그램을 비교하면, 서로 다른 형태 및 언어로 작성된 콘텐츠들 사이의 토픽 유사도를 결정할 수 있다.The topic histogram of each content is generated using the same integrated classification scheme for a plurality of contents, and the same criteria apply to all contents. Accordingly, comparing topic histograms of respective contents, it is possible to determine topic similarity between contents written in different forms and languages.

본 발명에 따르면, 각 콘텐츠로부터 추출한 토픽을 이용하는바, 기존에 구축된 분류 체계를 보완할 수 있다.According to the present invention, a topic extracted from each content can be used to supplement the existing classification scheme.

또한, 본 발명에 따르면, 복수의 콘텐츠들로부터 추출된 토픽을 이용하여 통합 분류 체계를 설정하고, 설정된 통합 분류 체계를 사용하여 각 콘텐츠의 토픽 히스토그램을 생성하고 이를 통해 콘텐츠 간 유사도를 결정하는바, 초기에 다른 기준으로 분류된 콘텐츠들 사이에도 토픽 유사도를 결정할 수 있다.In addition, according to the present invention, by using the topics extracted from the plurality of contents set the integrated classification system, using the set of the integrated classification system to generate a topic histogram of each content and through this to determine the similarity between the contents, Topic similarity can also be determined between content that is initially classified by other criteria.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiments of the present invention described above are not only implemented through the apparatus and the method, but may be implemented through a program for realizing a function corresponding to the configuration of the embodiments of the present invention or a recording medium on which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

As a topic similarity determination system,
A topic determiner configured to perform topic modeling on a plurality of contents to determine one or more topics for each content;
A classification system setting unit for setting an integrated classification system of the contents using the determined topics, and
A topic histogram generator that determines association values between topics included in the integrated classification system and each content, and visualizes the association value to generate a topic histogram of each content.
Topic similarity determination system comprising a.

In claim 1,
The topic determination unit
And when the plurality of contents are documents written in different languages, using the linked translation engine to translate the corresponding documents into a specific language, and perform topic modeling on the translated documents.

In claim 1,
And the type of content is any one of document, photo or media.

In claim 3,
The topic determination unit
If the type of content is a document, apply the stemmer to the content in advance, perform topic modeling on the applied result,
If the type of content is a photograph, the topic modeling is performed on the content to generate a caption of the content.
And a topic similarity determination system for documenting speech parts and performing topic modeling for the documented content by preemptively applying a speech to text (STT) algorithm to the content when the content type is media.

In claim 1,
The classification system setting unit
A topic similarity determining system configured to classify the determined topics hierarchically and to set the structure of the classified hierarchies as the integrated classification system.

In claim 1,
The classification system setting unit
A topic similarity determination system for setting the unified classification system by further considering the category information when content includes preset category information.

In claim 1,
The topic histogram generator
A topic similarity determination system that determines the association values using an association between topics included in the integrated classification scheme and one or more keywords extracted from each content.

In claim 1,
The topic histogram generator
And comparing the association value with a preset range, and changing the association value to visual information corresponding to a range in which the association value is included.

In claim 1,
Topic similarity determination unit for comparing the similarity between the topic histogram of each content to determine the topic similarity between each content
Topic similarity determination system further comprising.

As a method of generating topic histograms,
Performing topic modeling on a plurality of contents to determine one or more topics for each content,
Classifying the determined topics hierarchically, and setting an integrated classification system of the contents by using the structure of the classified hierarchies;
Determining association values between topics included in the integrated classification scheme and each content, and
Visualizing the association value to generate a topic histogram of each content
Topic histogram generation method comprising a.

In claim 10,
Performing topic modeling on the plurality of contents
When the plurality of contents are documents written in different languages, a topic histogram generation method for translating the corresponding documents in a specific language using a linked translation engine, and performs topic modeling on the translated documents.

In claim 10,
And the type of the content is any one of document, photo or media.

In claim 12,
Performing topic modeling on the plurality of contents
If the type of content is a document, apply the stemmer to the content in advance, perform topic modeling on the applied result,
If the type of content is a photograph, the topic modeling is performed on the content to generate a caption of the content.
And a topic histogram generation method for documenting speech parts and performing topic modeling on the documented content by preemptively applying a speech to text (STT) algorithm to the content when the content type is media.

In claim 10,
Setting the integrated classification scheme
And a topic histogram generation method for setting the integrated classification scheme in consideration of the category information.

In claim 10,
The determining of the association values
The topic histogram generation method of determining the association values using the degree of association between the topics included in the integrated classification system and one or more keywords extracted from each content.

In claim 10,
Generating the topic histogram
And comparing the association value with a preset range, and changing the association value to visual information corresponding to a range including the association value.

In claim 10,
Comparing similarity between topic histograms of each content to determine topic similarity between each content
Topic histogram generation method further comprising.