KR20140114496A

KR20140114496A - Method and system for providing summery of text document using word cloud

Info

Publication number: KR20140114496A
Application number: KR1020130027410A
Authority: KR
Inventors: 강동엽; 이수빈; 박준영
Original assignee: 한국과학기술원
Priority date: 2013-03-14
Filing date: 2013-03-14
Publication date: 2014-09-29
Also published as: KR101481253B1

Abstract

Disclosed are a method for providing the summary of a word cloud-based text document in an image form and an information providing system using the same. The document summary method may include a step of generating a word cloud by using words in a given document; and a step of providing the word cloud as an image, which is summary information of the document that has been visualized.

Description

TECHNICAL FIELD [0001] The present invention relates to a method of summarizing an image of a word cloud-based text document, and an information providing system using the same.

본 발명의 실시예들은 워드 클라우드를 이용하여 텍스트 기반 문서를 요약하는 기술에 관한 것이다.Embodiments of the present invention are directed to techniques for summarizing text-based documents using Word Cloud.

인터넷을 통한 정보 검색 시 사용자가 방대한 양의 텍스트 기반 문서들(예컨대, 강의 자료, 논문 등) 중 짧은 시간에 어떠한 문서를 읽을지 결정하기 어려운 문제가 있다.There is a problem that it is difficult for a user to determine which document to read in a short time among a vast amount of text-based documents (e.g., lecture materials, articles, etc.)

문서의 키워드나 제목 등 메타 정보로 대략적인 주제는 알 수 있지만, 이 또한 사람의 노력이 필요로 하고 실제로 본문에서 어떠한 내용들이 다루어져 있는지 알기는 어렵다. 텍스트 기반 문서들에 대한 시각적 인지 능력이 부재인 현재, 특히 모바일 환경에서 작은 화면으로 본문의 텍스트를 예측하기란 다소 어려운 일일 수밖에 없다.Although it is possible to know the approximate theme by meta information such as the keyword or title of document, it is also difficult to know what contents are actually covered in the text because it requires human effort. In the absence of visual cognitive ability for text-based documents, it is somewhat difficult to predict the text of a text with a small screen, especially in a mobile environment.

최근 인터넷 정보 검색 모델에서는 인터넷을 통한 정보 검색에 문서 자동 요약 개념을 부가하고 있다. 예를 들어, 사용자가 정보를 검색하면 해당 문서의 범주를 제시하거나, 문서의 일부분을 추출하거나 혹은 기술문(Description)에 지정된 부분을 제시하거나, 해당 검색어가 포함되어 있는 문장을 조합하여 제시하거나, 해당 문서의 텍스트만을 추출하여 제시하는 미리 보기 기능을 제시하는 형태를 취할 수 있다.Recently, the Internet information retrieval model adds the concept of automatic document summarization to information retrieval through the Internet. For example, when a user searches for information, he presents a category of the document, extracts a part of the document, presents a portion specified in the description, presents a combination of sentences containing the search word, A preview function for extracting and presenting only the text of the document can be provided.

한국공개특허 제10-2000-0050225호(공개일 2000년 08월 05일, 문서 자동 요약에 의한 인터넷 정보 검색 시스템 및 방법)에서는 인터넷 정보 검색에 문서 자동 요약 개념을 부가하여 주제어와 주제 문장을 요약하여 제시하는 기술이 개시되어 있다.Korean Patent Publication No. 10-2000-0050225 (published on Aug. 05, 2000, an Internet information retrieval system and method based on automatic document summarization) summarizes a main word and a topic sentence by adding an automatic document summarizing concept to Internet information retrieval And the like.

도 1에 도시한 바와 같이, 텍스트 기반의 문서(100)에는 본문 이외에도 본문 내용을 파악하기 위한 정보로서 제목(101), 스니핏(snippet)(102), 도입(103)이나 초록(104), 키워드(또는 태그)(105) 등이 포함될 수 있다.1, a title 101, a snippet 102, an introduction 103, and an abstract 104 are included in the text-based document 100 in addition to the main text, Keywords (or tags) 105, and the like.

문서 자동 요약 방법 중 하나로, 제목(101), 키워드(105) 등의 메타 정보를 이용하여 문서를 요약하는 경우에는 요약 내용이 간결하지만 본문 내용을 파악하기 어렵고 메타 정보의 생산에 사람의 노력이 조금 필요하다.When summarizing a document using meta information such as a title (101) or a keyword (105) as one of the automatic document summarizing methods, it is difficult to grasp the contents of the text, need.

또한, 도입(103)이나 초록(104)을 이용하여 문서를 요약하는 경우에는 요약 내용이 다소 길어 본문 내용을 파악하기 용이하나 초록 작성에 사람의 노력이 많이 필요하다.In addition, when summarizing a document using introduction (103) or abstract (104), it is easy to grasp the contents of the text because the summary contents are rather long,

마지막으로, 스니핏(102) 등을 이용하여 문서를 요약하는 경우에는 본문 중 중요 부분을 추출하는 알고리즘을 사용하기 때문에 문서 요약이 자동으로 가능하나 추출 알고리즘 성능이 낮은 문제가 있다.Finally, when a document is summarized using a snippet (102) or the like, a document summary can be automatically generated because an algorithm for extracting an important part of the text is used, but the performance of the extraction algorithm is low.

텍스트 기반 문서를 인지 능력에 효과적인 이미지로 요약할 수 있는 문서 요약 방법 및 이를 이용한 정보 제공 시스템을 제공한다.The present invention provides a document summarizing method and an information providing system using the same, which can summarize a text based document into an image that is effective for cognitive ability.

워드 클라우드를 이용하여 텍스트 기반 문서를 요약할 수 있는 문서 요약 방법 및 이를 이용한 정보 제공 시스템을 제공한다.A document summary method capable of summarizing a text-based document using a word cloud and an information providing system using the same are provided.

본 발명의 실시예에 따르면, 문서 요약 방법은, 주어진 문서에 대하여, 상기 문서에 등장하는 단어를 이용하여 워드 클라우드(word cloud)를 생성하는 단계; 및 상기 문서의 요약된 정보가 시각화 된 이미지로서 상기 워드 클라우드를 제공하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a document summary method comprises the steps of: generating, for a given document, a word cloud using words appearing in the document; And providing the word cloud as a visualized image of the summarized information of the document.

일 측면에 따르면, 상기 워드 클라우드를 생성하는 단계는, 상기 문서에 등장하는 단어의 빈도수를 계산하는 단계; 및 상기 단어의 빈도수에 따라 각 단어의 크기를 결정하는 단계를 포함할 수 있다.According to an aspect, the step of generating the word cloud includes: calculating a frequency of words appearing in the document; And determining the size of each word according to the frequency of the word.

다른 측면에 따르면, 상기 워드 클라우드를 생성하는 단계는, 상기 문서에 등장하는 단어의 빈도수를 계산하는 단계; 상기 단어의 빈도수에 따라 각 단어의 크기와 배열 위치를 결정하는 단계; 및 상기 단어의 크기에 따라 각 단어의 글자 색을 결정하는 단계를 포함할 수 있다.According to another aspect, the step of generating the word cloud comprises: calculating a frequency of words appearing in the document; Determining a size and an arrangement position of each word according to the frequency of the word; And determining a character color of each word according to the size of the word.

또 다른 측면에 따르면, 상기 워드 클라우드를 생성하는 단계는, 상기 문서에 등장하는 단어의 빈도수를 계산하는 단계; 상기 단어의 빈도수를 합산하여 상기 문서의 전체 문서 길이를 구하는 단계; 및 상기 문서의 전제 문서 길이를 이용하여 상기 워드 클라우드의 이미지 크기를 결정하는 단계를 포함할 수 있다.According to another aspect, the step of generating the word cloud includes: calculating a frequency of words appearing in the document; Summing the frequency of the words to obtain the total document length of the document; And determining an image size of the word cloud using the total document length of the document.

또 다른 측면에 따르면, 문서 요약 방법은, 복수의 문서로 이루어진 전체 문서를 대상으로 유사어의 집합인 토픽(topic)을 N개 추출하는 단계; 및 상기 토픽 별로 상기 유사어를 이용하여 상기 토픽에 대한 워드 클라우드인 토픽 클라우드를 생성하는 단계를 더 포함할 수 있다.According to another aspect, a document summarizing method includes: extracting N topics as a set of similar words for an entire document composed of a plurality of documents; And generating a topic cloud, which is a word cloud for the topic, using the similarity for each topic.

또 다른 측면에 따르면, 상기 토픽 클라우드는 상기 문서를 분류하기 위한 카테고리로 제공될 수 있다.According to another aspect, the topic cloud may be provided in a category for classifying the document.

또 다른 측면에 따르면, 상기 토픽 별로 상기 토픽 클라우드의 글자 색이 결정될 수 있다.According to another aspect, the color of the font of the topic cloud may be determined for each topic.

또 다른 측면에 따르면, 문서 요약 방법은, 복수의 문서로 이루어진 전체 문서를 대상으로 유사어의 집합인 토픽을 N개 추출하는 단계; 및 상기 토픽 별로 상기 토픽 클라우드의 글자 색을 결정하는 단계를 더 포함할 수 있다. 이때, 상기 워드 클라우드를 생성하는 단계는, 하나의 문서의 각 단어에 대해, 상기 N개의 토픽 중에서 해당되는 토픽의 글자 색을 적용하여 상기 워드 클라우드를 생성할 수 있다.According to another aspect, a document summarizing method includes: extracting N topics, which are sets of similar words, over an entire document consisting of a plurality of documents; And determining a font color of the topic cloud for each topic. At this time, in the step of generating the word cloud, the word cloud of the corresponding topic among the N topics may be applied to each word of one document to generate the word cloud.

또 다른 측면에 따르면, 상기 유사어의 집합인 토픽을 N개 추출하는 단계는, LDA(Latent Dirichlet Allocation) 알고리즘을 이용하여 단어 간의 유사성을 바탕으로 상기 N개의 토픽을 추출할 수 있다.According to another aspect of the present invention, the step of extracting N items of the set of the similar words can extract the N topics based on the similarity between words using an LDA (Latent Dirichlet Allocation) algorithm.

또 다른 측면에 따르면, 상기 워드 클라우드를 생성하는 단계는, 하나의 문서에 대해, 같은 토픽으로 분류된 단어들을 같은 글자 색으로 결정하되, 해당 문서에 등장하는 단어의 빈도수에 따라 상기 정해진 글자 색의 투명도를 결정할 수 있다.According to another aspect of the present invention, in the step of generating the word cloud, the words classified into the same topic are determined to have the same letter color for one document, and the words are classified into the same letter color according to the frequency of words appearing in the document. Transparency can be determined.

본 발명의 실시예에 따르면, 문서 요약 방법은, 복수의 개별 문서로 이루어진 전체 문서를 대상으로 유사어의 집합인 토픽을 N개 추출하는 단계; 상기 토픽 별로 각 토픽에 대한 글자 색을 결정하는 단계; 및 상기 개별 문서 각각에 대하여 상기 개별 문서에 등장하는 단어를 이용하여 워드 클라우드를 생성하는 단계를 포함할 수 있다. 이때, 상기 워드 클라우드를 생성하는 단계는, 상기 개별 문서의 각 단어에 대해, 상기 개별 문서에 등장하는 단어의 빈도수에 따라 글자 크기를 결정하고 상기 N개의 토픽 중에서 해당되는 토픽의 글자 색을 적용할 수 있다.According to an embodiment of the present invention, a document summarizing method comprises the steps of: extracting N topics, which are sets of similar words, over an entire document consisting of a plurality of individual documents; Determining a text color for each topic for each topic; And generating a word cloud using words appearing in the individual document for each of the individual documents. At this time, the step of generating the word cloud may include determining, for each word of the individual document, a character size according to the frequency of words appearing in the individual document, and applying a color of the corresponding topic among the N topics .

본 발명의 실시예에 따르면, 정보 제공 시스템은, 복수의 문서 및 각 문서의 요약 정보를 저장하는 데이터베이스; 및 상기 복수의 문서 중 검색어에 대응되는 적어도 하나의 문서에 대하여, 상기 요약 정보를 제공하는 제공부를 포함할 수 있으며, 이때, 상기 요약 정보는 상기 문서의 요약된 정보가 시각화 된 이미지로서 상기 문서에 등장하는 단어로 구성된 워드 클라우드일 수 있다.According to an embodiment of the present invention, an information providing system includes: a database storing a plurality of documents and summary information of each document; And a providing unit for providing the summary information with respect to at least one document corresponding to a search word among the plurality of documents, wherein the summary information includes information that summarized information of the document is a visualized image, It can be a word cloud composed of emerging words.

본 발명의 실시예에 따르면, 워드 클라우드를 이용한 이미지화를 통해 텍스트 기반 문서를 자동 요약 및 시각화 함으로써 문서의 본문 내용을 보다 빠르고 효율적으로 인지할 수 있다.In accordance with an embodiment of the present invention, textual content of a document can be recognized more quickly and efficiently by automatically summarizing and visualizing a text-based document through imaging using a word cloud.

본 발명의 실시예에 따르면, 텍스트 기반 문서에 대한 요약 정보를 의미적으로 주제가 관련된 문서들끼리 군집화 하여 배치함으로써 검색 서비스의 질을 높이고 사용자에게 검색 편의를 제공할 수 있다.According to the embodiment of the present invention, the summary information on the text-based document can be clustered among the documents related to the topic semantically, thereby improving the quality of the search service and providing the user with the convenience of searching.

본 발명의 실시예에 따르면, 텍스트 기반 문서에 대하여 이미지를 통한 문서 검색 결과를 제공함으로써 인지적/시각적 형태의 검색 환경을 제공할 수 있으며, 특히 모바일 환경에서 검색 결과에 포함된 문서를 더욱 빠르게 이해시킬 수 있다.According to an embodiment of the present invention, it is possible to provide a search environment of a cognitive / visual form by providing a document search result through an image with respect to a text-based document. In particular, .

도 1은 텍스트 기반 문서를 요약하기 위해 이용되는 정보들을 설명하기 위한 예시 도면이다.
도 2는 본 발명의 일 실시예에 있어서, 워드 클라우드를 이용한 문서 요약 방법의 순서도를 도시한 것이다.
도 3은 본 발명의 일 실시예에 있어서, 문서의 요약된 정보로서 워드 클라우드를 생성하는 과정을 설명하기 위한 예시 도면이다.
도 4는 본 발명의 일 실시예에 있어서, 전체 문서에서 추출된 토픽에 대한 워드 클라우드를 설명하기 위한 예시 도면이다.
도 5는 본 발명의 일 실시예에 있어서, 워드 클라우드가 문서의 요약 정보로 제공되는 검색 결과를 제공하는 예시 도면이다.
도 6은 본 발명의 일 실시예에 있어서, 워드 클라우드를 문서의 요약 정보로 제공하는 정보 제공 시스템의 내부 구성을 도시한 블록도이다.1 is an exemplary diagram for illustrating information used to summarize a text-based document.
2 illustrates a flowchart of a document summary method using word cloud in one embodiment of the present invention.
3 is an exemplary diagram for explaining a process of generating a word cloud as summarized information of a document in an embodiment of the present invention.
4 is an exemplary diagram for explaining a word cloud for a topic extracted from an entire document, in an embodiment of the present invention.
5 is an exemplary diagram illustrating a search result in which, in an embodiment of the present invention, a word cloud is provided as summary information of a document.
6 is a block diagram illustrating an internal configuration of an information providing system for providing a word cloud as a summary information of a document in an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 실시예들은 텍스트 기반 문서를 요약하여 해당 요약 정보를 제공하는 것으로, 이는 인터넷 상의 자료들 가운데서 소정 조건과 매칭되는 정보를 찾아주는 검색 시스템에 적용될 수 있다.The present embodiments summarize a text-based document and provide corresponding summary information, which can be applied to a search system that finds information matching a predetermined condition among data on the Internet.

일반적으로 인간과 기계 간에 텍스트와 이미지에 대한 인지 능력은 차이가 있기 마련이다. 다시 말해, 기계는 픽셀 단위의 이미지 보다 텍스트 처리가 훨씬 빠르고 의미를 이해하는데 효율적인 반면에, 인간은 시각적 이미지들을 통한 해당 정보의 인식이 사전적 의미의 텍스트 보다 빠르고 효율적이다.In general, there is a difference in cognitive ability between text and image between human and machine. In other words, a machine is much faster to interpret text than an image per pixel, and is more efficient in understanding the meaning, whereas humans are quicker and more efficient in recognizing that information through visual images than text in a dictionary.

이에, 본 실시예에서는 워드 클라우드를 이용하여 텍스트 기반 문서의 본문을 이미지화 하여 요약할 수 있는 기술을 제안한다.Accordingly, the present embodiment proposes a technique of summarizing and textizing the text of a text-based document using a word cloud.

본 명세서에서, '워드 클라우드'는 문서에 등장하는 단어들이 클라우드 모양으로 그래픽 화 된 일종의 이미지를 의미할 수 있다.In this specification, 'word cloud' may mean a kind of image in which words appearing in a document are graphically displayed in a cloud shape.

도 2는 본 발명의 일 실시예에 있어서, 워드 클라우드를 이용한 문서 요약 방법의 순서도를 도시한 것이다. 일 실시예에 따른 문서 요약 방법은 이하에서 설명하게 될 정보 제공 시스템에 의해 각각의 단계가 수행될 수 있다.2 illustrates a flowchart of a document summary method using word cloud in one embodiment of the present invention. The document summarizing method according to one embodiment can be performed by the information providing system to be described below.

단계(S210)에서 정보 제공 시스템은 주어진 문서에 대하여 문서에 등장하는 단어를 이용하여 워드 클라우드를 생성할 수 있다. 여기서, 워드 클라우드는 문서에서 단어의 빈도수를 이용하여 단어들의 크기를 결정하고 이를 클라우드 형태로 재배열 하는 그래픽 이미지 개념이라고 할 수 있다.In step S210, the information providing system may generate a word cloud using a word appearing in the document for a given document. Here, Word Cloud is a graphical image concept that determines the size of words using the frequency of words in a document and rearranges them in a cloud form.

단계(S220)에서 정보 제공 시스템은 문서에 대해 생성된 워드 클라우드를 해당 문서의 요약 정보로서 제공할 수 있다. 다시 말해, 정보 제공 시스템은 워드 클라우드를 문서의 요약된 정보가 시각화 된 이미지로서 이용할 수 있으며, 일 예로 검색어에 대응되는 검색 결과 제공 시 검색 결과에 포함된 문서에 대하여 워드 클라우드를 해당 문서의 요약 정보로 제공할 수 있다.In step S220, the information providing system may provide the word cloud generated for the document as summary information of the document. In other words, the information providing system can use the word cloud as the visualized image of the summarized information of the document. For example, when providing the search result corresponding to the search word, the word cloud is compared with the summary information of the document included in the search result .

워드 클라우드를 생성하는 과정을 구체적으로 설명하면 다음과 같다.The process of generating a word cloud will be described in detail as follows.

일 예로, 정보 제공 시스템은 하나의 텍스트 기반 문서가 입력되면 단어들이 해당 문서에 등장하는 빈도수를 기반으로 하여 각 단어들을 벡터 형태로 변환할 수 있다. 예컨대, <apple banana apple fruit apple banana>의 문서에 대하여 {apple:3, banana:2, fruit:1}와 같이 각 단어의 빈도수를 나타내는 벡터 값으로 변환될 수 있다. 이어, 정보 제공 시스템은 도 3에 도시한 바와 같이 문서(301)에 등장하는 단어들을 각 단어의 크기에 따라 로그로 스케일링 하여 표시함으로써 워드 클라우드(302)를 생성할 수 있다. 예컨대, 각 단어의 빈도수를 나타내는 벡터 값이 {apple:17, banana:9, fruit: 3}인 경우, 로그 스케일링 수치가 2인 것(=LOG(2))으로 가정하면, 각 단어의 크기를 [apple:4, banana:3, fruit: 1]와 같이 스케일링 할 수 있다. 그리고, 정보 제공 시스템은 문서에 등장하는 단어들에 대하여 스케일링 된 수치가 높을수록 클라우드의 가운데에 배치할 수 있다. 이때, 각 단어가 배치되는 각도는 랜덤으로 일정 범위(예컨대, ±20도) 내에서 결정될 수 있다. 또한, 정보 제공 시스템은 워드 클라우드에 스케일링 된 수치에 따라 각 단어의 글자 색을 결정하여 적용할 수 있다. 예컨대, 스케일링 된 수치가 가장 큰 단어의 경우 빨간색, 그 다음 주황색, 갈색, 초록색, 파란색 등의 순서로 각 단어의 크기에 따라 글자 색을 다르게 적용할 수 있다. 상기한 바와 같이, 본 실시예에서는 워드 클라우드를 생성함에 있어 문서에 등장하는 빈도수에 따라 각 단어의 크기를 결정할 수 있으며, 더 나아가 각 단어의 배열 위치 및 글자 색 중 적어도 하나를 결정할 수 있다.For example, the information providing system can convert each word into a vector form based on the frequency with which words appear in the document when a text-based document is input. For example, for a document of <apple banana apple fruit apple banana>, it can be converted into a vector value representing frequency of each word such as {apple: 3, banana: 2, fruit: 1} 3, the information providing system can generate the word cloud 302 by scaling and displaying the words appearing in the document 301 according to the size of each word. For example, if the vector value representing the frequency of each word is {apple: 17, banana: 9, fruit: 3}, and assuming that the logarithmic scaling value is 2 (= LOG (2)), [apple: 4, banana: 3, fruit: 1]. And, the information providing system can place the center of the cloud as the scaled value for the words appearing in the document. At this time, the angle at which each word is arranged may be randomly determined within a certain range (e.g., +/- 20 degrees). In addition, the information providing system can determine and apply the character color of each word according to the scaled value in the word cloud. For example, in the case of a word having the largest scaled numerical value, the character color may be differently applied according to the size of each word in the order of red, orange, brown, green, and blue. As described above, in the present embodiment, the size of each word can be determined according to the frequency appearing in the document in generating the word cloud, and further, at least one of the arrangement position and the character color of each word can be determined.

다른 예로, 정보 제공 시스템은 문서의 전체 길이에 따라 워드 클라우드의 이미지 크기를 결정할 수 있다. 이를 위하여, 정보 제공 시스템은 문서에 등장하는 단어의 빈도수를 각각 계산하여 단어들의 빈도수를 모두 합산하는 방식으로 문서의 전체 문서 길이를 구할 수 있다. 그리고, 정보 제공 시스템은 문서의 전체 문서 길이와 비례하여 워드 클라우드의 로그 스케일링 수치를 조절할 수 있다. 예컨대, 문서가 길 경우 LOG(10)에 대해서 단어 크기를 스케일링 하고, 문서가 짧을 경우 LOG(2)에 대해 단어 크기를 스케일링 하도록 제어할 수 있다.As another example, the information providing system may determine the image size of the word cloud according to the overall length of the document. To this end, the information providing system can calculate the total document length of the document by calculating the frequency of the words appearing in the document and summing the frequency of the words. The information providing system can adjust the log scaling value of the word cloud in proportion to the total document length of the document. For example, the document size may be scaled for LOG 10 if the document is long, and scaled for LOG (2) if the document is short.

또 다른 예로, 정보 제공 시스템은 복수의 문서로 이루어진 전체 문서를 대상으로 의미적으로 주제가 관련된 문서들을 군집화 하여 배치할 수 있다. 일 예로, 정보 제공 시스템은 LDA(Latent Dirichlet Allocation) 알고리즘을 이용하여 단어 간의 유사성을 바탕으로 의미적으로 유사한 단어인 유사어의 집합인 토픽을 N개 추출할 수 있다. 이때, 전체 문서에서 추출하고자 하는 토픽의 수 N은 임의로 결정될 수 있다. 본 실시예에서 토픽들은 특정 주제를 지닌 워드들의 집합을 의미하며, 도 4에 도시한 바와 같이 상기와 동일한 방식으로 각 토픽에 대한 워드 클라우드(이하, '토픽 클라우드'라 칭함)를 생성하여 이미지로 표현할 수 있다. 이때, 정보 제공 시스템은 N개의 토픽에 대하여 토픽 별로 각 토픽 클라우드의 글자 색을 결정할 수 있다. 다시 말해, 정보 제공 시스템은 같은 토픽으로 분류된 단어들을 같은 색으로 정하고 서로 다른 주제를 지닌 토픽 클라우드를 서로 다른 글자 색의 단어들로 구성할 수 있다. 그리고, 정보 제공 시스템은 토픽 클라우드를 문서를 분류하기 위한 카테고리로 이용할 수 있으며, 사용자에게 N개의 토픽 클라우드를 제공하여 사용자에 의해 검색하고자 하는 문서의 주제를 지정할 수 있다.As another example, an information providing system can cluster documents in which semantically related subjects are related to an entire document composed of a plurality of documents. For example, the information providing system can extract N topics, which are sets of similar words, which are semantically similar words, based on the similarity between words using LDA (Latent Dirichlet Allocation) algorithm. At this time, the number N of the topics to be extracted from the entire document can be arbitrarily determined. In the present embodiment, a topic refers to a set of words having a specific topic. As shown in FIG. 4, a word cloud (hereinafter referred to as a topic cloud) for each topic is generated in the same manner as described above, Can be expressed. At this time, the information providing system can determine the font color of each topic cloud for each topic with respect to N topics. In other words, the information providing system can classify the words classified in the same topic into the same color, and configure topic clouds having different topics to be words of different letter colors. The information providing system can use the topic cloud as a category for classifying documents, and can provide the user with N topic clouds to specify the subject of the document to be searched by the user.

또 다른 예로, 정보 제공 시스템은 LDA 알고리즘을 통해 복수의 문서로 이루어진 전체 문서에서 N개의 토픽을 추출한 후 토픽 별로 토픽 클라우드의 글자 색을 결정할 수 있다. 이때, 정보 제공 시스템은 하나의 개별 문서에 등장하는 각각의 단어에 대해 N개의 토픽 중에서 해당되는 토픽 또는 가장 유사한 토픽을 선택하여 해당 단어의 글자 색을 상기 선택된 토픽의 글자 색으로 정할 수 있다. 즉, 정보 제공 시스템은 문서의 각 단어에 대해 같은 토픽으로 분류되어진 단어들과 같은 글자 색을 적용하는 방식으로 해당 문서의 워드 클라우드를 생성할 수 있다. 더 나아가, 정보 제공 시스템은 하나의 개별 문서에 대해 같은 토픽으로 분류된 단어들에 같은 글자 색을 적용하되, 해당 문서에 등장하는 단어의 빈도수에 따라 정해진 글자 색의 투명도를 결정할 수 있다. 예컨대, '빨간 색'으로 정해진 토픽에 해당되는 단어들에 대해 10번 등장한 단어에는 "짙은 빨간색"을 부여하고, 2번 등장한 단어에는 "옅은 빨간색"을 부여할 수 있다.As another example, the information providing system can extract the N topics from the entire document composed of a plurality of documents through the LDA algorithm, and then determine the font color of the topic cloud for each topic. At this time, the information providing system may select the topic or the most similar topic among the N topics for each word appearing in one individual document, and set the text color of the corresponding word to the color of the selected topic. That is, the information providing system can generate the word cloud of the document by applying the same color as the words classified into the same topic for each word of the document. Furthermore, the information providing system can apply the same letter color to the words classified as the same topic for one individual document, but can determine the transparency of the letter color according to the frequency of the words appearing in the document. For example, the word "dark red" may be assigned to a word appearing ten times in a word corresponding to a topic designated as "red color", and "light red" may be given to a word appearing twice.

상기한 바와 같이, 본 실시예에서는 텍스트 기반 문서에 대하여 문서에 등장하는 빈도가 높은 핵심어일수록 큰 글씨로 중심부에 표시되는 워드 클라우드를 생성한 후 생성된 워드 클라우드를 해당 문서의 시각화 된 이미지로 이용함으로써 텍스트 기반 문서를 이미지화 하여 요약할 수 있다.As described above, in the present embodiment, a word cloud displayed at a central portion of a text-based document having a high frequency of appearing frequently in a document is generated, and then the generated word cloud is used as a visualized image of the document A text-based document can be summarized and imaged.

이에, 정보 제공 시스템은 도 5에 도시한 바와 같이 복수의 문서 중 사용자에 의해 입력된 검색어에 대응되는 적어도 하나의 문서에 대하여 각 문서의 요약 정보를 검색 결과로서 제공하게 되는데, 이때 검색 결과에 포함된 각 문서의 워드 클라우드(502)를 해당 문서의 요약 정보로 제공할 수 있다. 더욱이, 정보 제공 시스템은 검색어에 대하여 이미지 검색을 통한 문서 검색 기능을 제공할 수 있으며, 이때 워드 클라우드에 의해 이미지화 된 문서들을 실제 이미지들과 함께 이미지 검색 대상으로 관리할 수 있다. 도 5를 참조하면, 특정 검색어에 대응되는 문서의 요약 정보로서 워드 클라우드(502)가 검색 결과에 포함되고, 이와 함께 실제 이미지들(503)이 검색 결과에 포함될 수 있다.As shown in FIG. 5, the information providing system provides summary information of each document to at least one document corresponding to a search word input by a user among a plurality of documents, The word cloud 502 of each document can be provided as summary information of the document. Furthermore, the information providing system can provide a document search function through an image search for a search word, and at this time, the documents imaged by the word cloud can be managed as an image search object together with actual images. Referring to FIG. 5, the word cloud 502 is included in the search result as the summary information of the document corresponding to the specific search word, and the actual images 503 may be included in the search result.

도 6은 본 발명의 일 실시예에 있어서, 워드 클라우드를 문서의 요약 정보로 제공하는 정보 제공 시스템의 내부 구성을 도시한 블록도이다.6 is a block diagram illustrating an internal configuration of an information providing system for providing a word cloud as a summary information of a document in an embodiment of the present invention.

도 6에 도시한 바와 같이, 일 실시예에 따른 정보 제공 시스템은 각 문서의 요약 정보를 생성하는 생성부(610), 검색어에 대응되는 적어도 하나의 문서에 대하여 해당 문서의 요약 정보를 제공하는 제공부(620), 및 복수의 문서 및 각 문서의 요약 정보를 저장하는 데이터베이스(미도시)를 포함하여 구성될 수 있다.As shown in FIG. 6, the information providing system according to an embodiment includes a generating unit 610 for generating summary information of each document, an information providing unit 610 for providing summary information of the document to at least one document corresponding to the search word, A study 620, and a database (not shown) that stores a plurality of documents and summary information of each document.

생성부(610)는 주어진 문서에 대하여 문서에 등장하는 단어를 이용하여 워드 클라우드를 생성하는 역할을 수행한다. 일 예로, 생성부(610)는 하나의 텍스트 기반 문서가 입력되면 단어들이 해당 문서에 등장하는 빈도수를 기반으로 하여 각 단어의 크기를 결정한 후 각 단어의 크기에 따라 로그로 스케일링 하여 표시함으로써 해당 문서의 워드 클라우드를 생성할 수 있다. 이때, 생성부(610)는 문서에 등장하는 단어들에 대하여 스케일링 된 수치가 높을수록 클라우드의 가운데에 배치할 수 있고 각 단어가 배치되는 각도는 랜덤으로 일정 범위(예컨대, ±20도) 내에서 결정할 수 있다. 또한, 생성부(610)는 문서에 등장하는 빈도수를 기준으로 결정된 각 단어의 크기에 따라 글자 색을 다르게 적용하는 것이 가능하다. 그리고, 생성부(610)는 문서에 등장하는 단어의 빈도수를 각각 계산하여 단어들의 빈도수를 모두 합산하는 방식으로 문서의 전체 문서 길이를 구한 후, 문서의 전체 문서 길이와 이용하여 워드 클라우드의 이미지 크기를 결정할 수 있다.The generation unit 610 plays a role of generating a word cloud using a word appearing in a document with respect to a given document. For example, when one text-based document is input, the generation unit 610 determines the size of each word on the basis of the frequency with which the words appear in the document, scales the data according to the size of each word, Of the word cloud. At this time, the generator 610 can arrange the words in the middle of the cloud as the scaled numerical values of the words appearing in the document are higher, and the angles at which the words are arranged are randomly set within a certain range (e.g., +/- 20 degrees) You can decide. In addition, the generator 610 can apply a different color to the text according to the size of each word determined based on the frequency of appearing in the document. Then, the generator 610 calculates the frequency of the words appearing in the document to calculate the total document length of the document in such a manner that the frequencies of the words are all summed. Then, the generator 610 calculates the image size of the word cloud Can be determined.

더 나아가, 생성부(610)는 복수의 문서로 이루어진 전체 문서를 LDA 알고리즘에 적용하여 단어 간의 유사성을 바탕으로 의미적으로 유사한 유사어의 집합인 토픽을 N개 추출할 수 있다. 그리고, 생성부(610)는 같은 토픽으로 분류된 유사어를 이용하여 각 토픽에 대한 토픽 클라우드를 생성할 수 있으며, 토픽 별로 각 토픽 클라우드의 글자 색을 결정할 수 있다. 이때, 토픽 클라우드는 문서를 분류하기 위한 카테고리로 이용될 수 있으며, 사용자에 의해 선택된 토픽 클라우드로 검색하고자 하는 문서의 주제를 지정할 수 있다. 특히, 생성부(610)는 한 개별 문서의 각 단어에 대해 N개의 토픽 중에서 해당되는 토픽의 글자 색을 적용하는 방식으로 해당 개별 문서의 워드 클라우드를 생성할 수 있다. 또한, 생성부(610)는 하나의 개별 문서에 대해 같은 토픽으로 분류된 단어들에 같은 글자 색을 적용하되, 해당 문서에 등장하는 단어의 빈도수에 따라 정해진 글자 색의 투명도를 결정할 수 있다.Furthermore, the generating unit 610 may apply the entire document composed of a plurality of documents to the LDA algorithm to extract N words, which are sets of semantically similar similar words, based on the similarity between words. The generating unit 610 may generate a topic cloud for each topic using a similar word classified into the same topic, and may determine the color of each topic cloud for each topic. At this time, the topic cloud may be used as a category for classifying the document, and the topic cloud selected by the user may specify the subject of the document to be searched. In particular, the generator 610 may generate a word cloud of the corresponding individual document by applying the color of the corresponding topic among the N topics for each word of an individual document. In addition, the generator 610 may apply the same letter color to the words classified into the same topic for one individual document, and determine the transparency of the letter color according to the frequency of the words appearing in the document.

상기한 동작을 수행하는 생성부(610)는 단일 문서에 대해 워드 클라우드를 생성하는 워드 클라우드 생성기(612)만으로 구성되거나, 워드 클라우드 생성기(612)와 함께 전체 문서를 주제와 관련된 문서들로 군집화 하기 위하여 LDA 기반으로 토픽 클라우드를 생성하는 토픽 클라우드 생성기(611)를 더 포함하여 구성될 수 있다.The generating unit 610 for performing the above operation may be configured only by a word cloud generator 612 for generating a word cloud for a single document or by grouping the entire document together with the word cloud generator 612 into subject- A topic cloud generator 611 for generating a topic cloud based on the LDA.

제공부(620)는 데이터베이스 검색을 통해 각 문서에 대해 생성된 워드 클라우드를 해당 문서의 요약 정보로서 제공할 수 있다. 다시 말해, 제공부(620)는 워드 클라우드를 문서의 요약된 정보가 시각화 된 이미지로서 이용할 수 있으며, 일 예로 검색어에 대응되는 검색 결과 제공 시 검색 결과에 포함된 문서에 대하여 워드 클라우드를 해당 문서의 요약 정보로 제공할 수 있다. 또한, 제공부(620)는 검색어에 대하여 이미지 검색을 통한 문서 검색 기능을 제공할 수 있으며, 이때 워드 클라우드에 의해 이미지화 된 문서들을 실제 이미지들과 함께 이미지 검색 대상으로 관리할 수 있다.The provider 620 may provide the word cloud generated for each document as a summary information of the document through a database search. In other words, the providing unit 620 may use the word cloud as an image in which summarized information of the document is visualized. For example, when providing a search result corresponding to a search word, It can be provided as summary information. In addition, the providing unit 620 may provide a document search function through an image search for a search word. At this time, the documents imaged by the word cloud may be managed as an image search target together with actual images.

이와 같이, 본 발명의 실시예에 따르면, 워드 클라우드를 이용한 이미지화를 통해 텍스트 기반 문서를 자동 요약 및 시각화 함으로써 문서의 본문 내용을 보다 빠르고 효율적으로 인지할 수 있다. 또한, 본 발명의 실시예에 따르면, 텍스트 기반 문서에 대한 요약 정보를 의미적으로 주제가 관련된 문서들끼리 군집화 하여 배치함으로써 검색 서비스의 질을 높이고 사용자에게 검색 편의를 제공할 수 있다. 그리고, 본 발명의 실시예에 따르면, 텍스트 기반 문서에 대하여 이미지를 통한 문서 검색 결과를 제공함으로써 인지적/시각적 형태의 검색 환경을 제공할 수 있으며, 특히 모바일 환경에서 검색 결과에 포함된 문서를 더욱 빠르게 이해시킬 수 있다.As described above, according to the embodiment of the present invention, the text contents of the document can be recognized more quickly and efficiently by automatically summarizing and visualizing the text-based document through image formation using the word cloud. In addition, according to the embodiment of the present invention, summary information about a text-based document can be clustered among documents related to a topic, thereby improving the quality of the search service and providing the user with the convenience of searching. According to an embodiment of the present invention, a cognitive / visual search environment can be provided by providing a document search result through an image with respect to a text-based document. In particular, It can be understood quickly.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA) A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

610: 생성부
620: 제공부610:
620: Offering

Claims

Generating, for a given document, a word cloud using words appearing in the document; And
Providing summarized information of the document as a visualized image to provide the word cloud
&Lt; / RTI >

The method according to claim 1,
The step of generating the word cloud comprises:
Calculating a frequency of words appearing in the document; And
Determining a size of each word according to the frequency of the word;
&Lt; / RTI >

The method according to claim 1,
The step of generating the word cloud comprises:
Calculating a frequency of words appearing in the document;
Determining a size and an arrangement position of each word according to the frequency of the word; And
Determining a character color of each word according to the size of the word
&Lt; / RTI >

The method according to claim 1,
The step of generating the word cloud comprises:
Calculating a frequency of words appearing in the document;
Summing the frequency of the words to obtain the total document length of the document; And
Determining an image size of the word cloud using the total document length of the document
&Lt; / RTI >

The method according to claim 1,
Extracting N topics as a set of similar words for an entire document composed of a plurality of documents; And
Generating a topic cloud that is a word cloud for the topic using the similarity for each topic
&Lt; / RTI >

6. The method of claim 5,
The topic cloud being provided as a category for classifying the document
/ RTI >

6. The method of claim 5,
The font color of the topic cloud is determined for each topic
/ RTI >

The method according to claim 1,
Extracting N topics, which are sets of similar words, with respect to an entire document composed of a plurality of documents; And
Determining a font color of the topic cloud for each topic
Further comprising:
The step of generating the word cloud comprises:
For each word of one document, generating the word cloud by applying the color of the corresponding topic among the N topics
/ RTI >

9. The method of claim 8,
Wherein the step of extracting N topics, which are sets of the similar words,
Extracting the N topics based on similarity between words using LDA (Latent Dirichlet Allocation) algorithm
/ RTI >

9. The method of claim 8,
The step of generating the word cloud comprises:
For one document, the words classified into the same topic are determined to have the same letter color, but the transparency of the determined letter color is determined according to the frequency of words appearing in the document
/ RTI >

Extracting N topics, which are sets of similar words, with respect to an entire document composed of a plurality of individual documents;
Determining a text color for each topic for each topic; And
Generating a word cloud using words appearing in the individual document for each of the individual documents;
Lt; / RTI >
The step of generating the word cloud comprises:
Determining, for each word of the individual document, a character size according to the frequency of words appearing in the individual document, and applying the color of the corresponding topic among the N topics
/ RTI >

A database storing a plurality of documents and summary information of each document; And
For providing at least one document corresponding to a search word among the plurality of documents,
Lt; / RTI >
Wherein the summary information is a word cloud consisting of words appearing in the document as summarized information of the document as a visualized image
And an information providing system.

13. The method of claim 12,
A generation unit for generating the word cloud based on the frequency of words appearing in the document,
Further comprising:

13. The method of claim 12,
A generation unit for generating the word cloud based on the frequency of words appearing in the document,
Further comprising:
Wherein the generation unit comprises:
Determining the size of each word according to the frequency of the word
And an information providing system.

13. The method of claim 12,
A generation unit for generating the word cloud based on the frequency of words appearing in the document,
Further comprising:
Wherein the generation unit comprises:
Determining a total document length of the document by summing the frequency of words appearing in the document, and determining an image size of the word cloud using the total document length of the document
And an information providing system.

13. The method of claim 12,
A generation unit for generating the word cloud based on the frequency of words appearing in the document,
Further comprising:
Wherein the generation unit comprises:
Extracting N topics, which are sets of similar words, from the entire document made up of the plurality of documents, determining the color of the text of the topic cloud for each topic, and then, for each word of one document, Generating the word cloud by applying the font color of the topic
And an information providing system.

17. The method of claim 16,
Wherein the generation unit comprises:
Extracting the N topics based on similarity between words using LDA (Latent Dirichlet Allocation) algorithm
And an information providing system.

17. The method of claim 16,
Wherein the generation unit comprises:
For one document, the words classified into the same topic are determined to have the same letter color, but the transparency of the determined letter color is determined according to the frequency of words appearing in the document
And an information providing system.