KR20190110435A

KR20190110435A - Text mining method, text mining program and text mining apparatus

Info

Publication number: KR20190110435A
Application number: KR1020190023397A
Authority: KR
Inventors: 징롱 저우
Original assignee: 가부시키가이샤 스크린 홀딩스
Priority date: 2018-03-20
Filing date: 2019-02-27
Publication date: 2019-09-30
Also published as: TWI736860B; JP2019164592A; CN110309260A; KR102175658B1; TW201941083A; CN110309260B; JP7078429B2

Abstract

Provided is a text mining method, which comprises: a step of extracting a word from text data consisting of a statement having a date; performing hierarchical cluster analysis on the extracted word for each analysis period; and displaying a screen including the result of the hierarchical cluster analysis. When an instruction for designating a key word in the first screen including the analysis result is inputted, a second screen including a change over time of the cluster including the key word is displayed by indicating a name of the cluster based on the word included in the cluster including the key word. Accordingly, the change over time of the hierarchical cluster analysis result may be easily recognized.

Description

Text Mining Methods, Text Mining Programs, and Text Mining Devices {TEXT MINING METHOD, TEXT MINING PROGRAM AND TEXT MINING APPARATUS}

본 발명은 텍스트 마이닝에 관한 것으로, 특히, 계층적 클러스터 분석의 결과를 포함하는 화면을 표시하는 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치에 관한 것이다.The present invention relates to text mining, and more particularly, to a text mining method, a text mining program, and a text mining device for displaying a screen including a result of a hierarchical cluster analysis.

최근, 자유 기술된 텍스트 데이터를 해석하고, 해석 결과로부터 유용한 정보를 구하는 텍스트 마이닝이 주목받고 있다. 텍스트 마이닝에서는, 예를 들어, 분석 대상인 텍스트 데이터로부터 단어를 추출하고, 단어의 출현 빈도나 출현 경향 등을 해석함으로써, 정보를 구한다.Recently, text mining has been attracting attention for analyzing freely described text data and obtaining useful information from the analysis results. In text mining, for example, information is obtained by extracting a word from text data to be analyzed and analyzing the frequency of occurrence of the word, the tendency of the word, and the like.

이하, 텍스트 데이터로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시하고, 그 결과를 포함하는 화면을 표시하는 텍스트 마이닝 장치에 대해서 생각해 본다. 계층적 클러스터 분석에서는, 단어 간의 유사도에 기초하여, 유사도가 높은 단어를 포함하는 클러스터가 계층적으로 작성된다. 일반적으로, 계층적 클러스터 분석의 결과는, 도 10 에 나타내는 수형도 (덴드로그램) 를 이용하여 분석자에게 제공된다. 분석자는, 계층적 클러스터 분석의 결과에 기초하여, 텍스트 데이터의 개요를 파악할 수 있다.Hereinafter, a text mining apparatus for performing a hierarchical cluster analysis on words extracted from text data and displaying a screen including the results will be considered. In hierarchical cluster analysis, clusters containing words with high similarity are hierarchically created based on the similarity between words. In general, the results of the hierarchical cluster analysis are provided to the analyst using the tree diagram (dendogram) shown in FIG. 10. The analyst can grasp the outline of the text data based on the results of the hierarchical cluster analysis.

일본 공개특허공보 2018-18118호에는, 계층적 클러스터 분석의 결과를 도 11 에 나타내는 양태로 표시하는 텍스트 마이닝 장치가 기재되어 있다. 이 문헌에 기재된 텍스트 마이닝 장치는, 클러스터수 m 과 클러스터 내의 최대 표시 데이터수 n 이 주어졌을 때에, 계층적 클러스터 분석의 결과로부터 m 개의 클러스터를 구하고, 구한 m 개의 클러스터를 구름형 도형으로 화면에 표시하고, 각 클러스터의 내부에 n 개 이하의 단어를 표시한다.Japanese Unexamined Patent Application Publication No. 2018-18118 describes a text mining apparatus that displays the results of hierarchical cluster analysis in an embodiment shown in FIG. 11. Given that the number of clusters m and the maximum number of display data n within a cluster are given, the text mining apparatus described in this document obtains m clusters from the results of the hierarchical cluster analysis, and displays the m clusters on the screen in a cloud shape. N or fewer words are displayed inside each cluster.

텍스트 데이터 중에는, 보수 작업 기록이나 콜 센터의 전화 응대 기록 등과 같이, 날짜를 갖는 문 (文) 으로 이루어지고, 장기간에 걸쳐 누적적으로 축적되는 것이 있다. 이와 같은 텍스트 데이터에 대해 계층적 클러스터 분석을 실시할 때에는, 텍스트 데이터를 예를 들어 월별로 나누고, 각 월의 텍스트 데이터에 대해 계층적 클러스터 분석을 실시한다. 이로써, 계층적 클러스터 분석의 결과를 월별로 구할 수 있다.Among the text data, there is a text having a date, such as a maintenance work record or a call reception record of a call center, and accumulates cumulatively over a long period of time. When performing the hierarchical cluster analysis on such text data, the text data is divided by month, for example, and the hierarchical cluster analysis is performed on the text data of each month. Thus, the results of the hierarchical cluster analysis can be obtained monthly.

이 경우, 분석자는, 텍스트 데이터 중에서 주목해야 할 단어 (이하, 주목어라고 한다) 를 선택하고, 각 월에서 주목어를 포함하는 클러스터, 주목어를 포함하는 클러스터가 변화하는 시기, 주목어의 출현 빈도의 시간 경과에 따른 변화 등을 알고 싶다고 생각한다. 그러나, 종래의 텍스트 마이닝 장치에서는, 이용자는 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 용이하게 인식할 수 없다.In this case, the analyst selects a word to be noted from the text data (hereinafter referred to as a keyword), the cluster containing the keyword in the month, the time when the cluster containing the keyword changes, and the appearance of the keyword. I want to know how the frequency changes over time. However, in the conventional text mining apparatus, the user cannot easily recognize the change over time of the hierarchical cluster analysis result.

그 때문에, 본 발명은, 이용자가 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 용이하게 인식할 수 있는 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치를 제공하는 것을 목적으로 한다.Therefore, an object of the present invention is to provide a text mining method, a text mining program, and a text mining device in which a user can easily recognize a change over time of a hierarchical cluster analysis result.

본 발명의 제 1 국면은, 텍스트 데이터의 분석 결과를 포함하는 화면을 표시하는 텍스트 마이닝 방법으로서,According to a first aspect of the present invention, there is provided a text mining method for displaying a screen including analysis results of text data.

날짜를 갖는 문으로 이루어지는 텍스트 데이터로부터 단어를 추출하는 스텝과, Extracting a word from text data consisting of a statement having a date;

상기 단어에 대해 분석 기간별로 계층적 클러스터 분석을 실시하는 스텝과,Performing a hierarchical cluster analysis on the word for each analysis period;

상기 계층적 클러스터 분석의 결과를 포함하는 화면을 표시하는 스텝을 구비하고, And displaying a screen including a result of the hierarchical cluster analysis.

상기 결과를 포함하는 제 1 화면 내에서 주목어를 지정하는 지시가 입력되었을 때에, 상기 화면을 표시하는 스텝은, 상기 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 나타내는 제 2 화면을 표시하는 것을 특징으로 한다.When an instruction for designating a key word in the first screen including the result is input, the step of displaying the screen may include displaying a second screen indicating a change over time of a cluster including the key word. It is characterized by.

본 발명의 제 2 국면은, 본 발명의 제 1 국면에 있어서, The second aspect of the present invention is, in the first aspect of the present invention,

상기 제 2 화면은, 상기 클러스터에 포함되는 단어에 기초하는 클러스터명을 시간축을 따라 나타내는 것을 특징으로 한다.The second screen is characterized by displaying a cluster name based on a word included in the cluster along a time axis.

본 발명의 제 3 국면은, 본 발명의 제 2 국면에 있어서, The third aspect of the present invention is, in the second aspect of the present invention,

상기 클러스터명은, 상기 클러스터에 포함되는 단어를 출현 빈도가 높은 순으로 소정의 개수 이하만큼 연결한 것인 것을 특징으로 한다.The cluster name is characterized by concatenating the words included in the cluster by a predetermined number or less in the order of appearance frequency.

본 발명의 제 4 국면은, 본 발명의 제 2 국면에 있어서, A fourth aspect of the present invention, in the second aspect of the present invention,

상기 제 2 화면은, 상기 클러스터명이 변화하는 시기에 대응하는 위치에, 상기 클러스터명의 변화의 정도에 따른 양태를 갖는 마크를 추가로 포함하는 것을 특징으로 한다.The second screen may further include a mark having an aspect corresponding to the degree of change of the cluster name at a position corresponding to the time when the cluster name changes.

본 발명의 제 5 국면은, 본 발명의 제 4 국면에 있어서, A fifth aspect of the present invention is the fourth aspect of the present invention,

상기 마크는, 상기 클러스터명의 변화의 정도에 따른 색을 갖는 화살표인 것을 특징으로 한다.The mark is an arrow having a color in accordance with the degree of change of the cluster name.

본 발명의 제 6 국면은, 본 발명의 제 2 국면에 있어서, A sixth aspect of the present invention is the second aspect of the present invention,

상기 클러스터명을 구성하는 단어 중 앞의 클러스터명으로부터 변화한 단어는, 상기 제 2 화면 내에서 강조 표시되는 것을 특징으로 한다.The word changed from the previous cluster name among the words constituting the cluster name is highlighted in the second screen.

본 발명의 제 7 국면은, 본 발명의 제 2 국면에 있어서, A seventh aspect of the present invention, in the second aspect of the present invention,

상기 제 2 화면은, 상기 시간축을 따라 상기 주목어의 출현 빈도의 시간 경과에 따른 변화를 나타내는 그래프를 추가로 포함하는 것을 특징으로 한다.The second screen may further include a graph indicating a change over time of the frequency of appearance of the main fish along the time axis.

본 발명의 제 8 국면은, 본 발명의 제 7 국면에 있어서, The eighth aspect of the present invention is the seventh aspect of the present invention,

상기 제 2 화면은, 상기 클러스터명이 변화하는 시기에 대응하는 위치에 경계선을 추가로 포함하고, 상기 그래프의 배경은, 상기 경계선마다 상이한 양태를 갖는 것을 특징으로 한다.The second screen further includes a boundary line at a position corresponding to a time when the cluster name changes, and the background of the graph has a different aspect for each boundary line.

본 발명의 제 9 국면은, 본 발명의 제 2 국면에 있어서, A ninth aspect of the present invention is the second aspect of the present invention,

상기 클러스터명이 크게 변화하는 것이 많은 경우에는, 상기 화면을 표시하는 스텝은, 경고 메세지를 포함하는 화면을 표시하는 것을 특징으로 한다.In the case where the cluster name is often greatly changed, the step of displaying the screen is characterized by displaying a screen including a warning message.

본 발명의 제 10 국면은, 텍스트 데이터의 분석 결과를 포함하는 화면을 표시하기 위한 텍스트 마이닝 프로그램으로서, A tenth aspect of the present invention is a text mining program for displaying a screen including a result of analysis of text data,

상기 계층적 클러스터 분석의 결과를 포함하는 화면을 표시하는 스텝을 컴퓨터에 CPU 가 메모리를 이용하여 실행시키고, The computer executes the step of displaying a screen including the result of the hierarchical cluster analysis in a computer using a memory,

본 발명의 제 11 국면은, 본 발명의 제 10 국면에 있어서, An eleventh aspect of the present invention is the tenth aspect of the present invention,

본 발명의 제 12 국면은, 본 발명의 제 11 국면에 있어서, A twelfth aspect of the present invention is the eleventh aspect of the present invention,

본 발명의 제 13 국면은, 본 발명의 제 11 국면에 있어서, A thirteenth aspect of the present invention is the eleventh aspect of the present invention,

본 발명의 제 14 국면은, 본 발명의 제 13 국면에 있어서, A fourteenth aspect of the present invention is the thirteenth aspect of the present invention,

본 발명의 제 15 국면은, 본 발명의 제 11 국면에 있어서, A fifteenth aspect of the present invention is the eleventh aspect of the present invention,

본 발명의 제 16 국면은, 본 발명의 제 11 국면에 있어서, A sixteenth aspect of the present invention is the eleventh aspect of the present invention,

본 발명의 제 17 국면은, 본 발명의 제 16 국면에 있어서, A seventeenth aspect of the present invention is the sixteenth aspect of the present invention,

본 발명의 제 18 국면은, 본 발명의 제 11 국면에 있어서, An eighteenth aspect of the present invention is the eleventh aspect of the present invention,

본 발명의 제 19 국면은, 텍스트 데이터의 분석 결과를 포함하는 화면을 표시하는 텍스트 마이닝 장치로서, A nineteenth aspect of the present invention is a text mining device that displays a screen including a result of analysis of text data,

날짜를 갖는 문으로 이루어지는 텍스트 데이터로부터 단어를 추출하는 단어 추출부와, A word extracting unit for extracting a word from text data consisting of a sentence having a date;

상기 단어에 대해 분석 기간별로 계층적 클러스터 분석을 실시하는 클러스터링 처리부와, A clustering processor configured to perform hierarchical cluster analysis on the word for each analysis period;

상기 계층적 클러스터 분석의 결과를 포함하는 화면을 표시하는 화면 표시부를 구비하고, A screen display unit which displays a screen including a result of the hierarchical cluster analysis,

상기 결과를 포함하는 제 1 화면 내에서 주목어를 지정하는 지시가 입력되었을 때에, 상기 화면 표시부는, 상기 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 나타내는 제 2 화면을 표시하는 것을 특징으로 한다.When an instruction for designating a key word in the first screen including the result is input, the screen display unit displays a second screen indicating a change over time of a cluster including the key word. do.

본 발명의 제 20 국면은, 본 발명의 제 19 국면에 있어서, A twentieth aspect of the present invention is the nineteenth aspect of the present invention,

상기 제 1, 제 10 또는 제 19 국면에 의하면, 계층적 클러스터 분석의 결과를 포함하는 제 1 화면 내에서 주목어를 지정하는 지시가 입력되었을 때에, 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 나타내는 제 2 화면을 표시함으로써, 이용자는 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 용이하게 인식할 수 있다.According to the first, tenth, or nineteenth aspect, when an instruction for designating a target word is input in a first screen including a result of a hierarchical cluster analysis, a change over time of a cluster including the target word is performed. By displaying a second screen indicating, the user can easily recognize a change over time of the hierarchical cluster analysis result.

상기 제 2, 제 11 또는 제 20 국면에 의하면, 주목어를 포함하는 클러스터에 포함되는 단어에 기초하는 클러스터명을 시간축을 따라 나타냄으로써, 이용자는 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 용이하게 인식할 수 있다.According to the second, eleventh, or twentieth aspect, a cluster name based on a word included in a cluster containing a key word is represented along a time axis, whereby the user can change the time course of the cluster including the key word. It can be easily recognized.

상기 제 3 또는 제 12 국면에 의하면, 주목어를 포함하는 클러스터 내에서 출현 빈도가 높은 단어를 연결한 클러스터명을 시간축을 따라 나타냄으로써, 이용자는 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 용이하게 인식할 수 있다.According to the third or twelfth aspect, a cluster name in which a word with a high frequency of occurrence is displayed along the time axis according to the third or twelfth aspect, so that the user can change the time course of the cluster including the keyword. It can be easily recognized.

상기 제 4, 제 5, 제 13 또는 제 14 국면에 의하면, 주목어를 포함하는 클러스터 이름의 변화의 정도에 따른 양태를 갖는 마크 (변화의 정도에 따른 색을 갖는 화살표) 를 포함하는 제 2 화면을 표시함으로써, 이용자는 주목어를 포함하는 클러스터의 변화의 정도를 용이하게 인식할 수 있다.According to the fourth, fifth, thirteenth or fourteenth aspect, a second screen including a mark (arrow having a color according to the degree of change) having an aspect according to the degree of change of the cluster name including a note By displaying, the user can easily recognize the degree of change of the cluster including the keyword.

상기 제 6 또는 제 15 국면에 의하면, 주목어를 포함하는 클러스터의 이름을 구성하는 단어 중 변화된 단어를 강조 표시함으로써, 이용자는 주목어를 포함하는 클러스터에 있어서 출현 빈도가 높은 단어가 어떻게 변화했는지를 용이하게 인식할 수 있다.According to the sixth or fifteenth aspect, by highlighting the changed words among the words constituting the name of the cluster containing the key word, the user can see how the word with high frequency in the cluster containing the key word has changed. It can be easily recognized.

상기 제 7 또는 제 16 국면에 의하면, 주목어를 포함하는 클러스터의 시간 경과에 따른 변화에 추가하여, 주목어의 출현 빈도의 시간 경과에 따른 변화를 나타내는 그래프를 포함하는 화면을 표시함으로써, 이용자는 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 용이하게 인식할 수 있다.According to the seventh or sixteenth aspect, in addition to the change over time of the cluster containing the main word, the user may display a screen including a graph showing a change over time of the frequency of occurrence of the main word. The change over time of the hierarchical cluster analysis result can be easily recognized.

상기 제 8 또는 제 17 국면에 의하면, 주목어를 포함하는 클러스터의 이름이 변화하는 시기에 대응하는 위치에 경계선을 표시하고, 그래프의 배경의 양태를 경계선마다 바꿈으로써, 이용자는 주목어를 포함하는 클러스터가 변화하는 시기를 용이하게 인식할 수 있다.According to the eighth or seventeenth aspect, a boundary line is displayed at a position corresponding to a time when a name of a cluster including a core change, and the aspect of the graph background is changed for each boundary so that the user includes the core. You can easily recognize when the cluster is changing.

상기 제 9 또는 제 18 국면에 의하면, 주목어를 포함하는 클러스터의 이름이 크게 변화하는 것이 많은 경우에 경고 메세지를 포함하는 화면을 표시함으로써, 이용자는 계층적 클러스터 분석이 잘 되고 있지 않음을 인식할 수 있다.According to the ninth or eighteenth aspect, when a name of a cluster including a notice word is largely changed, a screen including a warning message is displayed, whereby the user can recognize that the hierarchical cluster analysis is not well performed. Can be.

도 1 은 본 발명의 실시형태에 관련된 텍스트 마이닝 장치의 구성을 나타내는 블록도이다.
도 2 는 도 1 에 나타내는 텍스트 마이닝 장치로서 동작하는 컴퓨터의 구성을 나타내는 블록도이다.
도 3 은 도 1 에 나타내는 텍스트 마이닝 장치의 동작을 나타내는 플로 차트이다.
도 4 는 도 1 에 나타내는 텍스트 마이닝 장치가 표시하는 계층적 클러스터 분석의 결과를 나타내는 창의 예를 나타내는 도면이다.
도 5 는 도 4 에 나타내는 창 내에서 주목어를 지정하는 조작을 나타내는 도면이다.
도 6 은 도 1 에 나타내는 텍스트 마이닝 장치가 표시하는 분석 결과의 시간 경과에 따른 변화를 나타내는 창의 예를 나타내는 도면이다.
도 7 은 도 1 에 나타내는 텍스트 마이닝 장치의 표시 화면의 예를 나타내는 도면이다.
도 8a 는 계층적 클러스터 분석 결과의 시간 경과에 따른 변화의 예를 나타내는 도면이다.
도 8b 는 도 8a 에 이어지는 도면이다.
도 8c 는 도 8b 에 이어지는 도면이다.
도 8d 는 도 8c 에 이어지는 도면이다.
도 9 는 도 1 에 나타내는 텍스트 마이닝 장치가 표시하는 창을 나타내는 도면이다.
도 10 은 수형도의 예를 나타내는 도면이다.
도 11 은 종래의 텍스트 마이닝 장치에 있어서의 계층적 클러스터 분석 결과의 표시 양태를 나타내는 도면이다.1 is a block diagram showing a configuration of a text mining apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing the configuration of a computer operating as a text mining apparatus shown in FIG. 1.
FIG. 3 is a flowchart showing the operation of the text mining apparatus shown in FIG. 1.
FIG. 4 is a diagram illustrating an example of a window indicating a result of hierarchical cluster analysis displayed by the text mining apparatus illustrated in FIG. 1.
FIG. 5 is a diagram illustrating an operation for specifying a target word in the window shown in FIG. 4.
FIG. 6 is a diagram illustrating an example of a window showing a change over time of an analysis result displayed by the text mining device illustrated in FIG. 1.
FIG. 7 is a diagram illustrating an example of a display screen of the text mining apparatus illustrated in FIG. 1.
8A illustrates an example of change over time of a hierarchical cluster analysis result.
FIG. 8B is a view following FIG. 8A.
8C is a view following FIG. 8B.
FIG. 8D is a view following FIG. 8C.
FIG. 9 is a diagram illustrating a window displayed by the text mining device illustrated in FIG. 1.
10 is a diagram illustrating an example of a tree.
It is a figure which shows the display aspect of the hierarchical cluster analysis result in the conventional text mining apparatus.

이하, 도면을 참조하여, 본 발명의 실시형태에 관련된 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치에 대하여 설명한다. 본 실시형태에 관련된 텍스트 마이닝 방법은, 전형적으로는 컴퓨터를 사용하여 실행된다. 본 실시형태에 관련된 텍스트 마이닝 프로그램은, 컴퓨터를 사용하여 텍스트 마이닝 방법을 실시하기 위한 프로그램이다. 본 실시형태에 관련된 텍스트 마이닝 장치는, 전형적으로는 컴퓨터를 사용하여 구성된다. 텍스트 마이닝 프로그램을 실행하는 컴퓨터는, 텍스트 마이닝 장치로서 기능한다.EMBODIMENT OF THE INVENTION Hereinafter, with reference to drawings, the text mining method, the text mining program, and the text mining apparatus which concern on embodiment of this invention are demonstrated. The text mining method according to the present embodiment is typically executed using a computer. The text mining program according to the present embodiment is a program for implementing a text mining method using a computer. The text mining device according to the present embodiment is typically configured by using a computer. The computer executing the text mining program functions as a text mining device.

도 1 은, 본 발명의 실시형태에 관련된 텍스트 마이닝 장치의 구성을 나타내는 블록도이다. 도 1 에 나타내는 텍스트 마이닝 장치 (10) 는, 지시 입력부 (11), 텍스트 데이터 기억부 (12), 단어 추출부 (13), 클러스터링 처리부 (14), 분석 결과 기억부 (15), 및 화면 표시부 (16) 를 구비하고 있다. 텍스트 마이닝 장치 (10) 는, 텍스트 데이터 기억부 (12) 에 기억된 텍스트 데이터에 대해 계층적 클러스터 분석을 실시하고, 분석 결과를 포함하는 화면을 표시한다.1 is a block diagram showing the configuration of a text mining apparatus according to an embodiment of the present invention. The text mining device 10 shown in FIG. 1 includes an instruction input unit 11, a text data storage unit 12, a word extraction unit 13, a clustering processing unit 14, an analysis result storage unit 15, and a screen display unit. (16) is provided. The text mining device 10 performs a hierarchical cluster analysis on the text data stored in the text data storage unit 12, and displays a screen including the analysis result.

텍스트 마이닝 장치 (10) 의 동작의 개요는, 이하와 같다. 지시 입력부 (11) 에는, 이용자 (텍스트 데이터의 분석자) 로부터의 지시가 입력된다. 텍스트 데이터 기억부 (12) 는, 자유 기술된 1 이상의 텍스트 데이터를 기억하고 있다. 단어 추출부 (13) 는, 텍스트 데이터 기억부 (12) 에 기억된 텍스트 데이터에 대해 형태소 해석을 실시함으로써, 텍스트 데이터로부터 단어를 추출한다. 클러스터링 처리부 (14) 는, 단어 추출부 (13) 에서 추출된 단어에 대해 계층적 클러스터 분석을 실시한다. 분석 결과 기억부 (15) 는, 클러스터링 처리부 (14) 에 의한 분석 결과를 기억한다. 화면 표시부 (16) 는, 분석 결과 기억부 (15) 에 기억된 분석 결과에 기초하여 화면 데이터를 표시한다.The outline of the operation of the text mining device 10 is as follows. Instruction input unit 11 receives an instruction from a user (analyzer of text data). The text data storage unit 12 stores one or more freely described text data. The word extraction unit 13 extracts a word from the text data by performing a morpheme analysis on the text data stored in the text data storage unit 12. The clustering processing unit 14 performs hierarchical cluster analysis on the words extracted by the word extraction unit 13. The analysis result storage unit 15 stores the analysis result by the clustering processing unit 14. The screen display unit 16 displays screen data based on the analysis result stored in the analysis result storage unit 15.

텍스트 데이터 기억부 (12) 는, 날짜를 갖는 문으로 이루어지고, 장기간 (예를 들어, 수년간) 에 걸쳐 누적적으로 축적된 텍스트 데이터를 기억하고 있다. 이용자는, 지시 입력부 (11) 를 사용하여, 분석 대상인 텍스트 데이터와 분석 기간과 분석 간격을 지정하는 지시, 주목어를 지정하는 지시 등을 입력한다. 단어 추출부 (13), 클러스터링 처리부 (14), 및 화면 표시부 (16) 는, 이용자로부터의 지시에 따라, 텍스트 데이터에 대해 계층적 클러스터 분석을 실시한 결과를 포함하는 화면을 표시하기 위한 동작을 실시한다. 또, 화면 표시부 (16) 는, 이용자로부터의 지시에 따라, 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 포함하는 화면을 표시한다.The text data storage unit 12 is composed of a door having a date, and stores text data accumulated cumulatively over a long period of time (for example, several years). The user inputs, using the instruction | indication input part 11, the instruction | indication which designates the text data which is an analysis object, an analysis period, an analysis interval, an instruction to designate a keyword, etc. The word extracting unit 13, the clustering processing unit 14, and the screen display unit 16 perform an operation for displaying a screen including a result of performing hierarchical cluster analysis on the text data according to the instruction from the user. do. In addition, the screen display unit 16 displays a screen including a change over time of the hierarchical cluster analysis result in accordance with an instruction from the user.

도 2 는, 텍스트 마이닝 장치 (10) 로서 기능하는 컴퓨터의 구성을 나타내는 블록도이다. 도 2 에 나타내는 컴퓨터 (20) 는, CPU (21), 메인 메모리 (22), 기억부 (23), 입력부 (24), 표시부 (25), 통신부 (26), 및 기록 매체 판독부 (27) 를 구비하고 있다. 메인 메모리 (22) 에는, 예를 들어, DRAM 이 사용된다. 기억부 (23) 에는, 예를 들어, 하드 디스크나 솔리드 스테이트 드라이브가 사용된다. 입력부 (24) 에는, 예를 들어, 키보드 (28) 나 마우스 (29) 가 포함된다. 표시부 (25) 에는, 예를 들어, 액정 디스플레이가 사용된다. 통신부 (26) 는, 유선 통신 또는 무선 통신의 인터페이스 회로이다. 기록 매체 판독부 (27) 는, 프로그램 등을 기억한 기록 매체 (30) 의 인터페이스 회로이다. 기록 매체 (30) 에는, 예를 들어, CD-ROM, DVD-ROM, USB 메모리 등의 비일과성의 기록 매체가 사용된다.2 is a block diagram showing the configuration of a computer functioning as the text mining device 10. The computer 20 shown in FIG. 2 includes a CPU 21, a main memory 22, a storage unit 23, an input unit 24, a display unit 25, a communication unit 26, and a recording medium reading unit 27. Equipped with. For example, a DRAM is used for the main memory 22. For example, a hard disk or a solid state drive is used for the storage unit 23. The input unit 24 includes, for example, a keyboard 28 and a mouse 29. The liquid crystal display is used for the display part 25, for example. The communication unit 26 is an interface circuit of wired communication or wireless communication. The recording medium reading unit 27 is an interface circuit of the recording medium 30 that stores a program or the like. As the recording medium 30, a non-transitory recording medium such as a CD-ROM, a DVD-ROM, or a USB memory is used.

컴퓨터 (20) 가 텍스트 마이닝 프로그램 (31) 을 실행하는 경우, 기억부 (23) 는, 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (32) 를 기억한다. 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (32) 는, 예를 들어, 서버나 다른 컴퓨터로부터 통신부 (26) 를 사용해서 수신한 것이어도 되고, 기록 매체 (30) 로부터 기록 매체 판독부 (27) 를 사용해서 판독 출력한 것이어도 된다.When the computer 20 executes the text mining program 31, the storage unit 23 stores the text mining program 31 and the text data 32. The text mining program 31 and the text data 32 may be received using, for example, the communication unit 26 from a server or another computer, and the recording medium reading unit 27 is read from the recording medium 30. It may be used as read out.

텍스트 마이닝 프로그램 (31) 을 실행할 때에는, 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (32) 는 메인 메모리 (22) 에 복사 전송된다. CPU (21) 는, 메인 메모리 (22) 를 작업용 메모리로서 이용하여, 메인 메모리 (22) 에 기억된 텍스트 마이닝 프로그램 (31) 을 실행함으로써, 텍스트 데이터 (32) 로부터 단어를 추출하는 처리, 추출한 단어에 대해 계층적 클러스터 분석을 실시하는 처리, 분석 결과를 포함하는 화면을 표시하는 처리 등을 실시한다. 이 때 컴퓨터 (20) 는, 텍스트 마이닝 장치 (10) 로서 기능한다. 또한, 이상으로 서술한 컴퓨터 (20) 의 구성은 일례에 불과하며, 임의의 컴퓨터를 사용하여 텍스트 마이닝 장치 (10) 를 구성할 수 있다.When the text mining program 31 is executed, the text mining program 31 and the text data 32 are copied and transferred to the main memory 22. The CPU 21 executes the text mining program 31 stored in the main memory 22 using the main memory 22 as the working memory, thereby processing to extract a word from the text data 32, and the extracted word. Processing for performing hierarchical cluster analysis, displaying a screen including the analysis result, and the like. At this time, the computer 20 functions as the text mining device 10. In addition, the structure of the computer 20 mentioned above is only an example, and the text mining apparatus 10 can be comprised using arbitrary computers.

도 3 은, 텍스트 마이닝 장치 (10) 의 동작을 나타내는 플로 차트이다. 도 3 에 나타내는 동작을 실시하기 전에, 텍스트 데이터 기억부 (12) 는, 자유 기술되고, 누적적으로 축적된 1 이상의 텍스트 데이터를 기억하고 있다. 텍스트 데이터는 날짜 (예를 들어, 작업일이나 접수일 등) 를 갖는 문으로 이루어지고, 텍스트 데이터는 날짜에 의해 복수의 부분으로 분할된다. 텍스트 마이닝 장치 (10) 는, 텍스트 데이터 기억부 (12) 에 기억된 텍스트 데이터 중에서 이용자가 지정한 텍스트 데이터에 대해 처리를 실시한다.3 is a flowchart showing the operation of the text mining device 10. Before performing the operation shown in FIG. 3, the text data storage unit 12 freely describes and stores one or more text data accumulated cumulatively. The text data consists of a statement having a date (for example, a work day or a reception date, etc.), and the text data is divided into a plurality of parts by the date. The text mining device 10 processes the text data specified by the user among the text data stored in the text data storage unit 12.

도 3 에 있어서, 지시 입력부 (11) 는, 먼저 이용자로부터 분석 대상인 텍스트 데이터, 분석 기간, 및 분석 간격을 지정하는 지시를 수취한다 (스텝 S101). 이용자는, 입력부 (24) 를 사용하여, 화면에 표시된 다이얼로그 박스 (도시 생략) 에 이들 정보를 입력한다. 수취한 지시는, 텍스트 마이닝 장치 (10) 의 각 부에 대해 출력된다.In FIG. 3, the instruction | indication input part 11 receives the instruction which designates the text data, an analysis period, and an analysis interval which are analysis objects from a user first (step S101). The user inputs these information into the dialog box (not shown) displayed on the screen using the input part 24. The received instruction is output to each part of the text mining apparatus 10.

다음으로, 단어 추출부 (13) 는, 텍스트 데이터 기억부 (12) 로부터 지정된 텍스트 데이터를 판독 출력한다 (스텝 S102). 다음으로, 단어 추출부 (13) 는, 스텝 S102 에서 판독 출력한 텍스트 데이터에 대해 형태소 해석을 실시함으로써, 판독 출력한 텍스트 데이터로부터 단어를 추출한다 (스텝 S103). 이 때, 단어 추출부 (13) 는, 판독 출력한 텍스트 데이터로부터, 이후의 분석에서 필요해지는 단어만을 추출한다.Next, the word extraction unit 13 reads out the specified text data from the text data storage unit 12 (step S102). Next, the word extraction part 13 extracts a word from the text data read-out by performing morphological analysis with respect to the text data read-out in step S102 (step S103). At this time, the word extracting unit 13 extracts only words necessary for subsequent analysis from the text data read out.

다음으로, 클러스터링 처리부 (14) 는, 스텝 S103 에서 추출된 단어에 대해 계층적 클러스터 분석을 실시한다 (스텝 S104). 다음으로, 클러스터링 처리부 (14) 는, 스텝 S103 에서 추출된 단어의 출현 빈도를 구한다 (스텝 S105). 다음으로, 분석 결과 기억부 (15) 는, 스텝 S104 에서 구한 계층적 클러스터 분석의 결과와 스텝 S105 에서 구한 단어의 출현 빈도를 기억한다 (스텝 S106).Next, the clustering processing unit 14 performs hierarchical cluster analysis on the words extracted in step S103 (step S104). Next, the clustering processing unit 14 calculates the frequency of appearance of the words extracted in step S103 (step S105). Next, the analysis result storage part 15 memorize | stores the result of the hierarchical cluster analysis calculated | required in step S104, and the frequency of appearance of the word calculated | required in step S105 (step S106).

클러스터링 처리부 (14) 는, 지시 입력부 (11) 로부터, 이용자가 지정한 분석 기간과 분석 간격을 수취한다. 분석 기간은, 분석 대상인 텍스트 데이터 중, 실제로 계층적 클러스터 분석을 실시하는 기간을 나타낸다. 분석 기간은, 분석 간격을 단위로 하여 복수의 기간으로 분할된다. 예를 들어, 분석 기간이 2005 년 6 월 1 일부터 2015 년 5 월 31 일까지의 기간이고, 분석 간격이 1 개월인 경우, 11 년의 분석 기간은 132 개의 기간으로 분할된다.The clustering processing unit 14 receives the analysis period and analysis interval specified by the user from the instruction input unit 11. The analysis period represents a period during which hierarchical cluster analysis is actually performed among the text data to be analyzed. The analysis period is divided into a plurality of periods based on the analysis interval. For example, if the analysis period is from June 1, 2005 to May 31, 2015, and the analysis interval is one month, the analysis period of 11 years is divided into 132 periods.

분할 후의 기간의 개수를 p 로 한다. 클러스터링 처리부 (14) 는, 스텝 S104 에 있어서, p 개의 기간 각각에 대하여 계층적 클러스터 분석을 실시한다. 보다 상세하게는, 클러스터링 처리부 (14) 는, p 개의 기간 각각에 대하여, 스텝 S103 에서 추출된 단어에 대해, 스텝 S102 에서 판독 출력된 텍스트 데이터 중 기간 내의 일시를 갖는 문을 사용하여 계층적 클러스터 분석을 실시한다. 클러스터링 처리부 (14) 는, 예를 들어, 텍스트 데이터 (32) 에 있어서의 2 개의 단어 간의 거리 (2 개의 단어가 어느 정도 떨어져 나타나는지) 에 기초하여, 2 개의 단어 간의 유사도를 구한다. 클러스터링 처리부 (14) 는, 구한 단어 간의 유사도에 기초하여, 소정의 방법 (예를 들어, 최단 거리법, 최장 거리법, 군 평균법, 십진법, 워드법 등) 을 이용하여 계층적 클러스터 분석을 실시한다.The number of periods after the division is set to p. In step S104, the clustering processing unit 14 performs hierarchical cluster analysis for each of the p periods. More specifically, the clustering processing unit 14 analyzes the hierarchical cluster using a statement having a date and time within the period of the text data read out and output in step S102 for the words extracted in step S103 for each of the p periods. Is carried out. The clustering processing unit 14 calculates the similarity between the two words based on, for example, the distance between the two words in the text data 32 (how far apart the two words appear). The clustering processing unit 14 performs hierarchical cluster analysis using a predetermined method (for example, the shortest distance method, the longest distance method, the group average method, the decimal method, the word method, and the like) based on the similarity between the obtained words. .

클러스터링 처리부 (14) 는, 스텝 S105 에 있어서, p 개의 기간 각각에 대하여 단어의 출현 빈도를 구한다. 스텝 S104 에서는 계층적 클러스터 분석의 결과가 p 개 구해지고, 스텝 S105 에서는 단어의 출현 빈도가 p 개씩 구해진다. 분석 결과 기억부 (15) 는, 스텝 S106 에 있어서, p 개의 기간 각각에 대하여, 계층적 클러스터 분석의 결과와 단어의 출현 빈도를 기억한다.In step S105, the clustering processing unit 14 calculates the frequency of appearance of words for each of the p periods. In step S104, p hierarchical cluster analysis results are obtained, and in step S105, the frequency of occurrence of words is determined by p. In step S106, the analysis result storage unit 15 stores the results of the hierarchical cluster analysis and the frequency of occurrence of words for each of the p periods.

다음으로, 화면 표시부 (16) 는, 분석 결과 기억부 (15) 에 기억된 계층적 클러스터 분석의 결과를 포함하는 화면을 표시한다 (스텝 S107). 도 4 는, 스텝 S107 에서 표시되는 창의 예를 나타내는 도면이다. 도 4 에 나타내는 창 (41) 은, 계층적 클러스터 분석의 결과를 포함하고 있다. 계층적 클러스터 분석의 결과에 대해 클러스터수를 설정하면, 각 클러스터에 포함되는 단어가 결정된다. 텍스트 마이닝 장치 (10) 는, 계층적 클러스터 분석의 결과를 포함하는 화면을 표시할 때에, 수형도 대신에, 복수의 클러스터를 도 4 에 나타내는 양태로 표시한다.Next, the screen display unit 16 displays a screen including the results of the hierarchical cluster analysis stored in the analysis result storage unit 15 (step S107). 4 is a diagram illustrating an example of a window displayed in step S107. The window 41 shown in FIG. 4 contains the results of the hierarchical cluster analysis. When the number of clusters is set for the results of the hierarchical cluster analysis, words included in each cluster are determined. When the text mining device 10 displays a screen including the results of the hierarchical cluster analysis, the text mining device 10 displays a plurality of clusters in the form shown in FIG. 4 instead of the tree diagram.

텍스트 마이닝 장치 (10) 는, 동작 파라미터로서, 클러스터수와 클러스터 내의 최대 표시 데이터수를 갖는다. 이하, 전자를 m, 후자를 n 으로 한다. 이들 값은, 초기 상태에서는 소정의 초기값으로 설정되어 있다. 이용자는, 지시 입력부 (11) 를 사용하여, 이들의 값을 임의로 설정해도 된다. 텍스트 마이닝 장치 (10) 에서는, 스텝 S103 에서 추출된 단어는, m 개의 클러스터로 분류된다. 각 클러스터에는, 1 개 이상의 단어가 포함된다. 창 (41) 에는 m 개의 클러스터가 구름형 도형으로 표시되고, 각 클러스터의 내부에는 각 클러스터에 포함되는 단어가 표시된다. 각 클러스터의 내부에 표시되는 단어의 개수는, n 개 이하로 제한된다. 예를 들어, n ＝ 5 일 때에 어떤 클러스터가 10 개의 단어를 포함하는 경우, 화면에 표시되는 클러스터의 내부에는 5 개의 단어가 표시된다.The text mining device 10 has, as an operation parameter, the number of clusters and the maximum number of display data in the cluster. Hereinafter, the former is m and the latter is n. These values are set to predetermined initial values in an initial state. The user may set these values arbitrarily using the instruction | indication input part 11. In the text mining device 10, the words extracted in step S103 are classified into m clusters. Each cluster contains one or more words. In the window 41, m clusters are displayed in a cloud shape, and words contained in each cluster are displayed inside each cluster. The number of words displayed inside each cluster is limited to n or less. For example, when a cluster includes ten words when n = 5, five words are displayed inside the cluster displayed on the screen.

다음으로, 지시 입력부 (11) 는, 이용자로부터 지시를 수취한다 (스텝 S111). 다음으로, 텍스트 마이닝 장치 (10) 는, 스텝 S111 에서 수취한 지시가 주목어를 지정하는 지시인지 여부를 판단한다 (스텝 S112). 텍스트 마이닝 장치 (10) 의 제어는, 예인 경우에는 스텝 S121 로 진행되고, 아니오인 경우에는 스텝 S113 으로 진행된다.Next, the instruction input unit 11 receives an instruction from the user (step S111). Next, the text mining apparatus 10 determines whether or not the instruction received in step S111 is an instruction for designating a key word (step S112). If yes, control of the text mining device 10 proceeds to step S121, and if no, the control proceeds to step S113.

후자의 경우, 스텝 S111 에서 수취한 지시는, 예를 들어, 창을 이동시키는 지시, 창을 비표시로 하는 지시, 창을 닫는 지시 등이다. 화면 표시부 (16) 는, 스텝 S111 에서 수취한 지시에 따라, 갱신 후의 화면을 표시한다 (스텝 S113). 그 후, 텍스트 마이닝 장치 (10) 의 제어는, 스텝 S111 로 진행된다.In the latter case, the instruction received in step S111 is, for example, an instruction to move the window, an instruction to hide the window, an instruction to close the window, or the like. The screen display unit 16 displays the screen after the update in accordance with the instruction received in step S111 (step S113). Thereafter, the control of the text mining device 10 proceeds to step S111.

스텝 S111 을 실행할 때에는, 계층적 클러스터 분석의 결과를 포함하는 화면이 표시되어 있다. 이하, 스텝 S111 을 실행할 때에, 도 4 에 나타내는 창 (41) 을 포함하는 화면이 표시되어 있는 것으로 한다. 또, 마우스 커서 (43) 가 표시 화면 내의 어떤 요소 위에 있을 때에 마우스 (29) 의 버튼을 클릭하는 것을 「요소를 클릭한다」라고 하고, 주목어를 포함하는 클러스터를 「주목어 클러스터」라고 하고, 주목어 클러스터에 붙여지는 이름을 「주목어 클러스터명」이라고 한다.When executing step S111, the screen containing the result of a hierarchical cluster analysis is displayed. Hereinafter, when performing step S111, it is assumed that the screen including the window 41 shown in FIG. 4 is displayed. Clicking on the button of the mouse 29 when the mouse cursor 43 is on an element in the display screen is referred to as "clicking on the element", and the cluster containing the main word is called "major key cluster", The name given to the main cluster is called the main cluster name.

도 5 는, 주목어를 지정하는 조작을 나타내는 도면이다. 이용자는, 창 (41) 내에서 주목어로서 지정하는 단어 (여기에서는 「분해」) 를 클릭한다 (1 회째 클릭). 이 때, 표시 화면 내에 컨텍스트 메뉴 (42) 가 나타난다. 이용자는, 컨텍스트 메뉴 (42) 중에서 항목 「분석 결과의 시간 경과에 따른 변화로」를 클릭한다 (2 회째 클릭). 이 조작에 의해, 1 회째에 클릭된 단어가 주목어로서 지정된다.5 is a diagram illustrating an operation for designating a target word. The user clicks on the word (here, "disassembly") that is designated as the key word in the window 41 (first click). At this time, the context menu 42 appears in the display screen. The user clicks on the item "with change over time of the analysis result" in the context menu 42 (second click). By this operation, the word clicked on the first time is designated as the key word.

스텝 S112 에서 예인 경우, 화면 표시부 (16) 는, 분석 결과 기억부 (15) 로부터 계층적 클러스터 분석의 결과와 주목어의 출현 빈도를 판독 출력한다 (스텝 S121). 다음으로, 화면 표시부 (16) 는, 판독 출력한 데이터에 기초하여, 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 포함하는 화면을 표시한다 (스텝 S122).In the case of Yes in step S112, the screen display unit 16 reads out the result of the hierarchical cluster analysis from the analysis result storage unit 15 and the frequency of appearance of the key word (step S121). Next, the screen display unit 16 displays a screen including a change over time of the hierarchical cluster analysis result based on the read-out data (step S122).

도 6 은, 스텝 S122 에서 표시되는, 분석 결과의 시간 경과에 따른 변화를 나타내는 창을 나타내는 도면이다. 도 6 에 나타내는 창 (51) 은, 스텝 S111 에 있어서, 주목어로서 「분해」를 지정했을 때에 표시된다. 창 (51) 은, 예를 들어 도 7 에 나타내는 바와 같이, 도 4 에 나타내는 창 (41) 에 겹쳐 표시된다.FIG. 6 is a diagram showing a window showing a change over time of the analysis result displayed in step S122. FIG. The window 51 shown in FIG. 6 is displayed when "decomposition" is designated as a key word in step S111. For example, as shown in FIG. 7, the window 51 overlaps the window 41 shown in FIG. 4.

창 (51) 은, 수평 방향으로 연신하는 시간축을 따라, 주목어의 출현 빈도의 시간 경과에 따른 변화를 나타내는 꺾은선 그래프 (52) 를 포함하고 있다. 주목어의 출현 빈도에는, 예를 들어, 주목어 클러스터에 포함되는 모든 단어의 출현 횟수의 합계 중에서 주목어의 출현 횟수가 차지하는 비율이 사용된다. 주목어의 출현 빈도는, 이용자로부터의 지시에 따라, 주목어의 출현 횟수로 바뀌어도 된다.The window 51 includes a line graph 52 representing a change over time of the appearance frequency of the main fish along the time axis extending in the horizontal direction. As the frequency of occurrences of the main word, for example, a ratio of the number of occurrences of the main word to the total number of occurrences of all the words included in the main word cluster is used. The frequency of appearance of the major word may be changed to the number of occurrences of the major word in accordance with an instruction from the user.

계층적 클러스터 분석에 의해 얻어지는 클러스터의 구성 (클러스터에 포함되는 요소) 은, 시간 경과에 따라 변화한다. 클러스터의 시간 경과에 따른 변화를 나타내기 위해, 클러스터에는 자동적으로 이름이 붙여진다. 클러스터가 1 개의 단어만을 포함하는 경우에는, 클러스터명에는 그 단어가 그대로 사용된다. 클러스터가 2 개의 단어를 포함하는 경우에는, 클러스터명에는 2 개의 단어를 출현 빈도가 높은 순으로 연결한 것이 사용된다. 클러스터가 3 개 이상의 단어를 포함하는 경우에는, 클러스터명에는 클러스터에 포함되는 단어 중 출현 빈도가 높은 3 개의 단어를 출현 빈도가 큰 순으로 연결한 것이 사용된다. 또한, 클러스터명을 구성하는 단어의 집합이 동일한 경우, 단어의 순서가 상이해도 동일한 클러스터명으로서 취급된다.The configuration (element included in the cluster) of the cluster obtained by the hierarchical cluster analysis changes over time. Clusters are automatically named to indicate changes over time in the cluster. If the cluster contains only one word, that word is used as it is in the cluster name. In the case where the cluster includes two words, the cluster name is a combination of two words in order of appearance. In the case where the cluster includes three or more words, the cluster name is used by concatenating three words having the highest frequency of appearance among the words included in the cluster in the order of appearance frequency. In addition, when the set of words constituting the cluster name is the same, even if the order of the words is different, they are treated as the same cluster name.

도 8a ∼ 도 8d 는, 계층적 클러스터 분석 결과의 시간 경과에 따른 변화의 예를 나타내는 도면이다. 도 8a ∼ 도 8d 에는, 다른 달에 있어서의 계층적 클러스터 분석의 결과가 기재되어 있다. 도 8a ∼ 도 8d 에 있어서, 구름형 도형은 클러스터를 나타내고, 밑줄을 그은 문자열은 클러스터명을 나타낸다. 원의 사이즈는, 원 안에 기재된 단어의 출현 빈도를 나타낸다.8A to 8D are diagrams showing examples of changes over time of the hierarchical cluster analysis results. 8A-8D, the result of the hierarchical cluster analysis in another month is described. In Figs. 8A to 8D, the cloud figures represent clusters, and the underlined strings represent cluster names. The size of the circle indicates the frequency of appearance of the words described in the circle.

도 8a 에 나타내는 분석 결과에서는, 텍스트 데이터로부터 추출된 단어는, 「구동」과 「분해」를 포함하는 클러스터, 「배기」와 「압」과 「플로」와 「밸브」를 포함하는 클러스터, 및 「벨트」와 「회전」과 「체크」와 「모터」와 「팽팽함」을 포함하는 클러스터로 분류되어 있다. 이들 3 개의 클러스터에는, 각각 「구동·분해」, 「배기·압·플로」, 및 「벨트·회전·체크」라는 이름이 붙여진다. 도 8b ∼ 도 8d 에 나타내는 분석 결과에 대해서도, 3 개의 클러스터에 동일한 방법으로 이름이 붙여진다.In the analysis result shown in FIG. 8A, the words extracted from the text data include a cluster including "drive" and "decomposition", a cluster including "exhaust", "pressure", "flow" and "valve", and " The belt is classified into clusters including "rotation", "check", "motor" and "tension". These three clusters are named "driving, disassembling," "exhaust, pressure, flow," and "belt, rotation, and check." Also about the analysis result shown to FIG. 8B-FIG. 8D, three clusters are named by the same method.

주목어로서 「분해」를 지정했을 때, 주목어 클러스터명은, 도 8a 에 나타내는 분석 결과에서는 「분해·구동」이고, 도 8b 에 나타내는 분석 결과에서는 「구동·벨트·회전」이고, 도 8c 에 나타내는 분석 결과에서는 「배기·압·플로」이고, 도 8d 에 나타내는 분석 결과에서는 「배기·압·분해」이다. 이와 같이 주목어 클러스터명은, 시간 경과에 따라 변화한다.When "decomposition" is designated as the key word, the key word cluster name is "decomposition and drive" in the analysis result shown in FIG. 8A, "drive belt rotation" in the analysis result shown in FIG. 8B, and shown in FIG. 8C. In the analysis result, it is "exhaust pressure / flow", and in the analysis result shown in FIG. 8D, it is "exhaust pressure / decomposition". In this way, the key cluster name changes over time.

도 6 에 나타내는 창 (51) 은, 꺾은선 그래프 (52) 에 추가하여, 주목어 클러스터명 (53), 경계선 (54), 및 화살표 (55) 를 포함하고 있다. 주목어 클러스터명 (53) 은, 수평 방향으로 연신하는 시간축을 따라, 꺾은선 그래프 (52) 의 상부에 표시된다. 경계선 (54) 은, 꺾은선 그래프 (52) 내에서, 주목어 클러스터명 (53) 이 변화하는 시기에 대응하는 위치에 표시된다. 주목어 클러스터명 (53) 은, 경계선 (54) 으로 구획된 기간마다 표시된다. 꺾은선 그래프 (52) 의 배경은, 경계선 (54) 마다 상이한 양태 (예를 들어, 다른 색이나 다른 패턴) 를 갖는다. 주목어 클러스터명 (53) 을 구성하는 단어 중 앞의 클러스터명으로부터 변화한 단어 (구 (舊) 주목어 클러스터명에는 포함되지 않고, 신 (新) 주목어 클러스터명에 포함되어 있는 단어) 는, 강조 표시된다. 창 (51) 에서는, 그러한 단어는 고딕체로 또한 이탤릭체로 표시되어 있다.The window 51 shown in FIG. 6 includes the main cluster name 53, the boundary line 54, and the arrow 55 in addition to the line graph 52. The key cluster name 53 is displayed above the broken line graph 52 along the time axis extending in the horizontal direction. The boundary line 54 is displayed in the line graph 52 at the position corresponding to the time when the main cluster name 53 changes. The key cluster name 53 is displayed for each period divided by the boundary line 54. The background of the broken line graph 52 has a different aspect (for example, a different color or a different pattern) for each boundary line 54. Among the words constituting the main cluster name 53, a word changed from the previous cluster name (a word not included in the old main cluster name and included in the new main cluster name) is Is highlighted. In window 51, such words are shown in Gothic and in italics.

화살표 (55) 는, 경계선 (54) 의 상부에서, 주목어 클러스터명 (53) 이 변화하는 시기에 대응하는 위치에 표시된다. 화살표 (55) 는, 주목어 클러스터명 (53) 의 변화의 정도에 따른 양태로 표시된다. 주목어 클러스터명 (53) 을 구성하는 단어가 모두 변화하는 경우에는, 빨간 화살표 (55r) 가 표시된다. 주목어 클러스터명 (53) 을 구성하는 단어가 2 개 변화하는 경우에는, 파란 화살표 (55b) 가 표시된다. 주목어 클러스터명 (53) 을 구성하는 단어가 1 개 변화하는 경우에는, 검은 화살표 (55n) 가 표시된다. 또한, 화살표 (55) 의 표시 양태는, 주목어 클러스터명 (53) 의 변화의 정도에 따라 다르기만 하면 임의여도 된다. 예를 들어, 화살표 (55) 의 표시 사이즈가, 주목어 클러스터명 (53) 의 변화의 정도에 따라 상이해도 된다.An arrow 55 is displayed at the position corresponding to the timing at which the main cluster cluster name 53 changes at the upper portion of the boundary line 54. An arrow 55 is displayed in the form according to the degree of change of the main cluster name 53. When all of the words constituting the main cluster name 53 change, a red arrow 55r is displayed. In the case where two words constituting the main cluster name 53 change, a blue arrow 55b is displayed. In the case where one word constituting the main cluster name 53 changes, a black arrow 55n is displayed. The display mode of the arrow 55 may be arbitrary as long as it varies depending on the degree of change in the target cluster name 53. For example, the display size of the arrow 55 may differ depending on the degree of change of the main cluster name 53.

도 6 에 나타내는 예에서는, 주목어 클러스터명 (53) 은, 「구동·분해」, 「구동·벨트·회전」, 「배기·압·플로」, 및 「배기·압·분해」의 순으로 시간 경과에 따라 변화한다. 1 회째 변화에서는 주목어 클러스터명 (53) 을 구성하는 단어가 2 개 변화하므로, 최초의 경계선 (54) 위에는 파란 화살표 (55b) 가 표시된다. 2 회째 변화에서는 주목어 클러스터명 (53) 을 구성하는 단어가 모두 변화하므로, 2 번째 경계선 (54) 위에는 빨간 화살표 (55r) 가 표시된다. 3 번째 변화에서는 주목어 클러스터명 (53) 을 구성하는 단어가 1 개 변화하므로, 3 회째 경계선 (54) 위에는 검은 화살표 (55n) 가 표시된다.In the example shown in FIG. 6, the main cluster name 53 is time in order of "drive and decomposition", "drive belt rotation", "exhaust pressure and flow", and "exhaust pressure and decomposition". Changes over time In the first change, two words constituting the main cluster name 53 change, and a blue arrow 55b is displayed on the first boundary line 54. In the second change, all of the words constituting the target cluster name 53 change, so that the red arrow 55r is displayed on the second boundary line 54. In the third change, one word constituting the main cluster name 53 changes, and therefore, a black arrow 55n is displayed on the third boundary line 54.

다음으로, 화면 표시부 (16) 는, 스텝 S122 에서 표시한 화면에 포함되는 화살표 (55) 의 개수를 종류별로 구한다 (스텝 S123). 다음으로, 화면 표시부 (16) 는, 각 종류의 화살표 (55) 의 개수에 기초하여, 주목어 클러스터명 (53) 의 변화가 큰지 여부를 판단한다 (스텝 S124). 화면 표시부 (16) 는, 예를 들어, 빨간 화살표 (55r) 의 개수가 화살표 (55) 의 총수의 30 ％ 를 초과한 경우에 예라고 판단해도 되고, 빨간 화살표 (55r) 의 개수와 파란 화살표 (55b) 의 개수의 합계가 화살표 (55) 의 총수의 60 ％ 를 초과한 경우에 예라고 판단해도 된다. 텍스트 마이닝 장치 (10) 의 제어는, 예인 경우에는 스텝 S125 로 진행되고, 아니오인 경우에는 스텝 S111 로 진행된다.Next, the screen display unit 16 obtains the number of arrows 55 included in the screen displayed in step S122 for each type (step S123). Next, the screen display unit 16 determines whether the change in the target cluster name 53 is large based on the number of arrows 55 of each type (step S124). The screen display unit 16 may determine, for example, when the number of the red arrows 55r exceeds 30% of the total number of the arrows 55, and the number of the red arrows 55r and the blue arrows ( When the sum total of the number of 55b) exceeds 60% of the total number of the arrow 55, you may judge yes. If yes, control of the text mining device 10 proceeds to step S125, and if no, the control proceeds to step S111.

전자의 경우, 화면 표시부 (16) 는, 경고 메세지를 포함하는 화면을 표시한다 (스텝 S125). 도 9 는, 스텝 S125 에서 표시되는 창을 나타내는 도면이다. 도 9 에 나타내는 창 (61) 은, 주목어 클러스터의 구성이 크게 변화하는 경우가 많기 때문에, 계층적 클러스터 분석의 설정 (예를 들어, 클러스터수나 대상 단어수) 을 재조정할 것을 권하는 취지의 경고 메세지를 포함하고 있다. 그 후, 텍스트 마이닝 장치 (10) 의 제어는, 스텝 S111 로 진행된다.In the former case, the screen display unit 16 displays a screen including a warning message (step S125). 9 is a diagram illustrating a window displayed in step S125. Since the window 61 shown in FIG. 9 often changes the structure of a target cluster largely, the warning message which recommends to readjust the setting of hierarchical cluster analysis (for example, number of clusters or number of target words). It includes. Thereafter, the control of the text mining device 10 proceeds to step S111.

이상으로 나타내는 바와 같이, 본 실시형태에 관련된 텍스트 마이닝 방법은, 날짜를 갖는 문으로 이루어지는 텍스트 데이터로부터 단어를 추출하는 스텝 (스텝 S102, S103) 과, 추출한 단어에 대해 분석 기간별로 계층적 클러스터 분석을 실시하는 스텝 (스텝 S104) 과, 계층적 클러스터 분석에 의한 분석 결과를 포함하는 화면을 표시하는 스텝 (스텝 S107, S113, S121 ∼ S125) 을 구비하고 있다. 분석 결과를 포함하는 제 1 화면 (창 (41) 을 포함하는 화면) 내에서 주목어를 지정하는 지시가 입력되었을 때에 (도 5), 화면을 표시하는 스텝 (스텝 S122) 은, 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 나타내는 제 2 화면 (창 (51) 을 포함하는 화면) 을 표시한다. 본 실시형태에 관련된 텍스트 마이닝 방법에 의하면, 계층적 클러스터 분석의 결과를 포함하는 제 1 화면 내에서 주목어를 지정하는 지시가 입력되었을 때에, 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 나타내는 제 2 화면을 표시함으로써, 이용자는 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 용이하게 인식할 수 있다.As described above, the text mining method according to the present embodiment performs steps (steps S102 and S103) of extracting words from text data consisting of a sentence having a date, and performs hierarchical cluster analysis on the extracted words for each analysis period. Steps (step S104) to perform and steps (step S107, S113, S121 to S125) which display the screen containing the analysis result by hierarchical cluster analysis are provided. When an instruction for designating a key word is input in the first screen (screen including window 41) containing the analysis result (FIG. 5), the step of displaying the screen (step S122) includes the key word. A second screen (screen including window 51) indicating a change over time of the cluster is displayed. According to the text mining method according to the present embodiment, when an instruction for designating a key word in the first screen including the result of the hierarchical cluster analysis is input, it indicates a change over time of the cluster including the key word. By displaying the second screen, the user can easily recognize the change over time of the hierarchical cluster analysis result.

또, 제 2 화면은, 주목어를 포함하는 클러스터에 포함되는 단어에 기초하는 클러스터명 (주목어 클러스터명 (53)) 을 시간축을 따라 나타낸다. 또, 이 클러스터명은, 주목어를 포함하는 클러스터에 포함되는 단어를 출현 빈도가 높은 순으로 소정의 개수 이하 (3 개 이하) 만큼 연결한 것이다. 따라서, 이용자는 주목어를 포함하는 클러스터의 시간 경과에 따른 변화를 용이하게 인식할 수 있다.In addition, the second screen displays a cluster name (the main cluster name 53) based on the words included in the cluster including the main word along the time axis. In addition, this cluster name connects the words contained in the cluster containing a key word by the predetermined number or less (three or less) in the order of high frequency of appearance. Thus, the user can easily recognize the change over time of the cluster containing the key word.

또, 제 2 화면은, 주목어를 포함하는 클러스터의 이름이 변화하는 시기에 대응하는 위치에, 클러스터명의 변화의 정도에 따른 양태를 갖는 마크를 포함하고 있다. 이 마크는, 클러스터명의 변화의 정도에 따른 색을 갖는 화살표 (55) 여도 된다. 이와 같은 마크 (화살표 (55)) 를 포함하는 제 2 화면을 표시함으로써, 이용자는 주목어를 포함하는 클러스터 이름의 변화의 정도를 용이하게 인식할 수 있다. 또, 클러스터명을 구성하는 단어 중 앞의 클러스터명으로부터 변화한 단어 (도 6 에 나타내는 「벨트」, 「회전」 등) 는, 제 2 화면 내에서 강조 표시된다. 따라서, 이용자는 주목어를 포함하는 클러스터에 있어서 출현 빈도가 높은 단어가 어떻게 변화했는지를 용이하게 인식할 수 있다.In addition, the second screen includes a mark having an aspect corresponding to the degree of change of the cluster name at a position corresponding to the time when the name of the cluster including the main word changes. This mark may be an arrow 55 having a color corresponding to the degree of change of the cluster name. By displaying the second screen including such a mark (arrow 55), the user can easily recognize the degree of change of the cluster name including the keyword. Moreover, the word ("belt", "rotation", etc. which are shown in FIG. 6) which changed from the previous cluster name among the words which comprise a cluster name is highlighted and displayed in a 2nd screen. Therefore, the user can easily recognize how the word with high appearance frequency has changed in the cluster including the main word.

또, 제 2 화면은, 시간축을 따라 주목어의 출현 빈도의 시간 경과에 따른 변화를 나타내는 그래프 (꺾은선 그래프 (52)) 를 포함하고 있다. 주목어를 포함하는 클러스터의 시간 경과에 따른 변화에 추가하여, 주목어의 출현 빈도의 시간 경과에 따른 변화를 나타내는 그래프를 포함하는 화면을 표시함으로써, 이용자는 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 용이하게 인식할 수 있다. 또, 제 2 화면은, 주목어를 포함하는 클러스터의 이름이 변화하는 시기에 대응하는 위치에 경계선 (54) 을 포함하고, 그래프의 배경은, 경계선마다 상이한 양태를 갖는다. 따라서, 이용자는 주목어를 포함하는 클러스터가 변화하는 시기를 용이하게 인식할 수 있다. 또, 주목어를 포함하는 클러스터의 이름이 크게 변화하는 것이 많은 경우에는, 화면을 표시하는 스텝은, 경고 메세지를 포함하는 화면 (창 (61) 을 포함하는 화면) 을 표시한다. 따라서, 이용자는 계층적 클러스터 분석이 잘 되고 있지 않음을 인식할 수 있다.The second screen also includes a graph (line graph 52) showing a change over time of the frequency of appearance of the main fish along the time axis. In addition to the change over time of the cluster containing the main fish, by displaying a screen including a graph showing a change over time of the frequency of occurrence of the main fish, the user can change the time course of the hierarchical cluster analysis results. Changes can be easily recognized. In addition, the second screen includes a boundary line 54 at a position corresponding to the time when the name of the cluster including the main word changes, and the background of the graph has a different aspect for each boundary line. Thus, the user can easily recognize when the cluster containing the key word changes. In addition, when the name of the cluster including the main word is largely changed, the step of displaying the screen displays a screen (screen including window 61) including a warning message. Thus, the user can recognize that hierarchical cluster analysis is not doing well.

본 실시형태에 관련된 텍스트 마이닝 장치 (10) 및 텍스트 마이닝 프로그램 (31) 은, 상기의 텍스트 마이닝 방법과 동일한 특징을 가지며, 동일한 효과를 나타낸다. 본 실시형태에 관련된 텍스트 마이닝 방법, 텍스트 마이닝 장치 (10), 및 텍스트 마이닝 프로그램 (31) 에 의하면, 이용자는 계층적 클러스터 분석 결과의 시간 경과에 따른 변화를 용이하게 인식할 수 있다.The text mining device 10 and the text mining program 31 according to the present embodiment have the same characteristics as the above-described text mining method and exhibit the same effects. According to the text mining method, the text mining apparatus 10, and the text mining program 31 according to the present embodiment, the user can easily recognize the change over time of the hierarchical cluster analysis result.

이상에서 본 발명을 상세하게 설명했지만, 이상의 설명은 모든 면에서 예시적인 것으로서 제한적인 것은 아니다. 다수의 다른 변경이나 변형이 본 발명의 범위를 일탈하지 않고 안출 가능한 것으로 이해된다.Although the present invention has been described in detail above, the above description is in all respects illustrative and not restrictive. It is understood that many other modifications and variations can be made without departing from the scope of the present invention.

10 : 텍스트 마이닝 장치
11 : 지시 입력부
12 : 텍스트 데이터 기억부
13 : 단어 추출부
14 : 클러스터링 처리부
15 : 분석 결과 기억부
16 : 화면 표시부
20 : 컴퓨터
21 : CPU
22 : 메인 메모리
29 : 마우스
30 : 기록 매체
31 : 텍스트 마이닝 프로그램
32 : 텍스트 데이터
41, 51, 61 : 창
42 : 컨텍스트 메뉴
43 : 마우스 커서
52 : 꺾은선 그래프
53 : 주목어 클러스터명
54 : 경계선
55 : 화살표10: text mining device
11: instruction input unit
12: text data storage unit
13: word extraction unit
14: clustering processing unit
15: analysis result storage unit
16: screen display unit
20: computer
21: CPU
22: main memory
29: mouse
30: recording medium
31: Text Mining Program
32: text data
41, 51, 61: windows
42: context menu
43: mouse cursor
52: line graph
53: main cluster name
54: boundary line
55: arrow

Claims

A text mining method for displaying a screen containing analysis results of text data, the method comprising:
Extracting a word from text data consisting of a sentence having a date;
Performing a hierarchical cluster analysis on the word for each analysis period;
And displaying a screen including a result of the hierarchical cluster analysis.
When an instruction for designating a key word in the first screen including the result is input, the step of displaying the screen may include displaying a second screen indicating a change over time of a cluster including the key word. Text mining method, characterized in that.

The method of claim 1,
And the second screen displays a cluster name based on a word included in the cluster along the time axis.

The method of claim 2,
The cluster name is a text mining method characterized by concatenating a word included in the cluster by a predetermined number or less in the order of appearance frequency.

The method of claim 2,
And the second screen further includes a mark having an aspect corresponding to a degree of change of the cluster name at a position corresponding to a time when the cluster name changes.

The method of claim 4, wherein
And the mark is an arrow having a color in accordance with the degree of change of the cluster name.

The method of claim 2,
The word changed from the previous cluster name among the words constituting the cluster name is highlighted in the second screen.

The method of claim 2,
And the second screen further includes a graph indicating a change over time of the frequency of appearance of the main fish along the time axis.

The method of claim 7, wherein
The second screen further includes a boundary line at a position corresponding to a time when the cluster name changes, and the background of the graph has a different aspect for each boundary line.

The method of claim 2,
The text mining method is characterized in that, in the case where the cluster name is greatly changed, the step of displaying the screen displays a screen including a warning message.

A text mining program stored in a medium for displaying a screen including a result of analysis of text data, the text mining program comprising:
Extracting a word from text data consisting of a sentence having a date;
Performing a hierarchical cluster analysis on the word for each analysis period;
The computer executes the step of displaying a screen including the result of the hierarchical cluster analysis in a computer using a memory,
When an instruction for designating a key word in the first screen including the result is input, the step of displaying the screen may include displaying a second screen indicating a change over time of a cluster including the key word. And a text mining program stored in the medium.

The method of claim 10,
And the second screen displays a cluster name based on a word included in the cluster along a time axis.

The method of claim 11,
The cluster name is a text mining program stored in a medium, characterized in that the number of words included in the cluster are linked in the order of high frequency.

The method of claim 11,
And the second screen further comprises a mark having an aspect corresponding to the degree of change of the cluster name at a position corresponding to the time when the cluster name changes.

The method of claim 13,
And the mark is an arrow having a color in accordance with the degree of change of the cluster name.

The method of claim 11,
The word changed from the previous cluster name among the words constituting the cluster name is highlighted in the second screen. The text mining program stored in the medium.

The method of claim 11,
And the second screen further comprises a graph indicating a change over time of the frequency of appearance of the main fish along the time axis.

The method of claim 16,
The second screen further includes a boundary line at a position corresponding to a time when the cluster name changes, and the background of the graph has a different aspect for each boundary line.

The method of claim 11,
And when the cluster name is greatly changed, the step of displaying the screen displays a screen including a warning message.

A text mining device for displaying a screen containing analysis results of text data,
A word extracting unit for extracting a word from text data consisting of a sentence having a date;
A clustering processor configured to perform hierarchical cluster analysis on the word for each analysis period;
A screen display unit which displays a screen including a result of the hierarchical cluster analysis,
When an instruction for designating a key word in the first screen including the result is input, the screen display unit displays a second screen indicating a change over time of a cluster including the key word. , Text mining device.

The method of claim 19,
And the second screen displays a cluster name based on a word included in the cluster along the time axis.