KR102180487B1

KR102180487B1 - Text mining method, text mining program, and text mining device

Info

Publication number: KR102180487B1
Application number: KR1020197000933A
Authority: KR
Inventors: 마사시 아키타; 야스노리 나카무라; 징롱 저우
Original assignee: 가부시키가이샤 스크린 홀딩스
Priority date: 2016-07-25
Filing date: 2017-06-06
Publication date: 2020-11-18
Also published as: TWI686716B; KR20190018480A; TW201807597A; WO2018020842A1; JP6794162B2; CN109478191A; CN109478191B; JP2018018118A

Abstract

텍스트 분석 스텝 (S109 ∼ S110) 에서는, 입력된 텍스트 데이터로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시한다. 화면 생성 스텝 (S111) 에서는, 그룹 수 (m) 와 그룹 내의 최대 데이터 수 (n) 에 기초하여, 텍스트 분석 스텝에 의한 분석 결과로부터 m 개의 클러스터를 구하고, 클러스터에 포함되는 단어를 n 개 이하 포함하는 그룹을 화면에 표시하기 위한 화면 데이터를 생성한다. 분석 결과 표시 스텝 (S112) 에서는, 생성된 화면 데이터에 기초하여, 화면을 표시한다. 이로 인해, 계층적 클러스터 분석의 결과를 이용자가 직감적으로 이해할 수 있도록 화면에 표시한다.In the text analysis steps S109 to S110, hierarchical cluster analysis is performed on words extracted from the input text data. In the screen generation step S111, based on the number of groups (m) and the maximum number of data in the group (n), m clusters are obtained from the analysis result by the text analysis step, and n or less words included in the cluster are included. Create screen data for displaying a group of actions on the screen. In the analysis result display step S112, a screen is displayed based on the generated screen data. For this reason, the results of the hierarchical cluster analysis are displayed on the screen so that users can intuitively understand them.

Description

Text mining method, text mining program, and text mining device

본 발명은, 텍스트 마이닝에 관한 것으로, 특히, 텍스트 데이터의 분석 결과를 화면에 표시하는 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치에 관한 것이다.The present invention relates to text mining, and more particularly, to a text mining method, a text mining program, and a text mining apparatus for displaying an analysis result of text data on a screen.

최근, 자유롭게 기술된 대량의 텍스트 데이터를 해석하고, 해석 결과로부터 유용한 정보를 구하는 텍스트 마이닝이 주목받고 있다. 텍스트 마이닝에서는, 예를 들어, 분석 대상인 텍스트 데이터로부터 단어를 추출하여, 단어의 출현 빈도나 출현 경향 등을 해석함으로써, 정보를 구한다.In recent years, text mining that analyzes a large amount of freely described text data and obtains useful information from the analysis results has attracted attention. In text mining, for example, information is obtained by extracting words from text data to be analyzed and analyzing the frequency and tendency of the words to appear.

이하, 텍스트 데이터로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시하고, 분석 결과를 화면에 표시하는 텍스트 마이닝 장치에 대해 검토한다. 계층적 클러스터 분석에서는, 단어간의 유사도에 기초하여, 유사도가 높은 단어를 포함하는 클러스터가 계층적으로 작성된다. 일반적으로, 계층적 클러스터 분석의 결과는, 도 15 에 나타내는 수형도 (樹形圖) (덴드로그램) 를 사용하여 이용자 (분석자) 에게 제공된다.Hereinafter, a hierarchical cluster analysis is performed on words extracted from text data, and a text mining device that displays the analysis result on a screen is examined. In hierarchical cluster analysis, clusters including words with high similarity are hierarchically created based on the similarity between words. In general, the result of hierarchical cluster analysis is provided to a user (analyzer) using a tree diagram (dendogram) shown in FIG. 15.

본원 발명에 관련하여, 특허문헌 1 에는, 수형도를 구축하고, 수형도를 탐색하여 하층에서부터 상층을 특정 가능한 인덱스를 생성하여 기억 수단에 기억시키는 계층적 클러스터링 수단을 갖는 클러스터링 장치가 기재되어 있다. 특허문헌 2 에는, 키워드간의 거리를 산출하고, 키워드로부터 키워드간의 거리를 탐색 가능한 거리 행렬 데이터를 생성하여 기억 수단에 기억시키는 거리 행렬 계산 수단과, 거리 행렬을 사용하여 키워드를 계층적 클러스터링하고, 구축된 수형도를 하층에서부터 상층으로 탐색 가능한 보텀 업 인덱스로서 기억 수단에 기억시키는 클러스터링 수단을 갖는 쿼리 제공 장치가 기재되어 있다.In connection with the present invention, Patent Document 1 describes a clustering apparatus having a hierarchical clustering means for constructing a tree diagram, searching for a tree diagram, generating an index capable of specifying the upper layer from the lower layer, and storing it in the storage means. In Patent Document 2, a distance matrix calculation means that calculates the distance between keywords, generates distance matrix data that can search for the distance between keywords from the keywords, and stores it in a storage means, and hierarchically clusters keywords using the distance matrix, and constructs A query providing apparatus having a clustering means for storing the resulting tree diagram in a storage means as a bottom-up index that can be searched from a lower layer to an upper layer is disclosed.

일본 공개특허공보 2011-216021호Japanese Unexamined Patent Publication No. 2011-216021 일본 공개특허공보 2012-150539호Japanese Unexamined Patent Publication No. 2012-150539

종래의 텍스트 마이닝 장치는, 계층적 클러스터 분석의 결과를 수형도를 사용하여 화면에 표시한다. 그러나, 이와 같은 텍스트 마이닝 장치에는, 이용자가 분석 결과를 직감적으로 이해할 수 없다는 문제가 있다. 예를 들어, 이용자는, 도 15 에 나타내는 분석 결과에 있어서 클러스터 수를 4 로 설정할 때에는, 도 16 에 나타내는 바와 같이, 수형도 상에 절단선을 설정한다. 그러나, 이용자는, 이와 같은 수형도를 본 것 만으로는, 각 클러스터에 포함되는 단어를 직감적으로 인식할 수 없다. 또, 이용자는, 단어 수가 많을 때에 클러스터 수를 변경한 경우에는, 각 클러스터에 포함되는 단어가 어떻게 변화할지를 직감적으로 파악할 수 없다.A conventional text mining apparatus displays the results of hierarchical cluster analysis on a screen using a tree diagram. However, such a text mining apparatus has a problem that the user cannot intuitively understand the analysis result. For example, when the user sets the number of clusters to 4 in the analysis result shown in FIG. 15, as shown in FIG. 16, a cut line is set on the tree diagram. However, the user cannot intuitively recognize words included in each cluster just by looking at such a tree diagram. Further, when the number of clusters is changed when the number of words is large, the user cannot intuitively grasp how the words included in each cluster will change.

또, 수형도에는 단어의 출현 빈도가 기재되어 있지 않기 때문에, 이용자는 어느 단어가 중요한지를 알 수 없다. 또, 분석 대상인 텍스트 데이터가 연월일이나 시각 등의 정보를 갖는 시계열 데이터인 경우에는, 이용자는 분석 결과의 시간적인 변화를 알 것을 요망하는 경우가 있다. 그러나, 종래의 텍스트 마이닝 장치에서는, 이용자의 이와 같은 요망에 부응할 수 없다.In addition, since the frequency of appearance of words is not described in the tree diagram, the user cannot know which words are important. In addition, when the text data to be analyzed is time series data having information such as year, month, date and time, the user may request to know the temporal change of the analysis result. However, in the conventional text mining apparatus, such a request of a user cannot be met.

그 때문에, 본 발명은, 계층적 클러스터 분석의 결과를 이용자가 직감적으로 이해할 수 있도록 화면에 표시하는 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치를 제공하는 것을 목적으로 한다.Therefore, an object of the present invention is to provide a text mining method, a text mining program, and a text mining device that display the results of hierarchical cluster analysis on a screen so that users can intuitively understand them.

본 발명의 제 1 국면은, 텍스트 데이터의 분석 결과를 화면에 표시하는 텍스트 마이닝 방법으로서,A first aspect of the present invention is a text mining method for displaying an analysis result of text data on a screen,

입력된 텍스트 데이터로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시하는 텍스트 분석 스텝과,A text analysis step that performs hierarchical cluster analysis on words extracted from the input text data,

상기 텍스트 분석 스텝에 의한 분석 결과에 기초하여, 화면 데이터를 생성하는 화면 생성 스텝과,A screen generation step for generating screen data based on the analysis result by the text analysis step,

상기 화면 데이터에 기초하여, 화면을 표시하는 분석 결과 표시 스텝을 구비하고,An analysis result display step of displaying a screen based on the screen data,

상기 화면 생성 스텝은, 그룹 수와 그룹 내의 최대 데이터 수에 기초하여, 상기 분석 결과로부터 상기 그룹 수의 클러스터를 구하고, 상기 클러스터에 포함되는 단어를 상기 최대 데이터 수 이하 포함하는 그룹을 화면에 표시하기 위한 화면 데이터를 생성하는 것을 특징으로 한다.The screen generation step, based on the number of groups and the maximum number of data in the group, obtains a cluster of the number of groups from the analysis result, and displays a group including words included in the cluster less than the maximum number of data on the screen. It characterized in that it generates screen data for.

본 발명의 제 2 국면은, 본 발명의 제 1 국면에 있어서,In the second aspect of the present invention, in the first aspect of the present invention,

상기 그룹에 포함되는 단어는, 상기 그룹에 대응하는 클러스터에 포함되는 단어 중에서 출현 빈도가 높은 순으로 선택되는 것을 특징으로 한다.The words included in the group may be selected in the order of their appearance frequency from among words included in the cluster corresponding to the group.

본 발명의 제 3 국면은, 본 발명의 제 2 국면에 있어서,The third aspect of the present invention, in the second aspect of the present invention,

상기 그룹은, 상기 화면 내에서, 상기 그룹에 대응하는 클러스터에 포함되는 단어의 출현 빈도의 합계에 따른 사이즈를 갖는 것을 특징으로 한다.The group is characterized in that it has a size according to the sum of the frequency of appearance of words included in the cluster corresponding to the group in the screen.

본 발명의 제 4 국면은, 본 발명의 제 3 국면에 있어서,In the fourth aspect of the present invention, in the third aspect of the present invention,

상기 그룹에 포함되는 단어는, 상기 화면 내에서, 상기 단어의 출현 빈도에 따른 사이즈를 갖는 것을 특징으로 한다.Words included in the group are characterized in that they have a size according to the frequency of appearance of the word in the screen.

본 발명의 제 5 국면은, 본 발명의 제 1 국면에 있어서,In the fifth aspect of the present invention, in the first aspect of the present invention,

이용자로부터의 지시를 입력하기 위한 지시 입력 스텝을 추가로 구비하고,An instruction input step for inputting an instruction from the user is further provided,

상기 텍스트 분석 스텝 및 상기 화면 생성 스텝 중의 어느 것이, 상기 지시 입력 스텝에서 입력된 지시에 기초하여 실행되는 것을 특징으로 한다.Any of the text analysis step and the screen generation step is performed based on an instruction input in the instruction input step.

본 발명의 제 6 국면은, 본 발명의 제 5 국면에 있어서,In the sixth aspect of the present invention, in the fifth aspect of the present invention,

상기 지시 입력 스텝은 상기 그룹 수의 설정 지시를 받고,The instruction input step receives an instruction for setting the number of groups,

상기 화면 생성 스텝은, 상기 지시 입력 스텝에서 설정된 그룹 수에 기초하여, 상기 화면 데이터를 생성하는 것을 특징으로 한다.The screen generation step is characterized in that the screen data is generated based on the number of groups set in the instruction input step.

본 발명의 제 7 국면은, 본 발명의 제 5 국면에 있어서,In the seventh aspect of the present invention, in the fifth aspect of the present invention,

상기 지시 입력 스텝은 상기 최대 데이터 수의 설정 지시를 받고,The instruction input step receives an instruction to set the maximum number of data,

상기 화면 생성 스텝은, 상기 지시 입력 스텝에서 설정된 최대 데이터 수에 기초하여, 상기 화면 데이터를 생성하는 것을 특징으로 한다.The screen generation step is characterized in that the screen data is generated based on the maximum number of data set in the instruction input step.

본 발명의 제 8 국면은, 본 발명의 제 5 국면에 있어서,In the eighth aspect of the present invention, in the fifth aspect of the present invention,

상기 지시 입력 스텝은 분석 대상 기간의 설정 지시를 받고,The instruction input step receives an instruction for setting an analysis target period,

상기 텍스트 분석 스텝은, 상기 텍스트 데이터 중 상기 지시 입력 스텝에서 설정된 분석 대상 기간 내의 텍스트 데이터에 포함되는 단어에 대해, 상기 계층적 클러스터 분석을 실시하는 것을 특징으로 한다.The text analysis step is characterized in that, among the text data, the hierarchical cluster analysis is performed on words included in the text data within the analysis target period set in the instruction input step.

본 발명의 제 9 국면은, 본 발명의 제 5 국면에 있어서,In the ninth aspect of the present invention, in the fifth aspect of the present invention,

상기 지시 입력 스텝은 분석 목적의 설정 지시를 받고,The instruction input step receives an instruction for setting the purpose of analysis,

상기 텍스트 분석 스텝은, 상기 텍스트 데이터로부터 상기 지시 입력 스텝에서 설정된 분석 목적에 따른 종류의 단어를 추출하여, 상기 계층적 클러스터 분석을 실시하는 것을 특징으로 한다.The text analysis step is characterized in that the hierarchical cluster analysis is performed by extracting a word of a type according to an analysis purpose set in the instruction input step from the text data.

본 발명의 제 10 국면은, 본 발명의 제 5 국면에 있어서,In the tenth aspect of the present invention, in the fifth aspect of the present invention,

상기 지시 입력 스텝은 단어 제외 지시를 받고,The instruction input step receives an instruction to exclude words,

상기 텍스트 분석 스텝은, 상기 지시 입력 스텝에서 지시된 단어를 제외하고, 상기 계층적 클러스터 분석을 실시하는 것을 특징으로 한다.The text analysis step is characterized in that the hierarchical cluster analysis is performed by excluding the words indicated in the instruction input step.

본 발명의 제 11 국면은, 본 발명의 제 5 국면에 있어서,In the eleventh aspect of the present invention, in the fifth aspect of the present invention,

상기 지시 입력 스텝은 유의어 등록 지시를 받고,The instruction input step receives a synonym registration instruction,

상기 텍스트 분석 스텝은, 상기 지시 입력 스텝에서 지시된 복수의 단어를 동일한 단어로 간주하여, 상기 계층적 클러스터 분석을 실시하는 것을 특징으로 한다.The text analysis step is characterized in that the hierarchical cluster analysis is performed by considering a plurality of words indicated in the instruction input step as the same word.

본 발명의 제 12 국면은, 본 발명의 제 5 국면에 있어서,In the twelfth aspect of the present invention, in the fifth aspect of the present invention,

상기 지시 입력 스텝은 복합어 등록 지시를 받고,The instruction input step receives a compound word registration instruction,

상기 텍스트 분석 스텝은, 상기 지시 입력 스텝에서 지시된 복수의 단어를 1 개의 단어로 병합하여, 상기 계층적 클러스터 분석을 실시하는 것을 특징으로 한다.The text analysis step is characterized in that the hierarchical cluster analysis is performed by merging a plurality of words indicated in the instruction input step into one word.

본 발명의 제 13 국면은, 본 발명의 제 1 국면에 있어서,In the thirteenth aspect of the present invention, in the first aspect of the present invention,

상기 화면 생성 스텝은, 상기 그룹을 포함하는 분석 결과 화면과, 상기 분석 결과 화면의 표시 양태를 설정하기 위한 분석 설정 화면을 표시하기 위한 화면 데이터를 생성하는 것을 특징으로 한다.The screen generation step is characterized by generating screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.

본 발명의 제 14 국면은, 텍스트 데이터의 분석 결과를 화면에 표시하는 텍스트 마이닝 프로그램으로서,A fourteenth aspect of the present invention is a text mining program that displays an analysis result of text data on a screen,

상기 화면 데이터에 기초하여, 화면을 표시하는 분석 결과 표시 스텝을 컴퓨터에 CPU 가 메모리를 이용하여 실행시키고,Based on the screen data, the CPU executes the analysis result display step of displaying a screen in the computer using a memory,

본 발명의 제 15 국면은, 본 발명의 제 14 국면에 있어서,In the fifteenth aspect of the present invention, in the fourteenth aspect of the present invention,

본 발명의 제 16 국면은, 본 발명의 제 15 국면에 있어서,In the sixteenth aspect of the present invention, in the fifteenth aspect of the present invention,

본 발명의 제 17 국면은, 본 발명의 제 16 국면에 있어서,In the seventeenth aspect of the present invention, in the sixteenth aspect of the present invention,

본 발명의 제 18 국면은, 본 발명의 제 14 국면에 있어서,In the eighteenth aspect of the present invention, in the fourteenth aspect of the present invention,

이용자로부터의 지시를 입력하기 위한 지시 입력 스텝을 상기 컴퓨터에 추가로 실행시키고,An instruction input step for inputting an instruction from the user is additionally executed on the computer,

본 발명의 제 19 국면은, 본 발명의 제 14 국면에 있어서,In the nineteenth aspect of the present invention, in the fourteenth aspect of the present invention,

본 발명의 제 20 국면은, 텍스트 데이터의 분석 결과를 화면에 표시하는 텍스트 마이닝 장치로서,A twentieth aspect of the present invention is a text mining apparatus that displays an analysis result of text data on a screen,

입력된 텍스트 데이터로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시하는 텍스트 분석부와,A text analysis unit that performs hierarchical cluster analysis on words extracted from the input text data,

상기 텍스트 분석부에 의한 분석 결과에 기초하여, 화면 데이터를 생성하는 화면 생성부와,A screen generation unit that generates screen data based on the analysis result by the text analysis unit,

상기 화면 데이터에 기초하여, 화면을 표시하는 분석 결과 표시부를 구비하고,An analysis result display unit for displaying a screen based on the screen data,

상기 화면 생성부는, 그룹 수와 그룹 내의 최대 데이터 수에 기초하여, 상기 분석 결과로부터 상기 그룹 수의 클러스터를 구하고, 상기 클러스터에 포함되는 단어를 상기 최대 데이터 수 이하 포함하는 그룹을 화면에 표시하기 위한 화면 데이터를 생성하는 것을 특징으로 한다.The screen generator is configured to obtain a cluster of the number of groups from the analysis result based on the number of groups and the maximum number of data in the group, and to display a group including words included in the cluster less than the maximum number of data on the screen. It is characterized by generating screen data.

본 발명의 제 21 국면은, 본 발명의 제 20 국면에 있어서,In the twenty-first aspect of the present invention, in the twenty-first aspect of the present invention,

본 발명의 제 22 국면은, 본 발명의 제 21 국면에 있어서,In the twenty-second aspect of the present invention, in the twenty-first aspect of the present invention,

본 발명의 제 23 국면은, 본 발명의 제 22 국면에 있어서,In the twenty-third aspect of the present invention, in the twenty-second aspect of the present invention,

본 발명의 제 24 국면은, 본 발명의 제 20 국면에 있어서,In the twenty-fourth aspect of the present invention, in the twenty-fourth aspect of the present invention,

이용자로부터의 지시를 입력하기 위한 지시 입력부를 추가로 구비하고,Further provided with an instruction input unit for inputting an instruction from the user,

상기 텍스트 분석부 및 상기 화면 생성부 중 어느 것이, 상기 지시 입력부에서 입력된 지시에 기초하여 동작하는 것을 특징으로 한다.Any of the text analysis unit and the screen generation unit may operate based on an instruction input from the instruction input unit.

본 발명의 제 25 국면은, 본 발명의 제 20 국면에 있어서,In the twenty-fifth aspect of the present invention, in the twenty-fifth aspect of the present invention,

상기 화면 생성부는, 상기 그룹을 포함하는 분석 결과 화면과, 상기 분석 결과 화면의 표시 양태를 설정하기 위한 분석 설정 화면을 표시하기 위한 화면 데이터를 생성하는 것을 특징으로 한다.The screen generating unit is characterized in that generating screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.

본 발명의 제 1, 제 14 또는 제 20 국면에 의하면, 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과에 기초하여, 클러스터에 포함되는 단어를 포함하는 그룹이 화면에 표시된다. 또, 그룹에 포함되는 단어의 수는, 최대 데이터 수 이하로 제한된다. 따라서, 이용자는, 화면을 보았을 때에, 계층적 클러스터 분석의 결과를 직감적으로 이해할 수 있다.According to the first, fourteenth, or twentieth aspect of the present invention, a group including words included in the cluster is displayed on a screen based on a result of performing hierarchical cluster analysis on words included in text data. In addition, the number of words included in the group is limited to the maximum number of data. Accordingly, the user can intuitively understand the results of hierarchical cluster analysis when viewing the screen.

본 발명의 제 2, 제 15 또는 제 21 국면에 의하면, 그룹의 내부에는, 클러스터에 포함되는 단어 중 출현 빈도가 높은 단어가 표시된다. 따라서, 이용자는, 각 클러스터에 포함되는 출현 빈도가 높은 단어를 용이하게 인식할 수 있다.According to the second, fifteenth or twenty-first aspect of the present invention, words with a high frequency of appearance among words included in the cluster are displayed inside the group. Accordingly, the user can easily recognize words with a high frequency of appearance included in each cluster.

본 발명의 제 3, 제 16 또는 제 22 국면에 의하면, 그룹은, 화면 내에서, 클러스터에 포함되는 단어의 출현 빈도의 합계에 따른 사이즈를 갖는다. 따라서, 이용자는, 단어의 출현 빈도의 합계가 큰 클러스터를 용이하게 인식할 수 있다.According to the third, sixteenth, or twenty-second aspect of the present invention, a group has a size according to the sum of the frequency of appearances of words included in the cluster within the screen. Accordingly, the user can easily recognize a cluster in which the sum of the frequency of occurrence of words is large.

본 발명의 제 4, 제 17 또는 제 23 국면에 의하면, 단어는, 화면 내에서, 단어의 빈도에 따른 사이즈를 갖는다. 따라서, 이용자는, 출현 빈도가 높은 단어를 용이하게 인식할 수 있다.According to the fourth, seventeenth or twenty-third aspect of the present invention, a word has a size within a screen according to the frequency of the word. Accordingly, the user can easily recognize words with a high frequency of appearance.

본 발명의 제 5, 제 18 또는 제 24 국면에 의하면, 이용자로부터의 지시에 따라, 계층적 클러스터 분석 결과의 표시 양태를 전환할 수 있다.According to the fifth, eighteenth or twenty-fourth aspect of the present invention, the display mode of the hierarchical cluster analysis result can be switched according to an instruction from the user.

본 발명의 제 6 국면에 의하면, 화면에 표시되는 그룹의 개수 (클러스터의 개수) 를 이용자로부터의 지시에 따라 전환할 수 있다.According to the sixth aspect of the present invention, the number of groups (number of clusters) displayed on the screen can be switched according to an instruction from the user.

본 발명의 제 7 국면에 의하면, 그룹에 포함되는 단어의 개수의 상한치를 이용자로부터의 지시에 따라 전환할 수 있다.According to the seventh aspect of the present invention, the upper limit of the number of words included in the group can be switched according to an instruction from the user.

본 발명의 제 8 국면에 의하면, 이용자로부터 지시된 분석 대상 기간 내의 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다. 따라서, 이용자는, 계층적 클러스터 분석 결과의 시간적 변화를 용이하게 인식할 수 있다.According to an eighth aspect of the present invention, a result of performing hierarchical cluster analysis on words included in text data within a period to be analyzed instructed by a user is displayed on the screen. Therefore, the user can easily recognize the temporal change in the hierarchical cluster analysis result.

본 발명의 제 9 국면에 의하면, 이용자로부터 지시된 분석 목적에 따라 분석 대상의 단어의 종류를 전환하여 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 수 있다.According to the ninth aspect of the present invention, a result of hierarchical cluster analysis can be displayed on a screen by switching the types of words to be analyzed according to the analysis purpose indicated by the user.

본 발명의 제 10 국면에 의하면, 이용자로부터 지시된 단어를 제외하고 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 수 있다.According to the tenth aspect of the present invention, the result of hierarchical cluster analysis can be displayed on the screen except for words indicated by the user.

본 발명의 제 11 국면에 의하면, 이용자로부터 지시된 복수의 단어를 동일한 단어로 간주하여 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 수 있다.According to the eleventh aspect of the present invention, a plurality of words instructed by a user can be regarded as the same word, and the result of hierarchical cluster analysis can be displayed on a screen.

본 발명의 제 12 국면에 의하면, 이용자로부터 지시된 복수의 단어를 1 개의 단어로 병합하여, 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 수 있다.According to the twelfth aspect of the present invention, a plurality of words instructed by a user can be merged into one word, and the result of hierarchical cluster analysis can be displayed on a screen.

본 발명의 제 13, 제 19 또는 제 25 국면에 의하면, 분석 결과 화면과 분석 설정 화면이 표시된다. 따라서, 이용자는, 분석 설정 화면을 사용하여, 계층적 클러스터 분석을 실시한 결과의 표시 양태를 용이하게 전환할 수 있다.According to the thirteenth, nineteenth or twenty-fifth aspect of the present invention, an analysis result screen and an analysis setting screen are displayed. Accordingly, the user can easily switch the display mode of the result of performing the hierarchical cluster analysis using the analysis setting screen.

도 1 은, 본 발명의 실시형태에 관련된 텍스트 마이닝 장치의 구성을 나타내는 블록도이다.
도 2 는, 도 1 에 나타내는 텍스트 마이닝 장치로서 기능하는 컴퓨터의 구성을 나타내는 블록도이다.
도 3 은, 도 1 에 나타내는 텍스트 마이닝 장치의 표시 화면을 나타내는 도면이다.
도 4 는, 도 1 에 나타내는 텍스트 마이닝 장치의 동작을 나타내는 플로 차트이다.
도 5 는, 도 1 에 나타내는 텍스트 마이닝 장치의 화면 데이터 생성 처리의 플로 차트이다.
도 6 은, 도 1 에 나타내는 텍스트 마이닝 장치의 데이터 지정 화면을 나타내는 도면이다.
도 7 은, 도 1 에 나타내는 텍스트 마이닝 장치에 입력되는 텍스트 데이터의 예를 나타내는 도면이다.
도 8 은, 도 1 에 나타내는 텍스트 마이닝 장치의 목적 지정 화면을 나타내는 도면이다.
도 9 는, 도 1 에 나타내는 텍스트 마이닝 장치의 유의어 리스트 선택 화면을 나타내는 도면이다.
도 10 은, 도 1 에 나타내는 텍스트 마이닝 장치의 복합어 리스트 선택 화면을 나타내는 도면이다.
도 11a 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 분석 대상 기간을 설정하기 전의 분석 결과 화면을 나타내는 도면이다.
도 11b 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 분석 대상 기간을 설정한 후의 분석 결과 화면을 나타내는 도면이다.
도 12a 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 단어 제외를 실시하기 전의 분석 결과 화면을 나타내는 도면이다.
도 12b 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 단어 제외를 실시한 후의 분석 결과 화면을 나타내는 도면이다.
도 13a 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 유의어 등록을 실시하기 전의 분석 결과 화면을 나타내는 도면이다.
도 13b 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 유의어 등록을 실시한 후의 분석 결과 화면을 나타내는 도면이다.
도 14a 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 복합어 등록을 실시하기 전의 분석 결과 화면을 나타내는 도면이다.
도 14b 는, 도 1 에 나타내는 텍스트 마이닝 장치에 있어서 복합어 등록을 실시한 후의 분석 결과 화면을 나타내는 도면이다.
도 15 는, 수형도의 예를 나타내는 도면이다.
도 16 은, 도 15 에 나타내는 수형도에 클러스터 수를 설정한 모습을 나타내는 도면이다.
도 17 은, 도면 및 그 설명에 나타나는 단어를 나타내는 도면이다.1 is a block diagram showing a configuration of a text mining apparatus according to an embodiment of the present invention.
Fig. 2 is a block diagram showing the configuration of a computer functioning as the text mining device shown in Fig. 1.
3 is a diagram showing a display screen of the text mining device shown in FIG. 1.
4 is a flow chart showing the operation of the text mining device shown in FIG. 1.
5 is a flowchart of screen data generation processing of the text mining device shown in FIG. 1.
6 is a diagram illustrating a data designation screen of the text mining device shown in FIG. 1.
7 is a diagram illustrating an example of text data input to the text mining device shown in FIG. 1.
FIG. 8 is a diagram illustrating a purpose designation screen of the text mining device shown in FIG. 1.
9 is a diagram showing a synonym list selection screen of the text mining device shown in FIG. 1.
FIG. 10 is a diagram illustrating a compound word list selection screen of the text mining device shown in FIG. 1.
FIG. 11A is a diagram illustrating an analysis result screen before setting an analysis target period in the text mining device shown in FIG. 1.
11B is a diagram showing an analysis result screen after setting an analysis target period in the text mining device shown in FIG. 1.
12A is a diagram illustrating an analysis result screen before word exclusion is performed in the text mining device shown in FIG. 1.
12B is a diagram illustrating an analysis result screen after word exclusion is performed in the text mining device shown in FIG. 1.
FIG. 13A is a diagram showing an analysis result screen before registration of synonyms in the text mining device shown in FIG. 1.
FIG. 13B is a diagram showing an analysis result screen after registration of synonyms in the text mining device shown in FIG. 1.
14A is a diagram showing an analysis result screen before registration of compound words in the text mining device shown in FIG. 1.
14B is a diagram showing an analysis result screen after compound word registration is performed in the text mining apparatus shown in FIG. 1.
15 is a diagram illustrating an example of a vertical diagram.
FIG. 16 is a diagram showing a mode in which the number of clusters is set in the tree diagram shown in FIG. 15.
Fig. 17 is a diagram showing words appearing in a drawing and its description.

이하, 도면을 참조하여, 본 발명의 실시형태에 관련된 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치에 대해 설명한다. 본 실시형태에 관련된 텍스트 마이닝 방법은, 전형적으로는 컴퓨터를 사용하여 실행된다. 본 실시형태에 관련된 텍스트 마이닝 프로그램은, 컴퓨터를 사용하여 텍스트 마이닝 방법을 실시하기 위한 프로그램이다. 본 실시형태에 관련된 텍스트 마이닝 장치는, 전형적으로는 컴퓨터를 사용하여 구성된다. 텍스트 마이닝 프로그램을 실행하는 컴퓨터는, 텍스트 마이닝 장치로서 기능한다.Hereinafter, a text mining method, a text mining program, and a text mining apparatus according to an embodiment of the present invention will be described with reference to the drawings. The text mining method according to this embodiment is typically executed using a computer. The text mining program according to the present embodiment is a program for implementing a text mining method using a computer. The text mining apparatus according to the present embodiment is typically configured using a computer. A computer executing a text mining program functions as a text mining device.

도 1 은, 본 발명의 실시형태에 관련된 텍스트 마이닝 장치의 구성을 나타내는 블록도이다. 도 1 에 나타내는 텍스트 마이닝 장치 (10) 는, 지시 입력부 (11), 텍스트 분석부 (12), 화면 생성부 (13), 및 분석 결과 표시부 (14) 를 구비하고 있다. 텍스트 마이닝 장치 (10) 에는, 분석 대상인 텍스트 데이터 (5) 가 입력된다. 텍스트 마이닝 장치 (10) 는, 입력된 텍스트 데이터 (5) 로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시하여, 분석 결과를 화면에 표시한다.1 is a block diagram showing a configuration of a text mining apparatus according to an embodiment of the present invention. The text mining device 10 shown in FIG. 1 includes an instruction input unit 11, a text analysis unit 12, a screen generation unit 13, and an analysis result display unit 14. In the text mining device 10, text data 5 to be analyzed is input. The text mining device 10 performs hierarchical cluster analysis on the words extracted from the input text data 5 and displays the analysis result on the screen.

텍스트 마이닝 장치 (10) 의 동작의 개요는, 이하와 같다. 지시 입력부 (11) 에는, 이용자로부터의 지시가 입력된다. 텍스트 분석부 (12) 는, 입력된 텍스트 데이터 (5) 로부터 단어를 추출하고, 추출한 단어에 대해 계층적 클러스터 분석을 실시한다. 화면 생성부 (13) 는, 텍스트 분석부 (12) 에 의한 분석 결과에 기초하여, 화면 데이터를 생성한다. 분석 결과 표시부 (14) 는, 화면 생성부 (13) 에서 생성된 화면 데이터에 기초하여, 화면을 표시한다.The outline of the operation of the text mining device 10 is as follows. In the instruction input unit 11, an instruction from the user is input. The text analysis unit 12 extracts words from the input text data 5, and performs hierarchical cluster analysis on the extracted words. The screen generation unit 13 generates screen data based on the analysis result by the text analysis unit 12. The analysis result display unit 14 displays a screen based on the screen data generated by the screen generation unit 13.

지시 입력부 (11) 에 입력되는 이용자로부터의 지시에는, 그룹 수의 설정, 그룹 내의 최대 데이터 수의 설정, 분석 대상 기간의 설정, 단어 제외, 유의어 등록, 복합어 등록 등이 포함된다. 텍스트 데이터 (5) 가 연월일이나 시각 등의 정보를 갖는 시계열 데이터인 경우에는, 텍스트 분석부 (12) 는, 입력된 텍스트 데이터 (5) 중, 지시 입력부 (11) 에서 설정된 분석 대상 기간 내의 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한다.Instructions from the user input to the instruction input unit 11 include setting of the number of groups, setting of the maximum number of data in the group, setting of an analysis target period, excluding words, registration of synonyms, registration of compound words, and the like. When the text data 5 is time series data having information such as year, month, date, time, etc., the text analysis unit 12 includes text data within the analysis target period set by the instruction input unit 11 among the input text data 5 Hierarchical cluster analysis is performed on words included in

화면 생성부 (13) 는, 화면 데이터를 생성할 때에, 그룹 수와 그룹 내의 최대 데이터 수에 따른다 (상세한 것은 후술). 또, 이용자가 새로운 지시를 입력 했을 때에는, 지시된 처리가 실시된 후에, 화면 생성부 (13) 는 새로운 화면 데이터를 생성하고, 분석 결과 표시부 (14) 는 새로운 화면을 표시한다. 이와 같이 텍스트 마이닝 장치 (10) 는, 이용자로부터의 지시에 따라, 텍스트 데이터 (5) 의 분석 양태와 분석 결과의 표시 양태를 전환한다.When the screen generating unit 13 generates screen data, it follows the number of groups and the maximum number of data in the group (details will be described later). Further, when the user inputs a new instruction, after the instructed processing is performed, the screen generating unit 13 generates new screen data, and the analysis result display unit 14 displays a new screen. In this way, the text mining apparatus 10 switches between the analysis mode of the text data 5 and the display mode of the analysis result according to an instruction from the user.

도 2 는, 텍스트 마이닝 장치 (10) 로서 기능하는 컴퓨터의 구성을 나타내는 블록도이다. 도 2 에 나타내는 컴퓨터 (20) 는, CPU (21), 메인 메모리 (22), 기억부 (23), 입력부 (24), 표시부 (25), 통신부 (26), 및 기록 매체 판독부 (27) 를 구비하고 있다. 메인 메모리 (22) 에는, 예를 들어, DRAM 이 사용된다. 기억부 (23) 에는, 예를 들어, 하드 디스크나 솔리드 스테이트 드라이브가 사용된다. 입력부 (24) 에는, 예를 들어, 키보드 (28) 나 마우스 (29) 가 포함된다. 표시부 (25) 에는, 예를 들어, 액정 디스플레이가 사용된다. 통신부 (26) 는, 유선 통신 또는 무선 통신의 인터페이스 회로이다. 기록 매체 판독부 (27) 는, 프로그램 등을 기억한 기록 매체 (30) 의 인터페이스 회로이다. 기록 매체 (30) 에는, 예를 들어, CD-ROM, DVD-ROM, USB 메모리 등의 비일과성의 기록 매체가 사용된다.2 is a block diagram showing the configuration of a computer functioning as the text mining device 10. The computer 20 shown in FIG. 2 includes a CPU 21, a main memory 22, a storage unit 23, an input unit 24, a display unit 25, a communication unit 26, and a recording medium reading unit 27. It is equipped with. For the main memory 22, for example, DRAM is used. For the storage unit 23, for example, a hard disk or a solid state drive is used. The input unit 24 includes, for example, a keyboard 28 and a mouse 29. For the display portion 25, for example, a liquid crystal display is used. The communication unit 26 is an interface circuit for wired communication or wireless communication. The recording medium reading unit 27 is an interface circuit of the recording medium 30 storing programs and the like. For the recording medium 30, for example, a non-transitory recording medium such as a CD-ROM, a DVD-ROM, or a USB memory is used.

컴퓨터 (20) 가 텍스트 마이닝 프로그램 (31) 을 실행하는 경우, 기억부 (23) 는, 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (5) 를 기억한다. 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (5) 는, 예를 들어, 서버나 다른 컴퓨터로부터 통신부 (26) 를 사용하여 수신한 것이어도 되고, 기록 매체 (30) 로부터 기록 매체 판독부 (27) 를 사용하여 판독 출력한 것이어도 된다.When the computer 20 executes the text mining program 31, the storage unit 23 stores the text mining program 31 and text data 5. The text mining program 31 and the text data 5 may be received from, for example, a server or another computer using the communication unit 26, and the recording medium reading unit 27 is transmitted from the recording medium 30. It may be used and read out.

텍스트 마이닝 프로그램 (31) 을 실행할 때에는, 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (5) 는 메인 메모리 (22) 에 복사 전송된다. CPU (21) 는, 메인 메모리 (22) 를 작업용 메모리로서 이용하여, 메인 메모리 (22) 에 기억된 텍스트 마이닝 프로그램 (31) 을 실행함으로써, 메인 메모리 (22) 에 기억된 텍스트 데이터 (5) 를 처리한다. 이 때 컴퓨터 (20) 는, 텍스트 마이닝 장치 (10) 로서 기능한다. 또한, 이상에 서술한 컴퓨터 (20) 의 구성은 일례에 불과하고, 임의의 컴퓨터를 사용하여 텍스트 마이닝 장치 (10) 를 구성할 수 있다.When executing the text mining program 31, the text mining program 31 and text data 5 are transferred to the main memory 22 by copying. The CPU 21 uses the main memory 22 as a working memory and executes the text mining program 31 stored in the main memory 22, thereby storing the text data 5 stored in the main memory 22. Process. At this time, the computer 20 functions as the text mining device 10. In addition, the configuration of the computer 20 described above is only an example, and the text mining apparatus 10 can be configured using an arbitrary computer.

이하, 텍스트 데이터 (5) 는, 일본어의 단어를 포함하는 일본어의 데이터라고 한다. 도 17 은, 도면 및 그 설명에 나타나는 단어를 나타내는 도면이다. 도 17 의 각 행에는, 단어 (일본어의 단어) 와 단어의 의미가 기재되어 있다. 이하의 설명에 있어서 일본어의 단어에 대해 언급할 때에, 단어의 뒤에 괄호 쓰기로 단어의 의미를 기재하는 경우가 있다. 또한, 텍스트 데이터 (5) 는, 임의의 언어의 데이터여도 된다.Hereinafter, the text data 5 is referred to as Japanese data including Japanese words. Fig. 17 is a diagram showing words appearing in a drawing and its description. In each row of Fig. 17, a word (a Japanese word) and the meaning of the word are described. In the following description, when referring to a Japanese word, the meaning of the word may be written in parentheses after the word. Further, the text data 5 may be data of any language.

도 3 은, 텍스트 마이닝 장치 (10) 의 표시 화면을 나타내는 도면이다. 도 3 에 나타내는 표시 화면 (40) 에는, 분석 결과 화면 (41) 과 분석 설정 화면 (42) 이 포함된다. 분석 결과 화면 (41) 에는, 텍스트 분석부 (12) 에 의한 분석 결과가 표시된다. 분석 설정 화면 (42) 에는, 텍스트 분석부 (12) 에 있어서의 분석 양태와 화면 생성부 (13) 에서 생성되는 화면 데이터의 특성을 설정하기 위한 그래피컬 유저 인터페이스 부품이 표시된다.3 is a diagram showing a display screen of the text mining device 10. The display screen 40 shown in FIG. 3 includes an analysis result screen 41 and an analysis setting screen 42. The analysis result by the text analysis unit 12 is displayed on the analysis result screen 41. On the analysis setting screen 42, a graphical user interface component for setting the analysis mode in the text analysis unit 12 and characteristics of the screen data generated by the screen generation unit 13 is displayed.

계층적 클러스터 분석의 결과에 대해 클러스터 수를 설정하면, 각 클러스터에 포함되는 단어가 결정된다. 텍스트 마이닝 장치 (10) 는, 텍스트 데이터 (5) 로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 때에, 수형도 대신에, 클러스터에 대응하는 그룹을 도 3 에 나타내는 양태로 표시한다.When the number of clusters is set for the result of hierarchical cluster analysis, words included in each cluster are determined. When the text mining device 10 displays the results of hierarchical cluster analysis on the words extracted from the text data 5 on the screen, instead of a tree diagram, the groups corresponding to the clusters are displayed in the manner shown in FIG. 3. .

이하의 설명에서는, 화면에 표시되는 클러스터를 그룹이라고도 한다. 이용자는, 지시 입력부 (11) 를 사용하여, 그룹 수 (클러스터 수) 와 그룹 내의 최대 데이터 수 (그룹에 포함되는 단어의 수의 상한치) 를 지정한다. 이하, 전자를 m, 후자를 n 으로 한다.In the following description, clusters displayed on the screen are also referred to as groups. The user uses the instruction input unit 11 to designate the number of groups (the number of clusters) and the maximum number of data in the group (the upper limit of the number of words included in the group). Hereinafter, the former is m and the latter is n.

텍스트 마이닝 장치 (10) 에서는, 텍스트 데이터 (5) 에 포함되는 단어는 m 개의 클러스터로 분류되고, 각 클러스터에는 1 개 이상의 단어가 포함된다. 분석 결과 화면 (41) 에는 m 개의 그룹이 표시되고, 각 그룹의 내부에는 단어가 표시된다. 그룹은 구름형 도형을 사용하여 표시되고, 그룹에 포함되는 단어는 타원 영역의 내부에 표시된다. 각 그룹에 포함되는 단어는, n 개 이하로 제한된다. 예를 들어, n = 5 일 때에 어느 클러스터가 10 개의 단어를 포함하는 경우, 분석 결과 화면 (41) 에서는 그룹의 내부에 5 개의 단어가 표시된다.In the text mining apparatus 10, words included in the text data 5 are classified into m clusters, and each cluster contains one or more words. The analysis result screen 41 displays m groups, and words are displayed inside each group. Groups are displayed using a cloud shape, and words included in the group are displayed inside the ellipse area. The number of words included in each group is limited to n or less. For example, when n = 5, when a cluster contains 10 words, 5 words are displayed inside the group on the analysis result screen 41.

분석 설정 화면 (42) 에는, 그룹 수 (m) 를 설정하기 위한 제 1 슬라이더와 2 개의 제 1 버튼 (기호 「＋」 또는 「―」을 부여한 것), 그룹 내의 최대 데이터 수 (n) 를 설정하기 위한 제 2 슬라이더와 2 개의 제 2 버튼, 및 분석 대상 기간을 설정하기 위한 4 개의 박스와 2 개의 제 3 버튼 (좌향 화살표 또는 우향 화살표를 부여한 것) 이 표시된다.On the analysis setting screen 42, a first slider for setting the number of groups (m), two first buttons (with symbols "+" or "-"), and the maximum number of data in the group (n) are set. A second slider and two second buttons for setting the analysis target period and four boxes and two third buttons (with a left arrow or a right arrow) are displayed.

이용자는, 마우스 (29) 를 조작하여, 제 1 슬라이더의 탭을 좌우로 이동시키거나, 제 1 버튼을 누르는 것에 의해, 그룹 수 (m) 를 지시한다. 그룹 수 (m) 는, 기호 「＋」를 부여한 제 1 버튼이 눌렸을 때에는 증가하고, 기호 「―」를 부여한 제 1 버튼이 눌렸을 때에는 감소한다. 그룹 수 (m) 의 초기치는, 예를 들어, 텍스트 분석부 (12) 에 의한 분석 결과에 포함되는 단어의 종류의 평방근, 또는 이것에 가까운 정수로 설정된다. 예를 들어, 텍스트 분석부 (12) 에 의한 분석 결과에 16 종류의 단어가 포함되어 있는 경우, 그룹 수 (m) 의 초기치는 4 로 설정된다.The user operates the mouse 29 to instruct the number of groups m by moving the tab of the first slider to the left or right or pressing the first button. The number of groups (m) increases when the first button with the symbol "+" is pressed, and decreases when the first button with the symbol "-" is pressed. The initial value of the number of groups m is set to, for example, the square root of the type of words included in the analysis result by the text analysis unit 12, or an integer close to this. For example, when 16 types of words are included in the analysis result by the text analysis unit 12, the initial value of the number of groups m is set to 4.

이용자는, 마우스 (29) 를 조작하여, 제 2 슬라이더의 탭을 좌우로 이동시키거나, 제 2 버튼을 누르는 것에 의해, 그룹 내의 최대 데이터 수 (n) 를 지시한다. 그룹 내의 최대 데이터 수 (n) 는, 제 2 버튼이 눌렸을 때에는 증가 또는 감소한다. 그룹 내의 최대 데이터 수 (n) 의 초기치는, 예를 들어, 5 로 설정된다.The user operates the mouse 29 to instruct the maximum number of data (n) in the group by moving the tab of the second slider left or right, or pressing the second button. The maximum number of data (n) in the group increases or decreases when the second button is pressed. The initial value of the maximum number of data (n) in the group is set to 5, for example.

텍스트 데이터 (5) 가 시계열 데이터인 경우, 이용자는, 키보드 (28) 또는 마우스 (29) 를 조작하여, 4 개의 박스를 사용하여 연월일과 시각을 지정하거나, 제 3 버튼을 누르는 것에 의해, 분석 대상 기간을 지시한다. 분석 대상 기간은, 좌향 화살표를 부여한 제 3 버튼이 눌렸을 때에는 소정량만큼 (예를 들어 1 개월) 과거로 이동하고, 우향 화살표를 부여한 제 3 버튼이 눌렸을 때에는 소정량만큼 반대 방향으로 이동한다. 분석 대상 기간의 초기치는, 예를 들어, 텍스트 데이터 (5) 의 가장 오래된 시각부터 가장 새로운 시각까지의 기간으로 설정된다. 또한, 텍스트 데이터 (5) 가 시계열 데이터가 아닌 경우에는, 이용자는 분석 대상 기간을 지정할 수 없다.When the text data (5) is time series data, the user operates the keyboard (28) or the mouse (29) to designate the year, month, date and time using four boxes, or by pressing the third button to Dictate the period. The analysis target period moves to the past by a predetermined amount (for example, 1 month) when the third button with a left arrow is pressed, and moves in the opposite direction by a predetermined amount when the third button with a right arrow is pressed. . The initial value of the period to be analyzed is set to, for example, a period from the oldest time to the newest time of the text data 5. In addition, when the text data 5 is not time series data, the user cannot designate an analysis target period.

분석 결과 화면 (41) 에는 1 개 이상 m 개 이하의 그룹이 표시되고, 각 그룹의 내부에는 1 개 이상 n 개 이하의 단어가 표시된다. 각 그룹은, 화면 내에서, 대응하는 클러스터에 포함되는 단어의 출현 빈도의 합계가 클수록 크게 표시된다. 클러스터에 포함되는 단어의 수가 n 개를 초과하는 경우에는, 그룹의 내부에는 출현 빈도가 높은 n 개의 단어가 표시된다. 그룹에 포함되는 단어와 이것을 포함하는 타원 영역은, 화면 내에서, 단어의 출현 빈도가 높을수록 크게 표시된다. 각 그룹에는, 명칭이 부여된다. 그룹의 명칭에는, 클러스터에 포함되는 단어 중 출현 빈도가 가장 높은 단어가 사용된다. 그룹의 명칭은, 그룹의 내부에 밑줄을 그어 표시된다. 또한, 타원 영역의 내부에 단어를 표시할 수 없는 경우에는, 단어 대신에 기호 「…」이 표시된다.One or more and m or less groups are displayed on the analysis result screen 41, and one or more and n or less words are displayed inside each group. Each group is displayed larger in the screen as the sum of the frequency of occurrence of words included in the corresponding cluster increases. When the number of words included in the cluster exceeds n, n words with a high frequency of appearance are displayed inside the group. The words included in the group and the elliptical area including the words are displayed larger in the screen as the frequency of occurrence of the word increases. Each group is given a name. For the group name, the word with the highest frequency of appearance among the words included in the cluster is used. The name of the group is indicated by underlined inside the group. In addition, when a word cannot be displayed inside the elliptical region, the symbol "... ”Is displayed.

분석 결과 화면 (41) 에는, 줌 배율을 지정하기 위한 제 3 슬라이더 및 2 개의 제 4 버튼 (기호 「＋」 또는 「―」를 부여한 것) 이 표시된다. 이용자는, 마우스 (29) 를 조작하여, 제 3 슬라이더의 탭을 좌우로 이동시키거나, 제 4 버튼을 누르는 것에 의해, 줌 배율을 설정한다. 분석 결과 화면 (41) 에는, 단어를 포함하는 그룹이, 설정된 줌 배율에 따라 확대 또는 축소되어 표시된다. 줌 배율의 초기치는, 100 ％ 로 설정된다. 초기 상태의 분석 결과 화면 (41) 에는, 모든 그룹이 표시된다.On the analysis result screen 41, a third slider and two fourth buttons (with the symbol "+" or "-") for designating the zoom magnification are displayed. The user operates the mouse 29 to set the zoom magnification by moving the tab of the third slider left or right or pressing the fourth button. On the analysis result screen 41, a group including words is enlarged or reduced according to a set zoom factor and displayed. The initial value of the zoom magnification is set to 100%. All groups are displayed on the analysis result screen 41 in the initial state.

이용자가 분석 설정 화면 (42) 에 있어서 그룹 수 (m), 그룹 내의 최대 데이터 수 (n), 또는 분석 대상 기간을 변경했을 때에, 분석 결과 화면 (41) 의 내용은 이것에 따라 변화한다. 이용자가 분석 결과 화면 (41) 에 있어서 단어 제외, 유의어 등록, 또는 복합어 등록을 지시했을 때에도, 분석 결과 화면 (41) 의 내용은 이것에 따라 변화한다.When the user changes the number of groups (m), the maximum number of data in the group (n), or the period subject to analysis in the analysis setting screen 42, the content of the analysis result screen 41 changes accordingly. Even when the user instructs word exclusion, synonym registration, or compound word registration on the analysis result screen 41, the content of the analysis result screen 41 changes accordingly.

텍스트 마이닝 장치 (10) 는, 텍스트 데이터 (5) 로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시할 때, 제외해야 할 단어를 기억한 제외 단어 리스트, 유의어로서 처리해야 할 단어를 기억한 유의어 리스트, 및 복합어로서 처리해야 할 단어를 기억한 복합어 리스트를 참조한다. 유의어 리스트에는, 동일한 의미 (또는, 거의 동일한 의미) 를 갖는 복수의 단어와, 이들 단어를 대표하는 1 개의 단어가 대응하여 기억되어 있다. 복합어 리스트에는, 연결하면 1 개의 복합어가 되는 복수의 단어와, 이들 단어를 연결한 복합어가 대응하여 기억되어 있다. 유의어 리스트에는, 예를 들어, 「daigakusei (대학생)」 및 「gakusei (학생)」과, 양자를 대표하는 「daigakusei」가 대응하여 기억되어 있다. 복합어 리스트에는, 예를 들어, 「nintai (인내)」 및 「tsuyoi (강하다)」와, 양자를 연결한 「nintaizuyoi (인내심이 강하다)」가 대응하여 기억되어 있다. 텍스트 마이닝 장치 (10) 는, 복수의 유의어 리스트와 복수의 복합어 리스트를 갖는 경우가 있다.When performing hierarchical cluster analysis on words extracted from text data 5, the text mining apparatus 10 includes a list of excluded words in which words to be excluded are stored, a list of synonyms in which words to be processed as synonyms are stored, And a compound word list in which words to be processed as compound words are stored. In the synonym list, a plurality of words having the same meaning (or almost the same meaning) and one word representing these words are stored in correspondence with each other. In the compound word list, a plurality of words that become one compound word when concatenated, and compound words connecting these words are stored in correspondence with each other. In the synonym list, for example, "daigakusei (college student)" and "gakusei (student)" and "daigakusei" representing both are stored in correspondence. In the compound word list, for example, "nintai (patience)" and "tsuyoi (strong)" and "nintaizuyoi (patience is strong)" which connect both are stored in correspondence. The text mining apparatus 10 may have a plurality of synonyms lists and a plurality of compound word lists.

도 4 는, 텍스트 마이닝 장치 (10) 의 동작을 나타내는 플로 차트이다. 도 5 는, 텍스트 마이닝 장치 (10) 의 화면 데이터 생성 처리 (도 4 에 나타내는 스텝 S111) 의 상세를 나타내는 플로 차트이다. 입력부 (24) 와 스텝 S113 을 실행하는 CPU (21) 는, 지시 입력부 (11) 로서 기능한다. 스텝 S109 ∼ S110 을 실행하는 CPU (21) 는, 텍스트 분석부 (12) 로서 기능한다. 스텝 S111 을 실행하는 CPU (21) 는, 화면 생성부 (13) 로서 기능한다. 표시부 (25) 와 스텝 S112 를 실행하는 CPU (21) 는, 분석 결과 표시부 (14) 로서 기능한다. 이하, 도 4 및 도 5 를 참조하여, 텍스트 마이닝 장치 (10) 의 동작을 설명한다.4 is a flowchart showing the operation of the text mining device 10. 5 is a flowchart showing details of the screen data generation process (step S111 shown in FIG. 4) of the text mining device 10. The input unit 24 and the CPU 21 executing step S113 function as the instruction input unit 11. The CPU 21 which executes steps S109 to S110 functions as the text analysis unit 12. The CPU 21 that executes step S111 functions as the screen generation unit 13. The display unit 25 and the CPU 21 executing step S112 function as the analysis result display unit 14. Hereinafter, the operation of the text mining apparatus 10 will be described with reference to FIGS. 4 and 5.

먼저, CPU (21) 는, 도 6 에 나타내는 데이터 지정 화면 (51) 을 표시부 (25) 에 표시시킨다 (스텝 S101). 데이터 지정 화면 (51) 에는, 파일명을 지정하기 위한 박스와 폴더명을 지정하기 위한 박스가 표시되어 있다. 이용자는, 데이터 지정 화면 (51) 에 있어서 파일명 또는 폴더명을 지정함으로써, 분석 대상인 텍스트 데이터 (5) 를 지정한다. 텍스트 데이터 (5) 는, 하드 디스크 등의 기억부 (23) 에 기억되어 있어도 되고, 통신부 (26) 를 사용하여 접속된 서버나 다른 컴퓨터 등에 기억되어 있어도 된다.First, the CPU 21 causes the data designation screen 51 shown in FIG. 6 to be displayed on the display unit 25 (step S101). On the data designation screen 51, a box for designating a file name and a box for designating a folder name are displayed. The user designates the text data 5 to be analyzed by designating a file name or a folder name on the data designation screen 51. The text data 5 may be stored in a storage unit 23 such as a hard disk, or may be stored in a server or other computer connected using the communication unit 26.

다음으로, CPU (21) 는, 데이터 지정 화면 (51) 을 사용하여 지정된 텍스트 데이터 (5) 를 메인 메모리 (22) 에 전송한다. 이로 인해, 텍스트 마이닝 장치 (10) 에 텍스트 데이터 (5) 가 입력된다 (스텝 S102). 도 7 은, 텍스트 데이터 (5) 의 예를 나타내는 도면이다. 도 7 에 나타내는 텍스트 데이터는, 대학생이 작성한 리포트의 데이터이며, 연월일의 정보를 갖는 시계열 데이터이다. 도 7 에 나타내는 텍스트 데이터는, 위로부터 순서대로 「본 강의에 있어서의 대학생과 사회의 관계에 대해 …」, 「일반적으로 대학생은 졸업하여 사회에 나오기 전에 아르바이트나 …」, 「우리들 학생은, 비싼 수업료를 지불하며 배우고 있는 것을 자각 …」, 및 「학생 생활은 자신이 성장하기 위한 귀중한 시간이다. 또 …」이다. 또한, 텍스트 마이닝 장치 (10) 가 분석하는 텍스트 데이터 (5) 의 종류는 임의이다.Next, the CPU 21 transfers the designated text data 5 to the main memory 22 using the data designation screen 51. For this reason, the text data 5 is input to the text mining apparatus 10 (step S102). 7 is a diagram showing an example of text data 5. The text data shown in Fig. 7 is data of a report created by a university student, and is time series data having information of year, month, and day. The text data shown in Fig. 7 is, in order from the top, "About the relationship between university students and society in this lecture... 」, 「Generally, college students graduate and have a part-time job before entering society. 」, “Our students are aware of what they are learning by paying high tuition fees… "," and "Student life is a valuable time for self-growth. In addition … "to be. In addition, the type of text data 5 analyzed by the text mining device 10 is arbitrary.

다음으로, CPU (21) 는, 도 8 에 나타내는 목적 지정 화면 (52) 을 표시부 (25) 에 표시시킨다 (스텝 S103). 목적 지정 화면 (52) 에는, 내용, 특징, 및 평판에 대응한 3 개의 라디오 버튼이 표시되어 있다. 이용자는, 마우스 (29) 를 조작하여 어느 라디오 버튼을 누르는 것에 의해, 분석 목적을 내용, 특징, 및 평판 중에서 선택한다. 다음으로, CPU (21) 는, 목적 지정 화면 (52) 을 사용하여 지정된 분석 목적을 받는다. 이로 인해, 텍스트 마이닝 장치 (10) 에 분석 목적이 입력된다 (스텝 S104).Next, the CPU 21 causes the target designation screen 52 shown in FIG. 8 to be displayed on the display unit 25 (step S103). On the purpose designation screen 52, three radio buttons corresponding to content, characteristics, and reputation are displayed. By operating the mouse 29 and pressing a certain radio button, the user selects the purpose of analysis from content, characteristics, and reputation. Next, the CPU 21 receives the designated analysis purpose using the purpose designation screen 52. For this reason, the purpose of analysis is input to the text mining device 10 (step S104).

다음으로, CPU (21) 는, 도 9 에 나타내는 유의어 리스트 선택 화면 (53) 을 표시부 (25) 에 표시시킨다 (스텝 S105). 유의어 리스트 선택 화면 (53) 에는, 텍스트 마이닝 장치 (10) 가 갖는 유의어 리스트의 명칭과, 각 유의어 리스트에 등록된 유의어가 표시된다. 이용자는, 마우스 (29) 를 조작하여 유의어 리스트 선택 화면 (53) 에 있어서 어느 유의어 리스트를 선택함으로써, 사용하는 유의어 리스트를 지정한다. 이로 인해, 텍스트 마이닝 장치 (10) 에서는, 유의어 리스트가 선택된다 (스텝 S106).Next, the CPU 21 causes the synonym list selection screen 53 shown in FIG. 9 to be displayed on the display unit 25 (step S105). On the synonym list selection screen 53, the name of the synonym list included in the text mining device 10 and the synonyms registered in each synonym list are displayed. The user operates the mouse 29 to select a synonym list on the synonym list selection screen 53 to designate a synonym list to be used. For this reason, in the text mining apparatus 10, the synonym list is selected (step S106).

다음으로, CPU (21) 는, 도 10 에 나타내는 복합어 리스트 선택 화면 (54) 을 표시부 (25) 에 표시시킨다 (스텝 S107). 복합어 리스트 선택 화면 (54) 에는, 텍스트 마이닝 장치 (10) 가 갖는 복합어 리스트의 명칭과, 각 복합어 리스트에 등록된 복합어가 표시된다. 이용자는, 마우스 (29) 를 조작하여 복합어 리스트 선택 화면 (54) 에 있어서 어느 것의 복합어 리스트를 선택함으로써, 사용하는 복합어 리스트를 지정한다. 이로 인해, 텍스트 마이닝 장치 (10) 에서는, 복합어 리스트가 선택된다 (스텝 S108).Next, the CPU 21 causes the compound word list selection screen 54 shown in FIG. 10 to be displayed on the display unit 25 (step S107). On the compound word list selection screen 54, the name of the compound word list included in the text mining device 10 and compound words registered in each compound word list are displayed. The user operates the mouse 29 to select any compound word list on the compound word list selection screen 54 to designate a compound word list to be used. For this reason, in the text mining apparatus 10, a compound word list is selected (step S108).

다음으로, CPU (21) 는, 제외 단어 리스트, 유의어 리스트, 및 복합어 리스트를 고려하여, 스텝 S102 에서 입력된 텍스트 데이터 (5) 중 분석 대상 기간 내에 있는 텍스트 데이터로부터 스텝 S104 에서 지정된 분석 목적에 따른 종류의 단어를 추출한다 (스텝 S109). CPU (21) 는, 분석 목적이 「내용」인 경우에는, 텍스트 데이터 (5) 로부터 명사, 고유 명사, 지명, 및 인명을 추출한다. 분석 목적이 「특징」인 경우에는, CPU (21) 는 텍스트 데이터 (5) 로부터 명사, 고유 명사, 사(サ)행 변격 활용 명사, 및 동사를 추출한다. 분석 목적이 「평판」인 경우에는, CPU (21) 는 텍스트 데이터 (5) 로부터 형용사, 형용 동사, 및 감동사를 추출한다. 또한, 텍스트 마이닝 장치 (10) 는, 상기 3 개 이외의 분석 목적을 서포트해도 된다. 또, CPU (21) 는, 각 분석 목적에 따라 상기와는 상이한 종류의 단어를 추출해도 된다.Next, the CPU 21 considers the list of excluded words, the list of synonyms, and the list of compound words, according to the analysis purpose specified in step S104 from the text data within the period to be analyzed among the text data 5 input in step S102. A word of a kind is extracted (step S109). When the purpose of analysis is "content", the CPU 21 extracts a noun, a proper noun, a place name, and a person name from the text data 5. When the purpose of analysis is "feature", the CPU 21 extracts a noun, a proper noun, a four-line shift conjugation noun, and a verb from the text data 5. When the purpose of analysis is "reputation", the CPU 21 extracts an adjective, an adjective verb, and an inspirational word from the text data 5. Further, the text mining device 10 may support analysis purposes other than the above three. In addition, the CPU 21 may extract a word of a different type from the above according to each analysis purpose.

텍스트 데이터 (5) 가 시계열 데이터인 경우에는, CPU (21) 는, 스텝 S109 를 실행할 때에, 텍스트 데이터 (5) 중, 이용자로부터 지시된 분석 대상 기간에 포함되는 텍스트 데이터만으로부터 단어를 추출한다. 또, 단어 W1 이 제외 단어 리스트에 기억되어 있는 경우에는, CPU (21) 는, 스텝 S109 를 실행할 때에, 텍스트 데이터 (5) 에 포함되는 단어 W1 을 모두 무시한다. 또, 선택된 유의어 리스트에 단어 W2 및 단어 W3 과, 양자를 대표하는 단어 W2 가 대응하여 기억되어 있는 경우에는, CPU (21) 는, 스텝 S109 를 실행할 때에, 텍스트 데이터 (5) 에 포함되는 단어 W3 을 모두 단어 W2 로서 처리한다. 또, 선택된 복합어 리스트에 단어 W4 및 단어 W5 와, 양자를 연결한 단어 W6 이 대응하여 기억되어 있는 경우에는, CPU (21) 는, 스텝 S109 를 실행할 때에, 텍스트 데이터 (5) 에 포함되는, 연속한 단어 W4 와 단어 W5 를 모두 단어 W6 으로서 처리한다.When the text data 5 is time series data, the CPU 21 extracts a word from only the text data included in the analysis target period instructed by the user from the text data 5 when executing step S109. Further, when the word W1 is stored in the excluded word list, the CPU 21 ignores all the words W1 included in the text data 5 when executing step S109. In addition, when the word W2 and the word W3 and the word W2 representing both are stored in correspondence with the selected synonym list, the CPU 21 performs the word W3 included in the text data 5 when executing step S109. Are all treated as the word W2. In addition, when the word W4 and the word W5 and the word W6 connecting both are stored in correspondence with each other in the selected compound word list, the CPU 21 includes the continuous text data 5 when executing step S109. Treats one word W4 and word W5 as word W6.

다음으로, CPU (21) 는, 스텝 S109 에서 추출한 단어에 대해 계층적 클러스터 분석을 실시한다 (스텝 S110). CPU (21) 는, 스텝 S110 에 있어서, 예를 들어, 텍스트 데이터 (5) 에 있어서의 2 개의 단어간의 거리 (2 개의 단어가 어느 정도 떨어져 나타날지) 에 기초하여, 2 개의 단어간의 유사도를 구한다. CPU (21) 는, 구한 단어간의 유사도에 기초하여, 소정의 방법 (예를 들어, 최단 거리법, 최장 거리법, 군평균법, 십진법, 워드법 등) 을 이용하여 계층적 클러스터 분석을 실시한다. 또, CPU (21) 는, 스텝 S110 에 있어서, 각 단어의 출현 빈도를 구한다.Next, the CPU 21 performs hierarchical cluster analysis on the word extracted in step S109 (step S110). In step S110, the CPU 21 calculates the degree of similarity between the two words, for example, based on the distance between the two words in the text data 5 (how far the two words appear apart). The CPU 21 performs hierarchical cluster analysis using a predetermined method (eg, shortest distance method, longest distance method, group average method, decimal method, word method, etc.) based on the degree of similarity between the obtained words. Further, the CPU 21 calculates the frequency of appearance of each word in step S110.

다음으로, CPU (21) 는, 스텝 S110 에서 구한 계층적 클러스터 분석의 결과에 기초하여, 분석 결과를 표시하기 위한 화면 데이터를 생성한다 (스텝 S111). CPU (21) 는, 스텝 S111 에 있어서, 도 5 에 나타내는 처리를 실시한다.Next, the CPU 21 generates screen data for displaying the analysis result based on the hierarchical cluster analysis result obtained in step S110 (step S111). The CPU 21 performs the processing shown in FIG. 5 in step S111.

CPU (21) 는, 그룹 수를 m, 그룹 내의 최대 데이터 수를 n 으로 한다 (스텝 S201). 다음으로, CPU (21) 는, 계층적 클러스터 분석의 결과에 대해 클러스터 수를 m 으로 설정하고, m 개의 클러스터를 구한다 (스텝 S202). 다음으로, CPU (21) 는, 각 클러스터에 대해, 클러스터에 포함되는 단어의 출현 빈도의 합계를 구한다 (스텝 S203). 다음으로, CPU (21) 는, 스텝 S203 에서 구한 출현 빈도의 합계에 기초하여, 각 그룹의 표시 사이즈를 결정한다 (스텝 S204). 스텝 S204에서는, 클러스터에 포함되는 단어의 출현 빈도의 합계가 클수록, 그룹의 표시 사이즈는 크게 결정된다.The CPU 21 sets the number of groups to m and the maximum number of data in the group to n (step S201). Next, the CPU 21 sets the number of clusters to m for the result of hierarchical cluster analysis, and obtains m clusters (step S202). Next, the CPU 21 calculates the sum of the frequency of occurrence of words included in the cluster for each cluster (step S203). Next, the CPU 21 determines the display size of each group based on the sum of the appearance frequencies obtained in step S203 (step S204). In step S204, the larger the sum of the frequency of occurrence of words included in the cluster is, the larger the group display size is determined.

다음으로, CPU (21) 는, 각 클러스터에 대해, 클러스터에 포함되는 단어 중에서 표시해야 할 단어를 선택한다 (스텝 S205). 스텝 S205 에서는, 각 클러스터에 포함되는 단어 중에서 출현 빈도가 높은 순으로, n 개 이하의 단어가 선택된다. 다음으로, CPU (21) 는, 스텝 S205 에서 선택한 각 단어에 대해, 단어의 출현 빈도에 기초하여 단어의 표시 사이즈를 결정한다 (스텝 S206). 스텝 S206 에서는, 출현 빈도가 높은 단어일수록, 단어의 표시 사이즈는 크게 결정된다.Next, for each cluster, the CPU 21 selects a word to be displayed from among words included in the cluster (step S205). In step S205, n or less words are selected from the words included in each cluster in the order of the highest frequency of appearance. Next, the CPU 21 determines, for each word selected in step S205, the display size of the word based on the frequency of appearance of the word (step S206). In step S206, the higher the frequency of appearance is, the larger the display size of the word is determined.

다음으로, CPU (21) 는, 계층적 클러스터 분석의 결과를 표시하기 위한 화면 데이터를 생성한다 (스텝 S207). 스텝 S207 에서 생성되는 화면 데이터에는, 스텝 S204 에서 결정된 사이즈를 갖는 m 개의 그룹 (구름형 도형으로 표현된다) 이 포함된다. 각 그룹의 내부에는, 스텝 S206 에서 결정된 사이즈를 갖는 n 개 이하의 단어가 포함된다. 단어는, 화면 내에서, 그룹의 내부에 표시된다. CPU (21) 는, 스텝 S207 을 실행한 후에 화면 데이터 생성 처리를 종료한다.Next, the CPU 21 generates screen data for displaying the result of the hierarchical cluster analysis (step S207). The screen data generated in step S207 includes m groups (represented by cloud-shaped figures) having the size determined in step S204. Inside each group, n or less words having the size determined in step S206 are included. Words are displayed within the group, within the screen. The CPU 21 ends the screen data generation process after executing step S207.

다음으로, CPU (21) 는, 스텝 S111 에서 생성한 화면 데이터에 기초하는 화면을 표시부 (25) 에 표시시킨다 (스텝 S112). 다음으로, CPU (21) 는, 이용자로부터의 지시를 받는다 (스텝 S113). 다음으로, CPU (21) 는, 스텝 S113 에서 받은 지시의 종류에 따라, 스텝 S115 ∼ S120 중의 어느 것으로 진행된다 (스텝 S114).Next, the CPU 21 causes the display unit 25 to display a screen based on the screen data generated in step S111 (step S112). Next, the CPU 21 receives an instruction from the user (step S113). Next, the CPU 21 proceeds to any of steps S115 to S120 according to the type of instruction received in step S113 (step S114).

CPU (21) 는, 스텝 S113 에서 받은 지시가 「그룹 수의 설정」인 경우에는, 스텝 S115 로 진행된다. 이 경우, CPU (21) 는, 그룹 수 (m) 를 이용자가 지시한 값으로 설정하고 (스텝 S115), 스텝 S111 로 진행된다. 그 후, 설정된 그룹 수 (m) 에 기초하여 화면 데이터가 생성되어 새로운 화면이 표시된다. 이로 인해, 지정된 개수의 그룹을 포함하는 분석 결과 화면이 표시된다.When the instruction received in step S113 is "setting of the number of groups", the CPU 21 proceeds to step S115. In this case, the CPU 21 sets the number of groups m to the value instructed by the user (step S115), and proceeds to step S111. Thereafter, screen data is generated based on the set number of groups m, and a new screen is displayed. For this reason, an analysis result screen including a specified number of groups is displayed.

CPU (21) 는, 스텝 S113 에서 받은 지시가 「그룹 내의 최대 데이터 수의 설정」인 경우에는, 스텝 S116 으로 진행된다. 이 경우, CPU (21) 는, 그룹 내의 최대 데이터 수 (n) 를 이용자가 지정한 값으로 설정하고 (스텝 S116), 스텝 S111 로 진행된다. 그 후, 설정된 그룹 내의 최대 데이터 수 (n) 에 기초하여 화면 데이터가 생성되어 새로운 화면이 표시된다. 이로 인해, 각 그룹에 포함되는 단어의 개수가 지정된 값 이하로 제한된 분석 결과 화면이 표시된다.When the instruction received in step S113 is "setting of the maximum number of data in the group", the CPU 21 proceeds to step S116. In this case, the CPU 21 sets the maximum number of data (n) in the group to a value designated by the user (step S116), and proceeds to step S111. Thereafter, screen data is generated based on the set maximum number of data n in the group, and a new screen is displayed. For this reason, an analysis result screen is displayed in which the number of words included in each group is limited to a specified value or less.

CPU (21) 는, 스텝 S113 에서 받은 지시가 「분석 대상 기간의 설정」인 경우에는, 스텝 S117 로 진행된다. 이 경우, CPU (21) 는, 분석 대상 기간을 이용자가 지정한 기간으로 설정하고 (스텝 S117), 스텝 S109 로 진행된다. 그 후, 설정된 분석 대상 기간을 참조하여 계층적 클러스터 분석이 실시되고, 새로운 분석 결과를 표시하기 위한 화면 데이터가 생성되어 새로운 화면이 표시된다. 이로 인해, 지정된 분석 대상 기간 내의 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다.When the instruction received in step S113 is "setting of an analysis target period", the CPU 21 proceeds to step S117. In this case, the CPU 21 sets the period to be analyzed to the period designated by the user (step S117) and proceeds to step S109. Thereafter, hierarchical cluster analysis is performed with reference to the set analysis target period, screen data for displaying a new analysis result is generated, and a new screen is displayed. For this reason, the results of hierarchical cluster analysis on words included in the text data within the designated analysis target period are displayed on the screen.

도 11a 는, 분석 대상 기간을 설정하기 전의 분석 결과 화면을 나타내는 도면이다. 도 11b 는, 분석 대상 기간을 설정한 후의 분석 결과 화면을 나타내는 도면이다. 도 11a 에 나타내는 설정 전의 분석 결과 화면 (61) 에는, 입력된 텍스트 데이터 (5) 중, 2014년 1월 1일 0시 0분부터 2015년 12월 31일 24시 0분까지의 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과가 표시된다. 도 11b 에 나타내는 설정 후의 분석 결과 화면 (62) 에는, 입력된 텍스트 데이터 (5) 중, 2014년 3월 1일 0시 0분부터 2014년 9월 30일 24시 0분까지의 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과가 표시된다. 분석 결과 화면 (61) 의 표시 내용과 분석 결과 화면 (62) 의 표시 내용은 상이하다. 이용자는, 분석 대상 기간을 설정하기 전후의 분석 결과 화면을 봄으로써, 계층적 클러스터 분석 결과의 시간적인 변화를 용이하게 인식할 수 있다.11A is a diagram showing an analysis result screen before an analysis target period is set. 11B is a diagram showing an analysis result screen after setting an analysis target period. In the analysis result screen 61 before setting shown in Fig. 11A, the input text data 5 is included in the text data from 0:0 on January 1, 2014 to 24:0 on December 31, 2015. The result of hierarchical cluster analysis is displayed for the word. In the analysis result screen 62 after setting shown in Fig. 11B, the input text data 5 is included in the text data from 00:00 on March 1, 2014 to 24:0 on September 30, 2014. The result of hierarchical cluster analysis is displayed for the word. The display contents of the analysis result screen 61 and the display contents of the analysis result screen 62 are different. The user can easily recognize the temporal change in the hierarchical cluster analysis result by viewing the analysis result screen before and after setting the analysis target period.

CPU (21) 는, 스텝 S113 에서 받은 지시가 「단어 제외」인 경우에는, 스텝 S118 로 진행된다. 이 경우, CPU (21) 는, 지정된 단어를 제외 단어 리스트에 추가하여 (스텝 S118), 스텝 S109 로 진행된다. 그 후, 지정된 단어를 제외하고 계층적 클러스터 분석이 실시되고, 새로운 분석 결과를 표시하기 위한 화면 데이터가 생성되어, 새로운 화면이 표시된다. 이로 인해, 지정된 단어를 제외하고 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다.When the instruction received in step S113 is "except for words", the CPU 21 proceeds to step S118. In this case, the CPU 21 adds the designated word to the excluded word list (step S118) and proceeds to step S109. Thereafter, hierarchical cluster analysis is performed excluding the designated words, screen data for displaying a new analysis result is generated, and a new screen is displayed. For this reason, the results of hierarchical cluster analysis excluding specified words are displayed on the screen.

도 12a 는, 단어 제외를 실시하기 전의 분석 결과 화면을 나타내는 도면이다. 도 12b 는, 단어 제외를 실시한 후의 분석 결과 화면을 나타내는 도면이다. 이용자는, 마우스 (29) 를 조작하여, 제외해야 할 단어를 선택한 후, 단어 제외를 지시한다. 도 12a 에 나타내는 단어 제외 전의 분석 결과 화면 (63) 에서는, 「shakai (사회)」가 선택되고, 메뉴 중에서 「단어 제외」가 선택되어 있다. 그 후, 「shakai」를 제외하고 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다. 도 12b 에 나타내는 단어 제외 후의 분석 결과 화면 (64) 에는, 「shakai」 대신에 「shingaku (진학)」가 표시되어 있다. 「shingaku」는, 「shakai」와 동일한 클러스터에 포함되는 단어 중에서, 분석 결과 화면 (63) 에 표시된 5 개의 단어의 다음으로 출현 빈도가 높은 것이다.12A is a diagram illustrating an analysis result screen before word exclusion is performed. 12B is a diagram illustrating an analysis result screen after word exclusion is performed. The user operates the mouse 29 to select a word to be excluded, and then instructs the word to be excluded. In the analysis result screen 63 before word exclusion shown in Fig. 12A, "shakai (social)" is selected, and "word exclusion" is selected from the menu. After that, the results of hierarchical cluster analysis except for "shakai" are displayed on the screen. In the analysis result screen 64 after word removal shown in Fig. 12B, "shingaku" is displayed instead of "shakai". "Shingaku" is one of the words included in the same cluster as "shakai", which has a higher frequency of appearance after the five words displayed on the analysis result screen 63.

CPU (21) 는, 스텝 S113 에서 받은 지시가 「유의어 등록」인 경우에는, 스텝 S119 로 진행된다. 이 경우, CPU (21) 는, 지시된 단어를 사용 중인 유의어 리스트에 추가하고 (스텝 S119), 스텝 S109 로 진행된다. 그 후, 지시된 유의어를 고려하여 계층적 클러스터 분석이 실시되고, 새로운 분석 결과를 표시하기 위한 화면 데이터가 생성되어, 새로운 화면이 표시된다. 이로 인해, 지시된 단어를 유의어로 하여 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다.When the instruction received in step S113 is "significant word registration", the CPU 21 proceeds to step S119. In this case, the CPU 21 adds the indicated word to the list of synonyms in use (step S119), and proceeds to step S109. Thereafter, hierarchical cluster analysis is performed in consideration of the indicated synonyms, screen data for displaying a new analysis result is generated, and a new screen is displayed. For this reason, the result of hierarchical cluster analysis using the indicated word as the synonym is displayed on the screen.

도 13a 는, 유의어 등록을 실시하기 전의 분석 결과 화면을 나타내는 도면이다. 도 13b 는, 유의어 등록을 실시한 후의 분석 결과 화면을 나타내는 도면이다. 이용자는, 마우스 (29) 를 조작하여, 유의어로서 등록해야 할 복수의 단어를 선택한 후, 유의어 등록을 지시한다. 도 13a 에 나타내는 유의어 등록 전의 분석 결과 화면 (65) 에서는, 「daigakusei (대학생)」와 「gakusei (학생)」이 선택되고, 메뉴 중에서 「유의어 등록」이 선택되어 있다. 그 후, 「daigakusei」와 「gakusei」를 유의어로 하여 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다. 도 13b 에 나타내는 유의어 등록 후의 분석 결과 화면 (66) 에서는, 「daigakusei」가 분석 결과 화면 (65) 보다 큰 사이즈로 표시되고, 「gakusei」대신에 「shingaku (진학)」가 표시되어 있다. 「daigakusei」는, 「daigakusei」의 출현 빈도와 「gakusei」의 출현 빈도의 합계에 따라, 분석 결과 화면 (65) 내의 「daigakusei」보다 큰 사이즈로 표시된다.13A is a diagram showing an analysis result screen before registration of synonyms is performed. 13B is a diagram showing an analysis result screen after registration of synonyms is performed. The user operates the mouse 29, selects a plurality of words to be registered as synonyms, and then instructs registration of the synonyms. In the analysis result screen 65 before the synonymous word registration shown in FIG. 13A, "daigakusei (college student)" and "gakusei (student)" are selected, and "symbol registration" is selected from the menu. After that, the results of hierarchical cluster analysis using "daigakusei" and "gakusei" as synonyms are displayed on the screen. On the analysis result screen 66 after the synonymous word registration shown in FIG. 13B, "daigakusei" is displayed in a larger size than the analysis result screen 65, and "shingaku" is displayed instead of "gakusei". "Daigakusei" is displayed in a size larger than "daigakusei" in the analysis result screen 65 according to the sum of the appearance frequency of "daigakusei" and the appearance frequency of "gakusei".

CPU (21) 는, 스텝 S113 에서 받은 지시가 「복합어 등록」인 경우에는, 스텝 S120 으로 진행된다. 이 경우, CPU (21) 는, 지시된 단어를 사용 중인 복합어 리스트에 추가하여 (스텝 S120), 스텝 S109 로 진행된다. 그 후, 지시된 복합어를 고려하여 계층적 클러스터 분석이 실시되고, 새로운 분석 결과를 표시하기 위한 화면 데이터가 생성되어, 새로운 화면이 표시된다. 이로 인해, 지정된 단어를 복합어로 하여 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다.When the instruction received in step S113 is "compound word registration", the CPU 21 proceeds to step S120. In this case, the CPU 21 adds the instructed word to the compound word list in use (step S120) and proceeds to step S109. Thereafter, hierarchical cluster analysis is performed in consideration of the indicated compound words, screen data for displaying a new analysis result is generated, and a new screen is displayed. For this reason, the result of hierarchical cluster analysis using the designated word as a compound word is displayed on the screen.

도 14a 는, 복합어 등록을 실시하기 전의 분석 결과 화면을 나타내는 도면이다. 도 14b 는, 복합어 등록을 실시한 후의 분석 결과 화면을 나타내는 도면이다. 이용자는, 마우스 (29) 를 조작하여, 복합어로서 등록해야 할 복수의 단어를 선택한 후, 「유의어 등록」을 지시한다. 도 14a 에 나타내는 복합어 등록 전의 분석 결과 화면 (67) 에서는, 「nintai (인내)」와 「tsuyoi (강하다)」가 선택되고, 메뉴 중에서 「복합어 등록」이 선택되어 있다. 그 후, 「nintai」와 「tsuyoi」를 복합어로 하여 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다. 도 14b 에 나타내는 복합어 등록 후의 분석 결과 화면 (68) 에서는, 「nintai」 및 「tsuyoi」대신에, 「nintaizuyoi (인내심이 강하다)」가 「nintai」 및 「tsuyoi」이하의 사이즈로 표시된다.14A is a diagram showing an analysis result screen before compound word registration is performed. 14B is a diagram showing an analysis result screen after compound word registration is performed. The user operates the mouse 29 to select a plurality of words to be registered as compound words, and then instructs "significant word registration". In the analysis result screen 67 before compound word registration shown in Fig. 14A, "nintai (patience)" and "tsuyoi (strong)" are selected, and "compound word registration" is selected from the menu. After that, the results of hierarchical cluster analysis using "nintai" and "tsuyoi" as compound words are displayed on the screen. In the analysis result screen 68 after compound word registration shown in Fig. 14B, instead of "nintai" and "tsuyoi", "nintaizuyoi (strong patience)" is displayed in sizes less than "nintai" and "tsuyoi".

이상에 나타내는 바와 같이, 본 실시형태에 관련된 텍스트 마이닝 방법은, 입력된 텍스트 데이터로부터 추출한 단어에 대해 계층적 클러스터 분석을 실시하는 텍스트 분석 스텝과, 텍스트 분석 스텝에 의한 분석 결과에 기초하여, 화면 데이터를 생성하는 화면 생성 스텝과, 화면 데이터에 기초하여, 화면을 표시하는 분석 결과 표시 스텝을 구비하고 있다. 화면 생성 스텝은, 그룹 수 (m) 와, 그룹 내의 최대 데이터 수 (n) 에 기초하여, 분석 결과로부터 m 개의 클러스터를 구하고, 클러스터에 포함되는 단어를 n 개 이하 포함하는 그룹을 화면에 표시하기 위한 화면 데이터를 생성한다. 본 실시형태에 관련된 텍스트 마이닝 방법에 의하면, 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과에 기초하여, 클러스터에 포함되는 단어를 포함하는 그룹이 화면에 표시된다. 또, 그룹에 포함되는 단어의 수는, n 개 이하로 제한된다. 따라서, 이용자는, 화면을 보았을 때에, 계층적 클러스터 분석의 결과를 직감적으로 이해할 수 있다.As shown above, the text mining method according to the present embodiment includes a text analysis step for performing hierarchical cluster analysis on words extracted from input text data, and screen data based on the analysis result by the text analysis step. And an analysis result display step of displaying a screen based on the screen data and a screen generation step of generating a screen. In the screen creation step, based on the number of groups (m) and the maximum number of data in the group (n), m clusters are obtained from the analysis result, and groups containing n or less words included in the cluster are displayed on the screen. Create screen data for According to the text mining method according to the present embodiment, a group including words included in the cluster is displayed on a screen based on a result of performing hierarchical cluster analysis on words included in text data. In addition, the number of words included in the group is limited to n or less. Accordingly, the user can intuitively understand the results of hierarchical cluster analysis when viewing the screen.

또, 그룹에 포함되는 단어는, 그룹에 대응하는 클러스터에 포함되는 단어 중에서 출현 빈도가 높은 순으로 선택된다. 이 때문에, 그룹의 내부에는, 클러스터에 포함되는 단어 중 출현 빈도가 높은 단어가 표시된다. 따라서, 이용자는, 각 클러스터에 포함되는 출현 빈도가 높은 단어를 용이하게 인식할 수 있다. 또, 그룹은, 화면 내에서, 그룹에 대응하는 클러스터에 포함되는 단어의 출현 빈도의 합계에 따른 사이즈를 갖는다. 따라서, 이용자는, 단어의 출현 빈도의 합계가 큰 클러스터를 용이하게 인식할 수 있다. 또, 그룹에 포함되는 단어는, 화면 내에서, 단어의 출현 빈도에 따른 사이즈를 갖는다. 따라서, 이용자는, 출현 빈도가 높은 단어를 용이하게 인식할 수 있다.Further, the words included in the group are selected in the order of the highest frequency of appearance among words included in the cluster corresponding to the group. For this reason, words with a high frequency of appearance among words included in the cluster are displayed inside the group. Accordingly, the user can easily recognize words with a high frequency of appearance included in each cluster. Further, the group has a size according to the sum of the frequency of appearance of words included in the cluster corresponding to the group in the screen. Accordingly, the user can easily recognize a cluster in which the sum of the frequency of occurrence of words is large. Further, the words included in the group have a size according to the frequency of appearance of the words in the screen. Accordingly, the user can easily recognize words with a high frequency of appearance.

또, 텍스트 마이닝 방법은, 이용자로부터의 지시를 입력하기 위한 지시 입력 스텝을 구비하고, 텍스트 분석 스텝 및 화면 생성 스텝 중의 어느 것이, 지시 입력 스텝에서 입력된 지시에 기초하여 실행된다. 따라서, 이용자로부터의 지시에 따라, 계층적 클러스터 분석 결과의 표시 양태를 전환할 수 있다. 특히, 지시 입력 스텝은 그룹 수 (m) 의 설정 지시를 받고, 화면 생성 스텝은 지시 입력 스텝에서 지정된 그룹 수 (m) 에 기초하여, 화면 데이터를 생성한다. 이로 인해, 화면에 표시되는 영역의 개수 (클러스터의 개수) 를 이용자로부터의 지시에 따라 전환할 수 있다. 또, 지시 입력 스텝은 그룹 내의 최대 데이터 수 (n) 를 받고, 화면 생성 스텝은 지시 입력 스텝에서 지정된 그룹 내의 최대 데이터 수 (n) 에 기초하여, 화면 데이터를 생성한다. 이로 인해, 영역 내에 표시되는 단어의 개수를 이용자로부터의 지시에 따라 전환할 수 있다.Further, the text mining method includes an instruction input step for inputting an instruction from a user, and any of the text analysis step and the screen generation step is executed based on the instruction input in the instruction input step. Accordingly, according to an instruction from the user, the display mode of the hierarchical cluster analysis result can be switched. In particular, the instruction input step receives an instruction for setting the number of groups m, and the screen generation step generates screen data based on the number of groups m designated in the instruction input step. For this reason, the number of areas (number of clusters) displayed on the screen can be switched according to an instruction from the user. Further, the instruction input step receives the maximum number of data (n) in the group, and the screen generation step generates screen data based on the maximum number of data (n) in the group specified in the instruction input step. For this reason, the number of words displayed in the area can be switched according to an instruction from the user.

또, 지시 입력 스텝은 분석 대상 기간의 지시를 받고, 텍스트 분석 스텝은, 텍스트 데이터 중 지시 입력 스텝에서 지정된 분석 대상 기간 내의 텍스트 데이터에 포함되는 단어에 대해, 계층적 클러스터 분석을 실시한다. 따라서, 이용자로부터 지시된 분석 대상 기간 내의 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과가 화면에 표시된다. 이로써, 이용자는, 계층적 클러스터 분석 결과의 시간적 변화를 용이하게 인식할 수 있다. 또, 지시 입력 스텝은 분석 목적의 설정 지시를 받고, 텍스트 분석 스텝은, 텍스트 데이터 (5) 로부터 지시 입력 스텝에서 설정된 분석 목적에 따른 종류의 단어를 추출하여, 계층적 클러스터 분석을 실시한다. 이로 인해, 이용자로부터 지시된 분석 목적에 따라 분석 대상의 단어의 종류를 전환하여 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 수 있다.In addition, the instruction input step receives an instruction of an analysis target period, and the text analysis step performs hierarchical cluster analysis on words contained in the text data within the analysis target period designated by the instruction input step among text data. Accordingly, the result of hierarchical cluster analysis on words included in the text data within the analysis target period indicated by the user is displayed on the screen. Thereby, the user can easily recognize the temporal change of the hierarchical cluster analysis result. In addition, the instruction input step receives an instruction for setting the analysis purpose, and the text analysis step extracts words of a kind according to the analysis purpose set in the instruction input step from the text data 5, and performs hierarchical cluster analysis. For this reason, it is possible to display the results of hierarchical cluster analysis by switching the types of words to be analyzed according to the analysis purpose instructed by the user.

또, 지시 입력 스텝은 단어 제외 지시를 받고, 텍스트 분석 스텝은 지시 입력 스텝에서 지시된 단어를 제외하여, 계층적 클러스터 분석을 실시한다. 이로 인해, 이용자로부터 지시된 단어를 제외하고 계층적 클러스터 분석을 실시한 결과를 표시할 수 있다. 또, 지시 입력 스텝은 유의어 등록 지시를 받고, 텍스트 분석 스텝은 지시 입력 스텝에서 지시된 복수의 단어를 동일한 단어로 간주하여, 계층적 클러스터 분석을 실시한다. 이로 인해, 이용자로부터 지시된 복수의 단어를 동일한 단어로 간주하여 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 수 있다. 또, 지시 입력 스텝은 복합어 등록 지시를 받고, 텍스트 분석 스텝은 지시 입력 스텝에서 지정된 복수의 단어를 1 개의 단어로 병합하여, 계층적 클러스터 분석을 실시한다. 이로 인해, 이용자로부터 지시된 복수의 단어를 1 개의 단어로 병합하여, 계층적 클러스터 분석을 실시한 결과를 화면에 표시할 수 있다.Further, the instruction input step receives an instruction to exclude words, and the text analysis step excludes the words indicated at the instruction input step, and performs hierarchical cluster analysis. For this reason, it is possible to display the result of performing hierarchical cluster analysis excluding words indicated by the user. Further, the instruction input step receives a synonym registration instruction, and the text analysis step considers a plurality of words indicated in the instruction input step as the same word, and performs hierarchical cluster analysis. For this reason, a plurality of words instructed by the user can be regarded as the same word and the result of hierarchical cluster analysis can be displayed on the screen. Further, the instruction input step receives a compound word registration instruction, and the text analysis step merges a plurality of words designated in the instruction input step into one word, and performs hierarchical cluster analysis. For this reason, it is possible to merge a plurality of words instructed by the user into one word and display the results of hierarchical cluster analysis on the screen.

또, 화면 생성 스텝은, 그룹을 포함하는 분석 결과 화면과, 분석 결과 화면의 표시 양태를 설정하기 위한 분석 설정 화면을 표시하기 위한 화면 데이터를 생성한다. 따라서, 분석 결과 화면과 분석 설정 화면이 표시된다. 이로써, 이용자는, 분석 설정 화면을 사용하여, 계층적 클러스터 분석을 실시한 결과의 표시 양태를 용이하게 전환할 수 있다.In addition, the screen generation step generates screen data for displaying an analysis result screen including a group and an analysis setting screen for setting a display mode of the analysis result screen. Therefore, the analysis result screen and analysis setting screen are displayed. Thereby, the user can easily switch the display mode of the result of hierarchical cluster analysis using the analysis setting screen.

본 실시형태에 관련된 텍스트 마이닝 프로그램 (31), 및 본 실시형태에 관련된 텍스트 마이닝 장치 (10) 는, 본 실시형태에 관련된 텍스트 마이닝 처리 방법과 동일한 구성을 갖고, 동일한 효과를 발휘한다.The text mining program 31 according to the present embodiment and the text mining device 10 according to the present embodiment have the same configuration as the text mining processing method according to the present embodiment, and exhibit the same effects.

본 실시형태에 관련된 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치에 의하면, 텍스트 데이터에 포함되는 단어에 대해 계층적 클러스터 분석을 실시한 결과에 기초하여, 클러스터에 포함되는 단어를 최대 데이터 수 이하 포함하는 그룹이 화면에 표시된다. 따라서, 이용자는, 화면을 보았을 때에, 계층적 클러스터 분석의 결과를 직감적으로 이해할 수 있다.According to the text mining method, text mining program, and text mining apparatus according to the present embodiment, words included in the cluster are included in the maximum number of data or less, based on the result of hierarchical cluster analysis on words included in text data. Group is displayed on the screen. Accordingly, the user can intuitively understand the results of hierarchical cluster analysis when viewing the screen.

또한, 본원은, 2016년 7월 25일에 출원된 「텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치」라는 명칭의 일본 특허출원 2016-145065호에 기초하는 우선권을 주장하는 출원이며, 이들 출원의 내용은 인용함으로써 본원 중에 포함된다.In addition, this application is an application claiming priority based on Japanese Patent Application No. 2016-145065 entitled "Text mining method, text mining program, and text mining device" filed on July 25, 2016, and these applications The contents of are incorporated herein by reference.

5 : 텍스트 데이터
10 : 텍스트 마이닝 장치
11 : 지시 입력부
12 : 텍스트 분석부
13 : 화면 생성부
14 : 분석 결과 표시부
20 : 컴퓨터
21 : CPU
22 : 메인 메모리
24 : 입력부
25 : 표시부
30 : 기록 매체
31 : 텍스트 마이닝 프로그램
40 : 표시 화면
41, 61 ∼ 68 : 분석 결과 화면
42 : 분석 설정 화면
51 : 데이터 지정 화면
52 : 목적 지정 화면
53 : 유의어 리스트 선택 화면
54 : 복합어 리스트 선택 화면 5: text data
10: text mining device
11: Instruction input unit
12: text analysis unit
13: screen generator
14: analysis result display unit
20: computer
21: CPU
22: main memory
24: input
25: display
30: recording medium
31: text mining program
40: display screen
41, 61 ∼ 68: Analysis result screen
42: Analysis setting screen
51: Data designation screen
52: Purpose designation screen
53: Thesaurus list selection screen
54: Compound word list selection screen

Claims

As a text mining method that displays the analysis result of text data on a screen,
A text analysis step that performs hierarchical cluster analysis on words extracted from the input text data,
A screen generation step for generating screen data based on the analysis result by the text analysis step,
An analysis result display step of displaying a screen based on the screen data,
The screen generation step, based on the number of groups and the maximum number of data in the group, obtains a cluster of the number of groups from the analysis result, and displays a group including words included in the cluster less than the maximum number of data on the screen. Create screen data for
The text mining method, characterized in that, as a name, a word having the highest occurrence frequency among words included in the cluster is assigned to the group.

The method of claim 1,
The words included in the group are selected from among words included in a cluster corresponding to the group in the order of their occurrence frequency.

The method of claim 2,
Wherein the group has a size according to the sum of the frequency of occurrence of words included in the cluster corresponding to the group in the screen.

The method of claim 3,
A text mining method, characterized in that the words included in the group have a size according to the frequency of appearance of the word in the screen.

The method of claim 1,
An instruction input step for inputting an instruction from the user is further provided,
Any of the text analysis step and the screen generation step is executed based on an instruction input in the instruction input step.

The method of claim 5,
The instruction input step receives an instruction for setting the number of groups,
The screen generating step is characterized in that the screen data is generated based on the number of groups set in the instruction input step.

The method of claim 5,
The instruction input step receives an instruction to set the maximum number of data,
The screen generating step is characterized in that the screen data is generated based on the maximum number of data set in the instruction input step.

The method of claim 5,
The instruction input step receives an instruction for setting an analysis target period,
In the text analysis step, the hierarchical cluster analysis is performed on words included in the text data within the analysis target period set in the instruction input step among the text data.

The method of claim 5,
The instruction input step receives an instruction for setting the purpose of analysis,
In the text analysis step, the hierarchical cluster analysis is performed by extracting a word of a kind according to an analysis purpose set in the instruction input step from the text data.

The method of claim 5,
The instruction input step receives an instruction to exclude words,
In the text analysis step, the hierarchical cluster analysis is performed, excluding words indicated in the instruction input step.

The method of claim 5,
The instruction input step receives a synonym registration instruction,
In the text analysis step, the hierarchical cluster analysis is performed by considering a plurality of words indicated in the instruction input step as the same word.

The method of claim 5,
The instruction input step receives a compound word registration instruction,
In the text analysis step, the hierarchical cluster analysis is performed by merging a plurality of words indicated in the instruction input step into one word.

The method of claim 1,
The screen generating step comprises generating screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.

As a text mining program stored in a recording medium that displays the analysis result of text data on a screen,
A text analysis step that performs hierarchical cluster analysis on words extracted from the input text data,
A screen generation step for generating screen data based on the analysis result by the text analysis step,
Based on the screen data, the CPU executes the analysis result display step of displaying a screen in the computer using a memory,
The screen generation step, based on the number of groups and the maximum number of data in the group, obtains a cluster of the number of groups from the analysis result, and displays a group including words included in the cluster less than the maximum number of data on the screen. Create screen data for
A text mining program stored in a recording medium, characterized in that a word having the highest frequency of appearance among words included in the cluster is assigned as a name to the group.

The method of claim 14,
The text mining program stored in the recording medium, characterized in that the words included in the group are selected in the order of their appearance frequency from among words included in the cluster corresponding to the group.

The method of claim 15,
Wherein the group has a size according to the sum of the frequency of appearances of words included in the cluster corresponding to the group in the screen.

The method of claim 16,
A text mining program stored in a recording medium, characterized in that the words included in the group have a size according to the frequency of appearance of the word in the screen.

The method of claim 14,
An instruction input step for inputting an instruction from the user is additionally executed on the computer,
A text mining program stored in a recording medium, characterized in that either of the text analysis step and the screen generation step is executed based on an instruction input in the instruction input step.

The method of claim 14,
The screen generation step comprises generating screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen, characterized in that the text mining stored in a recording medium program.

As a text mining device that displays text data analysis results on a screen,
A text analysis unit that performs hierarchical cluster analysis on words extracted from the input text data,
A screen generation unit that generates screen data based on the analysis result by the text analysis unit,
An analysis result display unit for displaying a screen based on the screen data,
The screen generator is configured to obtain a cluster of the number of groups from the analysis result based on the number of groups and the maximum number of data in the group, and to display a group including words included in the cluster less than the maximum number of data on the screen. Create screen data,
The text mining apparatus, characterized in that, as a name, a word having the highest occurrence frequency among words included in the cluster is assigned to the group.

The method of claim 20,
The text mining apparatus, characterized in that the words included in the group are selected in the order of their appearance frequency from among words included in a cluster corresponding to the group.

The method of claim 21,
The group, in the screen, characterized in that the size according to the sum of the frequency of occurrence of words included in the cluster corresponding to the group, text mining apparatus.

The method of claim 22,
The text mining apparatus, characterized in that the words included in the group have a size according to the frequency of appearance of the words in the screen.

The method of claim 20,
Further provided with an instruction input unit for inputting an instruction from the user,
Any one of the text analysis unit and the screen generation unit operates based on an instruction input from the instruction input unit.

The method of claim 20,
The screen generation unit generates screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.