TW201807597A

TW201807597A - Text mining method, text mining program, and text mining apparatus

Info

Publication number: TW201807597A
Application number: TW106122011A
Authority: TW
Inventors: 秋田正史; 中村康則; 周景龍
Original assignee: 斯庫林集團股份有限公司
Priority date: 2016-07-25
Filing date: 2017-06-30
Publication date: 2018-03-01
Also published as: TWI686716B; CN109478191B; KR20190018480A; JP6794162B2; JP2018018118A; WO2018020842A1; KR102180487B1; CN109478191A

Abstract

In text analysis steps (S109-S110), hierarchical cluster analysis is carried out for words extracted from inputted text data. In a screen generation step (S111), m clusters are calculated from the analysis result of the text analysis steps on the basis of m groups and the maximum number n of data within each of the groups, and screen data for displaying, on a screen, a group which includes not more than n words in the clusters is generated. In an analysis result display step (S112), the screen is displayed on the basis of the generated screen data. Thus, the result of the hierarchical cluster analysis is displayed on the screen so as to be intuitively understood by a user.

Description

Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program

本發明係關於文字探勘，尤其關於將文字資料之分析結果顯示於畫面之文字探勘方法、文字探勘程式、及文字探勘裝置。 The present invention relates to text exploration, and in particular, to a text exploration method, a text exploration program, and a text exploration device that display analysis results of text data on a screen.

近年來，解析以自由形態所記載之大量文字資料，並從解析結果求得有用資訊之文字探勘受到矚目。在文字探勘中，例如自分析對象之文字資料擷取單字，並藉由解析單字的出現頻率與出現趨勢等來求得資訊。 In recent years, text exploration that analyzes a large amount of text data recorded in free form and obtains useful information from the analysis results has attracted attention. In text exploration, for example, words are extracted from the text data of the analysis object, and information is obtained by analyzing the appearance frequency and appearance trend of the words.

以下，針對對自文字資料所擷取之單字進行階層式集群分析而將分析結果顯示於畫面之文字探勘裝置進行探討。在階層式集群分析中，根據單字間之相似度，而階層式地製作包含相似度高之單字之集群。一般而言，階層式集群分析之結果係使用圖15所示之樹狀圖(樹狀結構圖；dendrogram)，而被提供給使用者(分析者)。 In the following, a text exploration device that performs hierarchical cluster analysis on the words extracted from the text data and displays the analysis results on the screen is discussed. In hierarchical cluster analysis, clusters containing words with a high degree of similarity are created hierarchically based on the similarity between the words. In general, the results of hierarchical cluster analysis are provided to users (analysts) using a dendrogram shown in FIG. 15.

與本案發明相關連地，於專利文獻1記載有一種分群裝置，其具有建構樹狀圖，探索樹狀圖而生成可自下層至上層進行特定之索引並儲存於儲存手段之階層式分群手段。於引證2記載有一種提供查詢裝置，其具有：距離矩陣計算手段，其計算出關鍵字間之距離，生成可探索關鍵字與關鍵字間之距離之距離矩陣資料並儲存於儲存手段；及分群手段，其使用距離矩陣將關鍵字階層式分群，並作為可自下層至上層地探索所建構之樹狀圖之由下往上索引而儲存於儲存手段。 Related to the invention of this case, a clustering device is described in Patent Document 1, which has a hierarchical tree structure for constructing a tree map, exploring the tree map, and generating a specific index from the lower layer to the upper layer and storing it in a storage means. A citation device is described in Citation 2 which includes: a distance matrix calculation means that calculates the distance between keywords, generates a distance matrix data that can explore the distance between keywords and the keywords, and stores them in a storage means; and grouping Means, which uses a distance matrix to hierarchically classify keywords The clusters are stored in the storage means as a bottom-up index of the constructed tree map that can be explored from the lower layer to the upper layer.

[Prior technical literature] [Patent Literature]

[專利文獻1]日本專利特開2011-216021號公報 [Patent Document 1] Japanese Patent Laid-Open No. 2011-216021

[專利文獻2]日本專利特開2012-150539號公報 [Patent Document 2] Japanese Patent Laid-Open No. 2012-150539

習知之文字探勘裝置，使用樹狀圖將階層式集群分析之結果顯示於畫面。然而，如此之文字探勘裝置存在有使用者無法直觀地理解分析結果之問題。例如，於圖15所示之分析結果中，在使用者將集群數設定為4時，如圖16所示，會在樹狀圖上設定切割線。然而，使用者並無法僅從看到如此之樹狀圖，便直觀地認知各集群所包含之單字。又，使用者在單字數較多而變更集群數之情形時，並無法直觀地掌握各集群所包含之單字會如何地變化。 The conventional text exploration device uses a tree diagram to display the results of hierarchical cluster analysis on the screen. However, such a text exploration device has a problem that a user cannot intuitively understand the analysis result. For example, in the analysis result shown in FIG. 15, when the user sets the number of clusters to 4, as shown in FIG. 16, a cutting line is set on the tree diagram. However, the user cannot intuitively recognize the words contained in each cluster only by seeing such a tree diagram. In addition, when the number of words is large and the number of clusters is changed, the user cannot intuitively grasp how the words included in each cluster change.

又，因為樹狀圖並未記載單字的出現頻率，因此使用者無法得知哪個單字較重要。又，於分析對象之文字資料為具有年月日或時刻等之資訊之時間序列資料之情形時，使用者有時會期望能得知分析結果在時間上的變化。然而，在習知之文字探勘裝置中，並無法滿足使用者的上述期望。 In addition, because the tree diagram does not record the frequency of occurrence of words, the user cannot know which word is more important. In addition, when the text data to be analyzed is time-series data with information such as year, month, day, time, etc., the user may be expected to know the temporal change of the analysis result. However, in the conventional text exploration device, the above expectations of users cannot be met.

因此，本發明之目的，在於提供將階層式集群分析之結果以使用者可直觀地理解之方式顯示於畫面之文字探勘方法、文字探勘程式、及文字探勘裝置。 Therefore, an object of the present invention is to provide a text exploration method, a text exploration program, and a text exploration device that display the results of hierarchical cluster analysis on a screen in a manner that users can intuitively understand.

本發明第1態樣係一種文字探勘方法，係將文字資料之分析結果顯示於畫面者，其特徵在於具備有：文字分析步驟，其對自被輸入之文字資料所擷取之單字(單語，即單詞，word，vocabulary)進行階層式集群分析；畫面生成步驟，其根據上述文字分析步驟之分析結果來生成畫面資料；及分析結果顯示步驟，其根據上述畫面資料來顯示畫面；上述畫面生成步驟根據群組數與群組內之最多資料數，自上述分析結果求得上述群組數之集群，而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面之畫面資料。 The first aspect of the present invention is a text exploration method, which displays the analysis result of text data on the screen, and is characterized by having a text analysis step for the single word (monolingual) extracted from the input text data. (Ie, word, word, vocabulary) for hierarchical cluster analysis; a screen generation step that generates screen data based on the analysis results of the above text analysis step; and an analysis result display step that displays a screen based on the screen data; the screen generation Steps: According to the number of groups and the maximum number of data in the group, the cluster of the number of groups is obtained from the above analysis result, and a group including the single word included in the cluster below the maximum number of data is generated and displayed in the group. Picture data of the picture.

本發明第2態樣之特徵在於，於本發明之第1態樣中，上述群組所包含之單字係自對應於上述群組之集群所包含之單字中，依出現頻率高之順序所選擇。 The second aspect of the present invention is characterized in that, in the first aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group in the order of high occurrence frequency .

本發明第3態樣之特徵在於，於本發明之第2態樣中，上述群組在上述畫面內，具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The third aspect of the present invention is characterized in that, in the second aspect of the present invention, the group has a total size corresponding to the appearance frequency of the words included in the cluster corresponding to the group in the screen.

本發明第4態樣之特徵在於，於本發明之第3態樣中，上述群組所包含之單字在上述畫面內，具有對應於上述單字之出現頻率的尺寸。 A fourth aspect of the present invention is characterized in that, in the third aspect of the present invention, the words included in the group have a size corresponding to the appearance frequency of the words in the screen.

本發明第5態樣之特徵在於，於本發明之第1態樣中，進一步具備有用以輸入來自使用者之指示之指示輸入步驟，上述文字分析步驟及上述畫面生成步驟之任一者，係根據在上述指示輸入步驟所輸入之指示而被執行。 A fifth aspect of the present invention is characterized in that, in the first aspect of the present invention, there is further provided an instruction input step for inputting instructions from a user, any one of the above-mentioned text analysis step and the above-mentioned screen generation step, It is executed according to the instruction input in the instruction input step.

本發明第6態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收上述群組數之設定指示，上述畫面生成步驟根據在上述指示輸入步驟所設定之群組數，來生成上述畫面資料。 A sixth aspect of the present invention is characterized in that in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the number of groups, and the screen generation step is based on the number of groups set in the instruction input step, To generate the above screen data.

本發明第7態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收上述最多資料數之設定指示，上述畫面生成步驟根據在上述指示輸入步驟所設定之最多資料數，來生成上述畫面資料。 A seventh aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the maximum number of data, and the screen generating step is based on the maximum number of data set in the instruction input step. To generate the above screen data.

本發明第8態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收分析對象期間之設定指示，上述文字分析步驟對上述文字資料中在上述指示輸入步驟所設定之分析對象期間內之文字資料所包含之單字，進行上述階層式集群分析。 An eighth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a setting instruction during an analysis object, and the character analysis step analyzes the character data that is set in the instruction input step. The words included in the text data during the target period were analyzed by the above-mentioned hierarchical cluster analysis.

本發明第9態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收分析目的之設定指示，上述文字分析步驟自上述文字資料擷取對應於在上述指示輸入步驟中所設定之分析目的之種類的單字，來進行上述階層式集群分析。 The ninth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a setting instruction for analysis purpose, and the text analysis step extracts from the text data corresponding to the information input in the instruction input step. Set the type of words for the analysis purpose to perform the above-mentioned hierarchical cluster analysis.

本發明第10態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收單字除外指示，上述文字分析步驟將在上述指示輸入步驟所指示之單字除外，而進行上述階層式集群分析。 The tenth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a single-word exclusion instruction, and the character analysis step excludes the single word indicated by the instruction input step, and performs the hierarchical structure Cluster analysis.

本發明第11態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收近義詞登錄指示，上述文字分析步驟將在上述指示輸入步驟所指示之複數個單字視為相同之單字，而進行上述階層式集群分析。 An eleventh aspect of the present invention is the fifth aspect of the present invention. In the above, the instruction input step receives a synonym registration instruction, and the character analysis step treats the plurality of words indicated in the instruction input step as the same word, and performs the hierarchical cluster analysis.

本發明第12態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收複合詞登錄指示，上述文字分析步驟將在上述指示輸入步驟所指示之複數個單字合併為1個單字，而進行上述階層式集群分析。 The twelfth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a compound word registration instruction, and the character analysis step combines a plurality of words indicated in the instruction input step into one word And perform the above-mentioned hierarchical cluster analysis.

本發明之第13態樣之特徵在於，於本發明之第1態樣中，上述畫面生成步驟生成畫面資料，該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The thirteenth aspect of the present invention is characterized in that, in the first aspect of the present invention, the screen generation step generates screen data, the screen data is used to display an analysis result screen including the group, and is used to set the above. Analysis setting screen of analysis result screen.

本發明第14態樣係一種電腦可讀取之記錄媒體，其記錄有將文字資料之分析結果顯示於畫面之文字探勘程式，其特徵在於CPU(中央處理單元)利用記憶體使電腦執行如下之步驟：文字分析步驟，其對自被輸入之文字資料所擷取之單字進行階層式集群分析；畫面生成步驟，其根據上述文字分析步驟之分析結果，來生成畫面資料；及分析結果顯示步驟，其根據上述畫面資料來顯示畫面；上述畫面生成步驟根據群組數與群組內之最多資料數，自上述分析結果求得上述群組數之集群，而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面之畫面資料。 A fourteenth aspect of the present invention is a computer-readable recording medium that records a text exploration program that displays the analysis results of text data on the screen. It is characterized in that the CPU (central processing unit) uses the memory to make the computer execute the following Steps: a text analysis step, which performs hierarchical cluster analysis on the words extracted from the input text data; a screen generation step, which generates screen data based on the analysis results of the above text analysis step; and a display step of analysis results, It displays the screen according to the above screen data; the above screen generating step obtains the cluster of the above group number from the analysis result according to the number of groups and the maximum number of data in the group, and generates a cluster to include the most data The groups of words included in the above-mentioned clusters are displayed in the screen data of the screen.

本發明第15態樣之特徵在於，於本發明之第14態樣中，上述群組所包含之單字係自對應於上述群組之集群所包含之單字中，依出現頻率高之順序所選擇。 A feature of the fifteenth aspect of the present invention is that, in the fourteenth aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group in the order of high occurrence frequency .

本發明第16態樣之特徵在於，於本發明之第15態樣中，上述群組在上述畫面內，具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The sixteenth aspect of the present invention is characterized in that, in the fifteenth aspect of the present invention, the group has a total size corresponding to the frequency of occurrence of the words included in the cluster corresponding to the group in the screen.

本發明第17態樣之特徵在於，於本發明之第16態樣中，上述群組所包含之單字在上述畫面內，具有對應於上述單字之出現頻率的尺寸。 A seventeenth aspect of the present invention is characterized in that, in the sixteenth aspect of the present invention, the words included in the group have a size corresponding to the appearance frequency of the words in the screen.

本發明第18態樣之特徵在於，於本發明之第14態樣中，使上述電腦進一步執行用以輸入來自使用者之指示之指示輸入步驟，上述文字分析步驟及上述畫面生成步驟之任一者，係根據在上述指示輸入步驟所輸入之指示而被執行。 The eighteenth aspect of the present invention is characterized in that, in the fourteenth aspect of the present invention, the computer is further caused to execute any one of an instruction input step for inputting instructions from a user, the text analysis step, and the screen generation step. It is executed according to the instruction input in the instruction input step.

本發明第19態樣之特徵在於，於本發明之第14態樣中，上述畫面生成步驟生成畫面資料，該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The nineteenth aspect of the present invention is characterized in that, in the fourteenth aspect of the present invention, the screen generation step generates screen data, the screen data is used to display the analysis result screen including the group, and to set the analysis The analysis setting screen of the result screen.

本發明之第20態樣係一種文字探勘裝置，係將文字資料之分析結果顯示於畫面者，其特徵在於具備有：文字分析部，其對自被輸入之文字資料所擷取之單字進行階層式集群分析；畫面生成部，其根據上述文字分析部之分析結果，來生成畫面資料；及分析結果顯示部，其根據上述畫面資料來顯示畫面；上述畫面生成部根據群組數與群組內之最多資料數，自上述分析結果求得上述群組數之集群，而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面。 The twentieth aspect of the present invention is a text exploration device that displays the analysis results of text data on the screen. It is characterized by having a text analysis unit that hierarchically extracts the words extracted from the input text data. Cluster analysis; screen generation unit that generates screen data based on the analysis results of the text analysis unit; and analysis result display unit that displays screens based on the screen data; the screen generation unit based on the number of groups and within the group The cluster with the largest number of data is obtained from the above analysis result, and a cluster containing the words included in the cluster below the cluster with the largest number of data is generated and displayed on the screen.

本發明第21態樣之特徵在於，於本發明之第20態樣中，上述群組所包含之單字係自對應於上述群組之集群所包含之單字中，依出現頻率高之順序所選擇。 The twenty-first aspect of the present invention is characterized in that, in the twentieth aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group according to the order of high occurrence frequency .

本發明第22態樣之特徵在於，於本發明之第21態樣中，上述群組在上述畫面內，具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The twenty-second aspect of the present invention is characterized in that, in the twenty-first aspect of the present invention, the group has a total size corresponding to the appearance frequency of the words included in the cluster corresponding to the group in the screen.

本發明第23態樣之特徵在於，於本發明之第22態樣中，上述群組所包含之單字在上述畫面內，具有對應於上述單字之出現頻率的尺寸。 The twenty-third aspect of the present invention is characterized in that, in the twenty-second aspect of the present invention, the words included in the group have a size corresponding to the appearance frequency of the words in the screen.

本發明第24態樣之特徵在於，於本發明之第20態樣中，進一步具備有用以輸入來自使用者之指示之指示輸入部，上述文字分析部及上述畫面生成部之任一者，根據在上述指示輸入部所輸入之指示來動作。 The twenty-fourth aspect of the present invention is characterized in that, in the twentieth aspect of the present invention, It further includes an instruction input unit for inputting an instruction from a user, and either of the character analysis unit and the screen generation unit operates based on the instruction input in the instruction input unit.

本發明第25態樣之特徵在於，於本發明之第20態樣中，上述畫面生成部生成畫面資料，該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The twenty-fifth aspect of the present invention is characterized in that, in the twentieth aspect of the present invention, the screen generation unit generates screen data, the screen data is used to display an analysis result screen including the group, and to set the analysis The analysis setting screen of the result screen.

根據本發明第1、第14或第20態樣，基於對文字資料所包含之單字進行階層式集群分析後之結果，包含集群所包含之單字之群組被顯示於畫面。又，群組所包含之單字數，被限制在最多資料數以下。因此，使用者看到畫面時可直觀地理解階層式集群分析之結果。 According to the first, fourteenth, or twentieth aspects of the present invention, based on the result of performing hierarchical cluster analysis on the words included in the text data, a group including the words included in the cluster is displayed on the screen. The number of words in a group is limited to the maximum number of data. Therefore, users can intuitively understand the results of hierarchical cluster analysis when they see the screen.

根據本發明第2、第15或第21態樣，在群組之內部，集群所包含之單字中出現頻率高之單字被顯示。因此，使用者可容易地認知各集群所包含之出現頻率高之單字。 According to the second, the fifteenth, or the twenty-first aspect of the present invention, within the group, a word having a high frequency among the words included in the cluster is displayed. Therefore, the user can easily recognize the frequently occurring words contained in each cluster.

藉由本發明第3、第16或第22態樣，群組在畫面內具有對應於集群所包含之單字之出現頻率之合計的尺寸。因此，使用者可容易地認知單字出現頻率之合計大之集群。 With the third, sixteenth, or twenty-second aspect of the present invention, the group has a total size corresponding to the appearance frequency of the words included in the cluster in the screen. Therefore, the user can easily recognize a large cluster of the total occurrence frequency of the single word.

藉由本發明第4、第17或第23態樣，單字在畫面內具有對應於單字頻率之尺寸。因此，使用者可容易地認知出現頻率高之單字。 With the 4th, 17th, or 23rd aspect of the present invention, the single word has a size corresponding to the frequency of the single word in the screen. Therefore, the user can easily recognize words that appear frequently.

根據本發明第5、第18或第24態樣，可對應於來自使用者之指示，切換階層式集群分析之結果之顯示態樣。 According to the fifth, eighteenth or twenty-fourth aspect of the present invention, The user instructs to switch the display of the results of hierarchical cluster analysis.

根據本發明第6態樣，可對應於來自使用者之指示，切換畫面所顯示之群組的個數(集群個數)。 According to the sixth aspect of the present invention, the number of groups (the number of clusters) displayed on the screen can be switched in accordance with an instruction from the user.

根據本發明第7態樣，可對應於來自使用者之指示，切換群組所包含之單字之個數的上限值。 According to the seventh aspect of the present invention, the upper limit of the number of words included in the group can be switched in response to an instruction from the user.

根據本發明第8態樣，對使用者所指示之分析對象期間內之文字資料所包含之單字進行階層式集群分析之結果被顯示於畫面。因此，使用者可容易地認知階層式集群分析之結果在時間上的變化。 According to the eighth aspect of the present invention, the result of performing hierarchical cluster analysis on the words included in the text data within the analysis target period instructed by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change of the results of the hierarchical cluster analysis.

根據本發明第9態樣，可對應於使用者所指示之分析目的，切換分析對象之單字種類並將進行階層式集群分析後之結果顯示於畫面。 According to the ninth aspect of the present invention, corresponding to the analysis purpose instructed by the user, the type of the single word of the analysis target can be switched and the result of the hierarchical cluster analysis can be displayed on the screen.

根據本發明第10態樣，可將使用者所指示之單字除外，並將進行階層式集群分析後之結果顯示於畫面。 According to the tenth aspect of the present invention, the words instructed by the user can be excluded, and the result of the hierarchical cluster analysis can be displayed on the screen.

根據本發明第11態樣，可將使用者所指示之複數個單字視為相同單字，並將進行階層式集群分析後之結果顯示於畫面。 According to the eleventh aspect of the present invention, the plurality of words instructed by the user can be regarded as the same word, and the result after the hierarchical cluster analysis is displayed on the screen.

根據本發明第12態樣，可將使用者所指示之複數個單字合併為1個單字，並將進行階層式集群分析後之結果顯示於畫面。 According to the twelfth aspect of the present invention, the plurality of words instructed by the user can be combined into one word, and the result of the hierarchical cluster analysis is displayed on the screen.

根據本發明第13、第19或第25態樣，分析結果畫面與分析設定畫面被顯示。因此，使用者可使用分析設定畫面而容易地切換進行階層式集群分析後之結果之顯示態樣。 According to the 13th, 19th or 25th aspect of the present invention, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can easily switch the display state of the result after performing the hierarchical cluster analysis using the analysis setting screen.

5‧‧‧文字資料 5‧‧‧Text

10‧‧‧文字探勘裝置 10‧‧‧Text exploration device

11‧‧‧指示輸入部 11‧‧‧Instruction input section

12‧‧‧文字分析部 12‧‧‧Text Analysis Department

13‧‧‧畫面生成部 13‧‧‧Screen generation department

14‧‧‧分析結果顯示部 14‧‧‧ Analysis result display section

20‧‧‧電腦 20‧‧‧Computer

21‧‧‧CPU 21‧‧‧CPU

22‧‧‧主記憶體 22‧‧‧Main memory

23‧‧‧儲存部 23‧‧‧Storage Department

24‧‧‧輸入部 24‧‧‧ Input Department

25‧‧‧顯示部 25‧‧‧Display

26‧‧‧通信部 26‧‧‧ Ministry of Communications

27‧‧‧記錄媒體讀取部 27‧‧‧Recording medium reading section

28‧‧‧鍵盤 28‧‧‧ keyboard

29‧‧‧滑鼠 29‧‧‧Mouse

30‧‧‧記錄媒體 30‧‧‧Recording media

31‧‧‧文字探勘程式 31‧‧‧text exploration program

40‧‧‧顯示畫面 40‧‧‧display

41、61~68‧‧‧分析結果畫面 41, 61 ~ 68‧‧‧ Analysis result screen

42‧‧‧分析設定畫面 42‧‧‧ Analysis Setting Screen

51‧‧‧資料指定畫面 51‧‧‧Data Designation Screen

52‧‧‧目的指定畫面 52‧‧‧ Purpose Designation Screen

53‧‧‧近義詞列表選擇畫面 53‧‧‧Synonym list selection screen

54‧‧‧複合詞列表選擇畫面 54‧‧‧ Compound word list selection screen

m‧‧‧群組數(集群數) m‧‧‧groups (number of clusters)

n‧‧‧群組內之最多資料數 n‧‧‧The maximum number of data in the group

W1~W6‧‧‧單字 W1 ~ W6‧‧‧Word

圖1係顯示本發明實施形態之文字探勘裝置之構成之方塊圖。 FIG. 1 is a block diagram showing the structure of a text exploration device according to an embodiment of the present invention.

圖2係顯示作為圖1所示之文字探勘裝置而發揮功能之電腦之構成之方塊圖。 FIG. 2 is a block diagram showing the configuration of a computer functioning as the text exploration device shown in FIG. 1. FIG.

圖3係顯示圖1所示之文字探勘裝置之顯示畫面之圖。 FIG. 3 is a diagram showing a display screen of the text exploration device shown in FIG. 1. FIG.

圖4係顯示圖1所示之文字探勘裝置之動作之流程圖。 FIG. 4 is a flowchart showing the operation of the text exploration device shown in FIG. 1. FIG.

圖5係圖1所示之文字探勘裝置之畫面資料生成處理之流程圖。 FIG. 5 is a flowchart of a screen data generating process of the text exploration device shown in FIG. 1. FIG.

圖6係顯示圖1所示之文字探勘裝置之資料指定畫面之圖。 FIG. 6 is a diagram showing a data specifying screen of the text exploration device shown in FIG. 1. FIG.

圖7係顯示被輸入於圖1所示之文字探勘裝置之文字資料之例子之圖。 FIG. 7 is a diagram showing an example of text data input to the text exploration device shown in FIG. 1. FIG.

圖8係顯示圖1所示之文字探勘裝置之目的指定畫面之圖。 FIG. 8 is a diagram showing a purpose designation screen of the text exploration device shown in FIG. 1. FIG.

圖9係顯示圖1所示之文字探勘裝置之近義詞列表選擇畫面之圖。 FIG. 9 is a diagram showing a synonym list selection screen of the text exploration device shown in FIG. 1. FIG.

圖10係顯示圖1所示之文字探勘裝置之複合詞列表選擇畫面之圖。 FIG. 10 is a diagram showing a compound word list selection screen of the text exploration device shown in FIG. 1. FIG.

圖11A係顯示於圖1所示之文字探勘裝置中設定分析對象期間前之分析結果畫面之圖。 FIG. 11A is a diagram showing an analysis result screen before the analysis target period is set in the text exploration device shown in FIG. 1. FIG.

圖11B係顯示於圖1所示之文字探勘裝置中設定分析對象期間後之分析結果畫面之圖。 FIG. 11B is a diagram showing an analysis result screen after the analysis target period is set in the text exploration device shown in FIG. 1.

圖12A係顯示於圖1所示之文字探勘裝置中進行單字除外前之分析結果畫面之圖。 FIG. 12A is a diagram showing an analysis result screen before a single character is excluded in the text exploration device shown in FIG. 1. FIG.

圖12B係顯示於圖1所示之文字探勘裝置中進行單字除外後之分析結果畫面之圖。 FIG. 12B is a diagram showing an analysis result screen after single words are excluded in the text exploration device shown in FIG. 1.

圖13A係顯示於圖1所示之文字探勘裝置中進行近義詞登錄前之分析結果畫面之圖。 FIG. 13A is a diagram showing an analysis result screen before synonyms are registered in the text exploration device shown in FIG. 1. FIG.

圖13B係顯示於圖1所示之文字探勘裝置中進行近義詞登錄後之分析結果畫面之圖。 FIG. 13B is a diagram showing an analysis result screen after synonyms are registered in the text exploration device shown in FIG. 1.

圖14A係顯示於圖1所示之文字探勘裝置中進行複合詞登錄前之分析結果畫面之圖。 FIG. 14A is a diagram showing an analysis result screen before compound word registration is performed in the text exploration device shown in FIG. 1. FIG.

圖14B係顯示於圖1所示之文字探勘裝置中進行複合詞登錄後之分析結果畫面之圖。 FIG. 14B is a diagram showing an analysis result screen after compound word registration is performed in the text exploration device shown in FIG. 1. FIG.

圖15係顯示樹狀圖之例子之圖。 FIG. 15 is a diagram showing an example of a tree diagram.

圖16係顯示對圖15所示之樹狀圖設定集群數之情況之圖。 FIG. 16 is a diagram showing a case where the number of clusters is set to the tree diagram shown in FIG. 15.

圖17係顯示在圖式及其說明所出現之單字之圖。 FIG. 17 is a diagram showing words appearing in the drawing and its description.

以下，參照圖式，對本發明實施形態之文字探勘方法、文字探勘程式、及文字探勘裝置進行說明。本實施形態之文字探勘方法，通常係使用電腦來執行。本實施形態之文字探勘程式係為了使用電腦來實施文字探勘方法之程式。本實施形態之文字探勘裝置通常係使用電腦所構成。執行文字探勘程式之電腦係作為文字探勘裝置而發揮功能。 Hereinafter, a character exploration method, a character exploration program, and a character exploration device according to an embodiment of the present invention will be described with reference to the drawings. The text exploration method of this embodiment is usually executed by a computer. The text exploration program of this embodiment is a program for implementing a text exploration method using a computer. The character exploration device of this embodiment is usually constructed using a computer. The computer executing the text exploration program functions as a text exploration device.

圖1係顯示本發明之實施形態之文字探勘裝置之構成之方塊圖。圖1所示之文字探勘裝置10具備有指示輸入部11、文字分析部12、畫面生成部13、及分析結果顯示部14。於文字探勘裝置10輸入有分析對象之文字資料5。文字探勘裝置10對自被輸入之文字資料5所擷取之單字進行階層式集群分析，並將分析結果顯示於畫面。 FIG. 1 is a block diagram showing the structure of a text exploration device according to an embodiment of the present invention. The character exploration apparatus 10 shown in FIG. 1 includes an instruction input unit 11, a character analysis unit 12, a screen generation unit 13, and an analysis result display unit 14. The text data 5 of the analysis target is input into the text exploration device 10. The character exploration device 10 performs hierarchical cluster analysis on the words extracted from the input character data 5 and displays the analysis results on the screen.

文字探勘裝置10之動作的概要如以下所述。於指示輸入部11輸入有來自使用者之指示。文字分析部12自被輸入之文字資料5擷取單字，並對所擷取之單字進行階層式集群分析。畫面生成部13根據文字分析部12之分析結果來生成畫面資料。分析結果顯示部14根據由畫面生成部13所生成之畫面資料來顯示畫面。 The outline of the operation of the character exploration apparatus 10 is as follows. An instruction from the user is input to the instruction input unit 11. The character analysis unit 12 extracts characters from the input character data 5 and performs hierarchical cluster analysis on the extracted characters. The screen generation unit 13 generates screen data based on the analysis result of the character analysis unit 12. The analysis result display unit 14 displays a screen based on the screen data generated by the screen generating unit 13.

被輸入至指示輸入部11之來自使用者之指示，包含群組數之設定、群組內之最多資料數之設定、分析對象期間之設定、單字除外、近義詞登錄、複合詞登錄等。於文字資料5為具有年月日或時刻等之資訊之時間序列資料之情形時，文字分析部12對被輸入之文字資料5中在指示輸入部11被設定之分析對象期間內之文字資料所包含之單字，進行階層式集群分析。 The instructions from the user input to the instruction input section 11 include the setting of the number of groups, the setting of the maximum number of data in the group, the setting of the analysis target period, excluding single words, synonyms, compound words, and the like. When the text data 5 is time-series data with information such as year, month, day, time, etc., the text analysis unit 12 analyzes the text data of the input text data 5 within the analysis target period set by the instruction input unit 11. Included words for hierarchical cluster analysis.

畫面生成部13在生成畫面資料時，係依照群組數與群組內之最多資料數(細節將如後述之)。又，於使用者輸入新的指示時，在所指示之處理被進行後，畫面生成部13生成新的畫面資料，而分析結果顯示部14顯示新的畫面。如此，文字探勘裝置10對應於來自使用者之指示，切換文字資料5之分析態樣與分析結果之顯示態樣。 The screen generating unit 13 generates screen data according to the number of groups and the maximum number of data in the group (the details will be described later). When a user inputs a new instruction, after the instructed processing is performed, the screen generating unit 13 generates new screen data, and the analysis result display unit 14 displays a new screen. In this way, the text exploration device 10 switches the analysis state of the text data 5 and the display state of the analysis result in response to an instruction from the user.

圖2係顯示作為文字探勘裝置10而發揮功能之電腦之構成之方塊圖。圖2所示之電腦20，具備有CPU(Central Processing Unit；中央處理單元)21、主記憶體22、儲存部23、輸入部24、顯示部25、通信部26、及記錄媒體讀取部27。主記憶體22例如使用DRAM(Dynamic Random Access Memory；動態隨機存取記憶體)。儲存部23例如使用硬碟(Hard Disk)或固態硬碟(Solid State Drive)。輸入部24例如包含有鍵盤(Keyboard)28與滑鼠 (Mouse)29。顯示部25例如使用液晶顯示器。通信部26係有線通信或無線通信之介面電路。記錄媒體讀取部27係儲存有程式等之記錄媒體30之介面電路。記錄媒體30例如使用CD-ROM(Compact Disc Read-Only Memory；唯讀記憶光碟)、DVD-ROM(Digital Versatile Disc Read-Only Memory；數位多功能影音唯讀記憶光碟)、USB(Universal Serial Bus；通用序列匯流排)記憶體等非過渡性之記錄媒體。 FIG. 2 is a block diagram showing a configuration of a computer functioning as the text exploration device 10. As shown in FIG. The computer 20 shown in FIG. 2 includes a CPU (Central Processing Unit) 21, a main memory 22, a storage section 23, an input section 24, a display section 25, a communication section 26, and a recording medium reading section 27. . The main memory 22 is, for example, a DRAM (Dynamic Random Access Memory). The storage unit 23 uses, for example, a hard disk (Hard Disk) or a solid state drive (Solid State Drive). The input unit 24 includes, for example, a keyboard 28 and a mouse (Mouse) 29. The display unit 25 is, for example, a liquid crystal display. The communication section 26 is an interface circuit for wired communication or wireless communication. The recording medium reading unit 27 is an interface circuit of the recording medium 30 in which a program or the like is stored. The recording medium 30 is, for example, a CD-ROM (Compact Disc Read-Only Memory), a DVD-ROM (Digital Versatile Disc Read-Only Memory), or a USB (Universal Serial Bus; (Universal Serial Bus) non-transitory recording media such as memory.

於電腦20執行文字探勘程式31之情形時，儲存部23儲存文字探勘程式31與文字資料5。文字探勘程式31與文字資料5例如既可為使用通信部26自伺服器或其他電腦接收者，亦可為使用記錄媒體讀取部27自記錄媒體30所讀取者。 When the computer 20 executes the text exploration program 31, the storage unit 23 stores the text exploration program 31 and the text data 5. The text exploration program 31 and the text data 5 may be, for example, those who use the communication unit 26 from a server or other computer receivers, or those who use the recording medium reading unit 27 from the recording medium 30.

於執行文字探勘程式31時，文字探勘程式31與文字資料5被複製傳送至主記憶體22。CPU 21將主記憶體22作為作業用記憶體來使用，藉由執行被儲存於主記憶體22之文字探勘程式31，來處理被儲存於主記憶體22之文字資料5。此時，電腦20作為文字探勘裝置10而發揮功能。再者，以上所述之電腦20之構成僅為一例，可使用任意之電腦來構成文字探勘裝置10。 When the text exploration program 31 is executed, the text exploration program 31 and the text data 5 are copied and transmitted to the main memory 22. The CPU 21 uses the main memory 22 as a working memory, and executes the text exploration program 31 stored in the main memory 22 to process the character data 5 stored in the main memory 22. At this time, the computer 20 functions as the character exploration device 10. In addition, the configuration of the computer 20 described above is merely an example, and an arbitrary computer may be used to configure the text exploration device 10.

以下，文字資料5設為包含日文單字之日文資料。圖17係顯示圖式及其說明所出現之單字之圖。於圖17之各列記載有單字(日文單字)與單字的意思。於以下之說明中在提及日文單字時，有時會在單字後之括號內記載單字的意思。再者，文字資料5亦可為任意語言的資料。 Hereinafter, the text data 5 is assumed to be Japanese data including Japanese characters. FIG. 17 is a diagram showing words appearing in the diagram and its description. Words (Japanese words) and meanings of the words are described in each column of FIG. 17. When referring to a Japanese word in the following description, the meaning of the word may be described in parentheses after the word. Furthermore, the text data 5 may be data in any language.

圖3係顯示文字探勘裝置10之顯示畫面之圖。圖3所示之顯示畫面40，包含有分析結果畫面41與分析設定畫面42。於分析結果畫面41顯示有文字分析部12之分析結果。於分析設定畫面42顯示有GUI(圖形化使用者介面；Graphical User Interface)元件，該GUI元件係用以設定文字分析部12之分析態樣與畫面生成部13所生成之畫面資料的特性。 FIG. 3 is a diagram showing a display screen of the text exploration device 10. The display screen 40 shown in FIG. 3 includes an analysis result screen 41 and an analysis setting screen 42. The analysis result of the character analysis unit 12 is displayed on the analysis result screen 41. A GUI (Graphical User Interface) element is displayed on the analysis setting screen 42, and the GUI element is used to set the analysis mode of the character analysis unit 12 and the characteristics of the screen data generated by the screen generation unit 13.

若對階層式集群分析之結果設定集群數，則決定各集群所包含之單字。於將對自文字資料5擷取之單字進行階層式集群分析後之結果顯示於畫面時，文字探勘裝置10係以圖3所示之態樣顯示與集群對應之群組，以取代樹狀圖。 If the number of clusters is set for the result of hierarchical cluster analysis, the words included in each cluster are determined. When the results of hierarchical cluster analysis of the words extracted from the text data 5 are displayed on the screen, the text exploration device 10 displays the groups corresponding to the clusters as shown in FIG. 3 instead of the tree diagram .

於以下之說明中，將於畫面所顯示之集群亦稱為群組。使用者使用指示輸入部11，來指定群組數(集群數)與群組內之最多資料數(群組所包含之單字數之上限值)。以下，將前者設為m，後者設為n。 In the following description, the clusters displayed on the screen are also called groups. The user uses the instruction input unit 11 to specify the number of groups (the number of clusters) and the maximum number of data in the group (the upper limit of the number of single words included in the group). Hereinafter, the former is set to m and the latter is set to n.

在文字探勘裝置10中，文字資料5所包含之單字係分類為m個集群，且各集群包含有1個以上之單字。於分析結果畫面41顯示有m個群組，於各群組之內部顯示有單字。群組係使用雲狀圖形來顯示，群組所包含之單字係顯示於橢圓區域之內部。各群組所包含之單字被限制在n個以下。例如，在n=5之時的集群包含有10個單字之情形時，在分析結果畫面41中，於群組之內部顯示有5個單字。 In the character exploration apparatus 10, the single characters included in the character data 5 are classified into m clusters, and each cluster includes one or more single characters. M groups are displayed on the analysis result screen 41, and single words are displayed inside each group. The group system is displayed using a cloud-like graphic, and the individual words contained in the group are displayed inside the ellipse area. The number of words contained in each group is limited to n or less. For example, when the cluster at the time of n = 5 contains 10 words, in the analysis result screen 41, 5 words are displayed inside the group.

於分析設定畫面42顯示有用以設定群組數m之第1滑動條與2個第1按鈕(標示有記號「+」或「-」者)、用以設定群組內之最多資料數n之第2滑動條與2個第2按鈕、及用以設定分析對象期間之4個方框與2個第3按鈕(標示有向左箭頭或向右箭頭者)。 On the analysis setting screen 42, a first slider bar for setting the number of groups m and two first buttons (marked with a mark "+" or "-") are displayed. A second slide bar and two second buttons, and four boxes and two third buttons (the ones marked with a left arrow or a right arrow) for setting an analysis target period.

使用者藉由操作滑鼠29，使第1滑動條之捲動塊朝左右移動或按下第1按鈕，來指示群組數m。群組數m於標示有記號「+」之第1按鈕被按下時會增加，於標示有記號「-」之第1按鈕被按下時則會減少。群組數m之初始值，例如被設定為文字分析部12之分析結果所包含之單字之種類的平方根，或者為接近該平方根之整數。例如，於文字分析部12之分析結果包含有16種類之單字之情形時，群組數m之初始值係設定為4。 By operating the mouse 29, the user moves the scroll block of the first slider to the left or right or presses the first button to indicate the group number m. The number of groups m increases when the first button marked with a "+" is pressed, and decreases when the first button marked with a "-" is pressed. The initial value of the number of groups m is set, for example, as the square root of the type of a single word included in the analysis result of the character analysis unit 12, or an integer close to the square root. For example, when the analysis result of the character analysis unit 12 includes 16 types of single words, the initial value of the group number m is set to 4.

使用者藉由操作滑鼠29，使第2滑動條之捲動塊朝左右移動或按下第2按鈕，來指示群組內之最多資料數n。群組內之最多資料數n於第2按鈕被按下時會增加或減少。群組內之最多資料數n之初始值，例如被設定為5。 By operating the mouse 29, the user moves the scroll block of the second slider to the left or right or presses the second button to indicate the maximum number of data n in the group. The maximum number of data n in the group will increase or decrease when the second button is pressed. The initial value of the maximum number of data n in the group is set to 5, for example.

於文字資料5為時間序列資料之情形時，使用者藉由操作鍵盤28或滑鼠29，使用4個方框來指定年月日與時刻或按下第3按鈕，來指示分析對象期間。分析對象期間於標示有向左箭頭之第3按鈕被按下時，朝向過去移動既定量(例如1個月)，而於標示有向右箭頭之第3按鈕被按下時則朝向相反方向移動既定量。分析對象期間之初始值，例如被設定為自文字資料5最舊之時刻至最新之時刻之期間。再者，於文字資料5並非時間序列資料之情形時，使用者無法指定分析對象期間。 When the text data 5 is time-series data, the user uses the keyboard 28 or the mouse 29 to designate the year, month, day, and time or press the third button to indicate the analysis target period. During the analysis period, when the third button marked with a left arrow is pressed, it moves toward the past by a predetermined amount (for example, one month), and when the third button marked with a right arrow is pressed, it moves in the opposite direction. Both quantitative. The initial value of the analysis target period is set, for example, as a period from the oldest time to the latest time of the text data 5. When the text data 5 is not time-series data, the user cannot specify the analysis target period.

於分析結果畫面41顯示有1個以上且m個以下之群組，於各群組之內部顯示有1個以上且n個以下之單字。各群組在畫面內，對應之集群所包含之單字之出現頻率之合計越大者越被放大地顯示。於集群所包含之單字數超過n個之情形時，於群組之內部顯示出現頻率高之n個單字。群組所包含之單字與包含該等之橢圓區域，在畫面內單字之出現頻率越高者越被放大地顯示。於各群組標示有名稱。群組之名稱係使用集群所包含之單字中出現頻率最高之單字。群組之名稱係於群組之內部標示底線來顯示。再者，於在橢圓區域之內部無法顯示單字之情形時，取代單字而顯示記號「...」。 One or more and m or less groups are displayed on the analysis result screen 41, and one or more and n or less words are displayed inside each group. Each group in the screen, the larger the total frequency of the occurrence of the words contained in the corresponding cluster, the larger it will be displayed. When the number of words included in the cluster exceeds n, n words with high frequency are displayed in the group. Words included in groups and ellipses containing them The circle area, the higher the frequency of single words in the screen, the larger the size will be displayed. Each group is marked with a name. The group name uses the word that appears most frequently among the words contained in the cluster. The name of the group is displayed on the underline of the group. When a single character cannot be displayed inside the elliptical region, the symbol "..." is displayed instead of the single character.

於分析結果畫面41顯示有用以指定縮放倍率之第3滑動條及2個第4按鈕(標示有記號「+」或「-」者)。使用者藉由操作滑鼠29，使第3滑動條之捲動塊朝左右移動或按下第4按鈕，來設定縮放倍率。於分析結果畫面41，包含單字之群組係對應於所設定之縮放倍率而放大或縮小地被顯示。縮放倍率之初始值係設定為100%。於初始狀態之分析結果畫面41，顯示有所有的群組。 On the analysis result screen 41, a third slide bar for specifying a zoom ratio and two fourth buttons (those marked with "+" or "-") are displayed. The user sets the zoom ratio by operating the mouse 29 to move the scroll block of the third slider to the left or right or pressing the fourth button. On the analysis result screen 41, groups containing single words are displayed enlarged or reduced in accordance with a set zoom ratio. The initial value of the zoom ratio is set to 100%. On the analysis result screen 41 in the initial state, all groups are displayed.

於使用者在分析設定畫面42中變更群組數m、群組內之最多資料數n、或分析對象期間時，分析結果畫面41之內容係與該等對應地產生變化。於使用者在分析結果畫面41中指示單字除外、近義詞登錄、或複合詞登錄時，分析結果畫面41之內容也與該等對應地產生變化。 When the user changes the number of groups m, the maximum number of data n in the group, or the analysis target period in the analysis setting screen 42, the content of the analysis result screen 41 changes in accordance with these. When the user instructs the exclusion of a single word, the registration of a synonym, or the registration of a compound word on the analysis result screen 41, the content of the analysis result screen 41 also changes according to these.

於對自文字資料5所擷取之單字進行階層式集群分析時，文字探勘裝置10參照儲存有應除外之單字之除外單字列表、儲存有應作為近義詞來處理之單字之近義詞列表、及儲存有應作為複合詞來處理之單字之複合詞列表。具有相同意思(或大致相同意思)之複數個單字與代表該等單字之1個單字被建立對應而被儲存於近義詞列表。若加以連結便成為1個複合詞之複數個單字與連結該等單字之複合詞被建立對應而被儲存於複合詞列表。例如「daigakusei(大學生)」及「gakusei(學生)」與代表兩者之「daigakusei」被建立對應而被儲存於近義詞列表。例如「nintai(忍耐)」及「tsuyoi(強)」與連結兩者之「nintaizuyoi(忍耐力高)」被建立對應而被儲存於複合詞列表。文字探勘裝置10存在有具有複數個近義詞列表與複數個複合詞列表之情形。 When performing hierarchical cluster analysis on the words extracted from the text data 5, the text exploration device 10 refers to the list of excluded words that stores the words that should be excluded, the list of synonyms that stores the words that should be treated as synonyms, and the A list of compound words that should be treated as compound words. A plurality of words having the same meaning (or approximately the same meaning) and one word representing these words are associated and stored in the synonyms list. If connected, a plurality of words that become a compound word are associated with the compound words that link these words and are stored in the compound word list. For example, "daigakusei (university student)" and "gakusei (student)" and "daigakusei" representing both Correspondence is stored in the synonyms list. For example, "nintai (endurance)" and "tsuyoi (strong)" are associated with "nintaizuyoi (endurance)" and are stored in the compound word list. The text exploration apparatus 10 may have a plurality of synonyms and a plurality of compound words.

圖4係顯示文字探勘裝置10之動作之流程圖。圖5係顯示文字探勘裝置10之畫面資料生成處理(圖4所示之步驟S111)之細節之流程圖。輸入部24與執行步驟S113之CPU 21係作為指示輸入部11而發揮功能。執行步驟S109~S110之CPU 21係作為文字分析部12而發揮功能。執行步驟S111之CPU 21係作為畫面生成部13而發揮功能。顯示部25與執行步驟S112之CPU 21係作為分析結果顯示部14而發揮功能。以下，參照圖4及圖5而對文字探勘裝置10之動作進行說明。 FIG. 4 is a flowchart showing the operation of the text exploration device 10. FIG. 5 is a flowchart showing the details of the screen data generation processing (step S111 shown in FIG. 4) of the text exploration device 10. The input unit 24 and the CPU 21 executing step S113 function as an instruction input unit 11. The CPU 21 that executes steps S109 to S110 functions as the character analysis unit 12. The CPU 21 that executes step S111 functions as the screen generating section 13. The display section 25 and the CPU 21 executing step S112 function as the analysis result display section 14. Hereinafter, the operation of the character exploration apparatus 10 will be described with reference to FIGS. 4 and 5.

首先，CPU 21使顯示部25顯示圖6所示之資料指定畫面51(步驟S101)。於資料指定畫面51顯示有用以指定檔案名稱之方框、及用以指定資料夾名之方框。使用者藉由於資料指定畫面51中指定檔案名稱或資料夾名，來指定分析對象之文字資料5。文字資料5既可被儲存於硬碟等之儲存部23，亦可被儲存於使用通信部26所連接之伺服器或其他電腦等。 First, the CPU 21 causes the display unit 25 to display the data specifying screen 51 shown in FIG. 6 (step S101). A box for specifying a file name and a box for specifying a folder name are displayed on the data specifying screen 51. The user specifies the text data 5 to be analyzed by specifying a file name or a folder name in the data specifying screen 51. The text data 5 may be stored in a storage section 23 such as a hard disk, or in a server or other computer connected to the communication section 26.

接著，CPU 21將使用資料指定畫面51所指定之文字資料5傳送至主記憶體22。藉此，文字資料5被輸入至文字探勘裝置10(步驟S102)。圖7係顯示文字資料5之例子之圖。圖7所示之文字資料係大學生所製作之報告之資料，且為具有年月日之資訊之時間序列資料。圖7所示之文字資料，自上依序為「關於本授課內容中大學生與社會之關係...」、「一般大學生畢業後在出社會前打工或...」、「我們學生要有認知是付了昂貴的學費在學習...」、及「學生生活是為了使自我信心成長很珍貴的時間。而且...」。再者，文字探勘裝置10所分析之文字資料5之種類為任意。 Then, the CPU 21 transmits the character data 5 specified by the use data designation screen 51 to the main memory 22. Thereby, the character data 5 is input to the character exploration apparatus 10 (step S102). FIG. 7 is a diagram showing an example of the text data 5. FIG. The textual data shown in Figure 7 is the data of reports produced by college students, and is time-series data with the information of year, month, and day. The text information shown in Figure 7 is, in order from above, "About the relationship between college students and society in this course ...", "General college students play before leaving for society after graduation. Or "...", "our students need to know that they paid expensive tuition for studying ...", and "student life is a precious time for self-confidence. And ..." The type of the character data 5 analyzed by the character exploration device 10 is arbitrary.

接著，CPU 21使顯示部25顯示圖8所示之目的指定畫面52(步驟S103)。於目的指定畫面52顯示有對應於內容、特徵、及評價之3個選項按鈕(Radio Button)。使用者藉由操作滑鼠29按下任一選項按鈕，而自內容、特徵、及評價之中選擇分析目的。接著，CPU 21接收使用目的指定畫面52所指定之分析目的。藉此，分析目的被輸入至文字探勘裝置10(步驟S104)。 Next, the CPU 21 causes the display unit 25 to display the purpose designation screen 52 shown in FIG. 8 (step S103). On the purpose designation screen 52, three option buttons (Radio Buttons) corresponding to content, characteristics, and evaluation are displayed. The user selects an analysis purpose from the content, features, and evaluation by operating the mouse 29 and pressing any option button. Then, the CPU 21 receives the analysis purpose specified by the use purpose designation screen 52. Thereby, the analysis purpose is input to the character exploration apparatus 10 (step S104).

接著，CPU 21使顯示部25顯示圖9所示之近義詞列表選擇畫面53(步驟S105)。於近義詞列表選擇畫面53顯示有文字探勘裝置10所具有近義詞列表之名稱、及被登錄於各近義詞列表之近義詞。使用者藉由操作滑鼠29，於近義詞列表選擇畫面53中選擇任一近義詞列表，來指定要使用之近義詞列表。藉此，在文字探勘裝置10中選擇近義詞列表(步驟S106)。 Next, the CPU 21 causes the display unit 25 to display the synonyms list selection screen 53 shown in FIG. 9 (step S105). The synonyms list selection screen 53 displays the names of the synonyms list included in the text exploration device 10 and the synonyms that are registered in the synonyms list. The user selects any synonym list in the synonym list selection screen 53 by operating the mouse 29 to specify a synonym list to be used. Thereby, a synonym list is selected in the character exploration apparatus 10 (step S106).

接著，CPU 21使顯示部25顯示圖10所示之複合詞列表選擇畫面54(步驟S107)。於複合詞列表選擇畫面54顯示有文字探勘裝置10所具有複合詞列表之名稱、及被登錄於各複合詞列表之複合詞。使用者藉由操作滑鼠29，於複合詞列表選擇畫面54中選擇任一複合詞列表，來指定要使用之複合詞列表。藉此，在文字探勘裝置10中選擇複合詞列表(步驟S108)。 Next, the CPU 21 causes the display unit 25 to display the compound word list selection screen 54 shown in FIG. 10 (step S107). The compound word list selection screen 54 displays the name of the compound word list included in the text exploration device 10 and the compound words registered in each compound word list. The user selects any compound word list in the compound word list selection screen 54 by operating the mouse 29 to specify a compound word list to be used. Thereby, the compound word list is selected in the character exploration apparatus 10 (step S108).

接著，CPU 21考量除外單字列表、近義詞列表、及複合詞列表，而自在步驟S102被輸入之文字資料5中屬於分析對象期間內之文字資料，擷取對應於在步驟S104所指定之分析目的之種類之單字(步驟S109)。CPU 21在分析目的為「內容」之情形時，自文字資料5擷取名詞、專有名詞、地名、及人名。CPU 21在分析目的為「特徵」之情形時，係自文字資料5擷取名詞、專有名詞、(SA)行變格活用名詞、及動詞。CPU 21在分析目的為「評價」之情形時，自文字資料5擷取形容詞、形容動詞、及感嘆詞。再者，文字探勘裝置10亦可支援前述之3個以外之分析目的。又，CPU 21亦可根據各分析目的而擷取與前述不同種類之單字。 Next, the CPU 21 considers the list of excluded words, the list of synonyms, and the list of compound words, and the text data belonging to the analysis target period from the text data 5 input in step S102, extracts the type corresponding to the analysis purpose specified in step S104. Word (step S109). When the analysis purpose is "content", the CPU 21 extracts nouns, proper nouns, place names, and person names from the text data 5. When the analysis purpose of the CPU 21 is "feature", the CPU 21 extracts nouns, proper nouns, (SA) The use of nouns and verbs in a declension. When the analysis purpose is "evaluation", the CPU 21 extracts adjectives, adjective verbs, and interjections from the text data 5. In addition, the text exploration device 10 can also support analysis purposes other than the foregoing three. In addition, the CPU 21 can also retrieve different types of words according to the analysis purposes.

於文字資料5為時間序列資料之情形時，CPU 21在執行步驟S109時，僅自文字資料5中由使用者所指示之分析對象期間所包含之文字資料擷取單字。又，於單字W1被儲存於除外單字列表之情形時，CPU 21在執行步驟S109時會完全忽略文字資料5所包含之單字W1。又，於單字W2及單字W3與代表兩者之單字W2被建立對應而被儲存於所選擇之近義詞列表之情形時，CPU 21在執行步驟S109時，會將文字資料5所包含之單字W3全部作為單字W2來處理。又，於單字W4及單字W5與連結兩者之單字W6被建立對應而被儲存於所選擇之複合詞列表之情形時，CPU 21在執行步驟S109時，會將文字資料5所包含之連接之單字W4與單字W5全部作為單字W6來處理。 When the text data 5 is time-series data, when the CPU 21 executes step S109, it only extracts words from the text data included in the analysis target period indicated by the user in the text data 5. When the word W1 is stored in the excluded word list, the CPU 21 completely ignores the word W1 included in the text data 5 when executing step S109. In addition, when the single word W2 and the single word W3 and the single word W2 representing both are associated and stored in the selected synonym list, the CPU 21 will execute all the single words W3 included in the text data 5 when executing step S109. Treated as single word W2. In addition, when the single word W4 and the single word W5 and the connected single word W6 are associated and stored in the selected compound word list, the CPU 21 executes step S109 to connect the single words included in the text data 5 W4 and single word W5 are all treated as single word W6.

接著，CPU 21對在步驟S109所擷取之單字進行階層式集群分析(步驟S110)。CPU 21於步驟S110中，例如根據文字資料5中2個單字間之距離(2個單字呈現分開什麼程度的距離)，來求得2個單字間之相似度。CPU 21根據所求得之單字間之相似度，而使用既定之方法(例如，最短距離法、最長距離法、群平均法、十進位法、華德法(Ward’s Method)等)進行階層式集群分析。又， CPU 21在步驟S110中，求得各單字之出現頻率。 Next, the CPU 21 performs hierarchical cluster analysis on the words retrieved in step S109 (step S110). In step S110, the CPU 21 obtains the similarity between the two words according to the distance between the two words in the text data 5 (how far apart the two words are from each other). The CPU 21 performs hierarchical clustering using a predetermined method (for example, the shortest distance method, the longest distance method, the group average method, the decimal method, the Ward's Method, etc.) according to the similarity between the obtained words. analysis. also, The CPU 21 obtains the appearance frequency of each word in step S110.

接著，CPU 21根據在步驟S110所求得之階層式集群分析之結果，來生成用以顯示分析結果之畫面資料(步驟S111)。CPU 21在步驟S111中，進行圖5所示之處理。 Next, the CPU 21 generates screen data for displaying the analysis result based on the result of the hierarchical cluster analysis obtained in step S110 (step S111). The CPU 21 performs processing shown in FIG. 5 in step S111.

CPU 21將群組數設為m，並將群組內之最多資料數設為n(步驟S201)。接著，CPU 21針對階層式集群分析之結果，將集群數設定為m，來求得m個集群(步驟S202)。接著，CPU 21針對各集群，來求得集群所包含之單字之出現頻率之合計(步驟S203)。接著，CPU 21根據在步驟S203所求得之出現頻率之合計，來決定各群組之顯示尺寸(步驟S204)。在步驟S204中，集群所包含之單字之出現頻率之合計越大，群組之顯示尺寸便被決定為越大。 The CPU 21 sets the number of groups to m and sets the maximum number of data in the group to n (step S201). Next, the CPU 21 sets the number of clusters to m as a result of the hierarchical cluster analysis to obtain m clusters (step S202). Next, the CPU 21 obtains the total of the appearance frequencies of the words included in the clusters for each cluster (step S203). Next, the CPU 21 determines the display size of each group based on the total of the appearance frequencies obtained in step S203 (step S204). In step S204, the larger the total frequency of occurrences of the words included in the cluster, the larger the display size of the group is determined.

接著，CPU 21針對各集群，自集群所包含之單字中選擇應顯示之單字(步驟S205)。在步驟S205中，自各集群所包含之單字中，依出現頻率高之順序，被選擇出n個以下之單字。接著，CPU 21針對在步驟S205所選擇之各單字，根據單字之出現頻率來決定單字之顯示尺寸(步驟S206)。在步驟S206中，出現頻率越高之單字，單字之顯示尺寸便被決定為越大。 Next, for each cluster, the CPU 21 selects a word to be displayed from the words included in the cluster (step S205). In step S205, from the words included in each cluster, n words or less are selected in the order of high occurrence frequency. Next, the CPU 21 determines the display size of the single character based on the frequency of occurrence of the single character for each single character selected in step S205 (step S206). In step S206, the higher the frequency of the single character, the larger the display size of the single character is determined.

接著，CPU 21生成用以顯示階層式集群分析之結果之畫面資料(步驟S207)。在步驟S207所生成之畫面資料，包含具有在步驟S204所決定之尺寸之m個群組(以雲狀圖形來表示)。於各群組之內部，包含具有在步驟S206所決定之尺寸之n個以下之單字。單字在畫面內，被顯示於群組之內部。CPU 21於執行步驟S207之後，結束畫面資料生成處理。 Next, the CPU 21 generates screen data for displaying the results of the hierarchical cluster analysis (step S207). The screen data generated in step S207 includes m groups (represented by cloud shapes) having the size determined in step S204. Within each group, there are n or less words having the size determined in step S206. The single word is displayed in the group within the screen. After executing step S207, the CPU 21 ends the screen data generation processing.

接著，CPU 21使顯示部25顯示基於在步驟S111所生成之畫面資料的畫面(步驟S112)。接著，CPU 21接收來自使用者之指示(步驟S113)。接著，CPU 21根據在步驟S113所接收之指示之種類，前進至步驟S115~S120中之任一者(步驟S114)。 Next, the CPU 21 causes the display unit 25 to display a screen based on the screen data generated in step S111 (step S112). Next, the CPU 21 receives an instruction from the user (step S113). Next, the CPU 21 proceeds to any one of steps S115 to S120 according to the type of the instruction received in step S113 (step S114).

CPU 21於在步驟S113所接收之指示為「群組數之設定」之情形時，朝向步驟S115前進。於該情形時，CPU 21將群組數m設定為使用者所指示之值(步驟S115)，並朝向步驟S111前進。其後，根據所設定之群組數m生成畫面資料，並顯示新的畫面。藉此，包含所指定之個數之群組之分析結果畫面被顯示。 When the instruction received in step S113 is "setting of the number of groups", the CPU 21 proceeds to step S115. In this case, the CPU 21 sets the group number m to a value instructed by the user (step S115), and proceeds to step S111. Thereafter, screen data is generated based on the set number of groups m, and a new screen is displayed. Thereby, the analysis result screen of the group including the designated number is displayed.

CPU 21於在步驟S113所接收之指示為「群組內之最多資料數之設定」之情形時，朝向步驟S116前進。於該情形時，CPU 21將群組內之最多資料數n設定為使用者所指示之值(步驟S116)，並朝向步驟S111前進。其後，根據所設定之群組內之最多資料數n生成畫面資料，並顯示新的畫面。藉此，各群組所包含之單字個數被限制在所指定之值以下之分析結果畫面被顯示。 When the instruction received in step S113 is "setting of the maximum number of data in the group", the CPU 21 proceeds to step S116. In this case, the CPU 21 sets the maximum number of data n in the group to a value instructed by the user (step S116), and proceeds to step S111. Thereafter, screen data is generated according to the maximum number of data n in the set group, and a new screen is displayed. As a result, the analysis result screen in which the number of single words contained in each group is limited to a value specified below is displayed.

CPU 21於在步驟S113所接收之指示為「分析對象期間之設定」之情形時，朝向步驟S117前進。於該情形時，CPU 21將分析對象期間設定為使用者所指示之期間(步驟S117)，並朝向步驟S109前進。其後，參照所設定之分析對象期間進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，針對所指定之分析對象期間內之文字資料所包含之單字，進行階層式集群分析之結果被顯示於畫面。 When the instruction received in step S113 is "setting of the analysis target period", the CPU 21 proceeds to step S117. In this case, the CPU 21 sets the analysis target period to the period instructed by the user (step S117), and proceeds to step S109. Thereafter, the hierarchical cluster analysis is performed with reference to the set analysis target period, and screen data for displaying a new analysis result is generated and a new screen is displayed. Thereby, the result of performing hierarchical cluster analysis on the words included in the text data within the specified analysis target period is displayed on the screen.

圖11A係顯示設定分析對象期間前之分析結果畫面之圖。圖11B係顯示設定分析對象期間後之分析結果畫面之圖。於圖11A所示之設定前之分析結果畫面61，顯示有對所輸入之文字資料5中自2014年1月1日0時0分至2015年12月31日24時0分為止之文字資料所包含之單字進行階層式集群分析之結果。於圖11B所示之設定後之分析結果畫面62，顯示有對所輸入之文字資料5中自2014年3月1日0時0分至2014年9月30日24時0分為止之文字資料所包含之單字進行階層式集群分析之結果。分析結果畫面61之顯示內容與分析結果畫面62之顯示內容不同。使用者可藉由觀察設定分析對象期間前後之分析結果畫面，而容易地認知階層式集群分析結果在時間上的變化。 FIG. 11A is a diagram showing an analysis result screen before an analysis target period is set. FIG. 11B is a diagram showing an analysis result screen after the analysis target period is set. to The analysis result screen 61 before the setting shown in FIG. 11A displays the text data stores 5 of the input text data from 00:00 on January 1, 2014 to 24:00 on December 31, 2015. The results of hierarchical cluster analysis of the included words. In the analysis result screen 62 shown in FIG. 11B after the setting, the text data of the input text data 5 is displayed from 00:00 on March 1, 2014 to 24:00 on September 30, 2014. The results of hierarchical cluster analysis of the included words. The display content of the analysis result screen 61 is different from the display content of the analysis result screen 62. The user can easily recognize the temporal change of the hierarchical cluster analysis result by observing the analysis result screens before and after the analysis target period is set.

CPU 21於在步驟S113所接收之指示為「單字除外」之情形時，朝向步驟S118前進。於該情形時，CPU 21將所指定之單字追加至除外單字列表(步驟S118)，並朝向步驟S109前進。其後，將所指定之單字除外而進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，將所指定之單字除外而進行階層式集群分析之結果，被顯示於畫面。 When the instruction received in step S113 is "except for a single word", the CPU 21 proceeds to step S118. In this case, the CPU 21 adds the specified word to the excluded word list (step S118), and proceeds to step S109. After that, the designated words are excluded and hierarchical cluster analysis is performed to generate screen data for displaying new analysis results and display the new screen. As a result, the result of performing hierarchical cluster analysis excluding the designated word is displayed on the screen.

圖12A係顯示進行單字除外前之分析結果畫面之圖。圖12B係顯示進行單字除外後之分析結果畫面之圖。使用者操作滑鼠29，於選擇應除外之單字之後，指示進行單字除外。在圖12A所示之單字除外前之分析結果畫面63中，選擇「shakai(社會)」，並自選單中選擇「單字除外」。其後，將「shakai」除外而進行階層式集群分析之結果被顯示於畫面。於圖12B所示之單字除外後之分析結果畫面64，取代「shakai」而顯示「shingaku(升學)」。在與「shakai」相同集群所包含之單字中，「shingaku」係僅次於分析結果畫面63所顯示之5個單字，出現頻率最高者。 FIG. 12A is a diagram showing an analysis result screen before single word exclusion is performed. FIG. 12B is a diagram showing an analysis result screen after single words are excluded. The user operates the mouse 29 and, after selecting a word to be excluded, instructs the word to be excluded. In the analysis result screen 63 before the single word exclusion shown in FIG. 12A, select "shakai (Society)", and select "except single word" from the menu. Thereafter, the result of the hierarchical cluster analysis except "shakai" is displayed on the screen. The analysis result screen 64 after excluding the single character shown in FIG. 12B displays “shingaku (promoting)” instead of “shakai”. Among the words included in the same cluster as "shakai", "shingaku" is the second word with the highest frequency after the 5 words displayed on the analysis result screen 63.

CPU 21於在步驟S113所接收之指示為「近義詞登錄」之情形時，朝向步驟S119前進。於該情形時，CPU 21將所指示之單字追加至使用中之近義詞列表(步驟S119)，並朝向步驟S109前進。其後，考量所指示之近義詞而進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，將所指示之單字作為近義詞而進行階層式集群分析之結果，被顯示於畫面。 When the instruction received in step S113 is "synonym registration", the CPU 21 proceeds to step S119. In this case, the CPU 21 adds the indicated word to the currently used synonyms list (step S119), and proceeds to step S109. Thereafter, a hierarchical cluster analysis is performed in consideration of the indicated synonyms, and screen data for displaying a new analysis result is generated and a new screen is displayed. Thereby, the result of the hierarchical cluster analysis using the indicated word as a synonym is displayed on the screen.

圖13A係顯示進行近義詞登錄前之分析結果畫面之圖。圖13B係顯示進行近義詞登錄後之分析結果畫面之圖。使用者操作滑鼠29，於選擇應作為近義詞登錄之複數個單字後，指示進行近義詞登錄。在圖13A所示之近義詞登錄前之分析結果畫面65中，選擇「daigakusei(大學生)」與「gakusei(學生)」，並自選單中選擇「近義詞登錄」。其後，將「daigakusei」與「gakusei」作為近義詞而進行階層式集群分析後之結果，被顯示於畫面。在圖13B所示之近義詞登錄後之分析結果畫面66中，「daigakusei」以較分析結果畫面65更大之尺寸被顯示，且「shingaku(升學)」取代「gakusei」而被顯示。根據「daigakusei」之出現頻率與「gakusei」之出現頻率之合計，「daigakusei」係以較分析結果畫面65內之「daigakusei」更大之尺寸被顯示。 FIG. 13A is a diagram showing an analysis result screen before synonyms are registered. FIG. 13B is a diagram showing an analysis result screen after synonyms are registered. The user operates the mouse 29 and, after selecting a plurality of words to be registered as synonyms, instructs the registration of synonyms. In the analysis result screen 65 before the synonyms registration shown in FIG. 13A, "daigakusei (university student)" and "gakusei (student)" are selected, and "synonym registration" is selected from the menu. After that, the results of hierarchical cluster analysis using "daigakusei" and "gakusei" as synonyms are displayed on the screen. In the analysis result screen 66 after the registration of the synonyms shown in FIG. 13B, "daigakusei" is displayed in a larger size than the analysis result screen 65, and "shingaku" is displayed instead of "gakusei". Based on the sum of the occurrence frequency of "daigakusei" and the occurrence frequency of "gakusei", "daigakusei" is displayed in a larger size than "daigakusei" in the analysis result screen 65.

CPU 21於在步驟S113所接收之指示為「複合詞登錄」之情形時，朝向步驟S120前進。於該情形時，CPU 21將所指示之單字追加至使用中之複合詞列表(步驟S120)，並朝向步驟S109前進。其後，考量所指示之複合詞而進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，將所指定之單字作為複合詞而進行階層式集群分析之結果被顯示於畫面。 When the instruction received in step S113 is "composite word registration", the CPU 21 proceeds to step S120. In this case, the CPU 21 adds the indicated word to the compound word list in use (step S120), and proceeds to step S109. After that, a hierarchical cluster analysis is performed in consideration of the indicated compound word, and screen data for displaying a new analysis result is generated, and a new screen is displayed. Thereby, the result of the hierarchical cluster analysis using the designated word as a compound word is displayed on the screen.

圖14A係顯示進行複合詞登錄前之分析結果畫面之圖。圖14B係顯示進行複合詞登錄後之分析結果畫面之圖。使用者於操作滑鼠29來選擇應作為複合詞而加以登錄之複數個單字後，指示進行「近義詞登錄」。在圖14A所示之複合詞登錄前之分析結果畫面67中，「nintai(忍耐)」與「tsuyoi(強)」被選擇，且「複合詞登錄」自選單中被選擇。其後，將「nintai」與「tsuyoi」作為複合詞而進行階層式集群分析後之結果被顯示於畫面。在圖14B所示之複合詞登錄後之分析結果畫面68中，取代「nintai」及「tsuyoi」，而以「nintai」及「tsuyoi」以下之尺寸來顯示「nintaizuyoi(忍耐力高)」。 FIG. 14A is a diagram showing an analysis result screen before compound word registration is performed. FIG. 14B is a diagram showing an analysis result screen after compound word registration is performed. After the user operates the mouse 29 to select a plurality of words to be registered as a compound word, he instructs to perform "synonym registration". In the analysis result screen 67 before the compound word registration shown in FIG. 14A, "nintai" and "tsuyoi" are selected, and "composite word registration" is selected from a menu. Thereafter, hierarchical cluster analysis using "nintai" and "tsuyoi" as compound words is displayed on the screen. In the analysis result screen 68 after the compound word registration shown in FIG. 14B, "nintai" and "tsuyoi" are replaced, and "nintaizuyoi (high endurance)" is displayed in a size below "nintai" and "tsuyoi".

如以上所示，本實施形態之文字探勘方法具備有：文字分析步驟，其對自被輸入之文字資料所擷取之單字進行階層式集群分析；畫面生成步驟，其根據文字分析步驟之分析結果，生成畫面資料；及分析結果顯示步驟，其根據畫面資料來顯示畫面。畫面生成步驟，根據群組數m與群組內之最多資料數n，自分析結果求得m個集群，而生成用以將包含n個以下之集群所包含之單字之群組顯示於畫面之畫面資料。根據本實施形態之文字探勘方法，可根據對文字資料所包含之單字進行階層式集群分析之結果，使含有集群所包含之單字之群組被顯示於畫面。又，群組所包含單字的數量，被限制在n個以下。因此，使用者在看到畫面時，可直觀地理解階層式集群分析之結果。 As shown above, the text exploration method of this embodiment includes: a text analysis step, which performs hierarchical cluster analysis on the words extracted from the input text data; a screen generation step, which is based on the analysis result of the text analysis step To generate screen data; and an analysis result display step that displays a screen based on the screen data. Screen generation step: According to the number of groups m and the maximum number of data n in the group, m clusters are obtained from the analysis results, and a group of words containing the words included in the clusters of n or less is displayed on the screen. Screen data. According to the text exploration method of the present embodiment, a group of words included in the text can be displayed on the screen according to a hierarchical cluster analysis result of the words included in the text data. The number of words included in the group is limited to n or less. Therefore, when the user sees the screen, he can intuitively understand the results of the hierarchical cluster analysis.

又，群組所包含之單字係自對應於群組之集群所包含之單字中，依出現頻率高之順序所選擇。因此，於群組之內部，顯示有集群所包含之單字中出現頻率高之單字。因此，使用者可容易地認知各集群所包含之出現頻率高之單字。又，群組在畫面內具有對應於與群組對應之集群所包含之單字之出現頻率之合計的尺寸。因此，使用者可容易地認知單字出現頻率之合計較大之集群。又，群組所包含之單字在畫面內具有對應於單字之出現頻率之尺寸。因此，使用者可容易地認知出現頻率高之單字。 In addition, the words included in the group are selected from the words included in the cluster corresponding to the group in the order of high occurrence frequency. Therefore, within the group, words with high frequency among the words included in the cluster are displayed. Therefore, the user can easily To recognize the frequently occurring words contained in each cluster. In addition, the group has a total size corresponding to the appearance frequency of the words included in the cluster corresponding to the group in the screen. Therefore, the user can easily recognize clusters with a large total of the frequency of single words. In addition, the words included in the group have a size corresponding to the appearance frequency of the words in the screen. Therefore, the user can easily recognize words that appear frequently.

又，文字探勘方法具備有用以輸入來自使用者之指示之指示輸入步驟，且文字分析步驟及畫面生成步驟之任一者係根據在指示輸入步驟所輸入之指示來執行。因此，可根據來自使用者之指示，切換階層式集群分析之結果之顯示態樣。尤其，指示輸入步驟接收群組數m之設定指示，而畫面生成步驟根據在指示輸入步驟所指定之群組數m來生成畫面資料。藉此，根據來自使用者之指示，切換顯示於畫面之區域個數(集群個數)。又，指示輸入步驟接收群組內之最多資料數n，而畫面生成步驟根據在指示輸入步驟所指定之群組內之最多資料數n來生成畫面資料。藉此，根據來自使用者之指示，切換於區域內所顯示單字的個數。 The text exploration method includes an instruction input step for inputting instructions from a user, and either the text analysis step or the screen generation step is executed based on the instructions input in the instruction input step. Therefore, according to the instructions from the user, the display mode of the results of the hierarchical cluster analysis can be switched. In particular, the instruction input step receives a setting instruction of the number of groups m, and the screen generation step generates screen data based on the number of groups m specified in the instruction input step. Thereby, according to an instruction from the user, the number of areas (the number of clusters) displayed on the screen is switched. In addition, the instruction input step receives the maximum number of data n in the group, and the screen generation step generates screen data according to the maximum number of data n in the group specified by the instruction input step. Thereby, according to the instruction from the user, the number of single words displayed in the area is switched.

又，指示輸入步驟接收分析對象期間之指示，而文字分析步驟對文字資料中在指示輸入步驟所指定之分析對象期間內之文字資料所包含之單字進行階層式集群分析。因此，對使用者所指示之分析對象期間內之文字資料所包含之單字進行階層式集群分析之結果被顯示於畫面。因此，使用者可容易地認知階層式集群分析之結果在時間上的變化。又，指示輸入步驟接收分析目的之設定指示，而文字分析步驟自文字資料5擷取對應於在指示輸入步驟所設定之分析目的之種類之單字，來進行階層式集群分析。藉此，可根據使用者所指示之分析目的來切換分析對象之單字種類，並將進行階層式集群分析之結果顯示於畫面。 In addition, the instruction input step receives instructions from the analysis target period, and the text analysis step performs hierarchical cluster analysis on words included in the text data in the text data during the analysis target period specified by the instruction input step. Therefore, the result of hierarchical cluster analysis on the words included in the text data within the analysis target period indicated by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change of the results of the hierarchical cluster analysis. In addition, the instruction input step receives a setting instruction of the analysis purpose, and the text analysis step extracts a word corresponding to the type of the analysis purpose set in the instruction input step from the text data 5 to perform hierarchical cluster analysis. In this way, the type of words to be analyzed can be switched according to the analysis purpose indicated by the user, and The results of the hierarchical cluster analysis are displayed on the screen.

又，指示輸入步驟接收單字除外指示，而文字分析步驟將在指示輸入步驟所指示之單字除外，而進行階層式集群分析。藉此，可將使用者所指示之單字除外而顯示進行階層式集群分析之結果。又，指示輸入步驟接收近義詞登錄指示，而文字分析步驟將在指示輸入步驟所指示之複數個單字視為相同之單字，而進行階層式集群分析。藉此，可將使用者所指示之複數個單字視為相同單字並將進行階層式集群分析之結果顯示於畫面。又，指示輸入步驟接收複合詞登錄指示，而文字分析步驟將在指示輸入步驟所指示之複數個單字合併為1個單字，而進行階層式集群分析。藉此，可將使用者所指示之複數個單字合併為1個單字並將進行階層式集群分析之結果顯示於畫面。 In addition, the instruction input step receives a single word exclusion instruction, and the character analysis step excludes the words indicated in the instruction input step, and performs hierarchical cluster analysis. In this way, the results of hierarchical cluster analysis can be displayed except for the words indicated by the user. In addition, the instruction input step receives a synonym registration instruction, and the character analysis step treats the plurality of words indicated in the instruction input step as the same word, and performs hierarchical cluster analysis. In this way, the multiple words indicated by the user can be regarded as the same word, and the result of the hierarchical cluster analysis can be displayed on the screen. In addition, the instruction input step receives a compound word registration instruction, and the character analysis step combines a plurality of words indicated in the instruction input step into one word, and performs hierarchical cluster analysis. In this way, a plurality of words instructed by the user can be combined into one word, and the result of the hierarchical cluster analysis can be displayed on the screen.

又，畫面生成步驟生成畫面資料，該畫面資料係用以顯示包含群組之分析結果畫面、及用以設定分析結果畫面之顯示態樣之分析設定畫面。因此，分析結果畫面與分析設定畫面被顯示。因此，使用者可使用分析設定畫面而容易地切換進行階層式集群分析之結果之顯示態樣。 The screen generation step generates screen data, which is used to display an analysis result screen including a group and an analysis setting screen to set a display state of the analysis result screen. Therefore, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can easily switch the display state of the result of performing the hierarchical cluster analysis using the analysis setting screen.

本實施形態之文字探勘程式31、及本實施形態之文字探勘裝置10具有與本實施形態之文字探勘處理方法相同之構成，而發揮相同之效果。 The character surveying program 31 of this embodiment and the character surveying device 10 of this embodiment have the same configuration as the character survey processing method of this embodiment, and exhibit the same effects.

根據本實施形態之文字探勘方法、文字探勘程式、及文字探勘裝置，可根據對文字資料所包含之單字進行階層式集群分析之結果，使包含最多資料數以下之集群所包含之單字之群組被顯示於畫面。因此，使用者在看到畫面時，可直觀地理解階層式集群分析之結果。 According to the text exploration method, text exploration program, and text exploration device of this embodiment, the results of hierarchical cluster analysis of the words included in the text data can be used to make the group of words included in the cluster with the largest number of data or less Is displayed on the screen. Therefore, when users see the screen, they can intuitively understand the hierarchical cluster. The results of the analysis.

再者，本案係主張根據在2016年7月25日所提出申請之發明名稱為「文字探勘方法、文字探勘程式、及文字探勘裝置」之日本專利特願2016-145065號之優先權而提出申請案，該等申請之內容係藉由引用而包含於本申請案。 Furthermore, this case claims to file an application based on the priority of Japanese Patent Application No. 2016-145065 entitled "Text Exploration Method, Text Exploration Procedure, and Text Exploration Device" for the name of the invention filed on July 25, 2016. The contents of these applications are included in this application by reference.

40‧‧‧顯示畫面 40‧‧‧display

41‧‧‧分析結果畫面 41‧‧‧ Analysis result screen

42‧‧‧分析設定畫面 42‧‧‧ Analysis Setting Screen

Claims

A text exploration method is to display the analysis result of text data on the screen; it is characterized by having: a text analysis step that performs hierarchical cluster analysis on the words extracted from the input text data; screen generation Step, which generates screen data according to the analysis result of the text analysis step; and analysis result display step, which displays the screen based on the screen data; the screen generation step is based on the number of groups and the maximum number of data in the group. The result of the analysis is to obtain a cluster of the above-mentioned number of groups, and to generate screen data for displaying on the screen a group containing the words included in the above-mentioned clusters below the maximum number of data.

For example, the text exploration method of claim 1, wherein the words included in the above group are selected from the words included in the cluster corresponding to the above group in the order of high occurrence frequency.

For example, the text exploration method of claim 2, wherein the group has a total size corresponding to the frequency of occurrence of the words included in the cluster corresponding to the group in the screen.

For example, the text exploration method of claim 3, wherein the characters included in the group have a size corresponding to the frequency of occurrence of the characters in the screen.

For example, the text exploration method of claim 1 further includes an instruction input step for inputting instructions from a user, and any one of the text analysis step and the screen generation step is based on the input in the instruction input step. Instructions are executed.

For example, the text exploration method of item 5, wherein the instruction input step is received on the In the setting instruction for the number of groups, the screen generation step generates the screen data according to the number of groups set in the instruction input step.

For example, the text exploration method of claim 5, wherein the instruction input step receives the setting instruction of the maximum number of data, and the screen generation step generates the screen data according to the maximum number of data set in the instruction input step.

For example, the text exploration method of claim 5, wherein the instruction input step receives the setting instruction of the analysis target period, and the text analysis step includes the text data included in the text data during the analysis target period set in the instruction input step. Single word, perform the above-mentioned hierarchical cluster analysis.

For example, the text exploration method of claim 5, wherein the instruction input step receives a setting instruction of the analysis purpose, and the text analysis step extracts from the text data a word corresponding to the type of analysis purpose set in the instruction input step to Perform the hierarchical cluster analysis described above.

For example, the text exploration method of claim 5, wherein the instruction input step receives a single-word exclusion instruction, and the text analysis step excludes the word indicated by the instruction input step and performs the hierarchical cluster analysis.

For example, the text exploration method of claim 5, wherein the instruction input step receives a synonym registration instruction, and the text analysis step will include a plurality of words indicated in the instruction input step. Consider the same word and perform the hierarchical cluster analysis described above.

For example, the text exploration method of claim 5, wherein the instruction input step receives a compound word registration instruction, the text analysis step combines the plurality of words indicated in the instruction input step into one word, and performs the hierarchical cluster analysis.

For example, the text exploration method of claim 1, wherein the screen generation step generates screen data, and the screen data is used to display the analysis result screen including the group and the analysis setting to set the display state of the analysis result screen. Screen.

A computer-readable recording medium records a text exploration program that displays analysis results of text data on a screen. The computer-readable recording medium is characterized in that the CPU uses a memory to cause a computer to execute the following steps: a text analysis step, which is performed on a self-input The words extracted from the text data are subjected to hierarchical cluster analysis; the screen generation step generates screen data according to the analysis result of the text analysis step; and the analysis result display step displays the screen according to the screen data; the above screen The generating step obtains the cluster of the above-mentioned group number from the analysis result according to the number of the group and the maximum number of data in the group, and generates a group display for displaying the word including the single word included in the above-mentioned cluster below the maximum number of data Picture data on the screen.

For example, the computer-readable recording medium of claim 14, wherein the words included in the above group are selected from the words included in the cluster corresponding to the above group in order of high frequency of occurrence.

For example, the computer-readable recording medium of item 15, wherein the above group has an appearance corresponding to the word contained in the cluster corresponding to the above group in the above screen. The total size of the frequencies.

For example, the computer-readable recording medium of claim 16, wherein the words included in the group have a size corresponding to the frequency of appearance of the words in the screen.

For example, the computer-readable recording medium of claim 14, wherein the computer is further caused to execute the instruction input step for inputting instructions from the user, and any one of the text analysis step and the screen generation step is based on The instruction input in the instruction input step is executed.

For example, the computer-readable recording medium of claim 14, wherein the screen generation step generates screen data, and the screen data is used to display the analysis result screen including the group and to set the display status of the analysis result screen. The same analysis setting screen.

A text exploration device, which displays the analysis results of text data on a screen, is characterized in that it includes: a text analysis unit that performs hierarchical cluster analysis on words extracted from input text data; screen generation Unit, which generates screen data based on the analysis result of the text analysis unit; and analysis result display unit, which displays the screen based on the screen data; the screen generation unit automatically The above analysis results obtain the cluster of the number of groups, and generate screen data for displaying on the screen a group of words containing the single word included in the cluster below the maximum number of data.

For example, the character exploration device of claim 20, wherein the words included in the group are selected from the words included in the cluster corresponding to the group in the order of high occurrence frequency.

For example, the text exploration device of claim 21, wherein the group has a total size corresponding to the frequency of occurrence of the words included in the cluster corresponding to the group in the screen.

For example, the character exploration device of claim 22, wherein the words included in the group have a size corresponding to the appearance frequency of the words in the screen.

For example, the text surveying device of claim 20 further includes an instruction input unit for inputting instructions from a user, and any one of the text analysis unit and the screen generation unit, according to the instructions input in the instruction input unit. Come to action.

For example, the text exploration device of claim 20, wherein the screen generation unit generates screen data, and the screen data is used to display an analysis result screen including the group and an analysis setting to set a display state of the analysis result screen Screen.