TWI686716B

TWI686716B - Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program

Info

Publication number: TWI686716B
Application number: TW106122011A
Authority: TW
Inventors: 秋田正史; 中村康則; 周景龍
Original assignee: 斯庫林集團股份有限公司
Priority date: 2016-07-25
Filing date: 2017-06-30
Publication date: 2020-03-01
Also published as: KR20190018480A; KR102180487B1; TW201807597A; WO2018020842A1; JP6794162B2; CN109478191A; CN109478191B; JP2018018118A

Abstract

在文字分析步驟(S109~S110)中，對自被輸入之文字資料所擷取之單字進行階層式集群分析。在畫面生成步驟(S111)中，根據群組數m與群組內之最多資料數n，自文字分析步驟之分析結果求得m個集群，而生成用以將包含n個以下之集群所包含之單字之群組顯示於畫面之畫面資料。在分析結果顯示步驟(S112)中，根據所生成之畫面資料來顯示畫面。藉此，將階層式集群分析之結果，以使用者可直觀地理解之方式顯示於畫面。 In the character analysis step (S109~S110), the hierarchical cluster analysis is performed on the words extracted from the input character data. In the screen generation step (S111), m clusters are obtained from the analysis result of the text analysis step according to the number m of groups and the maximum number n of data in the group, and generated to include the clusters of n or less The group of words is displayed on the screen data of the screen. In the analysis result display step (S112), the screen is displayed based on the generated screen data. In this way, the results of the hierarchical cluster analysis are displayed on the screen in a way that the user can intuitively understand.

Description

Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program

本發明係關於文字探勘，尤其關於將文字資料之分析結果顯示於畫面之文字探勘方法、文字探勘程式、及文字探勘裝置。 The invention relates to text exploration, in particular to a text exploration method, a text exploration program, and a text exploration apparatus that display the analysis results of text data on a screen.

近年來，解析以自由形態所記載之大量文字資料，並從解析結果求得有用資訊之文字探勘受到矚目。在文字探勘中，例如自分析對象之文字資料擷取單字，並藉由解析單字的出現頻率與出現趨勢等來求得資訊。 In recent years, text exploration that analyzes a large amount of text data recorded in a free form and obtains useful information from the analysis results has attracted attention. In text exploration, for example, words are extracted from the text data of the analysis object, and information is obtained by analyzing the occurrence frequency and trend of the words.

以下，針對對自文字資料所擷取之單字進行階層式集群分析而將分析結果顯示於畫面之文字探勘裝置進行探討。在階層式集群分析中，根據單字間之相似度，而階層式地製作包含相似度高之單字之集群。一般而言，階層式集群分析之結果係使用圖15所示之樹狀圖(樹狀結構圖；dendrogram)，而被提供給使用者(分析者)。 The following is a discussion on a text exploration device that performs hierarchical cluster analysis on the words extracted from the text data and displays the analysis results on the screen. In the hierarchical cluster analysis, based on the similarity between words, hierarchically create clusters containing words with high similarity. In general, the results of the hierarchical cluster analysis are provided to users (analysts) using the tree diagram (dendrogram) shown in FIG. 15.

與本案發明相關連地，於專利文獻1記載有一種分群裝置，其具有建構樹狀圖，探索樹狀圖而生成可自下層至上層進行特定之索引並儲存於儲存手段之階層式分群手段。於引證2記載有一種提供查詢裝置，其具有：距離矩陣計算手段，其計算出關鍵字間之距離，生成可探索關鍵字與關鍵字間之距離之距離矩陣資料並儲存於儲存手段；及分群手段，其使用距離矩陣將關鍵字階層式分群，並作為可自下層至上層地探索所建構之樹狀圖之由下往上索引而儲存於儲存手段。 In connection with the invention of the present application, Patent Document 1 describes a grouping device that has a structured tree diagram, explores the tree diagram, and generates a hierarchical grouping method that can perform a specific index from the lower layer to the upper layer and store it in a storage device. In Reference 2, there is provided a query device, which has: distance matrix calculation means, which calculates the distance between keywords, generates distance matrix data that can explore the distance between keywords and keywords, and stores it in the storage means; and grouping Means, which uses a distance matrix to classify keywords hierarchically The group is stored in the storage means as a bottom-up index that can be explored from the lower level to the upper level of the constructed tree diagram.

[Prior Technical Literature] [Patent Literature]

[專利文獻1]日本專利特開2011-216021號公報 [Patent Document 1] Japanese Patent Laid-Open No. 2011-216021

[專利文獻2]日本專利特開2012-150539號公報 [Patent Document 2] Japanese Patent Laid-Open No. 2012-150539

習知之文字探勘裝置，使用樹狀圖將階層式集群分析之結果顯示於畫面。然而，如此之文字探勘裝置存在有使用者無法直觀地理解分析結果之問題。例如，於圖15所示之分析結果中，在使用者將集群數設定為4時，如圖16所示，會在樹狀圖上設定切割線。然而，使用者並無法僅從看到如此之樹狀圖，便直觀地認知各集群所包含之單字。又，使用者在單字數較多而變更集群數之情形時，並無法直觀地掌握各集群所包含之單字會如何地變化。 The conventional text exploration device uses a tree diagram to display the results of hierarchical cluster analysis on the screen. However, such a text exploration device has a problem that the user cannot intuitively understand the analysis result. For example, in the analysis result shown in FIG. 15, when the user sets the number of clusters to 4, as shown in FIG. 16, a cutting line is set on the tree diagram. However, users cannot intuitively recognize the words contained in each cluster just by seeing such a tree diagram. In addition, when the number of words is large and the number of clusters is changed, the user cannot intuitively grasp how the words included in each cluster will change.

又，因為樹狀圖並未記載單字的出現頻率，因此使用者無法得知哪個單字較重要。又，於分析對象之文字資料為具有年月日或時刻等之資訊之時間序列資料之情形時，使用者有時會期望能得知分析結果在時間上的變化。然而，在習知之文字探勘裝置中，並無法滿足使用者的上述期望。 In addition, because the dendrogram does not record the frequency of occurrence of words, the user cannot know which word is more important. In addition, when the text data of the analysis object is time series data with information such as year, month, day, or time, the user sometimes expects to be able to know the time change of the analysis result. However, the conventional text exploration device cannot meet the above expectations of users.

因此，本發明之目的，在於提供將階層式集群分析之結果以使用者可直觀地理解之方式顯示於畫面之文字探勘方法、文字探勘程式、及文字探勘裝置。 Therefore, an object of the present invention is to provide a text exploration method, a text exploration program, and a text exploration apparatus that display the results of hierarchical cluster analysis on a screen in a manner that a user can intuitively understand.

本發明第1態樣係一種文字探勘方法，係將文字資料之分析結果顯示於畫面者，其特徵在於具備有：文字分析步驟，其對自被輸入之文字資料所擷取之單字(單語，即單詞，word，vocabulary)進行階層式集群分析；畫面生成步驟，其根據上述文字分析步驟之分析結果來生成畫面資料；及分析結果顯示步驟，其根據上述畫面資料來顯示畫面；上述畫面生成步驟根據群組數與群組內之最多資料數，自上述分析結果求得上述群組數之集群，而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面之畫面資料。 The first aspect of the present invention is a text exploration method, which displays the analysis results of text data on the screen, and is characterized by having: a text analysis step for the words (single words) extracted from the input text data , Ie words, word, vocabulary) for hierarchical cluster analysis; screen generation step, which generates screen data based on the analysis results of the above text analysis steps; and analysis result display step, which displays screens based on the screen data; the screen generation Step: Based on the number of groups and the maximum number of data in the group, obtain the cluster of the number of groups from the above analysis results, and generate a group containing the words included in the cluster below the maximum number of data. Picture data of the picture.

本發明第2態樣之特徵在於，於本發明之第1態樣中，上述群組所包含之單字係自對應於上述群組之集群所包含之單字中，依出現頻率高之順序所選擇。 The second aspect of the present invention is characterized in that, in the first aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group, in the order of higher frequency of occurrence .

本發明第3態樣之特徵在於，於本發明之第2態樣中，上述群組在上述畫面內，具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The third aspect of the present invention is characterized in that, in the second aspect of the present invention, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group in the screen.

本發明第4態樣之特徵在於，於本發明之第3態樣中，上述群組所包含之單字在上述畫面內，具有對應於上述單字之出現頻率的尺寸。 The fourth aspect of the present invention is characterized in that, in the third aspect of the present invention, the words included in the group are within the screen and have a size corresponding to the frequency of occurrence of the words.

本發明第5態樣之特徵在於，於本發明之第1態樣中，進一步具備有用以輸入來自使用者之指示之指示輸入步驟，上述文字分析步驟及上述畫面生成步驟之任一者，係根據在上述指示輸入步驟所輸入之指示而被執行。 The fifth aspect of the present invention is characterized in that in the first aspect of the present invention, it further includes an instruction input step for inputting an instruction from the user, any of the above-mentioned character analysis step and the above-mentioned screen generation step. It is executed according to the instruction input in the above instruction input step.

本發明第6態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收上述群組數之設定指示，上述畫面生成步驟根據在上述指示輸入步驟所設定之群組數，來生成上述畫面資料。 The sixth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the group number, and the screen generation step is based on the group number set in the instruction input step, To generate the above screen data.

本發明第7態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收上述最多資料數之設定指示，上述畫面生成步驟根據在上述指示輸入步驟所設定之最多資料數，來生成上述畫面資料。 The seventh aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the maximum number of data, and the screen generation step is based on the maximum number of data set in the instruction input step, To generate the above screen data.

本發明第8態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收分析對象期間之設定指示，上述文字分析步驟對上述文字資料中在上述指示輸入步驟所設定之分析對象期間內之文字資料所包含之單字，進行上述階層式集群分析。 The eighth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a setting instruction during the analysis target period, and the character analysis step analyzes the analysis set in the instruction input step of the character data The words contained in the text data within the target period are analyzed in the above hierarchical cluster.

本發明第9態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收分析目的之設定指示，上述文字分析步驟自上述文字資料擷取對應於在上述指示輸入步驟中所設定之分析目的之種類的單字，來進行上述階層式集群分析。 The ninth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the analysis purpose, and the character analysis step extracts from the character data corresponding to the result of the instruction input step. Set the type of analysis purpose words to perform the above hierarchical cluster analysis.

本發明第10態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收單字除外指示，上述文字分析步驟將在上述指示輸入步驟所指示之單字除外，而進行上述階層式集群分析。 The tenth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a word exclusion instruction, and the character analysis step excludes the word indicated in the instruction input step, and performs the hierarchical form Cluster analysis.

本發明第11態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收近義詞登錄指示，上述文字分析步驟將在上述指示輸入步驟所指示之複數個單字視為相同之單字，而進行上述階層式集群分析。 The eleventh aspect of the invention is characterized by the fifth aspect of the invention In the above, the instruction input step receives a synonym registration instruction, and the character analysis step treats the plurality of words indicated in the instruction input step as the same word, and performs the hierarchical cluster analysis.

本發明第12態樣之特徵在於，於本發明之第5態樣中，上述指示輸入步驟接收複合詞登錄指示，上述文字分析步驟將在上述指示輸入步驟所指示之複數個單字合併為1個單字，而進行上述階層式集群分析。 The twelfth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a compound word registration instruction, and the character analysis step merges the plural words indicated in the instruction input step into a single word And perform the above hierarchical cluster analysis.

本發明之第13態樣之特徵在於，於本發明之第1態樣中，上述畫面生成步驟生成畫面資料，該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The thirteenth aspect of the present invention is characterized in that, in the first aspect of the present invention, the screen generating step generates screen data for displaying the analysis result screen including the group and for setting the above The analysis setting screen of the display form of the analysis result screen.

本發明第14態樣係一種電腦可讀取之記錄媒體，其記錄有將文字資料之分析結果顯示於畫面之文字探勘程式，其特徵在於CPU(中央處理單元)利用記憶體使電腦執行如下之步驟：文字分析步驟，其對自被輸入之文字資料所擷取之單字進行階層式集群分析；畫面生成步驟，其根據上述文字分析步驟之分析結果，來生成畫面資料；及分析結果顯示步驟，其根據上述畫面資料來顯示畫面；上述畫面生成步驟根據群組數與群組內之最多資料數，自上述分析結果求得上述群組數之集群，而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面之畫面資料。 The fourteenth aspect of the present invention is a computer-readable recording medium that records a text exploration program that displays the analysis results of text data on the screen. It is characterized in that the CPU (Central Processing Unit) uses memory to make the computer execute the following Steps: a text analysis step, which performs hierarchical cluster analysis on the words extracted from the input text data; a screen generation step, which generates screen data based on the analysis results of the above text analysis steps; and an analysis result display step, It displays the screen according to the above-mentioned screen data; the above-mentioned screen generation step obtains the cluster of the above-mentioned group number from the analysis result based on the number of groups and the maximum number of data in the group, and generates a The screen data of the group of words included in the above cluster below the number displayed on the screen.

本發明第15態樣之特徵在於，於本發明之第14態樣中，上述群組所包含之單字係自對應於上述群組之集群所包含之單字中，依出現頻率高之順序所選擇。 The fifteenth aspect of the present invention is characterized in that, in the fourteenth aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group in the order of higher frequency of occurrence .

本發明第16態樣之特徵在於，於本發明之第15態樣中，上述群組在上述畫面內，具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The sixteenth aspect of the present invention is characterized in that, in the fifteenth aspect of the present invention, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group in the screen.

本發明第17態樣之特徵在於，於本發明之第16態樣中，上述群組所包含之單字在上述畫面內，具有對應於上述單字之出現頻率的尺寸。 The seventeenth aspect of the present invention is characterized in that in the sixteenth aspect of the present invention, the words included in the group are within the screen and have a size corresponding to the frequency of occurrence of the words.

本發明第18態樣之特徵在於，於本發明之第14態樣中，使上述電腦進一步執行用以輸入來自使用者之指示之指示輸入步驟，上述文字分析步驟及上述畫面生成步驟之任一者，係根據在上述指示輸入步驟所輸入之指示而被執行。 The eighteenth aspect of the present invention is characterized in that in the fourteenth aspect of the present invention, the computer is further executed to perform any of the instruction input step for inputting instructions from the user, the character analysis step and the screen generation step It is executed according to the instruction input in the instruction input step.

本發明第19態樣之特徵在於，於本發明之第14態樣中，上述畫面生成步驟生成畫面資料，該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The 19th aspect of the present invention is characterized in that, in the 14th aspect of the present invention, the above-mentioned screen generating step generates screen data for displaying the analysis result screen including the above-mentioned group and for setting the above-mentioned analysis The analysis setting screen of the display aspect of the result screen.

本發明之第20態樣係一種文字探勘裝置，係將文字資料之分析結果顯示於畫面者，其特徵在於具備有：文字分析部，其對自被輸入之文字資料所擷取之單字進行階層式集群分析；畫面生成部，其根據上述文字分析部之分析結果，來生成畫面資料；及分析結果顯示部，其根據上述畫面資料來顯示畫面；上述畫面生成部根據群組數與群組內之最多資料數，自上述分析結果求得上述群組數之集群，而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面。 The twentieth aspect of the present invention is a text exploration device that displays the analysis results of text data on the screen, and is characterized by having: a text analysis section that hierarchizes the words extracted from the input text data Cluster analysis; a screen generation unit that generates screen data based on the analysis results of the character analysis unit; and an analysis result display unit that displays screens based on the screen data; the screen generation unit based on the number of groups and within the group The maximum number of data is obtained from the analysis result to obtain the cluster of the number of groups, and generated to display the group containing the word included in the cluster below the maximum number of data on the screen.

本發明第21態樣之特徵在於，於本發明之第20態樣中，上述群組所包含之單字係自對應於上述群組之集群所包含之單字中，依出現頻率高之順序所選擇。 The 21st aspect of the present invention is characterized in that, in the 20th aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group, in the order of higher frequency of occurrence .

本發明第22態樣之特徵在於，於本發明之第21態樣中，上述群組在上述畫面內，具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The twenty-second aspect of the present invention is characterized in that, in the twenty-first aspect of the present invention, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group in the screen.

本發明第23態樣之特徵在於，於本發明之第22態樣中，上述群組所包含之單字在上述畫面內，具有對應於上述單字之出現頻率的尺寸。 The twenty-third aspect of the present invention is characterized in that, in the twenty-second aspect of the present invention, the words included in the group are within the screen and have a size corresponding to the appearance frequency of the words.

本發明第24態樣之特徵在於，於本發明之第20態樣中，進一步具備有用以輸入來自使用者之指示之指示輸入部，上述文字分析部及上述畫面生成部之任一者，根據在上述指示輸入部所輸入之指示來動作。 The 24th aspect of the present invention is characterized in that in the 20th aspect of the present invention, It further includes an instruction input unit for inputting an instruction from the user, and either the character analysis unit or the screen generation unit operates according to the instruction input in the instruction input unit.

本發明第25態樣之特徵在於，於本發明之第20態樣中，上述畫面生成部生成畫面資料，該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The 25th aspect of the present invention is characterized in that, in the 20th aspect of the present invention, the screen generating unit generates screen data for displaying the analysis result screen including the group and for setting the analysis The analysis setting screen of the display aspect of the result screen.

根據本發明第1、第14或第20態樣，基於對文字資料所包含之單字進行階層式集群分析後之結果，包含集群所包含之單字之群組被顯示於畫面。又，群組所包含之單字數，被限制在最多資料數以下。因此，使用者看到畫面時可直觀地理解階層式集群分析之結果。 According to the first, 14th, or 20th aspect of the present invention, based on the result of performing hierarchical cluster analysis on the words included in the text data, the group including the words included in the cluster is displayed on the screen. In addition, the number of words included in the group is limited to the maximum number of data. Therefore, the user can intuitively understand the results of the hierarchical cluster analysis when viewing the screen.

根據本發明第2、第15或第21態樣，在群組之內部，集群所包含之單字中出現頻率高之單字被顯示。因此，使用者可容易地認知各集群所包含之出現頻率高之單字。 According to the second, fifteenth, or twenty-first aspect of the present invention, within the group, words with a high occurrence frequency among words included in the cluster are displayed. Therefore, the user can easily recognize the words with high frequency included in each cluster.

藉由本發明第3、第16或第22態樣，群組在畫面內具有對應於集群所包含之單字之出現頻率之合計的尺寸。因此，使用者可容易地認知單字出現頻率之合計大之集群。 According to the third, 16th, or 22nd aspect of the present invention, the group has a total size within the screen corresponding to the frequency of occurrence of the words included in the cluster. Therefore, the user can easily recognize the cluster with a large total occurrence frequency of the word.

藉由本發明第4、第17或第23態樣，單字在畫面內具有對應於單字頻率之尺寸。因此，使用者可容易地認知出現頻率高之單字。 According to the fourth, 17th or 23rd aspect of the present invention, the word has a size corresponding to the word frequency within the screen. Therefore, the user can easily recognize words with a high frequency of occurrence.

根據本發明第5、第18或第24態樣，可對應於來自使用者之指示，切換階層式集群分析之結果之顯示態樣。 According to the fifth, 18th or 24th aspect of the present invention, it can correspond to The user's instruction switches the display of the results of the hierarchical cluster analysis.

根據本發明第6態樣，可對應於來自使用者之指示，切換畫面所顯示之群組的個數(集群個數)。 According to the sixth aspect of the present invention, the number of groups (the number of clusters) displayed on the screen can be switched according to the instruction from the user.

根據本發明第7態樣，可對應於來自使用者之指示，切換群組所包含之單字之個數的上限值。 According to the seventh aspect of the present invention, the upper limit of the number of words included in the group can be switched according to the instruction from the user.

根據本發明第8態樣，對使用者所指示之分析對象期間內之文字資料所包含之單字進行階層式集群分析之結果被顯示於畫面。因此，使用者可容易地認知階層式集群分析之結果在時間上的變化。 According to the eighth aspect of the present invention, the result of performing hierarchical cluster analysis on the words included in the text data within the analysis target period indicated by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change of the results of the hierarchical cluster analysis.

根據本發明第9態樣，可對應於使用者所指示之分析目的，切換分析對象之單字種類並將進行階層式集群分析後之結果顯示於畫面。 According to the ninth aspect of the present invention, according to the analysis purpose instructed by the user, the word type of the analysis object can be switched and the results of the hierarchical cluster analysis can be displayed on the screen.

根據本發明第10態樣，可將使用者所指示之單字除外，並將進行階層式集群分析後之結果顯示於畫面。 According to the tenth aspect of the present invention, the words indicated by the user can be excluded, and the results of the hierarchical cluster analysis can be displayed on the screen.

根據本發明第11態樣，可將使用者所指示之複數個單字視為相同單字，並將進行階層式集群分析後之結果顯示於畫面。 According to the eleventh aspect of the present invention, the plurality of words indicated by the user can be regarded as the same word, and the results of the hierarchical cluster analysis can be displayed on the screen.

根據本發明第12態樣，可將使用者所指示之複數個單字合併為1個單字，並將進行階層式集群分析後之結果顯示於畫面。 According to the twelfth aspect of the present invention, the plurality of words indicated by the user can be combined into one word, and the results of the hierarchical cluster analysis can be displayed on the screen.

根據本發明第13、第19或第25態樣，分析結果畫面與分析設定畫面被顯示。因此，使用者可使用分析設定畫面而容易地切換進行階層式集群分析後之結果之顯示態樣。 According to the 13th, 19th or 25th aspect of the present invention, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can use the analysis setting screen to easily switch the display state of the results after the hierarchical cluster analysis.

5‧‧‧文字資料 5‧‧‧ Text

10‧‧‧文字探勘裝置 10‧‧‧Text exploration device

11‧‧‧指示輸入部 11‧‧‧Instruction input section

12‧‧‧文字分析部 12‧‧‧ Character Analysis Department

13‧‧‧畫面生成部 13‧‧‧ Screen generator

14‧‧‧分析結果顯示部 14‧‧‧Analysis result display

20‧‧‧電腦 20‧‧‧ Computer

21‧‧‧CPU 21‧‧‧CPU

22‧‧‧主記憶體 22‧‧‧Main memory

23‧‧‧儲存部 23‧‧‧Storage Department

24‧‧‧輸入部 24‧‧‧ Input

25‧‧‧顯示部 25‧‧‧Display

26‧‧‧通信部 26‧‧‧Ministry of Communications

27‧‧‧記錄媒體讀取部 27‧‧‧Recording Media Reading Department

28‧‧‧鍵盤 28‧‧‧ keyboard

29‧‧‧滑鼠 29‧‧‧Mouse

30‧‧‧記錄媒體 30‧‧‧Recording media

31‧‧‧文字探勘程式 31‧‧‧ Text exploration program

40‧‧‧顯示畫面 40‧‧‧Display screen

41、61~68‧‧‧分析結果畫面 41, 61~68‧‧‧Analysis result screen

42‧‧‧分析設定畫面 42‧‧‧Analysis setting screen

51‧‧‧資料指定畫面 51‧‧‧Data designation screen

52‧‧‧目的指定畫面 52‧‧‧Destination designation screen

53‧‧‧近義詞列表選擇畫面 53‧‧‧Synonyms list selection screen

54‧‧‧複合詞列表選擇畫面 54‧‧‧ Compound word list selection screen

m‧‧‧群組數(集群數) m‧‧‧ group number (cluster number)

n‧‧‧群組內之最多資料數 n‧‧‧Maximum number of data in the group

W1~W6‧‧‧單字 W1~W6‧‧‧Word

圖1係顯示本發明實施形態之文字探勘裝置之構成之方塊圖。 1 is a block diagram showing the structure of a text exploration device according to an embodiment of the present invention.

圖2係顯示作為圖1所示之文字探勘裝置而發揮功能之電腦之構成之方塊圖。 FIG. 2 is a block diagram showing the structure of a computer that functions as the text exploration device shown in FIG.

圖3係顯示圖1所示之文字探勘裝置之顯示畫面之圖。 FIG. 3 is a diagram showing the display screen of the text exploration device shown in FIG. 1.

圖4係顯示圖1所示之文字探勘裝置之動作之流程圖。 4 is a flowchart showing the operation of the text exploration device shown in FIG.

圖5係圖1所示之文字探勘裝置之畫面資料生成處理之流程圖。 FIG. 5 is a flowchart of screen data generation processing of the text exploration device shown in FIG. 1.

圖6係顯示圖1所示之文字探勘裝置之資料指定畫面之圖。 6 is a diagram showing a data designation screen of the text exploration device shown in FIG.

圖7係顯示被輸入於圖1所示之文字探勘裝置之文字資料之例子之圖。 7 is a diagram showing an example of text data input to the text exploration device shown in FIG.

圖8係顯示圖1所示之文字探勘裝置之目的指定畫面之圖。 FIG. 8 is a diagram showing the purpose designation screen of the text exploration device shown in FIG. 1.

圖9係顯示圖1所示之文字探勘裝置之近義詞列表選擇畫面之圖。 FIG. 9 is a diagram showing a selection screen of the synonyms list of the text exploration device shown in FIG. 1.

圖10係顯示圖1所示之文字探勘裝置之複合詞列表選擇畫面之圖。 FIG. 10 is a diagram showing a compound word list selection screen of the text exploration device shown in FIG. 1.

圖11A係顯示於圖1所示之文字探勘裝置中設定分析對象期間前之分析結果畫面之圖。 11A is a diagram showing an analysis result screen before setting an analysis target period in the text exploration apparatus shown in FIG. 1.

圖11B係顯示於圖1所示之文字探勘裝置中設定分析對象期間後之分析結果畫面之圖。 11B is a diagram showing an analysis result screen after setting the analysis target period in the text exploration apparatus shown in FIG. 1.

圖12A係顯示於圖1所示之文字探勘裝置中進行單字除外前之分析結果畫面之圖。 FIG. 12A is a diagram showing an analysis result screen before word exclusion in the text exploration apparatus shown in FIG. 1.

圖12B係顯示於圖1所示之文字探勘裝置中進行單字除外後之分析結果畫面之圖。 FIG. 12B is a diagram showing the analysis result screen after the single character is excluded in the text exploration device shown in FIG. 1.

圖13A係顯示於圖1所示之文字探勘裝置中進行近義詞登錄前之分析結果畫面之圖。 13A is a diagram showing an analysis result screen before registration of synonyms in the text exploration apparatus shown in FIG. 1.

圖13B係顯示於圖1所示之文字探勘裝置中進行近義詞登錄後之分析結果畫面之圖。 FIG. 13B is a diagram showing an analysis result screen after registration of synonyms in the text exploration apparatus shown in FIG. 1.

圖14A係顯示於圖1所示之文字探勘裝置中進行複合詞登錄前之分析結果畫面之圖。 14A is a diagram showing an analysis result screen before compound word registration in the text exploration apparatus shown in FIG. 1.

圖14B係顯示於圖1所示之文字探勘裝置中進行複合詞登錄後之分析結果畫面之圖。 14B is a diagram showing an analysis result screen after compound word registration in the text exploration device shown in FIG. 1.

圖15係顯示樹狀圖之例子之圖。 15 is a diagram showing an example of a tree diagram.

圖16係顯示對圖15所示之樹狀圖設定集群數之情況之圖。 FIG. 16 is a diagram showing a case where the number of clusters is set for the tree diagram shown in FIG. 15.

圖17係顯示在圖式及其說明所出現之單字之圖。 Figure 17 is a diagram showing the words that appear in the drawings and their descriptions.

以下，參照圖式，對本發明實施形態之文字探勘方法、文字探勘程式、及文字探勘裝置進行說明。本實施形態之文字探勘方法，通常係使用電腦來執行。本實施形態之文字探勘程式係為了使用電腦來實施文字探勘方法之程式。本實施形態之文字探勘裝置通常係使用電腦所構成。執行文字探勘程式之電腦係作為文字探勘裝置而發揮功能。 The text exploration method, text exploration program, and text exploration apparatus according to the embodiments of the present invention will be described below with reference to the drawings. The text exploration method of this embodiment is usually performed using a computer. The text exploration program in this embodiment is a program for implementing a text exploration method using a computer. The text exploration device of this embodiment is usually constructed using a computer. The computer that runs the text exploration program functions as a text exploration device.

圖1係顯示本發明之實施形態之文字探勘裝置之構成之方塊圖。圖1所示之文字探勘裝置10具備有指示輸入部11、文字分析部12、畫面生成部13、及分析結果顯示部14。於文字探勘裝置10輸入有分析對象之文字資料5。文字探勘裝置10對自被輸入之文字資料5所擷取之單字進行階層式集群分析，並將分析結果顯示於畫面。 FIG. 1 is a block diagram showing the structure of a text exploration device according to an embodiment of the present invention. The character exploration apparatus 10 shown in FIG. 1 includes an instruction input unit 11, a character analysis unit 12, a screen generation unit 13, and an analysis result display unit 14. Enter the text data 5 of the analysis object in the text exploration device 10. The text exploration device 10 performs hierarchical cluster analysis on the words extracted from the input text data 5 and displays the analysis results on the screen.

文字探勘裝置10之動作的概要如以下所述。於指示輸入部11輸入有來自使用者之指示。文字分析部12自被輸入之文字資料5擷取單字，並對所擷取之單字進行階層式集群分析。畫面生成部13根據文字分析部12之分析結果來生成畫面資料。分析結果顯示部14根據由畫面生成部13所生成之畫面資料來顯示畫面。 The outline of the operation of the text exploration device 10 is as follows. An instruction from the user is input to the instruction input unit 11. The text analysis unit 12 extracts words from the input text data 5 and performs hierarchical cluster analysis on the extracted words. The screen generating unit 13 generates screen data based on the analysis result of the character analysis unit 12. The analysis result display unit 14 displays the screen based on the screen data generated by the screen generation unit 13.

被輸入至指示輸入部11之來自使用者之指示，包含群組數之設定、群組內之最多資料數之設定、分析對象期間之設定、單字除外、近義詞登錄、複合詞登錄等。於文字資料5為具有年月日或時刻等之資訊之時間序列資料之情形時，文字分析部12對被輸入之文字資料5中在指示輸入部11被設定之分析對象期間內之文字資料所包含之單字，進行階層式集群分析。 The instruction from the user input to the instruction input unit 11 includes the setting of the number of groups, the setting of the maximum number of data in the group, the setting of the analysis target period, the exception of words, the registration of synonyms, and the registration of compound words. When the text data 5 is time-series data with information such as year, month, day, or time, the text analysis unit 12 compares the input text data 5 with the text data within the analysis target period set by the instruction input unit 11 The included words are analyzed in a hierarchical cluster.

畫面生成部13在生成畫面資料時，係依照群組數與群組內之最多資料數(細節將如後述之)。又，於使用者輸入新的指示時，在所指示之處理被進行後，畫面生成部13生成新的畫面資料，而分析結果顯示部14顯示新的畫面。如此，文字探勘裝置10對應於來自使用者之指示，切換文字資料5之分析態樣與分析結果之顯示態樣。 When the screen generating unit 13 generates screen data, it is based on the number of groups and the maximum number of data in the group (details will be described later). In addition, when the user inputs a new instruction, after the indicated processing is performed, the screen generating unit 13 generates new screen data, and the analysis result display unit 14 displays the new screen. In this way, the text exploration device 10 switches the analysis form of the text data 5 and the display form of the analysis result in response to an instruction from the user.

圖2係顯示作為文字探勘裝置10而發揮功能之電腦之構成之方塊圖。圖2所示之電腦20，具備有CPU(Central Processing Unit；中央處理單元)21、主記憶體22、儲存部23、輸入部24、顯示部25、通信部26、及記錄媒體讀取部27。主記憶體22例如使用DRAM(Dynamic Random Access Memory；動態隨機存取記憶體)。儲存部23例如使用硬碟(Hard Disk)或固態硬碟(Solid State Drive)。輸入部24例如包含有鍵盤(Keyboard)28與滑鼠 (Mouse)29。顯示部25例如使用液晶顯示器。通信部26係有線通信或無線通信之介面電路。記錄媒體讀取部27係儲存有程式等之記錄媒體30之介面電路。記錄媒體30例如使用CD-ROM(Compact Disc Read-Only Memory；唯讀記憶光碟)、DVD-ROM(Digital Versatile Disc Read-Only Memory；數位多功能影音唯讀記憶光碟)、USB(Universal Serial Bus；通用序列匯流排)記憶體等非過渡性之記錄媒體。 FIG. 2 is a block diagram showing the structure of a computer that functions as a text exploration device 10. The computer 20 shown in FIG. 2 includes a CPU (Central Processing Unit) 21, a main memory 22, a storage unit 23, an input unit 24, a display unit 25, a communication unit 26, and a recording medium reading unit 27 . The main memory 22 uses, for example, DRAM (Dynamic Random Access Memory). The storage unit 23 uses, for example, a hard disk (Hard Disk) or a solid state drive (Solid State Drive). The input unit 24 includes, for example, a keyboard 28 and a mouse (Mouse)29. The display unit 25 uses, for example, a liquid crystal display. The communication unit 26 is an interface circuit for wired communication or wireless communication. The recording medium reading unit 27 is an interface circuit of the recording medium 30 storing programs and the like. For the recording medium 30, for example, CD-ROM (Compact Disc Read-Only Memory; CD-ROM), DVD-ROM (Digital Versatile Disc Read-Only Memory; digital multi-function audio-visual CD-ROM), USB (Universal Serial Bus; (Universal Serial Bus) memory and other non-transitional recording media.

於電腦20執行文字探勘程式31之情形時，儲存部23儲存文字探勘程式31與文字資料5。文字探勘程式31與文字資料5例如既可為使用通信部26自伺服器或其他電腦接收者，亦可為使用記錄媒體讀取部27自記錄媒體30所讀取者。 When the computer 20 executes the text exploration program 31, the storage unit 23 stores the text exploration program 31 and the text data 5. The text exploration program 31 and the text data 5 may be, for example, those received from a server or other computer using the communication unit 26, or those read from the recording medium 30 using the recording medium reading unit 27.

於執行文字探勘程式31時，文字探勘程式31與文字資料5被複製傳送至主記憶體22。CPU 21將主記憶體22作為作業用記憶體來使用，藉由執行被儲存於主記憶體22之文字探勘程式31，來處理被儲存於主記憶體22之文字資料5。此時，電腦20作為文字探勘裝置10而發揮功能。再者，以上所述之電腦20之構成僅為一例，可使用任意之電腦來構成文字探勘裝置10。 When the text exploration program 31 is executed, the text exploration program 31 and the text data 5 are copied and sent to the main memory 22. The CPU 21 uses the main memory 22 as a working memory, and executes the text exploration program 31 stored in the main memory 22 to process the text data 5 stored in the main memory 22. At this time, the computer 20 functions as the text exploration device 10. Furthermore, the configuration of the computer 20 described above is just an example, and any computer can be used to configure the text exploration device 10.

以下，文字資料5設為包含日文單字之日文資料。圖17係顯示圖式及其說明所出現之單字之圖。於圖17之各列記載有單字(日文單字)與單字的意思。於以下之說明中在提及日文單字時，有時會在單字後之括號內記載單字的意思。再者，文字資料5亦可為任意語言的資料。 Hereinafter, the text data 5 is set as Japanese data containing Japanese words. Figure 17 is a diagram showing the words appearing in the diagram and its description. Words (Japanese words) and the meaning of words are described in each column of FIG. 17. When referring to Japanese words in the following description, the meaning of the word may sometimes be stated in parentheses after the word. Furthermore, the text data 5 can also be data in any language.

圖3係顯示文字探勘裝置10之顯示畫面之圖。圖3所示之顯示畫面40，包含有分析結果畫面41與分析設定畫面42。於分析結果畫面41顯示有文字分析部12之分析結果。於分析設定畫面42顯示有GUI(圖形化使用者介面；Graphical User Interface)元件，該GUI元件係用以設定文字分析部12之分析態樣與畫面生成部13所生成之畫面資料的特性。 FIG. 3 is a diagram showing the display screen of the text exploration device 10. The display screen 40 shown in FIG. 3 includes an analysis result screen 41 and an analysis setting screen 42. The analysis result of the character analysis unit 12 is displayed on the analysis result screen 41. A GUI (Graphical User Interface) element is displayed on the analysis setting screen 42. The GUI element is used to set the characteristics of the analysis pattern of the character analysis unit 12 and the screen data generated by the screen generation unit 13.

若對階層式集群分析之結果設定集群數，則決定各集群所包含之單字。於將對自文字資料5擷取之單字進行階層式集群分析後之結果顯示於畫面時，文字探勘裝置10係以圖3所示之態樣顯示與集群對應之群組，以取代樹狀圖。 If the number of clusters is set for the result of hierarchical cluster analysis, the words included in each cluster are determined. When the results of hierarchical cluster analysis on the words extracted from the text data 5 are displayed on the screen, the text exploration device 10 displays the group corresponding to the cluster as shown in FIG. 3, instead of the tree diagram .

於以下之說明中，將於畫面所顯示之集群亦稱為群組。使用者使用指示輸入部11，來指定群組數(集群數)與群組內之最多資料數(群組所包含之單字數之上限值)。以下，將前者設為m，後者設為n。 In the following description, the cluster to be displayed on the screen is also called a cluster. The user uses the instruction input unit 11 to specify the number of groups (the number of clusters) and the maximum number of data in the group (the upper limit of the number of words included in the group). Hereinafter, let the former be m and the latter n.

在文字探勘裝置10中，文字資料5所包含之單字係分類為m個集群，且各集群包含有1個以上之單字。於分析結果畫面41顯示有m個群組，於各群組之內部顯示有單字。群組係使用雲狀圖形來顯示，群組所包含之單字係顯示於橢圓區域之內部。各群組所包含之單字被限制在n個以下。例如，在n=5之時的集群包含有10個單字之情形時，在分析結果畫面41中，於群組之內部顯示有5個單字。 In the text exploration device 10, the words included in the text data 5 are classified into m clusters, and each cluster contains more than one word. There are m groups displayed on the analysis result screen 41, and words are displayed inside each group. The group is displayed using a cloud-like graphic, and the words included in the group are displayed inside the ellipse area. The number of words included in each group is limited to n or less. For example, when the cluster at n=5 contains 10 words, on the analysis result screen 41, 5 words are displayed inside the group.

於分析設定畫面42顯示有用以設定群組數m之第1滑動條與2個第1按鈕(標示有記號「+」或「-」者)、用以設定群組內之最多資料數n之第2滑動條與2個第2按鈕、及用以設定分析對象期間之4個方框與2個第3按鈕(標示有向左箭頭或向右箭頭者)。 On the analysis setting screen 42, the first slide bar for setting the number of groups m and two first buttons (the ones marked with a mark "+" or "-") are displayed, which is used to set the maximum number of data n in the group. The second slider bar and two second buttons, and four boxes and two third buttons (one marked with a left arrow or right arrow) for setting the analysis target period.

使用者藉由操作滑鼠29，使第1滑動條之捲動塊朝左右移動或按下第1按鈕，來指示群組數m。群組數m於標示有記號「+」之第1按鈕被按下時會增加，於標示有記號「-」之第1按鈕被按下時則會減少。群組數m之初始值，例如被設定為文字分析部12之分析結果所包含之單字之種類的平方根，或者為接近該平方根之整數。例如，於文字分析部12之分析結果包含有16種類之單字之情形時，群組數m之初始值係設定為4。 The user operates the mouse 29 to move the scroll block of the first slide bar to the left or right or presses the first button to indicate the group number m. The number of groups m will increase when the first button marked with "+" is pressed, and decrease when the first button marked with "-" is pressed. The initial value of the group number m is set to, for example, the square root of the type of word included in the analysis result of the character analysis unit 12, or an integer close to the square root. For example, when the analysis result of the character analysis unit 12 includes 16 kinds of words, the initial value of the number m of groups is set to 4.

使用者藉由操作滑鼠29，使第2滑動條之捲動塊朝左右移動或按下第2按鈕，來指示群組內之最多資料數n。群組內之最多資料數n於第2按鈕被按下時會增加或減少。群組內之最多資料數n之初始值，例如被設定為5。 The user operates the mouse 29 to move the scrolling block of the second slide bar to the left or right or presses the second button to indicate the maximum number n of data in the group. The maximum number of data n in the group will increase or decrease when the second button is pressed. The initial value of the maximum number of data n in the group is set to 5, for example.

於文字資料5為時間序列資料之情形時，使用者藉由操作鍵盤28或滑鼠29，使用4個方框來指定年月日與時刻或按下第3按鈕，來指示分析對象期間。分析對象期間於標示有向左箭頭之第3按鈕被按下時，朝向過去移動既定量(例如1個月)，而於標示有向右箭頭之第3按鈕被按下時則朝向相反方向移動既定量。分析對象期間之初始值，例如被設定為自文字資料5最舊之時刻至最新之時刻之期間。再者，於文字資料5並非時間序列資料之情形時，使用者無法指定分析對象期間。 When the text data 5 is time series data, the user uses the keyboard 28 or the mouse 29 to specify the year, month, day, and time using four boxes or presses the third button to indicate the analysis target period. During the analysis, when the third button marked with the left arrow is pressed, it moves towards the past by a certain amount (for example, 1 month), and when the third button marked with the right arrow is pressed, it moves in the opposite direction Both quantitative. The initial value of the analysis target period is, for example, set from the oldest time of the text data 5 to the newest time. Furthermore, when the text data 5 is not time series data, the user cannot specify the analysis target period.

於分析結果畫面41顯示有1個以上且m個以下之群組，於各群組之內部顯示有1個以上且n個以下之單字。各群組在畫面內，對應之集群所包含之單字之出現頻率之合計越大者越被放大地顯示。於集群所包含之單字數超過n個之情形時，於群組之內部顯示出現頻率高之n個單字。群組所包含之單字與包含該等之橢圓區域，在畫面內單字之出現頻率越高者越被放大地顯示。於各群組標示有名稱。群組之名稱係使用集群所包含之單字中出現頻率最高之單字。群組之名稱係於群組之內部標示底線來顯示。再者，於在橢圓區域之內部無法顯示單字之情形時，取代單字而顯示記號「...」。 On the analysis result screen 41, there are displayed more than one and m or less groups, and within each group, there are more than one and n or less words. Each group is within the screen, and the greater the total frequency of occurrence of the words included in the corresponding cluster, the larger the display. When the number of words included in the cluster exceeds n, n words with a high frequency are displayed inside the group. The words contained in the group and the ellipses containing them The circled area is displayed with greater magnification as the frequency of words appearing on the screen is higher. Named in each group. The name of the group is the word with the highest frequency among the words contained in the cluster. The name of the group is displayed on the bottom line of the internal label of the group. Furthermore, when a word cannot be displayed inside the ellipse area, the symbol "..." is displayed instead of the word.

於分析結果畫面41顯示有用以指定縮放倍率之第3滑動條及2個第4按鈕(標示有記號「+」或「-」者)。使用者藉由操作滑鼠29，使第3滑動條之捲動塊朝左右移動或按下第4按鈕，來設定縮放倍率。於分析結果畫面41，包含單字之群組係對應於所設定之縮放倍率而放大或縮小地被顯示。縮放倍率之初始值係設定為100%。於初始狀態之分析結果畫面41，顯示有所有的群組。 On the analysis result screen 41, a third slider bar and two fourth buttons (the ones marked with a symbol "+" or "-") useful for specifying the zoom ratio are displayed. The user sets the zoom ratio by operating the mouse 29 to move the scroll block of the third slide bar to the left or right or pressing the fourth button. On the analysis result screen 41, the group containing the word is displayed enlarged or reduced corresponding to the set zoom ratio. The initial value of the zoom ratio is set to 100%. The analysis result screen 41 in the initial state displays all groups.

於使用者在分析設定畫面42中變更群組數m、群組內之最多資料數n、或分析對象期間時，分析結果畫面41之內容係與該等對應地產生變化。於使用者在分析結果畫面41中指示單字除外、近義詞登錄、或複合詞登錄時，分析結果畫面41之內容也與該等對應地產生變化。 When the user changes the number m of groups, the maximum number n of data in the group, or the analysis target period on the analysis setting screen 42, the content of the analysis result screen 41 changes corresponding to these. When the user instructs the analysis result screen 41 to exclude single words, the registration of synonyms, or the registration of compound words, the content of the analysis result screen 41 also changes accordingly.

於對自文字資料5所擷取之單字進行階層式集群分析時，文字探勘裝置10參照儲存有應除外之單字之除外單字列表、儲存有應作為近義詞來處理之單字之近義詞列表、及儲存有應作為複合詞來處理之單字之複合詞列表。具有相同意思(或大致相同意思)之複數個單字與代表該等單字之1個單字被建立對應而被儲存於近義詞列表。若加以連結便成為1個複合詞之複數個單字與連結該等單字之複合詞被建立對應而被儲存於複合詞列表。例如「daigakusei(大學生)」及「gakusei(學生)」與代表兩者之「daigakusei」被建立對應而被儲存於近義詞列表。例如「nintai(忍耐)」及「tsuyoi(強)」與連結兩者之「nintaizuyoi(忍耐力高)」被建立對應而被儲存於複合詞列表。文字探勘裝置10存在有具有複數個近義詞列表與複數個複合詞列表之情形。 When performing hierarchical cluster analysis on the words extracted from the text data 5, the text exploration device 10 refers to the list of excluded words that store the words that should be excluded, the list of synonyms that store the words that should be treated as synonyms, and the A list of compound words of words that should be treated as compound words. Plural words with the same meaning (or roughly the same meaning) are associated with one word representing these words and stored in the list of synonyms. If connected, a plurality of words that become a compound word are associated with the compound words connecting the words and stored in the compound word list. For example, "daigakusei (university student)" and "gakusei (student)" and "daigakusei" representing both Correspondence is established and stored in the list of synonyms. For example, "nintai (endurance)" and "tsuyoi (strong)" are linked to the "nintaizuyoi (endurance)" linking the two and stored in the compound word list. The text exploration device 10 may have a plurality of synonym word lists and a plurality of compound word lists.

圖4係顯示文字探勘裝置10之動作之流程圖。圖5係顯示文字探勘裝置10之畫面資料生成處理(圖4所示之步驟S111)之細節之流程圖。輸入部24與執行步驟S113之CPU 21係作為指示輸入部11而發揮功能。執行步驟S109~S110之CPU 21係作為文字分析部12而發揮功能。執行步驟S111之CPU 21係作為畫面生成部13而發揮功能。顯示部25與執行步驟S112之CPU 21係作為分析結果顯示部14而發揮功能。以下，參照圖4及圖5而對文字探勘裝置10之動作進行說明。 FIG. 4 is a flowchart showing the operation of the text exploration device 10. FIG. 5 is a flowchart showing the details of the screen data generation process (step S111 shown in FIG. 4) of the text exploration device 10. The input unit 24 and the CPU 21 executing step S113 function as the instruction input unit 11. The CPU 21 that executes steps S109 to S110 functions as the character analysis unit 12. The CPU 21 that executes step S111 functions as the screen generating unit 13. The display unit 25 and the CPU 21 executing step S112 function as the analysis result display unit 14. Hereinafter, the operation of the character exploration apparatus 10 will be described with reference to FIGS. 4 and 5.

首先，CPU 21使顯示部25顯示圖6所示之資料指定畫面51(步驟S101)。於資料指定畫面51顯示有用以指定檔案名稱之方框、及用以指定資料夾名之方框。使用者藉由於資料指定畫面51中指定檔案名稱或資料夾名，來指定分析對象之文字資料5。文字資料5既可被儲存於硬碟等之儲存部23，亦可被儲存於使用通信部26所連接之伺服器或其他電腦等。 First, the CPU 21 causes the display section 25 to display the data designation screen 51 shown in FIG. 6 (step S101). On the data designation screen 51, a box for designating a file name and a box for designating a folder name are displayed. The user specifies the text data 5 of the analysis target by specifying the file name or folder name in the data specifying screen 51. The text data 5 may be stored in a storage unit 23 such as a hard disk, or may be stored in a server or other computer connected to the communication unit 26.

接著，CPU 21將使用資料指定畫面51所指定之文字資料5傳送至主記憶體22。藉此，文字資料5被輸入至文字探勘裝置10(步驟S102)。圖7係顯示文字資料5之例子之圖。圖7所示之文字資料係大學生所製作之報告之資料，且為具有年月日之資訊之時間序列資料。圖7所示之文字資料，自上依序為「關於本授課內容中大學生與社會之關係...」、「一般大學生畢業後在出社會前打工或...」、「我們學生要有認知是付了昂貴的學費在學習...」、及「學生生活是為了使自我信心成長很珍貴的時間。而且...」。再者，文字探勘裝置10所分析之文字資料5之種類為任意。 Next, the CPU 21 transmits the text data 5 specified by the use data designation screen 51 to the main memory 22. With this, the text data 5 is input to the text exploration device 10 (step S102). 7 is a diagram showing an example of text data 5. The text data shown in Figure 7 is the data of the report produced by the university students, and is the time series data with the information of the year, month, and day. The text information shown in Figure 7 is in order from the top: "About the relationship between college students and society in the content of this course...", "General college students play before leaving the society after graduation "Work or...", "Our students have to recognize that they are paying expensive tuition to study...", and "Student life is a precious time for self-confidence growth. And...". Furthermore, the type of text data 5 analyzed by the text exploration device 10 is arbitrary.

接著，CPU 21使顯示部25顯示圖8所示之目的指定畫面52(步驟S103)。於目的指定畫面52顯示有對應於內容、特徵、及評價之3個選項按鈕(Radio Button)。使用者藉由操作滑鼠29按下任一選項按鈕，而自內容、特徵、及評價之中選擇分析目的。接著，CPU 21接收使用目的指定畫面52所指定之分析目的。藉此，分析目的被輸入至文字探勘裝置10(步驟S104)。 Next, the CPU 21 causes the display unit 25 to display the destination designation screen 52 shown in FIG. 8 (step S103). On the destination designation screen 52, three radio buttons (Radio Buttons) corresponding to content, features, and evaluation are displayed. The user presses any option button by operating the mouse 29, and selects the analysis purpose from the content, characteristics, and evaluation. Next, the CPU 21 receives the analysis purpose specified in the use purpose specification screen 52. With this, the analysis purpose is input to the text exploration device 10 (step S104).

接著，CPU 21使顯示部25顯示圖9所示之近義詞列表選擇畫面53(步驟S105)。於近義詞列表選擇畫面53顯示有文字探勘裝置10所具有近義詞列表之名稱、及被登錄於各近義詞列表之近義詞。使用者藉由操作滑鼠29，於近義詞列表選擇畫面53中選擇任一近義詞列表，來指定要使用之近義詞列表。藉此，在文字探勘裝置10中選擇近義詞列表(步驟S106)。 Next, the CPU 21 causes the display unit 25 to display the synonym list selection screen 53 shown in FIG. 9 (step S105). On the synonym list selection screen 53, the names of the synonym lists included in the text exploration device 10 and the synonym words registered in each synonym list are displayed. The user selects any synonym list in the synonym list selection screen 53 by operating the mouse 29 to specify the synonym list to be used. With this, the synonym word list is selected in the character exploration device 10 (step S106).

接著，CPU 21使顯示部25顯示圖10所示之複合詞列表選擇畫面54(步驟S107)。於複合詞列表選擇畫面54顯示有文字探勘裝置10所具有複合詞列表之名稱、及被登錄於各複合詞列表之複合詞。使用者藉由操作滑鼠29，於複合詞列表選擇畫面54中選擇任一複合詞列表，來指定要使用之複合詞列表。藉此，在文字探勘裝置10中選擇複合詞列表(步驟S108)。 Next, the CPU 21 causes the display unit 25 to display the compound word list selection screen 54 shown in FIG. 10 (step S107). The compound word list selection screen 54 displays the name of the compound word list included in the text exploration device 10 and the compound word registered in each compound word list. The user selects any compound word list on the compound word list selection screen 54 by operating the mouse 29 to specify the compound word list to be used. With this, the compound word list is selected in the character exploration device 10 (step S108).

接著，CPU 21考量除外單字列表、近義詞列表、及複合詞列表，而自在步驟S102被輸入之文字資料5中屬於分析對象期間內之文字資料，擷取對應於在步驟S104所指定之分析目的之種類之單字(步驟S109)。CPU 21在分析目的為「內容」之情形時，自文字資料5擷取名詞、專有名詞、地名、及人名。CPU 21在分析目的為「特徵」之情形時，係自文字資料5擷取名詞、專有名詞、

(SA)行變格活用名詞、及動詞。CPU 21在分析目的為「評價」之情形時，自文字資料5擷取形容詞、形容動詞、及感嘆詞。再者，文字探勘裝置10亦可支援前述之3個以外之分析目的。又，CPU 21亦可根據各分析目的而擷取與前述不同種類之單字。 Next, the CPU 21 considers the exclusion word list, the synonym list, and the compound word list, and the text data belonging to the analysis target period in the text data 5 input in step S102, extracts the type corresponding to the analysis purpose specified in step S104 Word (step S109). The CPU 21 extracts nouns, proper nouns, place names, and person names from the text data 5 when analyzing the case where the purpose is "content". When the analysis purpose is "characteristic", the CPU 21 extracts nouns, proper nouns,

(SA) The use of nouns and verbs. The CPU 21 extracts adjectives, adjective verbs, and interjections from the text data 5 when the analysis purpose is "evaluation". In addition, the text exploration device 10 can also support the above three analysis purposes. In addition, the CPU 21 can also extract different kinds of words according to each analysis purpose.

於文字資料5為時間序列資料之情形時，CPU 21在執行步驟S109時，僅自文字資料5中由使用者所指示之分析對象期間所包含之文字資料擷取單字。又，於單字W1被儲存於除外單字列表之情形時，CPU 21在執行步驟S109時會完全忽略文字資料5所包含之單字W1。又，於單字W2及單字W3與代表兩者之單字W2被建立對應而被儲存於所選擇之近義詞列表之情形時，CPU 21在執行步驟S109時，會將文字資料5所包含之單字W3全部作為單字W2來處理。又，於單字W4及單字W5與連結兩者之單字W6被建立對應而被儲存於所選擇之複合詞列表之情形時，CPU 21在執行步驟S109時，會將文字資料5所包含之連接之單字W4與單字W5全部作為單字W6來處理。 When the text data 5 is time-series data, the CPU 21 executes step S109 to extract only words from the text data included in the analysis target period indicated by the user in the text data 5. In addition, when the word W1 is stored in the excluded word list, the CPU 21 completely ignores the word W1 included in the text data 5 when executing step S109. In addition, when the word W2 and the word W3 are associated with the word W2 representing the two and are stored in the selected synonym list, when the CPU 21 executes step S109, all the words W3 included in the text data 5 are included Treated as a word W2. In addition, when the word W4 and the word W5 are associated with the word W6 connecting the two and stored in the selected compound word list, the CPU 21 executes step S109 and converts the connected word included in the text data 5 W4 and word W5 are all treated as word W6.

接著，CPU 21對在步驟S109所擷取之單字進行階層式集群分析(步驟S110)。CPU 21於步驟S110中，例如根據文字資料5中2個單字間之距離(2個單字呈現分開什麼程度的距離)，來求得2個單字間之相似度。CPU 21根據所求得之單字間之相似度，而使用既定之方法(例如，最短距離法、最長距離法、群平均法、十進位法、華德法(Ward’s Method)等)進行階層式集群分析。又， CPU 21在步驟S110中，求得各單字之出現頻率。 Next, the CPU 21 performs hierarchical cluster analysis on the words extracted in step S109 (step S110). In step S110, the CPU 21 obtains the similarity between the two words according to the distance between the two words in the text data 5 (how far apart the two words are separated). The CPU 21 uses a predetermined method (for example, the shortest distance method, the longest distance method, the group average method, the decimal method, the Ward's method, etc.) to perform hierarchical clustering according to the similarity between the obtained words analysis. also, In step S110, the CPU 21 obtains the appearance frequency of each word.

接著，CPU 21根據在步驟S110所求得之階層式集群分析之結果，來生成用以顯示分析結果之畫面資料(步驟S111)。CPU 21在步驟S111中，進行圖5所示之處理。 Next, the CPU 21 generates screen data for displaying the analysis result based on the result of the hierarchical cluster analysis obtained in step S110 (step S111). The CPU 21 performs the processing shown in FIG. 5 in step S111.

CPU 21將群組數設為m，並將群組內之最多資料數設為n(步驟S201)。接著，CPU 21針對階層式集群分析之結果，將集群數設定為m，來求得m個集群(步驟S202)。接著，CPU 21針對各集群，來求得集群所包含之單字之出現頻率之合計(步驟S203)。接著，CPU 21根據在步驟S203所求得之出現頻率之合計，來決定各群組之顯示尺寸(步驟S204)。在步驟S204中，集群所包含之單字之出現頻率之合計越大，群組之顯示尺寸便被決定為越大。 The CPU 21 sets the number of groups to m, and sets the maximum number of data in the group to n (step S201). Next, the CPU 21 sets the number of clusters to m based on the result of hierarchical cluster analysis to obtain m clusters (step S202). Next, the CPU 21 obtains the total frequency of occurrence of the words included in the cluster for each cluster (step S203). Next, the CPU 21 determines the display size of each group based on the total of the appearance frequencies obtained in step S203 (step S204). In step S204, the greater the total frequency of occurrence of the words included in the cluster, the larger the display size of the group is determined.

接著，CPU 21針對各集群，自集群所包含之單字中選擇應顯示之單字(步驟S205)。在步驟S205中，自各集群所包含之單字中，依出現頻率高之順序，被選擇出n個以下之單字。接著，CPU 21針對在步驟S205所選擇之各單字，根據單字之出現頻率來決定單字之顯示尺寸(步驟S206)。在步驟S206中，出現頻率越高之單字，單字之顯示尺寸便被決定為越大。 Next, for each cluster, the CPU 21 selects the word to be displayed from the words included in the cluster (step S205). In step S205, from the words included in each cluster, n words or less are selected in the order of high frequency of occurrence. Next, for each word selected in step S205, the CPU 21 determines the display size of the word based on the frequency of occurrence of the word (step S206). In step S206, the higher the word frequency, the larger the display size of the word.

接著，CPU 21生成用以顯示階層式集群分析之結果之畫面資料(步驟S207)。在步驟S207所生成之畫面資料，包含具有在步驟S204所決定之尺寸之m個群組(以雲狀圖形來表示)。於各群組之內部，包含具有在步驟S206所決定之尺寸之n個以下之單字。單字在畫面內，被顯示於群組之內部。CPU 21於執行步驟S207之後，結束畫面資料生成處理。 Next, the CPU 21 generates screen data for displaying the results of the hierarchical cluster analysis (step S207). The picture data generated in step S207 includes m groups (represented by a cloud-like figure) having the size determined in step S204. Within each group, there are n words or less with the size determined in step S206. Words are displayed on the screen and displayed inside the group. After executing step S207, the CPU 21 ends the screen material generation process.

接著，CPU 21使顯示部25顯示基於在步驟S111所生成之畫面資料的畫面(步驟S112)。接著，CPU 21接收來自使用者之指示(步驟S113)。接著，CPU 21根據在步驟S113所接收之指示之種類，前進至步驟S115~S120中之任一者(步驟S114)。 Next, the CPU 21 causes the display unit 25 to display a screen based on the screen data generated in step S111 (step S112). Next, the CPU 21 receives an instruction from the user (step S113). Next, the CPU 21 proceeds to any one of steps S115 to S120 according to the type of instruction received in step S113 (step S114).

CPU 21於在步驟S113所接收之指示為「群組數之設定」之情形時，朝向步驟S115前進。於該情形時，CPU 21將群組數m設定為使用者所指示之值(步驟S115)，並朝向步驟S111前進。其後，根據所設定之群組數m生成畫面資料，並顯示新的畫面。藉此，包含所指定之個數之群組之分析結果畫面被顯示。 The CPU 21 proceeds to step S115 when the instruction received in step S113 is "setting of group number". In this case, the CPU 21 sets the group number m to the value indicated by the user (step S115), and proceeds to step S111. After that, the screen data is generated according to the set number of groups m, and a new screen is displayed. By this, the analysis result screen containing the specified number of groups is displayed.

CPU 21於在步驟S113所接收之指示為「群組內之最多資料數之設定」之情形時，朝向步驟S116前進。於該情形時，CPU 21將群組內之最多資料數n設定為使用者所指示之值(步驟S116)，並朝向步驟S111前進。其後，根據所設定之群組內之最多資料數n生成畫面資料，並顯示新的畫面。藉此，各群組所包含之單字個數被限制在所指定之值以下之分析結果畫面被顯示。 The CPU 21 proceeds to step S116 when the instruction received in step S113 is "setting of the maximum number of data in the group". In this case, the CPU 21 sets the maximum number n of data in the group to the value indicated by the user (step S116), and proceeds to step S111. After that, the screen data is generated according to the set maximum number of data n in the group, and a new screen is displayed. With this, the analysis result screen in which the number of words included in each group is limited to the specified value is displayed.

CPU 21於在步驟S113所接收之指示為「分析對象期間之設定」之情形時，朝向步驟S117前進。於該情形時，CPU 21將分析對象期間設定為使用者所指示之期間(步驟S117)，並朝向步驟S109前進。其後，參照所設定之分析對象期間進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，針對所指定之分析對象期間內之文字資料所包含之單字，進行階層式集群分析之結果被顯示於畫面。 When the instruction received in step S113 is "setting of analysis target period", the CPU 21 proceeds to step S117. In this case, the CPU 21 sets the analysis target period to the period instructed by the user (step S117), and proceeds to step S109. Thereafter, a hierarchical cluster analysis is performed with reference to the set analysis target period, and screen data for displaying new analysis results is generated, and a new screen is displayed. In this way, the results of the hierarchical cluster analysis of the words included in the text data within the specified analysis target period are displayed on the screen.

圖11A係顯示設定分析對象期間前之分析結果畫面之圖。圖11B係顯示設定分析對象期間後之分析結果畫面之圖。於圖11A所示之設定前之分析結果畫面61，顯示有對所輸入之文字資料5中自2014年1月1日0時0分至2015年12月31日24時0分為止之文字資料所包含之單字進行階層式集群分析之結果。於圖11B所示之設定後之分析結果畫面62，顯示有對所輸入之文字資料5中自2014年3月1日0時0分至2014年9月30日24時0分為止之文字資料所包含之單字進行階層式集群分析之結果。分析結果畫面61之顯示內容與分析結果畫面62之顯示內容不同。使用者可藉由觀察設定分析對象期間前後之分析結果畫面，而容易地認知階層式集群分析結果在時間上的變化。 FIG. 11A is a diagram showing an analysis result screen before setting an analysis target period. 11B is a diagram showing an analysis result screen after setting an analysis target period. in The analysis result screen 61 before setting shown in FIG. 11A displays the text data of the input text data 5 from 0:00 on January 1, 2014 to 24:00 on December 31, 2015 The included words are the result of hierarchical cluster analysis. The analysis result screen 62 after the setting shown in FIG. 11B displays the text data of the input text data 5 from 00:00 on March 1, 2014 to 24:00 on September 30, 2014 The included words are the result of hierarchical cluster analysis. The display content of the analysis result screen 61 is different from the display content of the analysis result screen 62. The user can easily recognize the temporal change of the hierarchical cluster analysis results by observing the analysis result screens before and after the set analysis target period.

CPU 21於在步驟S113所接收之指示為「單字除外」之情形時，朝向步驟S118前進。於該情形時，CPU 21將所指定之單字追加至除外單字列表(步驟S118)，並朝向步驟S109前進。其後，將所指定之單字除外而進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，將所指定之單字除外而進行階層式集群分析之結果，被顯示於畫面。 The CPU 21 proceeds to step S118 when the instruction received in step S113 is "except for single word". In this case, the CPU 21 adds the specified word to the excluded word list (step S118), and proceeds to step S109. After that, except for the specified words, a hierarchical cluster analysis is performed to generate screen data for displaying new analysis results, and display the new screen. With this, the result of performing hierarchical cluster analysis except for the specified word is displayed on the screen.

圖12A係顯示進行單字除外前之分析結果畫面之圖。圖12B係顯示進行單字除外後之分析結果畫面之圖。使用者操作滑鼠29，於選擇應除外之單字之後，指示進行單字除外。在圖12A所示之單字除外前之分析結果畫面63中，選擇「shakai(社會)」，並自選單中選擇「單字除外」。其後，將「shakai」除外而進行階層式集群分析之結果被顯示於畫面。於圖12B所示之單字除外後之分析結果畫面64，取代「shakai」而顯示「shingaku(升學)」。在與「shakai」相同集群所包含之單字中，「shingaku」係僅次於分析結果畫面63所顯示之5個單字，出現頻率最高者。 Fig. 12A is a diagram showing the analysis result screen before the word exclusion. Fig. 12B is a diagram showing an analysis result screen after excluding single words. The user operates the mouse 29, and after selecting the word to be excluded, instructs to exclude the word. On the analysis result screen 63 before the exclusion of the words shown in FIG. 12A, select "shakai (society)" and select "exclude words" from the menu. After that, the results of hierarchical cluster analysis except "shakai" are displayed on the screen. In the analysis result screen 64 after excluding the words shown in FIG. 12B, "shingaku" is displayed instead of "shakai". Among the words included in the same cluster as "shakai", "shingaku" is the second word after the analysis result screen 63, with the highest frequency.

CPU 21於在步驟S113所接收之指示為「近義詞登錄」之情形時，朝向步驟S119前進。於該情形時，CPU 21將所指示之單字追加至使用中之近義詞列表(步驟S119)，並朝向步驟S109前進。其後，考量所指示之近義詞而進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，將所指示之單字作為近義詞而進行階層式集群分析之結果，被顯示於畫面。 The CPU 21 proceeds to step S119 when the instruction received in step S113 is "synonym registration". In this case, the CPU 21 adds the indicated word to the list of synonyms in use (step S119), and proceeds to step S109. After that, the hierarchical cluster analysis is performed considering the indicated synonyms, and the screen data for displaying the new analysis result is generated, and the new screen is displayed. In this way, the result of performing hierarchical cluster analysis using the indicated word as a synonym is displayed on the screen.

圖13A係顯示進行近義詞登錄前之分析結果畫面之圖。圖13B係顯示進行近義詞登錄後之分析結果畫面之圖。使用者操作滑鼠29，於選擇應作為近義詞登錄之複數個單字後，指示進行近義詞登錄。在圖13A所示之近義詞登錄前之分析結果畫面65中，選擇「daigakusei(大學生)」與「gakusei(學生)」，並自選單中選擇「近義詞登錄」。其後，將「daigakusei」與「gakusei」作為近義詞而進行階層式集群分析後之結果，被顯示於畫面。在圖13B所示之近義詞登錄後之分析結果畫面66中，「daigakusei」以較分析結果畫面65更大之尺寸被顯示，且「shingaku(升學)」取代「gakusei」而被顯示。根據「daigakusei」之出現頻率與「gakusei」之出現頻率之合計，「daigakusei」係以較分析結果畫面65內之「daigakusei」更大之尺寸被顯示。 FIG. 13A is a diagram showing a screen of analysis results before registration of synonyms. FIG. 13B is a diagram showing an analysis result screen after registration of synonyms. The user operates the mouse 29, and after selecting a plurality of words that should be registered as synonyms, instructs to register the synonyms. On the analysis result screen 65 before the registration of synonyms, as shown in FIG. 13A, select "daigakusei (university student)" and "gakusei (student)", and select "synonym registration" from the menu. After that, the results of hierarchical cluster analysis using "daigakusei" and "gakusei" as synonyms are displayed on the screen. In the analysis result screen 66 after the registration of the synonym shown in FIG. 13B, "daigakusei" is displayed in a larger size than the analysis result screen 65, and "shingaku" is displayed instead of "gakusei". Based on the sum of the appearance frequency of "daigakusei" and the appearance frequency of "gakusei", "daigakusei" is displayed in a larger size than "daigakusei" in the analysis result screen 65.

CPU 21於在步驟S113所接收之指示為「複合詞登錄」之情形時，朝向步驟S120前進。於該情形時，CPU 21將所指示之單字追加至使用中之複合詞列表(步驟S120)，並朝向步驟S109前進。其後，考量所指示之複合詞而進行階層式集群分析，生成用以顯示新的分析結果之畫面資料，並顯示新的畫面。藉此，將所指定之單字作為複合詞而進行階層式集群分析之結果被顯示於畫面。 When the instruction received in step S113 is "composite word registration", the CPU 21 proceeds to step S120. In this case, the CPU 21 adds the indicated word to the compound word list in use (step S120), and proceeds to step S109. After that, the hierarchical cluster analysis is performed considering the indicated compound words, and the screen data for displaying the new analysis result is generated, and the new screen is displayed. With this, the result of performing hierarchical cluster analysis using the specified word as a compound word is displayed on the screen.

圖14A係顯示進行複合詞登錄前之分析結果畫面之圖。圖14B係顯示進行複合詞登錄後之分析結果畫面之圖。使用者於操作滑鼠29來選擇應作為複合詞而加以登錄之複數個單字後，指示進行「近義詞登錄」。在圖14A所示之複合詞登錄前之分析結果畫面67中，「nintai(忍耐)」與「tsuyoi(強)」被選擇，且「複合詞登錄」自選單中被選擇。其後，將「nintai」與「tsuyoi」作為複合詞而進行階層式集群分析後之結果被顯示於畫面。在圖14B所示之複合詞登錄後之分析結果畫面68中，取代「nintai」及「tsuyoi」，而以「nintai」及「tsuyoi」以下之尺寸來顯示「nintaizuyoi(忍耐力高)」。 FIG. 14A is a diagram showing an analysis result screen before compound word registration. 14B is a diagram showing the analysis result screen after compound word registration. The user operates the mouse 29 to select a plurality of words that should be registered as compound words, and then instructs to perform "synonym registration". In the analysis result screen 67 before compound word registration shown in FIG. 14A, "nintai (endurance)" and "tsuyoi (strong)" are selected, and "compound word registration" is selected from the menu. After that, the hierarchical cluster analysis using "nintai" and "tsuyoi" as compound words is displayed on the screen. In the analysis result screen 68 after registration of the compound word shown in FIG. 14B, instead of "nintai" and "tsuyoi", "nintaizuyoi (high endurance)" is displayed at a size below "nintai" and "tsuyoi".

如以上所示，本實施形態之文字探勘方法具備有：文字分析步驟，其對自被輸入之文字資料所擷取之單字進行階層式集群分析；畫面生成步驟，其根據文字分析步驟之分析結果，生成畫面資料；及分析結果顯示步驟，其根據畫面資料來顯示畫面。畫面生成步驟，根據群組數m與群組內之最多資料數n，自分析結果求得m個集群，而生成用以將包含n個以下之集群所包含之單字之群組顯示於畫面之畫面資料。根據本實施形態之文字探勘方法，可根據對文字資料所包含之單字進行階層式集群分析之結果，使含有集群所包含之單字之群組被顯示於畫面。又，群組所包含單字的數量，被限制在n個以下。因此，使用者在看到畫面時，可直觀地理解階層式集群分析之結果。 As shown above, the text exploration method of this embodiment includes: a text analysis step, which performs hierarchical cluster analysis on the words extracted from the input text data; a screen generation step, which is based on the analysis result of the text analysis step , Generate screen data; and analysis result display step, which displays the screen according to the screen data. The screen generation step is to obtain m clusters from the analysis result based on the number of groups m and the maximum number of data in the group n, and generate a group for displaying the words contained in n or less clusters on the screen Screen data. According to the text exploration method of this embodiment, the group containing the words included in the cluster can be displayed on the screen based on the result of hierarchical cluster analysis of the words included in the text data. In addition, the number of words included in the group is limited to n or less. Therefore, the user can intuitively understand the results of hierarchical cluster analysis when seeing the screen.

又，群組所包含之單字係自對應於群組之集群所包含之單字中，依出現頻率高之順序所選擇。因此，於群組之內部，顯示有集群所包含之單字中出現頻率高之單字。因此，使用者可容易地認知各集群所包含之出現頻率高之單字。又，群組在畫面內具有對應於與群組對應之集群所包含之單字之出現頻率之合計的尺寸。因此，使用者可容易地認知單字出現頻率之合計較大之集群。又，群組所包含之單字在畫面內具有對應於單字之出現頻率之尺寸。因此，使用者可容易地認知出現頻率高之單字。 In addition, the words included in the group are selected from the words included in the cluster corresponding to the group, in the order of high frequency of occurrence. Therefore, within the group, words with a high frequency appearing among the words included in the cluster are displayed. Therefore, the user can easily Understand the high-frequency words included in each cluster. In addition, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group within the screen. Therefore, the user can easily recognize a cluster with a larger total frequency of occurrence of words. In addition, the words included in the group have a size corresponding to the frequency of occurrence of the words within the screen. Therefore, the user can easily recognize words with a high frequency of occurrence.

又，文字探勘方法具備有用以輸入來自使用者之指示之指示輸入步驟，且文字分析步驟及畫面生成步驟之任一者係根據在指示輸入步驟所輸入之指示來執行。因此，可根據來自使用者之指示，切換階層式集群分析之結果之顯示態樣。尤其，指示輸入步驟接收群組數m之設定指示，而畫面生成步驟根據在指示輸入步驟所指定之群組數m來生成畫面資料。藉此，根據來自使用者之指示，切換顯示於畫面之區域個數(集群個數)。又，指示輸入步驟接收群組內之最多資料數n，而畫面生成步驟根據在指示輸入步驟所指定之群組內之最多資料數n來生成畫面資料。藉此，根據來自使用者之指示，切換於區域內所顯示單字的個數。 In addition, the text exploration method has an instruction input step useful for inputting instructions from the user, and any one of the character analysis step and the screen generation step is executed according to the instruction input in the instruction input step. Therefore, according to the instruction from the user, the display of the results of the hierarchical cluster analysis can be switched. In particular, the instruction input step receives a setting instruction of the group number m, and the screen generation step generates screen data according to the group number m specified in the instruction input step. With this, according to the instruction from the user, the number of areas (the number of clusters) displayed on the screen is switched. In addition, the instruction input step receives the maximum number n of data in the group, and the screen generation step generates screen data based on the maximum number n of data in the group specified in the instruction input step. Thereby, according to the instruction from the user, the number of words displayed in the area is switched.

又，指示輸入步驟接收分析對象期間之指示，而文字分析步驟對文字資料中在指示輸入步驟所指定之分析對象期間內之文字資料所包含之單字進行階層式集群分析。因此，對使用者所指示之分析對象期間內之文字資料所包含之單字進行階層式集群分析之結果被顯示於畫面。因此，使用者可容易地認知階層式集群分析之結果在時間上的變化。又，指示輸入步驟接收分析目的之設定指示，而文字分析步驟自文字資料5擷取對應於在指示輸入步驟所設定之分析目的之種類之單字，來進行階層式集群分析。藉此，可根據使用者所指示之分析目的來切換分析對象之單字種類，並將進行階層式集群分析之結果顯示於畫面。 In addition, the instruction input step receives the instruction of the analysis target period, and the character analysis step performs hierarchical cluster analysis on the words included in the character data within the analysis target period specified by the instruction input step in the text data. Therefore, the result of performing hierarchical cluster analysis on the words included in the text data within the analysis target period indicated by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change of the results of the hierarchical cluster analysis. In addition, the instruction input step receives the setting instruction of the analysis purpose, and the text analysis step extracts words corresponding to the type of analysis purpose set in the instruction input step from the text data 5 to perform hierarchical cluster analysis. In this way, the word type of the analysis object can be switched according to the analysis purpose indicated by the user, and the The results of the hierarchical cluster analysis are displayed on the screen.

又，指示輸入步驟接收單字除外指示，而文字分析步驟將在指示輸入步驟所指示之單字除外，而進行階層式集群分析。藉此，可將使用者所指示之單字除外而顯示進行階層式集群分析之結果。又，指示輸入步驟接收近義詞登錄指示，而文字分析步驟將在指示輸入步驟所指示之複數個單字視為相同之單字，而進行階層式集群分析。藉此，可將使用者所指示之複數個單字視為相同單字並將進行階層式集群分析之結果顯示於畫面。又，指示輸入步驟接收複合詞登錄指示，而文字分析步驟將在指示輸入步驟所指示之複數個單字合併為1個單字，而進行階層式集群分析。藉此，可將使用者所指示之複數個單字合併為1個單字並將進行階層式集群分析之結果顯示於畫面。 In addition, the instruction input step receives the word exclusion instruction, and the character analysis step excludes the word indicated in the instruction input step, and performs hierarchical cluster analysis. In this way, the words indicated by the user can be excluded and the results of hierarchical cluster analysis can be displayed. In addition, the instruction input step receives a synonym registration instruction, and the character analysis step treats the plural words indicated in the instruction input step as the same word, and performs hierarchical cluster analysis. In this way, the plurality of words indicated by the user can be regarded as the same word and the results of the hierarchical cluster analysis can be displayed on the screen. In addition, the instruction input step receives the compound word registration instruction, and the character analysis step combines the plural words indicated in the instruction input step into one word to perform hierarchical cluster analysis. In this way, the plural words indicated by the user can be combined into one word, and the results of the hierarchical cluster analysis can be displayed on the screen.

又，畫面生成步驟生成畫面資料，該畫面資料係用以顯示包含群組之分析結果畫面、及用以設定分析結果畫面之顯示態樣之分析設定畫面。因此，分析結果畫面與分析設定畫面被顯示。因此，使用者可使用分析設定畫面而容易地切換進行階層式集群分析之結果之顯示態樣。 In addition, the screen generation step generates screen data for displaying the analysis result screen including the group and the analysis setting screen for setting the display appearance of the analysis result screen. Therefore, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can easily switch the display state of the results of the hierarchical cluster analysis using the analysis setting screen.

本實施形態之文字探勘程式31、及本實施形態之文字探勘裝置10具有與本實施形態之文字探勘處理方法相同之構成，而發揮相同之效果。 The text exploration program 31 of this embodiment and the text exploration apparatus 10 of this embodiment have the same configuration as the text exploration processing method of this embodiment, and exert the same effects.

根據本實施形態之文字探勘方法、文字探勘程式、及文字探勘裝置，可根據對文字資料所包含之單字進行階層式集群分析之結果，使包含最多資料數以下之集群所包含之單字之群組被顯示於畫面。因此，使用者在看到畫面時，可直觀地理解階層式集群分析之結果。 According to the text exploration method, text exploration program, and text exploration device of this embodiment, the group of words included in the cluster with the maximum number of data can be made based on the results of hierarchical cluster analysis of the words included in the text data Is displayed on the screen. Therefore, the user can intuitively understand the hierarchical cluster when seeing the screen The result of the analysis.

再者，本案係主張根據在2016年7月25日所提出申請之發明名稱為「文字探勘方法、文字探勘程式、及文字探勘裝置」之日本專利特願2016-145065號之優先權而提出申請案，該等申請之內容係藉由引用而包含於本申請案。 Furthermore, this case claims to apply for priority based on Japanese Patent Application No. 2016-145065 with the invention titled "Text Exploration Method, Text Exploration Program, and Text Exploration Device" filed on July 25, 2016. The content of these applications is included in this application by reference.

40‧‧‧顯示畫面 40‧‧‧Display screen

41‧‧‧分析結果畫面 41‧‧‧Analysis result screen

42‧‧‧分析設定畫面 42‧‧‧Analysis setting screen

Claims

A text exploration method that displays the analysis results of text data on the screen; it is characterized by the following: a text analysis step, which performs hierarchical cluster analysis on the words extracted from the input text data; screen generation Step, which generates screen data based on the analysis results of the above text analysis step; and analysis result display step, which displays the screen based on the above screen data; the above screen generation step is based on the number of groups and the maximum number of data in the group, from the above The analysis result finds the clusters of the above-mentioned group number, and generates screen data for displaying the group containing the words included in the cluster below the maximum number of data on the screen, and the group contains the cluster The word with the highest frequency among words is used as the name.

For example, in the text exploration method of claim 1, the words included in the group are selected from the words included in the cluster corresponding to the group in the order of high frequency of occurrence.

As in the text exploration method of claim 2, wherein the group has a size corresponding to the total frequency of occurrence of words included in the cluster corresponding to the group within the screen.

For example, in the text exploration method of claim 3, the words included in the group are within the screen and have a size corresponding to the frequency of occurrence of the words.

For example, the text exploration method of claim 1, which further includes an instruction input step for inputting instructions from the user. Any of the above text analysis step and the above screen generation step is based on the above The instruction input in the instruction input step is executed.

For example, the text exploration method of claim 5, wherein the instruction input step receives the setting instruction of the group number, and the screen generation step generates the screen data according to the group number set in the instruction input step.

As in the text exploration method of claim 5, wherein the instruction input step receives the setting instruction of the maximum number of data, the screen generation step generates the screen data according to the maximum number of data set in the instruction input step.

As in the text exploration method of claim 5, wherein the instruction input step receives the setting instruction during the analysis object period, the character analysis step includes the text data included in the text data in the analysis object period set in the instruction input step in the text data Single word, perform the above hierarchical cluster analysis.

As in the text exploration method of claim 5, wherein the instruction input step receives the setting instruction of the analysis purpose, the character analysis step extracts words corresponding to the type of analysis purpose set in the instruction input step from the text data, to Perform the above hierarchical cluster analysis.

For example, in the text exploration method of claim 5, wherein the instruction input step receives a word exclusion instruction, the character analysis step will exclude the word indicated in the instruction input step, and perform the hierarchical cluster analysis.

As in the text exploration method of claim 5, wherein the above instruction input step is received In the synonym registration instruction, the character analysis step treats the plural words indicated in the instruction input step as the same word, and performs the hierarchical cluster analysis.

As in the text exploration method of claim 5, wherein the instruction input step receives a compound word registration instruction, the character analysis step combines the plural words indicated in the instruction input step into a single word, and performs the hierarchical cluster analysis.

As in the text exploration method of claim 1, wherein the above-mentioned screen generation step generates screen data, the screen data is used to display the analysis result screen including the above-mentioned group, and the analysis setting to set the display appearance of the analysis result screen Screen.

A computer-readable recording medium that records a text exploration program that displays the analysis results of text data on the screen, characterized in that the CPU uses memory to cause the computer to perform the following steps: text analysis step, which is self-entered Words extracted from the text data are subjected to hierarchical cluster analysis; a screen generation step, which generates screen data based on the analysis results of the text analysis step; and an analysis result display step, which displays the screen based on the screen data; the screen The generating step obtains the cluster of the above group number from the analysis result based on the number of groups and the maximum number of data in the group, and generates a group display for displaying the words included in the cluster below the maximum number of data In the picture data of the picture, the word with the highest frequency among the words included in the cluster is added to the group as the name.

For example, the computer-readable recording medium of item 14, wherein the above-mentioned group includes The included words are selected from the words included in the cluster corresponding to the above group in the order of high frequency of occurrence.

The computer-readable recording medium of claim 15, wherein the group is within the screen and has a size corresponding to the total frequency of occurrence of words included in the cluster corresponding to the group.

For example, the computer-readable recording medium of claim 16, wherein the words included in the group are within the screen and have a size corresponding to the frequency of occurrence of the words.

For example, the computer-readable recording medium of claim 14, wherein the computer is further subjected to an instruction input step for inputting an instruction from a user, any of the text analysis step and the screen generation step are based on The instruction input in the instruction input step described above is executed.

For example, the computer-readable recording medium of claim 14, wherein the screen generation step generates screen data that is used to display the analysis result screen including the group and to set the display state of the analysis result screen This kind of analysis setting screen.

A text exploration device that displays the analysis results of text data on a screen. It is characterized in that it includes: a text analysis section that performs hierarchical cluster analysis on the words extracted from the input text data; screen generation Part, which generates screen data based on the analysis result of the character analysis part; and analysis result display part, which displays the screen based on the screen data; the screen generation part based on the number of groups and the maximum number of data The result of the above analysis finds the cluster with the above number of groups, and generates a The screen data of the group of words included in the above cluster is displayed on the screen, and the word with the highest frequency among the words included in the cluster is attached to the group as the name.

For example, in the text exploration device of claim 20, the words included in the group are selected from the words included in the cluster corresponding to the group in the order of high frequency of occurrence.

The text exploration apparatus according to claim 21, wherein the group has a size corresponding to the total frequency of occurrence of words included in the cluster corresponding to the group within the screen.

A text exploration device according to claim 22, wherein the words included in the group are within the screen and have a size corresponding to the frequency of occurrence of the words.

The text exploration device according to claim 20, further comprising an instruction input unit for inputting an instruction from the user, any one of the character analysis unit and the screen generation unit according to the instruction input in the instruction input unit Come action.

A text exploration apparatus according to claim 20, wherein the screen generating unit generates screen data for displaying the analysis result screen including the group and the analysis setting for setting the display appearance of the analysis result screen Screen.