TWI686716B - Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program - Google Patents
Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program Download PDFInfo
- Publication number
- TWI686716B TWI686716B TW106122011A TW106122011A TWI686716B TW I686716 B TWI686716 B TW I686716B TW 106122011 A TW106122011 A TW 106122011A TW 106122011 A TW106122011 A TW 106122011A TW I686716 B TWI686716 B TW I686716B
- Authority
- TW
- Taiwan
- Prior art keywords
- screen
- analysis
- text
- group
- data
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/358—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
在文字分析步驟(S109~S110)中,對自被輸入之文字資料所擷取之單字進行階層式集群分析。在畫面生成步驟(S111)中,根據群組數m與群組內之最多資料數n,自文字分析步驟之分析結果求得m個集群,而生成用以將包含n個以下之集群所包含之單字之群組顯示於畫面之畫面資料。在分析結果顯示步驟(S112)中,根據所生成之畫面資料來顯示畫面。藉此,將階層式集群分析之結果,以使用者可直觀地理解之方式顯示於畫面。 In the character analysis step (S109~S110), the hierarchical cluster analysis is performed on the words extracted from the input character data. In the screen generation step (S111), m clusters are obtained from the analysis result of the text analysis step according to the number m of groups and the maximum number n of data in the group, and generated to include the clusters of n or less The group of words is displayed on the screen data of the screen. In the analysis result display step (S112), the screen is displayed based on the generated screen data. In this way, the results of the hierarchical cluster analysis are displayed on the screen in a way that the user can intuitively understand.
Description
本發明係關於文字探勘,尤其關於將文字資料之分析結果顯示於畫面之文字探勘方法、文字探勘程式、及文字探勘裝置。 The invention relates to text exploration, in particular to a text exploration method, a text exploration program, and a text exploration apparatus that display the analysis results of text data on a screen.
近年來,解析以自由形態所記載之大量文字資料,並從解析結果求得有用資訊之文字探勘受到矚目。在文字探勘中,例如自分析對象之文字資料擷取單字,並藉由解析單字的出現頻率與出現趨勢等來求得資訊。 In recent years, text exploration that analyzes a large amount of text data recorded in a free form and obtains useful information from the analysis results has attracted attention. In text exploration, for example, words are extracted from the text data of the analysis object, and information is obtained by analyzing the occurrence frequency and trend of the words.
以下,針對對自文字資料所擷取之單字進行階層式集群分析而將分析結果顯示於畫面之文字探勘裝置進行探討。在階層式集群分析中,根據單字間之相似度,而階層式地製作包含相似度高之單字之集群。一般而言,階層式集群分析之結果係使用圖15所示之樹狀圖(樹狀結構圖;dendrogram),而被提供給使用者(分析者)。 The following is a discussion on a text exploration device that performs hierarchical cluster analysis on the words extracted from the text data and displays the analysis results on the screen. In the hierarchical cluster analysis, based on the similarity between words, hierarchically create clusters containing words with high similarity. In general, the results of the hierarchical cluster analysis are provided to users (analysts) using the tree diagram (dendrogram) shown in FIG. 15.
與本案發明相關連地,於專利文獻1記載有一種分群裝置,其具有建構樹狀圖,探索樹狀圖而生成可自下層至上層進行特定之索引並儲存於儲存手段之階層式分群手段。於引證2記載有一種提供查詢裝置,其具有:距離矩陣計算手段,其計算出關鍵字間之距離,生成可探索關鍵字與關鍵字間之距離之距離矩陣資料並儲存於儲存手段;及分群手段,其使用距離矩陣將關鍵字階層式分
群,並作為可自下層至上層地探索所建構之樹狀圖之由下往上索引而儲存於儲存手段。
In connection with the invention of the present application,
[專利文獻1]日本專利特開2011-216021號公報 [Patent Document 1] Japanese Patent Laid-Open No. 2011-216021
[專利文獻2]日本專利特開2012-150539號公報 [Patent Document 2] Japanese Patent Laid-Open No. 2012-150539
習知之文字探勘裝置,使用樹狀圖將階層式集群分析之結果顯示於畫面。然而,如此之文字探勘裝置存在有使用者無法直觀地理解分析結果之問題。例如,於圖15所示之分析結果中,在使用者將集群數設定為4時,如圖16所示,會在樹狀圖上設定切割線。然而,使用者並無法僅從看到如此之樹狀圖,便直觀地認知各集群所包含之單字。又,使用者在單字數較多而變更集群數之情形時,並無法直觀地掌握各集群所包含之單字會如何地變化。 The conventional text exploration device uses a tree diagram to display the results of hierarchical cluster analysis on the screen. However, such a text exploration device has a problem that the user cannot intuitively understand the analysis result. For example, in the analysis result shown in FIG. 15, when the user sets the number of clusters to 4, as shown in FIG. 16, a cutting line is set on the tree diagram. However, users cannot intuitively recognize the words contained in each cluster just by seeing such a tree diagram. In addition, when the number of words is large and the number of clusters is changed, the user cannot intuitively grasp how the words included in each cluster will change.
又,因為樹狀圖並未記載單字的出現頻率,因此使用者無法得知哪個單字較重要。又,於分析對象之文字資料為具有年月日或時刻等之資訊之時間序列資料之情形時,使用者有時會期望能得知分析結果在時間上的變化。然而,在習知之文字探勘裝置中,並無法滿足使用者的上述期望。 In addition, because the dendrogram does not record the frequency of occurrence of words, the user cannot know which word is more important. In addition, when the text data of the analysis object is time series data with information such as year, month, day, or time, the user sometimes expects to be able to know the time change of the analysis result. However, the conventional text exploration device cannot meet the above expectations of users.
因此,本發明之目的,在於提供將階層式集群分析之結果以使用者可直觀地理解之方式顯示於畫面之文字探勘方法、文字探勘程式、及文字探勘裝置。 Therefore, an object of the present invention is to provide a text exploration method, a text exploration program, and a text exploration apparatus that display the results of hierarchical cluster analysis on a screen in a manner that a user can intuitively understand.
本發明第1態樣係一種文字探勘方法,係將文字資料之分析結果顯示於畫面者,其特徵在於具備有:文字分析步驟,其對自被輸入之文字資料所擷取之單字(單語,即單詞,word,vocabulary)進行階層式集群分析;畫面生成步驟,其根據上述文字分析步驟之分析結果來生成畫面資料;及分析結果顯示步驟,其根據上述畫面資料來顯示畫面;上述畫面生成步驟根據群組數與群組內之最多資料數,自上述分析結果求得上述群組數之集群,而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面之畫面資料。 The first aspect of the present invention is a text exploration method, which displays the analysis results of text data on the screen, and is characterized by having: a text analysis step for the words (single words) extracted from the input text data , Ie words, word, vocabulary) for hierarchical cluster analysis; screen generation step, which generates screen data based on the analysis results of the above text analysis steps; and analysis result display step, which displays screens based on the screen data; the screen generation Step: Based on the number of groups and the maximum number of data in the group, obtain the cluster of the number of groups from the above analysis results, and generate a group containing the words included in the cluster below the maximum number of data. Picture data of the picture.
本發明第2態樣之特徵在於,於本發明之第1態樣中,上述群組所包含之單字係自對應於上述群組之集群所包含之單字中,依出現頻率高之順序所選擇。 The second aspect of the present invention is characterized in that, in the first aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group, in the order of higher frequency of occurrence .
本發明第3態樣之特徵在於,於本發明之第2態樣中,上述群組在上述畫面內,具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The third aspect of the present invention is characterized in that, in the second aspect of the present invention, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group in the screen.
本發明第4態樣之特徵在於,於本發明之第3態樣中,上述群組所包含之單字在上述畫面內,具有對應於上述單字之出現頻率的尺寸。 The fourth aspect of the present invention is characterized in that, in the third aspect of the present invention, the words included in the group are within the screen and have a size corresponding to the frequency of occurrence of the words.
本發明第5態樣之特徵在於,於本發明之第1態樣中,進一步具備有用以輸入來自使用者之指示之指示輸入步驟,上述文字分析步驟及上述畫面生成步驟之任一者,係根據在上述指示輸入步驟所輸入之指示而被執行。 The fifth aspect of the present invention is characterized in that in the first aspect of the present invention, it further includes an instruction input step for inputting an instruction from the user, any of the above-mentioned character analysis step and the above-mentioned screen generation step. It is executed according to the instruction input in the above instruction input step.
本發明第6態樣之特徵在於,於本發明之第5態樣中,上述指示輸入步驟接收上述群組數之設定指示,上述畫面生成步驟根據在上述指示輸入步驟所設定之群組數,來生成上述畫面資料。 The sixth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the group number, and the screen generation step is based on the group number set in the instruction input step, To generate the above screen data.
本發明第7態樣之特徵在於,於本發明之第5態樣中,上述指示輸入步驟接收上述最多資料數之設定指示,上述畫面生成步驟根據在上述指示輸入步驟所設定之最多資料數,來生成上述畫面資料。 The seventh aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the maximum number of data, and the screen generation step is based on the maximum number of data set in the instruction input step, To generate the above screen data.
本發明第8態樣之特徵在於,於本發明之第5態樣中,上述指示輸入步驟接收分析對象期間之設定指示,上述文字分析步驟對上述文字資料中在上述指示輸入步驟所設定之分析對象期間內之文字資料所包含之單字,進行上述階層式集群分析。 The eighth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a setting instruction during the analysis target period, and the character analysis step analyzes the analysis set in the instruction input step of the character data The words contained in the text data within the target period are analyzed in the above hierarchical cluster.
本發明第9態樣之特徵在於,於本發明之第5態樣中,上述指示輸入步驟接收分析目的之設定指示,上述文字分析步驟自上述文字資料擷取對應於在上述指示輸入步驟中所設定之分析目的之種類的單字,來進行上述階層式集群分析。 The ninth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives the setting instruction of the analysis purpose, and the character analysis step extracts from the character data corresponding to the result of the instruction input step. Set the type of analysis purpose words to perform the above hierarchical cluster analysis.
本發明第10態樣之特徵在於,於本發明之第5態樣中,上述指示輸入步驟接收單字除外指示,上述文字分析步驟將在上述指示輸入步驟所指示之單字除外,而進行上述階層式集群分析。 The tenth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a word exclusion instruction, and the character analysis step excludes the word indicated in the instruction input step, and performs the hierarchical form Cluster analysis.
本發明第11態樣之特徵在於,於本發明之第5態樣 中,上述指示輸入步驟接收近義詞登錄指示,上述文字分析步驟將在上述指示輸入步驟所指示之複數個單字視為相同之單字,而進行上述階層式集群分析。 The eleventh aspect of the invention is characterized by the fifth aspect of the invention In the above, the instruction input step receives a synonym registration instruction, and the character analysis step treats the plurality of words indicated in the instruction input step as the same word, and performs the hierarchical cluster analysis.
本發明第12態樣之特徵在於,於本發明之第5態樣中,上述指示輸入步驟接收複合詞登錄指示,上述文字分析步驟將在上述指示輸入步驟所指示之複數個單字合併為1個單字,而進行上述階層式集群分析。 The twelfth aspect of the present invention is characterized in that, in the fifth aspect of the present invention, the instruction input step receives a compound word registration instruction, and the character analysis step merges the plural words indicated in the instruction input step into a single word And perform the above hierarchical cluster analysis.
本發明之第13態樣之特徵在於,於本發明之第1態樣中,上述畫面生成步驟生成畫面資料,該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The thirteenth aspect of the present invention is characterized in that, in the first aspect of the present invention, the screen generating step generates screen data for displaying the analysis result screen including the group and for setting the above The analysis setting screen of the display form of the analysis result screen.
本發明第14態樣係一種電腦可讀取之記錄媒體,其記錄有將文字資料之分析結果顯示於畫面之文字探勘程式,其特徵在於CPU(中央處理單元)利用記憶體使電腦執行如下之步驟:文字分析步驟,其對自被輸入之文字資料所擷取之單字進行階層式集群分析;畫面生成步驟,其根據上述文字分析步驟之分析結果,來生成畫面資料;及分析結果顯示步驟,其根據上述畫面資料來顯示畫面;上述畫面生成步驟根據群組數與群組內之最多資料數,自上述分析結果求得上述群組數之集群,而生成用以將包含上述最多資料 數以下之上述集群所包含之單字之群組顯示於畫面之畫面資料。 The fourteenth aspect of the present invention is a computer-readable recording medium that records a text exploration program that displays the analysis results of text data on the screen. It is characterized in that the CPU (Central Processing Unit) uses memory to make the computer execute the following Steps: a text analysis step, which performs hierarchical cluster analysis on the words extracted from the input text data; a screen generation step, which generates screen data based on the analysis results of the above text analysis steps; and an analysis result display step, It displays the screen according to the above-mentioned screen data; the above-mentioned screen generation step obtains the cluster of the above-mentioned group number from the analysis result based on the number of groups and the maximum number of data in the group, and generates a The screen data of the group of words included in the above cluster below the number displayed on the screen.
本發明第15態樣之特徵在於,於本發明之第14態樣中,上述群組所包含之單字係自對應於上述群組之集群所包含之單字中,依出現頻率高之順序所選擇。 The fifteenth aspect of the present invention is characterized in that, in the fourteenth aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group in the order of higher frequency of occurrence .
本發明第16態樣之特徵在於,於本發明之第15態樣中,上述群組在上述畫面內,具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The sixteenth aspect of the present invention is characterized in that, in the fifteenth aspect of the present invention, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group in the screen.
本發明第17態樣之特徵在於,於本發明之第16態樣中,上述群組所包含之單字在上述畫面內,具有對應於上述單字之出現頻率的尺寸。 The seventeenth aspect of the present invention is characterized in that in the sixteenth aspect of the present invention, the words included in the group are within the screen and have a size corresponding to the frequency of occurrence of the words.
本發明第18態樣之特徵在於,於本發明之第14態樣中,使上述電腦進一步執行用以輸入來自使用者之指示之指示輸入步驟,上述文字分析步驟及上述畫面生成步驟之任一者,係根據在上述指示輸入步驟所輸入之指示而被執行。 The eighteenth aspect of the present invention is characterized in that in the fourteenth aspect of the present invention, the computer is further executed to perform any of the instruction input step for inputting instructions from the user, the character analysis step and the screen generation step It is executed according to the instruction input in the instruction input step.
本發明第19態樣之特徵在於,於本發明之第14態樣中,上述畫面生成步驟生成畫面資料,該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The 19th aspect of the present invention is characterized in that, in the 14th aspect of the present invention, the above-mentioned screen generating step generates screen data for displaying the analysis result screen including the above-mentioned group and for setting the above-mentioned analysis The analysis setting screen of the display aspect of the result screen.
本發明之第20態樣係一種文字探勘裝置,係將文字資料之分析結果顯示於畫面者,其特徵在於具備有:文字分析部,其對自被輸入之文字資料所擷取之單字進行階層式集群分析;畫面生成部,其根據上述文字分析部之分析結果,來生成畫面資料;及分析結果顯示部,其根據上述畫面資料來顯示畫面;上述畫面生成部根據群組數與群組內之最多資料數,自上述分析結果求得上述群組數之集群,而生成用以將包含上述最多資料數以下之上述集群所包含之單字之群組顯示於畫面。 The twentieth aspect of the present invention is a text exploration device that displays the analysis results of text data on the screen, and is characterized by having: a text analysis section that hierarchizes the words extracted from the input text data Cluster analysis; a screen generation unit that generates screen data based on the analysis results of the character analysis unit; and an analysis result display unit that displays screens based on the screen data; the screen generation unit based on the number of groups and within the group The maximum number of data is obtained from the analysis result to obtain the cluster of the number of groups, and generated to display the group containing the word included in the cluster below the maximum number of data on the screen.
本發明第21態樣之特徵在於,於本發明之第20態樣中,上述群組所包含之單字係自對應於上述群組之集群所包含之單字中,依出現頻率高之順序所選擇。 The 21st aspect of the present invention is characterized in that, in the 20th aspect of the present invention, the words included in the group are selected from the words included in the cluster corresponding to the group, in the order of higher frequency of occurrence .
本發明第22態樣之特徵在於,於本發明之第21態樣中,上述群組在上述畫面內,具有對應於與上述群組對應之集群所包含之單字之出現頻率之合計的尺寸。 The twenty-second aspect of the present invention is characterized in that, in the twenty-first aspect of the present invention, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group in the screen.
本發明第23態樣之特徵在於,於本發明之第22態樣中,上述群組所包含之單字在上述畫面內,具有對應於上述單字之出現頻率的尺寸。 The twenty-third aspect of the present invention is characterized in that, in the twenty-second aspect of the present invention, the words included in the group are within the screen and have a size corresponding to the appearance frequency of the words.
本發明第24態樣之特徵在於,於本發明之第20態樣中, 進一步具備有用以輸入來自使用者之指示之指示輸入部,上述文字分析部及上述畫面生成部之任一者,根據在上述指示輸入部所輸入之指示來動作。 The 24th aspect of the present invention is characterized in that in the 20th aspect of the present invention, It further includes an instruction input unit for inputting an instruction from the user, and either the character analysis unit or the screen generation unit operates according to the instruction input in the instruction input unit.
本發明第25態樣之特徵在於,於本發明之第20態樣中,上述畫面生成部生成畫面資料,該畫面資料係用以顯示包含上述群組之分析結果畫面、及用以設定上述分析結果畫面之顯示態樣之分析設定畫面。 The 25th aspect of the present invention is characterized in that, in the 20th aspect of the present invention, the screen generating unit generates screen data for displaying the analysis result screen including the group and for setting the analysis The analysis setting screen of the display aspect of the result screen.
根據本發明第1、第14或第20態樣,基於對文字資料所包含之單字進行階層式集群分析後之結果,包含集群所包含之單字之群組被顯示於畫面。又,群組所包含之單字數,被限制在最多資料數以下。因此,使用者看到畫面時可直觀地理解階層式集群分析之結果。 According to the first, 14th, or 20th aspect of the present invention, based on the result of performing hierarchical cluster analysis on the words included in the text data, the group including the words included in the cluster is displayed on the screen. In addition, the number of words included in the group is limited to the maximum number of data. Therefore, the user can intuitively understand the results of the hierarchical cluster analysis when viewing the screen.
根據本發明第2、第15或第21態樣,在群組之內部,集群所包含之單字中出現頻率高之單字被顯示。因此,使用者可容易地認知各集群所包含之出現頻率高之單字。 According to the second, fifteenth, or twenty-first aspect of the present invention, within the group, words with a high occurrence frequency among words included in the cluster are displayed. Therefore, the user can easily recognize the words with high frequency included in each cluster.
藉由本發明第3、第16或第22態樣,群組在畫面內具有對應於集群所包含之單字之出現頻率之合計的尺寸。因此,使用者可容易地認知單字出現頻率之合計大之集群。 According to the third, 16th, or 22nd aspect of the present invention, the group has a total size within the screen corresponding to the frequency of occurrence of the words included in the cluster. Therefore, the user can easily recognize the cluster with a large total occurrence frequency of the word.
藉由本發明第4、第17或第23態樣,單字在畫面內具有對應於單字頻率之尺寸。因此,使用者可容易地認知出現頻率高之單字。 According to the fourth, 17th or 23rd aspect of the present invention, the word has a size corresponding to the word frequency within the screen. Therefore, the user can easily recognize words with a high frequency of occurrence.
根據本發明第5、第18或第24態樣,可對應於來自 使用者之指示,切換階層式集群分析之結果之顯示態樣。 According to the fifth, 18th or 24th aspect of the present invention, it can correspond to The user's instruction switches the display of the results of the hierarchical cluster analysis.
根據本發明第6態樣,可對應於來自使用者之指示,切換畫面所顯示之群組的個數(集群個數)。 According to the sixth aspect of the present invention, the number of groups (the number of clusters) displayed on the screen can be switched according to the instruction from the user.
根據本發明第7態樣,可對應於來自使用者之指示,切換群組所包含之單字之個數的上限值。 According to the seventh aspect of the present invention, the upper limit of the number of words included in the group can be switched according to the instruction from the user.
根據本發明第8態樣,對使用者所指示之分析對象期間內之文字資料所包含之單字進行階層式集群分析之結果被顯示於畫面。因此,使用者可容易地認知階層式集群分析之結果在時間上的變化。 According to the eighth aspect of the present invention, the result of performing hierarchical cluster analysis on the words included in the text data within the analysis target period indicated by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change of the results of the hierarchical cluster analysis.
根據本發明第9態樣,可對應於使用者所指示之分析目的,切換分析對象之單字種類並將進行階層式集群分析後之結果顯示於畫面。 According to the ninth aspect of the present invention, according to the analysis purpose instructed by the user, the word type of the analysis object can be switched and the results of the hierarchical cluster analysis can be displayed on the screen.
根據本發明第10態樣,可將使用者所指示之單字除外,並將進行階層式集群分析後之結果顯示於畫面。 According to the tenth aspect of the present invention, the words indicated by the user can be excluded, and the results of the hierarchical cluster analysis can be displayed on the screen.
根據本發明第11態樣,可將使用者所指示之複數個單字視為相同單字,並將進行階層式集群分析後之結果顯示於畫面。 According to the eleventh aspect of the present invention, the plurality of words indicated by the user can be regarded as the same word, and the results of the hierarchical cluster analysis can be displayed on the screen.
根據本發明第12態樣,可將使用者所指示之複數個單字合併為1個單字,並將進行階層式集群分析後之結果顯示於畫面。 According to the twelfth aspect of the present invention, the plurality of words indicated by the user can be combined into one word, and the results of the hierarchical cluster analysis can be displayed on the screen.
根據本發明第13、第19或第25態樣,分析結果畫面與分析設定畫面被顯示。因此,使用者可使用分析設定畫面而容易地切換進行階層式集群分析後之結果之顯示態樣。 According to the 13th, 19th or 25th aspect of the present invention, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can use the analysis setting screen to easily switch the display state of the results after the hierarchical cluster analysis.
5‧‧‧文字資料 5‧‧‧ Text
10‧‧‧文字探勘裝置 10‧‧‧Text exploration device
11‧‧‧指示輸入部 11‧‧‧Instruction input section
12‧‧‧文字分析部 12‧‧‧ Character Analysis Department
13‧‧‧畫面生成部 13‧‧‧ Screen generator
14‧‧‧分析結果顯示部 14‧‧‧Analysis result display
20‧‧‧電腦 20‧‧‧ Computer
21‧‧‧CPU 21‧‧‧CPU
22‧‧‧主記憶體 22‧‧‧Main memory
23‧‧‧儲存部 23‧‧‧Storage Department
24‧‧‧輸入部 24‧‧‧ Input
25‧‧‧顯示部 25‧‧‧Display
26‧‧‧通信部 26‧‧‧Ministry of Communications
27‧‧‧記錄媒體讀取部 27‧‧‧Recording Media Reading Department
28‧‧‧鍵盤 28‧‧‧ keyboard
29‧‧‧滑鼠 29‧‧‧Mouse
30‧‧‧記錄媒體 30‧‧‧Recording media
31‧‧‧文字探勘程式 31‧‧‧ Text exploration program
40‧‧‧顯示畫面 40‧‧‧Display screen
41、61~68‧‧‧分析結果畫面 41, 61~68‧‧‧Analysis result screen
42‧‧‧分析設定畫面 42‧‧‧Analysis setting screen
51‧‧‧資料指定畫面 51‧‧‧Data designation screen
52‧‧‧目的指定畫面 52‧‧‧Destination designation screen
53‧‧‧近義詞列表選擇畫面 53‧‧‧Synonyms list selection screen
54‧‧‧複合詞列表選擇畫面 54‧‧‧ Compound word list selection screen
m‧‧‧群組數(集群數) m‧‧‧ group number (cluster number)
n‧‧‧群組內之最多資料數 n‧‧‧Maximum number of data in the group
W1~W6‧‧‧單字 W1~W6‧‧‧Word
圖1係顯示本發明實施形態之文字探勘裝置之構成之方塊圖。 1 is a block diagram showing the structure of a text exploration device according to an embodiment of the present invention.
圖2係顯示作為圖1所示之文字探勘裝置而發揮功能之電腦之構成之方塊圖。 FIG. 2 is a block diagram showing the structure of a computer that functions as the text exploration device shown in FIG.
圖3係顯示圖1所示之文字探勘裝置之顯示畫面之圖。 FIG. 3 is a diagram showing the display screen of the text exploration device shown in FIG. 1.
圖4係顯示圖1所示之文字探勘裝置之動作之流程圖。 4 is a flowchart showing the operation of the text exploration device shown in FIG.
圖5係圖1所示之文字探勘裝置之畫面資料生成處理之流程圖。 FIG. 5 is a flowchart of screen data generation processing of the text exploration device shown in FIG. 1.
圖6係顯示圖1所示之文字探勘裝置之資料指定畫面之圖。 6 is a diagram showing a data designation screen of the text exploration device shown in FIG.
圖7係顯示被輸入於圖1所示之文字探勘裝置之文字資料之例子之圖。 7 is a diagram showing an example of text data input to the text exploration device shown in FIG.
圖8係顯示圖1所示之文字探勘裝置之目的指定畫面之圖。 FIG. 8 is a diagram showing the purpose designation screen of the text exploration device shown in FIG. 1.
圖9係顯示圖1所示之文字探勘裝置之近義詞列表選擇畫面之圖。 FIG. 9 is a diagram showing a selection screen of the synonyms list of the text exploration device shown in FIG. 1.
圖10係顯示圖1所示之文字探勘裝置之複合詞列表選擇畫面之圖。 FIG. 10 is a diagram showing a compound word list selection screen of the text exploration device shown in FIG. 1.
圖11A係顯示於圖1所示之文字探勘裝置中設定分析對象期間前之分析結果畫面之圖。 11A is a diagram showing an analysis result screen before setting an analysis target period in the text exploration apparatus shown in FIG. 1.
圖11B係顯示於圖1所示之文字探勘裝置中設定分析對象期間後之分析結果畫面之圖。 11B is a diagram showing an analysis result screen after setting the analysis target period in the text exploration apparatus shown in FIG. 1.
圖12A係顯示於圖1所示之文字探勘裝置中進行單字除外前之分析結果畫面之圖。 FIG. 12A is a diagram showing an analysis result screen before word exclusion in the text exploration apparatus shown in FIG. 1.
圖12B係顯示於圖1所示之文字探勘裝置中進行單字除外後之分析結果畫面之圖。 FIG. 12B is a diagram showing the analysis result screen after the single character is excluded in the text exploration device shown in FIG. 1.
圖13A係顯示於圖1所示之文字探勘裝置中進行近義詞登錄前之分析結果畫面之圖。 13A is a diagram showing an analysis result screen before registration of synonyms in the text exploration apparatus shown in FIG. 1.
圖13B係顯示於圖1所示之文字探勘裝置中進行近義詞登錄後之分析結果畫面之圖。 FIG. 13B is a diagram showing an analysis result screen after registration of synonyms in the text exploration apparatus shown in FIG. 1.
圖14A係顯示於圖1所示之文字探勘裝置中進行複合詞登錄前之分析結果畫面之圖。 14A is a diagram showing an analysis result screen before compound word registration in the text exploration apparatus shown in FIG. 1.
圖14B係顯示於圖1所示之文字探勘裝置中進行複合詞登錄後之分析結果畫面之圖。 14B is a diagram showing an analysis result screen after compound word registration in the text exploration device shown in FIG. 1.
圖15係顯示樹狀圖之例子之圖。 15 is a diagram showing an example of a tree diagram.
圖16係顯示對圖15所示之樹狀圖設定集群數之情況之圖。 FIG. 16 is a diagram showing a case where the number of clusters is set for the tree diagram shown in FIG. 15.
圖17係顯示在圖式及其說明所出現之單字之圖。 Figure 17 is a diagram showing the words that appear in the drawings and their descriptions.
以下,參照圖式,對本發明實施形態之文字探勘方法、文字探勘程式、及文字探勘裝置進行說明。本實施形態之文字探勘方法,通常係使用電腦來執行。本實施形態之文字探勘程式係為了使用電腦來實施文字探勘方法之程式。本實施形態之文字探勘裝置通常係使用電腦所構成。執行文字探勘程式之電腦係作為文字探勘裝置而發揮功能。 The text exploration method, text exploration program, and text exploration apparatus according to the embodiments of the present invention will be described below with reference to the drawings. The text exploration method of this embodiment is usually performed using a computer. The text exploration program in this embodiment is a program for implementing a text exploration method using a computer. The text exploration device of this embodiment is usually constructed using a computer. The computer that runs the text exploration program functions as a text exploration device.
圖1係顯示本發明之實施形態之文字探勘裝置之構成之方塊圖。圖1所示之文字探勘裝置10具備有指示輸入部11、文字分析部12、畫面生成部13、及分析結果顯示部14。於文字探勘裝置10輸入有分析對象之文字資料5。文字探勘裝置10對自被輸入之文字資料5所擷取之單字進行階層式集群分析,並將分析結果顯示於畫面。
FIG. 1 is a block diagram showing the structure of a text exploration device according to an embodiment of the present invention. The
文字探勘裝置10之動作的概要如以下所述。於指示輸入部11輸入有來自使用者之指示。文字分析部12自被輸入之文字資料5擷取單字,並對所擷取之單字進行階層式集群分析。畫面生成部13根據文字分析部12之分析結果來生成畫面資料。分析結果顯示部14根據由畫面生成部13所生成之畫面資料來顯示畫面。
The outline of the operation of the
被輸入至指示輸入部11之來自使用者之指示,包含群組數之設定、群組內之最多資料數之設定、分析對象期間之設定、單字除外、近義詞登錄、複合詞登錄等。於文字資料5為具有年月日或時刻等之資訊之時間序列資料之情形時,文字分析部12對被輸入之文字資料5中在指示輸入部11被設定之分析對象期間內之文字資料所包含之單字,進行階層式集群分析。
The instruction from the user input to the
畫面生成部13在生成畫面資料時,係依照群組數與群組內之最多資料數(細節將如後述之)。又,於使用者輸入新的指示時,在所指示之處理被進行後,畫面生成部13生成新的畫面資料,而分析結果顯示部14顯示新的畫面。如此,文字探勘裝置10對應於來自使用者之指示,切換文字資料5之分析態樣與分析結果之顯示態樣。
When the screen generating unit 13 generates screen data, it is based on the number of groups and the maximum number of data in the group (details will be described later). In addition, when the user inputs a new instruction, after the indicated processing is performed, the screen generating unit 13 generates new screen data, and the analysis
圖2係顯示作為文字探勘裝置10而發揮功能之電腦之構成之方塊圖。圖2所示之電腦20,具備有CPU(Central Processing Unit;中央處理單元)21、主記憶體22、儲存部23、輸入部24、顯示部25、通信部26、及記錄媒體讀取部27。主記憶體22例如使用DRAM(Dynamic Random Access Memory;動態隨機存取記憶體)。儲存部23例如使用硬碟(Hard Disk)或固態硬碟(Solid State Drive)。輸入部24例如包含有鍵盤(Keyboard)28與滑鼠
(Mouse)29。顯示部25例如使用液晶顯示器。通信部26係有線通信或無線通信之介面電路。記錄媒體讀取部27係儲存有程式等之記錄媒體30之介面電路。記錄媒體30例如使用CD-ROM(Compact Disc Read-Only Memory;唯讀記憶光碟)、DVD-ROM(Digital Versatile Disc Read-Only Memory;數位多功能影音唯讀記憶光碟)、USB(Universal Serial Bus;通用序列匯流排)記憶體等非過渡性之記錄媒體。
FIG. 2 is a block diagram showing the structure of a computer that functions as a
於電腦20執行文字探勘程式31之情形時,儲存部23儲存文字探勘程式31與文字資料5。文字探勘程式31與文字資料5例如既可為使用通信部26自伺服器或其他電腦接收者,亦可為使用記錄媒體讀取部27自記錄媒體30所讀取者。
When the
於執行文字探勘程式31時,文字探勘程式31與文字資料5被複製傳送至主記憶體22。CPU 21將主記憶體22作為作業用記憶體來使用,藉由執行被儲存於主記憶體22之文字探勘程式31,來處理被儲存於主記憶體22之文字資料5。此時,電腦20作為文字探勘裝置10而發揮功能。再者,以上所述之電腦20之構成僅為一例,可使用任意之電腦來構成文字探勘裝置10。
When the text exploration program 31 is executed, the text exploration program 31 and the
以下,文字資料5設為包含日文單字之日文資料。圖17係顯示圖式及其說明所出現之單字之圖。於圖17之各列記載有單字(日文單字)與單字的意思。於以下之說明中在提及日文單字時,有時會在單字後之括號內記載單字的意思。再者,文字資料5亦可為任意語言的資料。
Hereinafter, the
圖3係顯示文字探勘裝置10之顯示畫面之圖。圖3所示之顯示畫面40,包含有分析結果畫面41與分析設定畫面42。
於分析結果畫面41顯示有文字分析部12之分析結果。於分析設定畫面42顯示有GUI(圖形化使用者介面;Graphical User Interface)元件,該GUI元件係用以設定文字分析部12之分析態樣與畫面生成部13所生成之畫面資料的特性。
FIG. 3 is a diagram showing the display screen of the
若對階層式集群分析之結果設定集群數,則決定各集群所包含之單字。於將對自文字資料5擷取之單字進行階層式集群分析後之結果顯示於畫面時,文字探勘裝置10係以圖3所示之態樣顯示與集群對應之群組,以取代樹狀圖。
If the number of clusters is set for the result of hierarchical cluster analysis, the words included in each cluster are determined. When the results of hierarchical cluster analysis on the words extracted from the
於以下之說明中,將於畫面所顯示之集群亦稱為群組。使用者使用指示輸入部11,來指定群組數(集群數)與群組內之最多資料數(群組所包含之單字數之上限值)。以下,將前者設為m,後者設為n。
In the following description, the cluster to be displayed on the screen is also called a cluster. The user uses the
在文字探勘裝置10中,文字資料5所包含之單字係分類為m個集群,且各集群包含有1個以上之單字。於分析結果畫面41顯示有m個群組,於各群組之內部顯示有單字。群組係使用雲狀圖形來顯示,群組所包含之單字係顯示於橢圓區域之內部。各群組所包含之單字被限制在n個以下。例如,在n=5之時的集群包含有10個單字之情形時,在分析結果畫面41中,於群組之內部顯示有5個單字。
In the
於分析設定畫面42顯示有用以設定群組數m之第1滑動條與2個第1按鈕(標示有記號「+」或「-」者)、用以設定群組內之最多資料數n之第2滑動條與2個第2按鈕、及用以設定分析對象期間之4個方框與2個第3按鈕(標示有向左箭頭或向右箭頭者)。
On the
使用者藉由操作滑鼠29,使第1滑動條之捲動塊朝左右移動或按下第1按鈕,來指示群組數m。群組數m於標示有記號「+」之第1按鈕被按下時會增加,於標示有記號「-」之第1按鈕被按下時則會減少。群組數m之初始值,例如被設定為文字分析部12之分析結果所包含之單字之種類的平方根,或者為接近該平方根之整數。例如,於文字分析部12之分析結果包含有16種類之單字之情形時,群組數m之初始值係設定為4。
The user operates the
使用者藉由操作滑鼠29,使第2滑動條之捲動塊朝左右移動或按下第2按鈕,來指示群組內之最多資料數n。群組內之最多資料數n於第2按鈕被按下時會增加或減少。群組內之最多資料數n之初始值,例如被設定為5。
The user operates the
於文字資料5為時間序列資料之情形時,使用者藉由操作鍵盤28或滑鼠29,使用4個方框來指定年月日與時刻或按下第3按鈕,來指示分析對象期間。分析對象期間於標示有向左箭頭之第3按鈕被按下時,朝向過去移動既定量(例如1個月),而於標示有向右箭頭之第3按鈕被按下時則朝向相反方向移動既定量。分析對象期間之初始值,例如被設定為自文字資料5最舊之時刻至最新之時刻之期間。再者,於文字資料5並非時間序列資料之情形時,使用者無法指定分析對象期間。
When the
於分析結果畫面41顯示有1個以上且m個以下之群組,於各群組之內部顯示有1個以上且n個以下之單字。各群組在畫面內,對應之集群所包含之單字之出現頻率之合計越大者越被放大地顯示。於集群所包含之單字數超過n個之情形時,於群組之內部顯示出現頻率高之n個單字。群組所包含之單字與包含該等之橢
圓區域,在畫面內單字之出現頻率越高者越被放大地顯示。於各群組標示有名稱。群組之名稱係使用集群所包含之單字中出現頻率最高之單字。群組之名稱係於群組之內部標示底線來顯示。再者,於在橢圓區域之內部無法顯示單字之情形時,取代單字而顯示記號「...」。
On the
於分析結果畫面41顯示有用以指定縮放倍率之第3滑動條及2個第4按鈕(標示有記號「+」或「-」者)。使用者藉由操作滑鼠29,使第3滑動條之捲動塊朝左右移動或按下第4按鈕,來設定縮放倍率。於分析結果畫面41,包含單字之群組係對應於所設定之縮放倍率而放大或縮小地被顯示。縮放倍率之初始值係設定為100%。於初始狀態之分析結果畫面41,顯示有所有的群組。
On the
於使用者在分析設定畫面42中變更群組數m、群組內之最多資料數n、或分析對象期間時,分析結果畫面41之內容係與該等對應地產生變化。於使用者在分析結果畫面41中指示單字除外、近義詞登錄、或複合詞登錄時,分析結果畫面41之內容也與該等對應地產生變化。
When the user changes the number m of groups, the maximum number n of data in the group, or the analysis target period on the
於對自文字資料5所擷取之單字進行階層式集群分析時,文字探勘裝置10參照儲存有應除外之單字之除外單字列表、儲存有應作為近義詞來處理之單字之近義詞列表、及儲存有應作為複合詞來處理之單字之複合詞列表。具有相同意思(或大致相同意思)之複數個單字與代表該等單字之1個單字被建立對應而被儲存於近義詞列表。若加以連結便成為1個複合詞之複數個單字與連結該等單字之複合詞被建立對應而被儲存於複合詞列表。例如「daigakusei(大學生)」及「gakusei(學生)」與代表兩者之「daigakusei」
被建立對應而被儲存於近義詞列表。例如「nintai(忍耐)」及「tsuyoi(強)」與連結兩者之「nintaizuyoi(忍耐力高)」被建立對應而被儲存於複合詞列表。文字探勘裝置10存在有具有複數個近義詞列表與複數個複合詞列表之情形。
When performing hierarchical cluster analysis on the words extracted from the
圖4係顯示文字探勘裝置10之動作之流程圖。圖5係顯示文字探勘裝置10之畫面資料生成處理(圖4所示之步驟S111)之細節之流程圖。輸入部24與執行步驟S113之CPU 21係作為指示輸入部11而發揮功能。執行步驟S109~S110之CPU 21係作為文字分析部12而發揮功能。執行步驟S111之CPU 21係作為畫面生成部13而發揮功能。顯示部25與執行步驟S112之CPU 21係作為分析結果顯示部14而發揮功能。以下,參照圖4及圖5而對文字探勘裝置10之動作進行說明。
FIG. 4 is a flowchart showing the operation of the
首先,CPU 21使顯示部25顯示圖6所示之資料指定畫面51(步驟S101)。於資料指定畫面51顯示有用以指定檔案名稱之方框、及用以指定資料夾名之方框。使用者藉由於資料指定畫面51中指定檔案名稱或資料夾名,來指定分析對象之文字資料5。文字資料5既可被儲存於硬碟等之儲存部23,亦可被儲存於使用通信部26所連接之伺服器或其他電腦等。
First, the
接著,CPU 21將使用資料指定畫面51所指定之文字資料5傳送至主記憶體22。藉此,文字資料5被輸入至文字探勘裝置10(步驟S102)。圖7係顯示文字資料5之例子之圖。圖7所示之文字資料係大學生所製作之報告之資料,且為具有年月日之資訊之時間序列資料。圖7所示之文字資料,自上依序為「關於本授課內容中大學生與社會之關係...」、「一般大學生畢業後在出社會前打
工或...」、「我們學生要有認知是付了昂貴的學費在學習...」、及「學生生活是為了使自我信心成長很珍貴的時間。而且...」。再者,文字探勘裝置10所分析之文字資料5之種類為任意。
Next, the
接著,CPU 21使顯示部25顯示圖8所示之目的指定畫面52(步驟S103)。於目的指定畫面52顯示有對應於內容、特徵、及評價之3個選項按鈕(Radio Button)。使用者藉由操作滑鼠29按下任一選項按鈕,而自內容、特徵、及評價之中選擇分析目的。接著,CPU 21接收使用目的指定畫面52所指定之分析目的。藉此,分析目的被輸入至文字探勘裝置10(步驟S104)。
Next, the
接著,CPU 21使顯示部25顯示圖9所示之近義詞列表選擇畫面53(步驟S105)。於近義詞列表選擇畫面53顯示有文字探勘裝置10所具有近義詞列表之名稱、及被登錄於各近義詞列表之近義詞。使用者藉由操作滑鼠29,於近義詞列表選擇畫面53中選擇任一近義詞列表,來指定要使用之近義詞列表。藉此,在文字探勘裝置10中選擇近義詞列表(步驟S106)。
Next, the
接著,CPU 21使顯示部25顯示圖10所示之複合詞列表選擇畫面54(步驟S107)。於複合詞列表選擇畫面54顯示有文字探勘裝置10所具有複合詞列表之名稱、及被登錄於各複合詞列表之複合詞。使用者藉由操作滑鼠29,於複合詞列表選擇畫面54中選擇任一複合詞列表,來指定要使用之複合詞列表。藉此,在文字探勘裝置10中選擇複合詞列表(步驟S108)。
Next, the
接著,CPU 21考量除外單字列表、近義詞列表、及複合詞列表,而自在步驟S102被輸入之文字資料5中屬於分析對象期間內之文字資料,擷取對應於在步驟S104所指定之分析目的
之種類之單字(步驟S109)。CPU 21在分析目的為「內容」之情形時,自文字資料5擷取名詞、專有名詞、地名、及人名。CPU 21在分析目的為「特徵」之情形時,係自文字資料5擷取名詞、專有名詞、(SA)行變格活用名詞、及動詞。CPU 21在分析目的為「評價」之情形時,自文字資料5擷取形容詞、形容動詞、及感嘆詞。再者,文字探勘裝置10亦可支援前述之3個以外之分析目的。又,CPU 21亦可根據各分析目的而擷取與前述不同種類之單字。
Next, the
於文字資料5為時間序列資料之情形時,CPU 21在執行步驟S109時,僅自文字資料5中由使用者所指示之分析對象期間所包含之文字資料擷取單字。又,於單字W1被儲存於除外單字列表之情形時,CPU 21在執行步驟S109時會完全忽略文字資料5所包含之單字W1。又,於單字W2及單字W3與代表兩者之單字W2被建立對應而被儲存於所選擇之近義詞列表之情形時,CPU 21在執行步驟S109時,會將文字資料5所包含之單字W3全部作為單字W2來處理。又,於單字W4及單字W5與連結兩者之單字W6被建立對應而被儲存於所選擇之複合詞列表之情形時,CPU 21在執行步驟S109時,會將文字資料5所包含之連接之單字W4與單字W5全部作為單字W6來處理。
When the
接著,CPU 21對在步驟S109所擷取之單字進行階層式集群分析(步驟S110)。CPU 21於步驟S110中,例如根據文字資料5中2個單字間之距離(2個單字呈現分開什麼程度的距離),來求得2個單字間之相似度。CPU 21根據所求得之單字間之相似度,而使用既定之方法(例如,最短距離法、最長距離法、群平均法、十進位法、華德法(Ward’s Method)等)進行階層式集群分析。又,
CPU 21在步驟S110中,求得各單字之出現頻率。
Next, the
接著,CPU 21根據在步驟S110所求得之階層式集群分析之結果,來生成用以顯示分析結果之畫面資料(步驟S111)。CPU 21在步驟S111中,進行圖5所示之處理。
Next, the
CPU 21將群組數設為m,並將群組內之最多資料數設為n(步驟S201)。接著,CPU 21針對階層式集群分析之結果,將集群數設定為m,來求得m個集群(步驟S202)。接著,CPU 21針對各集群,來求得集群所包含之單字之出現頻率之合計(步驟S203)。接著,CPU 21根據在步驟S203所求得之出現頻率之合計,來決定各群組之顯示尺寸(步驟S204)。在步驟S204中,集群所包含之單字之出現頻率之合計越大,群組之顯示尺寸便被決定為越大。
The
接著,CPU 21針對各集群,自集群所包含之單字中選擇應顯示之單字(步驟S205)。在步驟S205中,自各集群所包含之單字中,依出現頻率高之順序,被選擇出n個以下之單字。接著,CPU 21針對在步驟S205所選擇之各單字,根據單字之出現頻率來決定單字之顯示尺寸(步驟S206)。在步驟S206中,出現頻率越高之單字,單字之顯示尺寸便被決定為越大。
Next, for each cluster, the
接著,CPU 21生成用以顯示階層式集群分析之結果之畫面資料(步驟S207)。在步驟S207所生成之畫面資料,包含具有在步驟S204所決定之尺寸之m個群組(以雲狀圖形來表示)。於各群組之內部,包含具有在步驟S206所決定之尺寸之n個以下之單字。單字在畫面內,被顯示於群組之內部。CPU 21於執行步驟S207之後,結束畫面資料生成處理。
Next, the
接著,CPU 21使顯示部25顯示基於在步驟S111所生成之畫面資料的畫面(步驟S112)。接著,CPU 21接收來自使用者之指示(步驟S113)。接著,CPU 21根據在步驟S113所接收之指示之種類,前進至步驟S115~S120中之任一者(步驟S114)。
Next, the
CPU 21於在步驟S113所接收之指示為「群組數之設定」之情形時,朝向步驟S115前進。於該情形時,CPU 21將群組數m設定為使用者所指示之值(步驟S115),並朝向步驟S111前進。其後,根據所設定之群組數m生成畫面資料,並顯示新的畫面。藉此,包含所指定之個數之群組之分析結果畫面被顯示。
The
CPU 21於在步驟S113所接收之指示為「群組內之最多資料數之設定」之情形時,朝向步驟S116前進。於該情形時,CPU 21將群組內之最多資料數n設定為使用者所指示之值(步驟S116),並朝向步驟S111前進。其後,根據所設定之群組內之最多資料數n生成畫面資料,並顯示新的畫面。藉此,各群組所包含之單字個數被限制在所指定之值以下之分析結果畫面被顯示。
The
CPU 21於在步驟S113所接收之指示為「分析對象期間之設定」之情形時,朝向步驟S117前進。於該情形時,CPU 21將分析對象期間設定為使用者所指示之期間(步驟S117),並朝向步驟S109前進。其後,參照所設定之分析對象期間進行階層式集群分析,生成用以顯示新的分析結果之畫面資料,並顯示新的畫面。藉此,針對所指定之分析對象期間內之文字資料所包含之單字,進行階層式集群分析之結果被顯示於畫面。
When the instruction received in step S113 is "setting of analysis target period", the
圖11A係顯示設定分析對象期間前之分析結果畫面之圖。圖11B係顯示設定分析對象期間後之分析結果畫面之圖。於
圖11A所示之設定前之分析結果畫面61,顯示有對所輸入之文字資料5中自2014年1月1日0時0分至2015年12月31日24時0分為止之文字資料所包含之單字進行階層式集群分析之結果。於圖11B所示之設定後之分析結果畫面62,顯示有對所輸入之文字資料5中自2014年3月1日0時0分至2014年9月30日24時0分為止之文字資料所包含之單字進行階層式集群分析之結果。分析結果畫面61之顯示內容與分析結果畫面62之顯示內容不同。使用者可藉由觀察設定分析對象期間前後之分析結果畫面,而容易地認知階層式集群分析結果在時間上的變化。
FIG. 11A is a diagram showing an analysis result screen before setting an analysis target period. 11B is a diagram showing an analysis result screen after setting an analysis target period. in
The
CPU 21於在步驟S113所接收之指示為「單字除外」之情形時,朝向步驟S118前進。於該情形時,CPU 21將所指定之單字追加至除外單字列表(步驟S118),並朝向步驟S109前進。其後,將所指定之單字除外而進行階層式集群分析,生成用以顯示新的分析結果之畫面資料,並顯示新的畫面。藉此,將所指定之單字除外而進行階層式集群分析之結果,被顯示於畫面。
The
圖12A係顯示進行單字除外前之分析結果畫面之圖。圖12B係顯示進行單字除外後之分析結果畫面之圖。使用者操作滑鼠29,於選擇應除外之單字之後,指示進行單字除外。在圖12A所示之單字除外前之分析結果畫面63中,選擇「shakai(社會)」,並自選單中選擇「單字除外」。其後,將「shakai」除外而進行階層式集群分析之結果被顯示於畫面。於圖12B所示之單字除外後之分析結果畫面64,取代「shakai」而顯示「shingaku(升學)」。在與「shakai」相同集群所包含之單字中,「shingaku」係僅次於分析結果畫面63所顯示之5個單字,出現頻率最高者。
Fig. 12A is a diagram showing the analysis result screen before the word exclusion. Fig. 12B is a diagram showing an analysis result screen after excluding single words. The user operates the
CPU 21於在步驟S113所接收之指示為「近義詞登錄」之情形時,朝向步驟S119前進。於該情形時,CPU 21將所指示之單字追加至使用中之近義詞列表(步驟S119),並朝向步驟S109前進。其後,考量所指示之近義詞而進行階層式集群分析,生成用以顯示新的分析結果之畫面資料,並顯示新的畫面。藉此,將所指示之單字作為近義詞而進行階層式集群分析之結果,被顯示於畫面。
The
圖13A係顯示進行近義詞登錄前之分析結果畫面之圖。圖13B係顯示進行近義詞登錄後之分析結果畫面之圖。使用者操作滑鼠29,於選擇應作為近義詞登錄之複數個單字後,指示進行近義詞登錄。在圖13A所示之近義詞登錄前之分析結果畫面65中,選擇「daigakusei(大學生)」與「gakusei(學生)」,並自選單中選擇「近義詞登錄」。其後,將「daigakusei」與「gakusei」作為近義詞而進行階層式集群分析後之結果,被顯示於畫面。在圖13B所示之近義詞登錄後之分析結果畫面66中,「daigakusei」以較分析結果畫面65更大之尺寸被顯示,且「shingaku(升學)」取代「gakusei」而被顯示。根據「daigakusei」之出現頻率與「gakusei」之出現頻率之合計,「daigakusei」係以較分析結果畫面65內之「daigakusei」更大之尺寸被顯示。
FIG. 13A is a diagram showing a screen of analysis results before registration of synonyms. FIG. 13B is a diagram showing an analysis result screen after registration of synonyms. The user operates the
CPU 21於在步驟S113所接收之指示為「複合詞登錄」之情形時,朝向步驟S120前進。於該情形時,CPU 21將所指示之單字追加至使用中之複合詞列表(步驟S120),並朝向步驟S109前進。其後,考量所指示之複合詞而進行階層式集群分析,生成用以顯示新的分析結果之畫面資料,並顯示新的畫面。藉此,將所指定之單字作為複合詞而進行階層式集群分析之結果被顯示於畫面。
When the instruction received in step S113 is "composite word registration", the
圖14A係顯示進行複合詞登錄前之分析結果畫面之圖。圖14B係顯示進行複合詞登錄後之分析結果畫面之圖。使用者於操作滑鼠29來選擇應作為複合詞而加以登錄之複數個單字後,指示進行「近義詞登錄」。在圖14A所示之複合詞登錄前之分析結果畫面67中,「nintai(忍耐)」與「tsuyoi(強)」被選擇,且「複合詞登錄」自選單中被選擇。其後,將「nintai」與「tsuyoi」作為複合詞而進行階層式集群分析後之結果被顯示於畫面。在圖14B所示之複合詞登錄後之分析結果畫面68中,取代「nintai」及「tsuyoi」,而以「nintai」及「tsuyoi」以下之尺寸來顯示「nintaizuyoi(忍耐力高)」。
FIG. 14A is a diagram showing an analysis result screen before compound word registration. 14B is a diagram showing the analysis result screen after compound word registration. The user operates the
如以上所示,本實施形態之文字探勘方法具備有:文字分析步驟,其對自被輸入之文字資料所擷取之單字進行階層式集群分析;畫面生成步驟,其根據文字分析步驟之分析結果,生成畫面資料;及分析結果顯示步驟,其根據畫面資料來顯示畫面。畫面生成步驟,根據群組數m與群組內之最多資料數n,自分析結果求得m個集群,而生成用以將包含n個以下之集群所包含之單字之群組顯示於畫面之畫面資料。根據本實施形態之文字探勘方法,可根據對文字資料所包含之單字進行階層式集群分析之結果,使含有集群所包含之單字之群組被顯示於畫面。又,群組所包含單字的數量,被限制在n個以下。因此,使用者在看到畫面時,可直觀地理解階層式集群分析之結果。 As shown above, the text exploration method of this embodiment includes: a text analysis step, which performs hierarchical cluster analysis on the words extracted from the input text data; a screen generation step, which is based on the analysis result of the text analysis step , Generate screen data; and analysis result display step, which displays the screen according to the screen data. The screen generation step is to obtain m clusters from the analysis result based on the number of groups m and the maximum number of data in the group n, and generate a group for displaying the words contained in n or less clusters on the screen Screen data. According to the text exploration method of this embodiment, the group containing the words included in the cluster can be displayed on the screen based on the result of hierarchical cluster analysis of the words included in the text data. In addition, the number of words included in the group is limited to n or less. Therefore, the user can intuitively understand the results of hierarchical cluster analysis when seeing the screen.
又,群組所包含之單字係自對應於群組之集群所包含之單字中,依出現頻率高之順序所選擇。因此,於群組之內部,顯示有集群所包含之單字中出現頻率高之單字。因此,使用者可容易 地認知各集群所包含之出現頻率高之單字。又,群組在畫面內具有對應於與群組對應之集群所包含之單字之出現頻率之合計的尺寸。因此,使用者可容易地認知單字出現頻率之合計較大之集群。又,群組所包含之單字在畫面內具有對應於單字之出現頻率之尺寸。因此,使用者可容易地認知出現頻率高之單字。 In addition, the words included in the group are selected from the words included in the cluster corresponding to the group, in the order of high frequency of occurrence. Therefore, within the group, words with a high frequency appearing among the words included in the cluster are displayed. Therefore, the user can easily Understand the high-frequency words included in each cluster. In addition, the group has a size corresponding to the total appearance frequency of the words included in the cluster corresponding to the group within the screen. Therefore, the user can easily recognize a cluster with a larger total frequency of occurrence of words. In addition, the words included in the group have a size corresponding to the frequency of occurrence of the words within the screen. Therefore, the user can easily recognize words with a high frequency of occurrence.
又,文字探勘方法具備有用以輸入來自使用者之指示之指示輸入步驟,且文字分析步驟及畫面生成步驟之任一者係根據在指示輸入步驟所輸入之指示來執行。因此,可根據來自使用者之指示,切換階層式集群分析之結果之顯示態樣。尤其,指示輸入步驟接收群組數m之設定指示,而畫面生成步驟根據在指示輸入步驟所指定之群組數m來生成畫面資料。藉此,根據來自使用者之指示,切換顯示於畫面之區域個數(集群個數)。又,指示輸入步驟接收群組內之最多資料數n,而畫面生成步驟根據在指示輸入步驟所指定之群組內之最多資料數n來生成畫面資料。藉此,根據來自使用者之指示,切換於區域內所顯示單字的個數。 In addition, the text exploration method has an instruction input step useful for inputting instructions from the user, and any one of the character analysis step and the screen generation step is executed according to the instruction input in the instruction input step. Therefore, according to the instruction from the user, the display of the results of the hierarchical cluster analysis can be switched. In particular, the instruction input step receives a setting instruction of the group number m, and the screen generation step generates screen data according to the group number m specified in the instruction input step. With this, according to the instruction from the user, the number of areas (the number of clusters) displayed on the screen is switched. In addition, the instruction input step receives the maximum number n of data in the group, and the screen generation step generates screen data based on the maximum number n of data in the group specified in the instruction input step. Thereby, according to the instruction from the user, the number of words displayed in the area is switched.
又,指示輸入步驟接收分析對象期間之指示,而文字分析步驟對文字資料中在指示輸入步驟所指定之分析對象期間內之文字資料所包含之單字進行階層式集群分析。因此,對使用者所指示之分析對象期間內之文字資料所包含之單字進行階層式集群分析之結果被顯示於畫面。因此,使用者可容易地認知階層式集群分析之結果在時間上的變化。又,指示輸入步驟接收分析目的之設定指示,而文字分析步驟自文字資料5擷取對應於在指示輸入步驟所設定之分析目的之種類之單字,來進行階層式集群分析。藉此,可根據使用者所指示之分析目的來切換分析對象之單字種類,並將
進行階層式集群分析之結果顯示於畫面。
In addition, the instruction input step receives the instruction of the analysis target period, and the character analysis step performs hierarchical cluster analysis on the words included in the character data within the analysis target period specified by the instruction input step in the text data. Therefore, the result of performing hierarchical cluster analysis on the words included in the text data within the analysis target period indicated by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change of the results of the hierarchical cluster analysis. In addition, the instruction input step receives the setting instruction of the analysis purpose, and the text analysis step extracts words corresponding to the type of analysis purpose set in the instruction input step from the
又,指示輸入步驟接收單字除外指示,而文字分析步驟將在指示輸入步驟所指示之單字除外,而進行階層式集群分析。藉此,可將使用者所指示之單字除外而顯示進行階層式集群分析之結果。又,指示輸入步驟接收近義詞登錄指示,而文字分析步驟將在指示輸入步驟所指示之複數個單字視為相同之單字,而進行階層式集群分析。藉此,可將使用者所指示之複數個單字視為相同單字並將進行階層式集群分析之結果顯示於畫面。又,指示輸入步驟接收複合詞登錄指示,而文字分析步驟將在指示輸入步驟所指示之複數個單字合併為1個單字,而進行階層式集群分析。藉此,可將使用者所指示之複數個單字合併為1個單字並將進行階層式集群分析之結果顯示於畫面。 In addition, the instruction input step receives the word exclusion instruction, and the character analysis step excludes the word indicated in the instruction input step, and performs hierarchical cluster analysis. In this way, the words indicated by the user can be excluded and the results of hierarchical cluster analysis can be displayed. In addition, the instruction input step receives a synonym registration instruction, and the character analysis step treats the plural words indicated in the instruction input step as the same word, and performs hierarchical cluster analysis. In this way, the plurality of words indicated by the user can be regarded as the same word and the results of the hierarchical cluster analysis can be displayed on the screen. In addition, the instruction input step receives the compound word registration instruction, and the character analysis step combines the plural words indicated in the instruction input step into one word to perform hierarchical cluster analysis. In this way, the plural words indicated by the user can be combined into one word, and the results of the hierarchical cluster analysis can be displayed on the screen.
又,畫面生成步驟生成畫面資料,該畫面資料係用以顯示包含群組之分析結果畫面、及用以設定分析結果畫面之顯示態樣之分析設定畫面。因此,分析結果畫面與分析設定畫面被顯示。因此,使用者可使用分析設定畫面而容易地切換進行階層式集群分析之結果之顯示態樣。 In addition, the screen generation step generates screen data for displaying the analysis result screen including the group and the analysis setting screen for setting the display appearance of the analysis result screen. Therefore, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can easily switch the display state of the results of the hierarchical cluster analysis using the analysis setting screen.
本實施形態之文字探勘程式31、及本實施形態之文字探勘裝置10具有與本實施形態之文字探勘處理方法相同之構成,而發揮相同之效果。
The text exploration program 31 of this embodiment and the
根據本實施形態之文字探勘方法、文字探勘程式、及文字探勘裝置,可根據對文字資料所包含之單字進行階層式集群分析之結果,使包含最多資料數以下之集群所包含之單字之群組被顯示於畫面。因此,使用者在看到畫面時,可直觀地理解階層式集群 分析之結果。 According to the text exploration method, text exploration program, and text exploration device of this embodiment, the group of words included in the cluster with the maximum number of data can be made based on the results of hierarchical cluster analysis of the words included in the text data Is displayed on the screen. Therefore, the user can intuitively understand the hierarchical cluster when seeing the screen The result of the analysis.
再者,本案係主張根據在2016年7月25日所提出申請之發明名稱為「文字探勘方法、文字探勘程式、及文字探勘裝置」之日本專利特願2016-145065號之優先權而提出申請案,該等申請之內容係藉由引用而包含於本申請案。 Furthermore, this case claims to apply for priority based on Japanese Patent Application No. 2016-145065 with the invention titled "Text Exploration Method, Text Exploration Program, and Text Exploration Device" filed on July 25, 2016. The content of these applications is included in this application by reference.
40‧‧‧顯示畫面 40‧‧‧Display screen
41‧‧‧分析結果畫面 41‧‧‧Analysis result screen
42‧‧‧分析設定畫面 42‧‧‧Analysis setting screen
Claims (25)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2016145065A JP6794162B2 (en) | 2016-07-25 | 2016-07-25 | Text mining methods, text mining programs, and text mining equipment |
JP2016-145065 | 2016-07-25 |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201807597A TW201807597A (en) | 2018-03-01 |
TWI686716B true TWI686716B (en) | 2020-03-01 |
Family
ID=61015910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW106122011A TWI686716B (en) | 2016-07-25 | 2017-06-30 | Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program |
Country Status (5)
Country | Link |
---|---|
JP (1) | JP6794162B2 (en) |
KR (1) | KR102180487B1 (en) |
CN (1) | CN109478191B (en) |
TW (1) | TWI686716B (en) |
WO (1) | WO2018020842A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7078429B2 (en) * | 2018-03-20 | 2022-05-31 | 株式会社Screenホールディングス | Text mining methods, text mining programs, and text mining equipment |
US11636144B2 (en) | 2019-05-17 | 2023-04-25 | Aixs, Inc. | Cluster analysis method, cluster analysis system, and cluster analysis program |
US20230065007A1 (en) * | 2020-02-25 | 2023-03-02 | Nec Corporation | Item classification assistance system, method, and program |
EP4266186A4 (en) * | 2020-12-16 | 2024-01-17 | Fujitsu Limited | Information processing program, information processing method, and information processing device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0991314A (en) * | 1995-07-14 | 1997-04-04 | Fuji Xerox Co Ltd | Information search device |
JP2000227917A (en) * | 1999-02-05 | 2000-08-15 | Agency Of Ind Science & Technol | Thesaurus browsing system and method therefor and recording medium recording its processing program |
US20030023600A1 (en) * | 2001-07-30 | 2003-01-30 | Kabushiki Kaisha | Knowledge analysis system, knowledge analysis method, and knowledge analysis program product |
JP2005107688A (en) * | 2003-09-29 | 2005-04-21 | Nippon Telegr & Teleph Corp <Ntt> | Information display method and system and information display program |
TW201516713A (en) * | 2013-10-16 | 2015-05-01 | Chunghwa Telecom Co Ltd | File classification method based on group characteristic values |
US9477704B1 (en) * | 2012-12-31 | 2016-10-25 | Teradata Us, Inc. | Sentiment expression analysis based on keyword hierarchy |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6611825B1 (en) | 1999-06-09 | 2003-08-26 | The Boeing Company | Method and system for text mining using multidimensional subspaces |
CN1934570B (en) * | 2004-03-18 | 2012-05-16 | 日本电气株式会社 | Text mining device, and method thereof |
KR20090069874A (en) * | 2007-12-26 | 2009-07-01 | 한국과학기술정보연구원 | Method of selecting keyword and similarity coefficient for knowledge map analysis, and system thereof and media that can record computer program sources for method therof |
JP5022319B2 (en) * | 2008-08-04 | 2012-09-12 | 日本電信電話株式会社 | Text mining apparatus, method, program, and recording medium thereof |
JP5439261B2 (en) | 2010-04-01 | 2014-03-12 | 日本電信電話株式会社 | Clustering apparatus, clustering method, and clustering program |
JP5545876B2 (en) | 2011-01-17 | 2014-07-09 | 日本電信電話株式会社 | Query providing apparatus, query providing method, and query providing program |
CN104142918B (en) * | 2014-07-31 | 2017-04-05 | 天津大学 | Short text clustering and focus subject distillation method based on TF IDF features |
CN104504024B (en) * | 2014-12-11 | 2018-09-07 | 中国科学院计算技术研究所 | Keyword method for digging based on content of microblog and system |
CN105550365A (en) * | 2016-01-15 | 2016-05-04 | 中国科学院自动化研究所 | Visualization analysis system based on text topic model |
-
2016
- 2016-07-25 JP JP2016145065A patent/JP6794162B2/en active Active
-
2017
- 2017-06-06 WO PCT/JP2017/020922 patent/WO2018020842A1/en active Application Filing
- 2017-06-06 KR KR1020197000933A patent/KR102180487B1/en active IP Right Grant
- 2017-06-06 CN CN201780043375.8A patent/CN109478191B/en active Active
- 2017-06-30 TW TW106122011A patent/TWI686716B/en active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0991314A (en) * | 1995-07-14 | 1997-04-04 | Fuji Xerox Co Ltd | Information search device |
JP2000227917A (en) * | 1999-02-05 | 2000-08-15 | Agency Of Ind Science & Technol | Thesaurus browsing system and method therefor and recording medium recording its processing program |
US20030023600A1 (en) * | 2001-07-30 | 2003-01-30 | Kabushiki Kaisha | Knowledge analysis system, knowledge analysis method, and knowledge analysis program product |
JP2005107688A (en) * | 2003-09-29 | 2005-04-21 | Nippon Telegr & Teleph Corp <Ntt> | Information display method and system and information display program |
US9477704B1 (en) * | 2012-12-31 | 2016-10-25 | Teradata Us, Inc. | Sentiment expression analysis based on keyword hierarchy |
TW201516713A (en) * | 2013-10-16 | 2015-05-01 | Chunghwa Telecom Co Ltd | File classification method based on group characteristic values |
Non-Patent Citations (1)
Title |
---|
J * |
Also Published As
Publication number | Publication date |
---|---|
KR20190018480A (en) | 2019-02-22 |
KR102180487B1 (en) | 2020-11-18 |
TW201807597A (en) | 2018-03-01 |
WO2018020842A1 (en) | 2018-02-01 |
JP6794162B2 (en) | 2020-12-02 |
CN109478191A (en) | 2019-03-15 |
CN109478191B (en) | 2022-04-08 |
JP2018018118A (en) | 2018-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI686716B (en) | Text exploration method, computer-readable recording medium and text exploration device recorded with text exploration program | |
JP4893243B2 (en) | Image summarization method, image display device, k-tree display system, k-tree display program, and k-tree display method | |
US7962478B2 (en) | Movement-based dynamic filtering of search results in a graphical user interface | |
JP6448207B2 (en) | Build visual search, document triage and coverage tracking | |
US9411482B2 (en) | Visualizing user interfaces | |
US20130110838A1 (en) | Method and system to organize and visualize media | |
WO2012116287A1 (en) | Methods for electronic document searching and graphically representing electronic document searches | |
CA2596068A1 (en) | Providing a dynamic user interface for a dense three-dimensional scene | |
JP2004287725A (en) | Retrieval processing method and program | |
Samet et al. | Using animation to visualize spatio-temporal varying COVID-19 data | |
US20180300039A1 (en) | Tree Frog Computer Navigation System for the Hierarchical Visualization of Data | |
JP2004362451A (en) | Method and system for displaying retrieving keyword information, and retrieving keyword information display program | |
JP2008310514A (en) | User operation history acquisition display device, user operation history acquisition display method, user operation history acquisition display program and recording medium recording that program | |
JP2004240887A (en) | Retrieval information display system, retrieval keyword information display method and retrieval keyword information display program | |
JP2014102625A (en) | Information retrieval system, program, and method | |
Emerson et al. | From toy to tool: Extending tag clouds for software and information visualisation | |
JP2005128872A (en) | Document retrieving system and document retrieving program | |
JP4640861B2 (en) | Search processing method and program | |
CN114416664A (en) | Information display method, information display device, electronic apparatus, and readable storage medium | |
US20080270347A1 (en) | Method and apparatus for facilitating improved navigation through a list | |
JP5302529B2 (en) | Information processing apparatus, information processing method, program, and recording medium | |
Nizamee et al. | Visualizing the web search results with web search visualization using scatter plot | |
JP2014021916A (en) | Information display program and information display device | |
JP2000305948A (en) | Sorting device for group of documents and sorting method of group of documents | |
JP2004302950A (en) | Keyword analyzing method and program for use therein |