CN109478191B

CN109478191B - Text mining method, recording medium, and text mining device

Info

Publication number: CN109478191B
Application number: CN201780043375.8A
Authority: CN
Inventors: 秋田正史; 中村康则; 周景龙
Original assignee: Screen Holdings Co Ltd
Current assignee: Screen Holdings Co Ltd
Priority date: 2016-07-25
Filing date: 2017-06-06
Publication date: 2022-04-08
Anticipated expiration: 2037-06-06
Also published as: CN109478191A; JP2018018118A; TW201807597A; KR20190018480A; TWI686716B; KR102180487B1; WO2018020842A1; JP6794162B2

Abstract

In the text analysis steps (S109 to S110), hierarchical cluster analysis is performed on the words extracted from the input text data. In the screen generation step (S111), from the number of groups (m) and the maximum number of data (n) in the groups, (m) clusters are obtained from the analysis result in the text analysis step, and screen data for displaying a group including (n) or less words belonging to the clusters on the screen is generated. In the analysis result display step (S112), a screen is displayed based on the generated screen data. In this way, the results of the hierarchical cluster analysis are displayed on the screen so that the user can intuitively understand the results.

Description

Text mining method, recording medium, and text mining device

Technical Field

The present invention relates to text mining, and more particularly, to a text mining method, a recording medium, and a text mining device for displaying an analysis result of text data on a screen.

Background

In recent years, text mining has been attracting attention, which analyzes a large amount of text data described in a free form and obtains useful information from the analysis result. In text mining, for example, words are extracted from text data to be analyzed, and information is obtained by analyzing the frequency and tendency of appearance of the words.

In the following, a text mining device that performs hierarchical clustering analysis on words extracted from text data and displays the analysis results on a screen will be discussed. In hierarchical cluster analysis, clusters containing words with high similarity are hierarchically created according to the similarity between words. Generally, the result of the hierarchical cluster analysis is provided to the user (analyst) using a tree diagram (Dendrogram) as shown in FIG. 15.

In connection with the present invention, patent document 1 describes a grouping apparatus having a hierarchical grouping unit that constructs a tree diagram, searches the tree diagram, generates an index that can be specified from a lower layer to an upper layer, and stores the index in a storage unit. Patent document 2 describes an inquiry providing device including: a distance matrix calculation unit which calculates the distance between the keywords, generates distance matrix data which can search the distance between the keywords and stores the distance matrix data in a storage unit; and a clustering unit hierarchically clustering the keywords using the distance matrix and storing the hierarchically clustered keywords in the storage unit as a bottom-up index of a tree diagram constructed by searching from a lower layer to an upper layer.

Documents of the prior art

Patent document

Patent document 1: japanese patent laid-open publication No. 2011-216021

Patent document 2: japanese patent laid-open No. 2012-150539

Disclosure of Invention

Problems to be solved by the invention

A conventional text mining device displays a result of hierarchical cluster analysis on a screen using a tree graph. However, such a text mining device has a problem that a user cannot intuitively understand an analysis result. For example, in the analysis result shown in fig. 15, when the user sets the number of clusters to 4, a division line is set on the tree diagram as shown in fig. 16. However, the user cannot intuitively recognize the words included in each cluster by simply seeing such a tree diagram. Further, when the number of clusters is changed due to a large number of words, the user cannot intuitively grasp how the words included in each cluster are changed.

Further, since the tree diagram does not describe the appearance frequency of words, the user cannot know which word is important. In addition, when the text data to be analyzed is time-series data having information such as the date of the year, month, and time, the user may desire to know the change of the analysis result with time. However, the conventional text mining device cannot satisfy the above-described desire of the user.

Therefore, an object of the present invention is to provide a text mining method, a text mining program, and a text mining device that display results of hierarchical cluster analysis on a screen so that a user can intuitively understand the results.

Means for solving the problems

An embodiment 1 of the present invention is a text mining method for displaying an analysis result of text data on a screen, including:

a text analysis step of performing hierarchical cluster analysis on words (which may be single words and/or words) extracted from the inputted text data,

a picture generation step of generating picture data based on the analysis result in the text analysis step, an

An analysis result display step of displaying a screen based on the screen data;

in the screen generating step, a cluster of the number of clusters is determined from the analysis result based on the number of clusters and the maximum number of data in the cluster, and screen data for displaying a cluster including a word belonging to the cluster and not more than the maximum number of data on a screen is generated.

The feature of the 2 nd embodiment of the present invention resides, in the 1 st embodiment of the present invention,

the words included in the group are selected from the words belonging to the cluster corresponding to the group in descending order of the frequency of occurrence.

Embodiment 3 of the present invention is characterized in that, in embodiment 2 of the present invention,

in the screen, the group has a size corresponding to a total value of appearance frequencies of words belonging to the cluster corresponding to the group.

Embodiment 4 of the present invention is characterized in that, in embodiment 3 of the present invention,

in the screen, the group includes words having a size corresponding to the frequency of appearance of the words.

The feature of the 5 th embodiment of the present invention resides, in the 1 st embodiment of the present invention,

further comprising an instruction input step for inputting an instruction from a user,

either one of the text analysis step and the screen generation step is executed in accordance with the instruction input in the instruction input step.

Embodiment 6 of the present invention is characterized in that, in embodiment 5 of the present invention,

in the instruction input step, a setting instruction of the number of groups is received,

in the screen generating step, the screen data is generated based on the number of groups set in the instruction inputting step.

Embodiment 7 of the present invention is characterized in that, in embodiment 5 of the present invention,

in the instruction input step, a setting instruction of the maximum data number is received,

in the screen generating step, the screen data is generated based on the maximum number of data set in the instruction inputting step.

Embodiment 8 of the present invention is characterized in that, in embodiment 5 of the present invention,

in the instruction input step, a setting instruction of the period to be analyzed is received,

in the text analysis step, the hierarchical cluster analysis is performed on words included in the text data in the analysis target period set in the instruction input step, among the text data.

The feature of the 9 th embodiment of the present invention resides, in the 5 th embodiment of the present invention,

in the instruction input step, an instruction to set an analysis target is received,

in the text analysis step, the hierarchical cluster analysis is performed by extracting a word of a type corresponding to the analysis target set in the instruction input step from the text data.

The feature of the 10 th embodiment of the present invention resides, in the 5 th embodiment of the present invention,

in the instruction input step, a word exclusion instruction is received,

in the text analysis step, the hierarchical cluster analysis is performed by excluding the word indicated in the instruction input step.

The feature of the 11 th embodiment of the present invention resides, in the 5 th embodiment of the present invention,

the above-mentioned instruction input step receives a near word registration instruction,

the text analysis step performs the hierarchical cluster analysis by regarding the plurality of words indicated by the instruction input step as identical words.

The feature of the 12 th embodiment of the present invention resides, in the 5 th embodiment of the present invention,

the above-mentioned instruction input step receives a compound word registration instruction,

the text analysis step merges the plurality of words instructed by the instruction input step into 1 word, and performs the hierarchical cluster analysis.

Embodiment 13 of the present invention is characterized in that, in embodiment 1 of the present invention,

in the screen generating step, screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen is generated.

A 14 th embodiment of the present invention is a text mining program for displaying an analysis result of text data on a screen, the text mining program causing a CPU of a computer to execute, using a memory, the steps of:

a text analysis step of performing hierarchical cluster analysis on words extracted from the inputted text data,

The feature of the 15 th embodiment of the present invention resides, in the 14 th embodiment of the present invention,

The feature of the 16 th embodiment of the present invention resides, in the 15 th embodiment of the present invention,

The feature of the 17 th embodiment of the present invention resides, in the 16 th embodiment of the present invention,

The feature of the 18 th embodiment of the present invention resides, in the 14 th embodiment of the present invention,

the text mining program further causes the above-mentioned computer to execute an instruction input step for inputting an instruction from a user,

A 19 th embodiment of the present invention is characterized in that, in the 14 th embodiment of the present invention,

A 20 th embodiment of the present invention is a text mining device that displays an analysis result of text data on a screen, the text mining device including:

a text analysis unit for performing hierarchical cluster analysis on words extracted from inputted text data,

a screen generation unit which generates screen data based on the analysis result of the text analysis unit, an

An analysis result display unit that displays a screen based on the screen data;

the screen generating unit obtains clusters of the number of clusters from the analysis result based on the number of clusters and the maximum number of data in the clusters, and generates screen data for displaying a cluster including a word belonging to the cluster and not more than the maximum number of data on a screen.

The feature of the 21 st embodiment of the present invention resides, in the 20 th embodiment of the present invention,

The feature of the 22 nd embodiment of the present invention resides, in the 21 st embodiment of the present invention,

The feature of the 23 rd embodiment of the present invention resides, in the 22 nd embodiment of the present invention,

The feature of the 24 th embodiment of the present invention resides, in the 20 th embodiment of the present invention,

the text mining device further has an instruction input section for inputting an instruction from a user,

either one of the text analysis unit and the screen generation unit operates in accordance with an instruction input by the instruction input unit.

A feature of the 25 th embodiment of the present invention is that, in the 20 th embodiment of the present invention,

the screen generating unit generates screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.

ADVANTAGEOUS EFFECTS OF INVENTION

According to

embodiment

1, 14 or 20 of the present invention, based on the result of performing hierarchical cluster analysis on words included in text data, a group including the words included in a cluster is displayed on a screen. The number of words included in the group is limited to the maximum number of data items. Therefore, the user can intuitively understand the result of the hierarchical cluster analysis when seeing the screen.

According to the 2 nd, 15 th or 21 st embodiment of the present invention, the words having a higher frequency of appearance among the words contained in the cluster are displayed inside the group. Therefore, the user can easily recognize the words with high frequency of appearance included in each cluster.

According to embodiment 3, 16 or 22 of the present invention, the group has a size corresponding to the total value of the appearance frequencies of the words included in the cluster within the screen. Therefore, the user can easily recognize a cluster in which the aggregate value of the word appearance frequencies is large.

According to the 4 th, 17 th or 23 th embodiment of the present invention, the word has a size within the screen corresponding to the word frequency. Therefore, the user can easily recognize the words whose appearance frequency is high.

According to

embodiment

5, 18 or 24 of the present invention, the display mode of the result of the hierarchical cluster analysis can be switched according to an instruction from the user.

According to embodiment 6 of the present invention, the number of groups (the number of clusters) displayed on the screen can be switched in accordance with an instruction from the user.

According to embodiment 7 of the present invention, the upper limit value of the number of words included in a group can be switched in accordance with an instruction from the user.

According to embodiment 8 of the present invention, the result of performing hierarchical cluster analysis on words included in text data in an analysis target period indicated by a user is displayed on a screen. Therefore, the user can easily recognize the time-dependent change of the result of the hierarchical cluster analysis.

According to embodiment 9 of the present invention, the word type of the analysis target can be switched according to the analysis target instructed by the user, and the result of performing hierarchical cluster analysis can be displayed on the screen.

According to embodiment 10 of the present invention, the result of hierarchical cluster analysis with the exception of the word indicated by the user can be displayed on the screen.

According to embodiment 11 of the present invention, a result of performing hierarchical cluster analysis on a plurality of words indicated by a user as identical words can be displayed on a screen.

According to embodiment 12 of the present invention, a result of merging a plurality of words instructed by a user into 1 word and performing hierarchical cluster analysis can be displayed on a screen.

According to the 13 th, 19 th or 25 th embodiment of the present invention, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can easily switch the display mode of the result obtained by performing the hierarchical cluster analysis using the analysis setting screen.

Drawings

Fig. 1 is a block diagram showing a configuration of a text mining device according to an embodiment of the present invention.

Fig. 2 is a block diagram showing a configuration of a computer functioning as the text mining device shown in fig. 1.

Fig. 3 is a diagram showing a display screen of the text mining device shown in fig. 1.

Fig. 4 is a flowchart showing the operation of the text mining device shown in fig. 1.

Fig. 5 is a flowchart of screen data generation processing of the text mining device shown in fig. 1.

Fig. 6 is a diagram showing a data designation screen of the text mining device shown in fig. 1.

Fig. 7 is a diagram showing an example of text data input to the text mining device shown in fig. 1.

Fig. 8 is a diagram showing a target specification screen of the text mining device shown in fig. 1.

Fig. 9 is a diagram showing a close sense word list selection screen of the text mining device shown in fig. 1.

Fig. 10 is a diagram showing a compound word list selection screen of the text mining device shown in fig. 1.

Fig. 11A is a diagram showing an analysis result screen before an analysis target period is set in the text mining device shown in fig. 1.

Fig. 11B is a diagram showing an analysis result screen after the analysis target period is set in the text mining device shown in fig. 1.

Fig. 12A is a diagram showing an analysis result screen before word exclusion is performed in the text mining device shown in fig. 1.

Fig. 12B is a diagram showing an analysis result screen after word exclusion in the text mining device shown in fig. 1.

Fig. 13A is a diagram showing an analysis result screen before the similar meaning word is registered in the text mining device shown in fig. 1.

Fig. 13B is a diagram showing an analysis result screen after the similar meaning word registration is performed in the text mining device shown in fig. 1.

Fig. 14A is a diagram showing an analysis result screen before compound word registration is performed in the text mining device shown in fig. 1.

Fig. 14B is a diagram showing an analysis result screen after compound word registration in the text mining device shown in fig. 1.

Fig. 15 is a diagram showing an example of a tree diagram.

Fig. 16 is a diagram showing a case where the number of clusters is set in the tree diagram shown in fig. 15.

Fig. 17 is a diagram showing words appearing in the drawings and the description thereof.

Detailed Description

Hereinafter, a text mining method, a text mining program, and a text mining device according to embodiments of the present invention will be described with reference to the drawings. The text mining method according to the present embodiment is generally executed by using a computer. The text mining program according to the present embodiment is a program for implementing a text mining method using a computer. The text mining device according to the present embodiment is generally configured using a computer. A computer that executes a text mining program functions as a text mining device.

Fig. 1 is a block diagram showing a configuration of a text mining device according to an embodiment of the present invention. The text mining device 10 shown in fig. 1 includes an instruction input unit 11, a text analysis unit 12, a screen generation unit 13, and an analysis result display unit 14. The text data 5 to be analyzed is input to the text mining device 10. The text mining device 10 performs hierarchical cluster analysis on words extracted from the inputted text data 5, and displays the analysis result on the screen.

The outline of the operation of the text mining device 10 is as follows. An instruction from the user is input to the instruction input unit 11. The text analysis unit 12 extracts words from the inputted text data 5, and performs hierarchical cluster analysis on the extracted words. The screen generation unit 13 generates screen data based on the analysis result of the text analysis unit 12. The analysis result display unit 14 displays a screen based on the screen data generated by the screen generation unit 13.

The instruction from the user input to the instruction input unit 11 includes: setting of the number of groups, setting of the maximum number of data in a group, setting of an analysis target period, word exclusion, similar meaning word registration, compound word registration, and the like. When the text data 5 is time-series data having information such as the date of year, month, and time, the text analysis unit 12 performs hierarchical cluster analysis on words included in the text data in the analysis target period set by the instruction input unit 11 in the input text data 5.

The screen generating unit 13 generates the screen data in accordance with the number of groups and the maximum number of data in the groups (details will be described later). When a new instruction is input by the user, the screen generation unit 13 generates new screen data and the analysis result display unit 14 displays the new screen after the instructed processing is performed. In this manner, the text mining device 10 switches the analysis method of the text data 5 and the display method of the analysis result in accordance with an instruction from the user.

Fig. 2 is a block diagram showing a configuration of a computer functioning as the text mining device 10. The computer 20 shown in fig. 2 includes a CPU (Central Processing Unit) 21, a main memory 22, a storage Unit 23, an input Unit 24, a display Unit 25, a communication Unit 26, and a recording medium reading Unit 27. The main Memory 22 is, for example, a DRAM (Dynamic Random Access Memory). The storage unit 23 is, for example, a Hard Disk (Hard Disk) or a Solid State Drive (Solid State Drive). The input unit 24 includes, for example, a Keyboard (Keyboard)28 and a Mouse (Mouse) 29. The display unit 25 uses, for example, a liquid crystal display. The communication unit 26 is an interface circuit for wired communication or wireless communication. The recording medium reading unit 27 is an interface circuit of the recording medium 30 in which a program and the like are stored. The recording medium 30 is a non-transitory recording medium such as a CD-ROM (Compact Disc Read-Only Memory), a DVD-ROM (Digital Versatile Disc Read-Only Memory), or a USB (Universal Serial Bus) Memory.

When the computer 20 executes the text mining program 31, the storage unit 23 stores the text mining program 31 and the text data 5. The text mining program 31 and the text data 5 may be received from a server or another computer using the communication unit 26, or may be read from the recording medium 30 using the recording medium reading unit 27, for example.

When the text mining program 31 is executed, the text mining program 31 and the text data 5 are copied and transferred to the main memory 22. The CPU21 uses the main memory 22 as a work memory, and executes the text mining program 31 stored in the main memory 22 to process the text data 5 stored in the main memory 22. At this time, the computer 20 functions as the text mining device 10. The configuration of the computer 20 described above is merely an example, and the text mining device 10 may be configured using any computer.

In the following, japanese data including japanese words is taken as the text data 5. Fig. 17 is a diagram showing words appearing in the drawings and the description thereof. Each line in fig. 17 shows a word (japanese word) and the meaning of the word. In the following description, when japanese words are referred to, the meanings of the words may be described in parentheses after the words. The text data 5 may be data in any language.

Fig. 3 is a diagram showing a display screen of the text mining device 10. The display screen 40 shown in fig. 3 includes an analysis result screen 41 and an analysis setting screen 42. The analysis result of the text analysis unit 12 is displayed on the analysis result screen 41. A GUI (Graphical User Interface) component for setting the analysis method of the text analysis unit 12 and the characteristics of the screen data generated by the screen generation unit 13 is displayed on the analysis setting screen 42.

If the number of clusters is set for the results of hierarchical cluster analysis, the words contained in each cluster are determined. When displaying the result of the hierarchical cluster analysis of the words extracted from the text data 5 on the screen, the text mining device 10 displays the group corresponding to the cluster in the manner shown in fig. 3 instead of the tree diagram.

In the following description, the clusters displayed on the screen are also referred to as groups. The user specifies the number of groups (the number of clusters) and the maximum number of data in the group (the upper limit of the number of words included in the group) using the instruction input unit 11. Hereinafter, the former is referred to as m, and the latter is referred to as n.

In the text mining device 10, the words included in the text data 5 are classified into m clusters, and each cluster includes 1 or more words. The m groups are displayed on the analysis result screen 41, and a word is displayed inside each group. The group is displayed by using a cloud-shaped graph, and the words contained in the group are displayed inside the oval area. The number of words contained in each group is limited to n or less. For example, when a cluster including 10 words where n is 5 is displayed, 5 words are displayed inside the cluster on the analysis result screen 41.

On the analysis setting screen 42, a first slider and 2 first buttons (marked with "+" or "-") for setting the number m of groups, a second slider and 2 second buttons for setting the maximum number n of data in a group, and 4 boxes and 2 third buttons (marked with left arrow or right arrow) for setting the period of time to be analyzed are displayed.

The user instructs the group number m by moving the slider of the first slider to the left or right or by pressing the first button by operating the mouse 29. The number of groups m increases when the first button marked with the symbol "+" is pressed, and decreases when the first button marked with the symbol "-" is pressed. The initial value of the group number m is set to, for example, the square root of the type of word included in the analysis result of the text analysis unit 12 or an integer close to the square root. For example, when the analysis result of the text analysis unit 12 includes 16 kinds of words, the initial value of the number of groups m is set to 4.

The user operates the mouse 29 to move the slider of the second slider left or right, or presses the second button to indicate the maximum number of data n in the group. The maximum number of data n in the group is increased or decreased when the second button is pressed. The initial value of the maximum number n of data in the group is set to 5, for example.

In the case where the text data 5 is time-series data, the user specifies the year, month, day, and time using 4 boxes by operating the keyboard 28 or the mouse 29, or presses the third button to indicate the analysis target period. During the analysis target period, the third button marked with a left arrow is moved by a predetermined amount (for example, 1 month) in the past when pressed, and the third button marked with a right arrow is moved by a predetermined amount in the opposite direction when pressed. The initial value of the analysis target period is set to a period from the oldest time to the newest time of the text data 5, for example. In addition, in the case where the text data 5 is not time-series data, the user cannot specify the analysis target period.

On the

analysis result screen

41, 1 to m groups are displayed, and 1 to n words are displayed inside each group. In the screen, as the total value of the appearance frequencies of the words included in the clusters corresponding to the respective groups increases, the group is displayed in an enlarged manner. And when the number of the words contained in the cluster exceeds n, displaying the n words with higher occurrence frequency in the group. In the screen, as for the word included in the group and the oval region including the word, the higher the frequency of appearance of the word, the more enlarged the word included in the group and the oval region including the word are displayed. Each group is labeled with a name. The name of the group uses the word that appears most frequently among the words contained in the cluster. The name of the group is underlined and displayed inside the group. In addition, when a word cannot be displayed inside the oval region, the symbol "…" is displayed instead of the word.

A third slider bar for specifying a zoom magnification and 2 fourth buttons (marked with a "+" or "-") are displayed on the analysis result screen 41. The user operates the mouse 29 to move the slider of the third slider left and right, or presses the fourth button to set the zoom magnification. On the analysis result screen 41, the group including the word is displayed in an enlarged or reduced manner according to the set zoom magnification. The initial value of the zoom magnification is set to 100%. All the groups are displayed on the initial analysis result screen 41.

When the user changes the number of groups m, the maximum number of data in the group n, or the period to be analyzed in the analysis setting screen 42, the content of the analysis result screen 41 changes in accordance with the change. When the user instructs word exclusion, synonym registration, or compound word registration in the analysis result screen 41, the content of the analysis result screen 41 also changes in accordance with the instruction.

When performing hierarchical clustering analysis on words extracted from the text data 5, the text mining device 10 refers to an excluded word list in which words to be excluded are stored, a near-synonym list in which words to be processed as near-synonyms are stored, and a compound word list in which words to be processed as compound words are stored. A plurality of words having the same meaning (or substantially the same meaning) and 1 word representing the plurality of words are associated with each other and stored in a hypernym list. If the words are connected, the words that become 1 compound word and the compound words obtained by connecting the words are associated and stored in the compound word list. For example, "daigakusei" and "gakusei" are associated with "daigakusei" representing both of them and stored in the hypernym list. For example, "nintayi" and "tsuyoi" obtained by connecting both of them and "nintaizuyoi" obtained by connecting both of them are associated with each other and stored in the compound word list. The text mining device 10 may sometimes have a plurality of similar word lists and a plurality of compound word lists.

Fig. 4 is a flowchart showing the operation of the text mining device 10. Fig. 5 is a flowchart showing details of the screen data generation processing (step S111 shown in fig. 4) of the text mining device 10. The input unit 24 and the CPU21 executing step S113 function as the instruction input unit 11. The CPU21 that executes steps S109 to S110 functions as the text analysis unit 12. The CPU21 executing step S111 functions as the screen generating unit 13. The display unit 25 and the CPU21 executing step S112 function as the analysis result display unit 14. The operation of the text mining device 10 will be described below with reference to fig. 4 and 5.

First, the CPU21 causes the display unit 25 to display the data designation screen 51 shown in fig. 6 (step S101). A box for specifying a file name and a box for specifying a folder name are displayed on the data specification screen 51. The user specifies the text data 5 to be analyzed by specifying a file name or a folder name on the data specifying screen 51. The text data 5 may be stored in the storage unit 23 such as a hard disk, or may be stored in a server or other computer connected by the communication unit 26.

Next, the CPU21 transfers the text data 5 specified using the data specifying screen 51 to the main memory 22. By this, the text data 5 is input to the text mining device 10 (step S102). Fig. 7 is a diagram showing an example of the text data 5. The text data shown in fig. 7 is data of a report created by an university student, and is time-series data having information of the year, month, day. The text data shown in fig. 7 includes "… about the relationship between college students and society in the content of the lecture", "… for working before college students enter the society after graduation", "… for our students to pay a high learning fee for learning, and" life of students is a precious time for growth of confidence of their own. And … ". The type of text data 5 analyzed by the text mining device 10 is arbitrary.

Next, the CPU21 causes the display unit 25 to display the target designation screen 52 shown in fig. 8 (step S103). In the

object designation screen

52, 3 Radio buttons (Radio buttons) corresponding to the contents, features, and evaluations are displayed. The user selects an analysis target from among the contents, features, and evaluations by pressing any one of the radio buttons by operating the mouse 29. Next, the CPU21 receives the analysis target specified using the target specification screen 52. By this, the analysis target is input to the text mining device 10 (step S104).

Next, the CPU21 causes the display unit 25 to display the similar meaning term list selection screen 53 shown in fig. 9 (step S105). The names of the similar meaning word list included in the text mining device 10 and the similar meaning words registered in the similar meaning word lists are displayed on the similar meaning word list selection screen 53. The user specifies a list of synonyms to be used by operating the mouse 29 to select any one of the list of synonyms in the list-of-synonym selection screen 53. In this way, the text mining device 10 selects the similar meaning word list (step S106).

Next, the CPU21 causes the display unit 25 to display the compound word list selection screen 54 shown in fig. 10 (step S107). The compound word list selection screen 54 displays the names of the compound word lists of the text mining device 10 and the compound words registered in the compound word lists. The user specifies a compound word list to be used by operating the mouse 29 to select any one of the compound word lists in the compound word list selection screen 54. In this way, the compound word list is selected in the text mining device 10 (step S108).

Next, the CPU21 extracts a word corresponding to the type of the analysis target designated in step S104 from the text data belonging to the analysis target period among the text data 5 input in step S102, taking into account the excluded word list, the similar meaning word list, and the compound word list (step S109). The CPU21 extracts a noun, a proper noun, a place name, and a person name from the text data 5 when analyzing a case where the target is "content". When the analysis target is "feature", the CPU21 extracts a noun, a proper noun, サ variant noun, and a verb from the text data 5. When the analysis target is "evaluation", the CPU21 extracts an adjective, an adjective verb, and an exclamation word from the text data 5. Furthermore, the text mining device 10 may support analysis targets other than the above-mentioned 3. The CPU21 may extract a different type of word from the above-described types of words based on each analysis target.

When the text data 5 is time-series data, the CPU21 extracts a word only from the text data included during the analysis target period indicated by the user in the text data 5 when executing step S109. In the case where the word W1 is stored in the excluded word list, the CPU21 completely ignores the word W1 included in the text data 5 when executing step S109. When the word W2 and the word W3 are associated with the word W2 representing both of them and stored in the selected synonym list, the CPU21 executes step S109 to process all the words W3 included in the text data 5 as the word W2. When the word W4 and the word W5 are associated with the word W6 obtained by concatenating the two words and stored in the selected compound word list, the CPU21 executes step S109 to treat all of the connected words W4 and W5 included in the text data 5 as the word W6.

Next, the CPU21 performs hierarchical cluster analysis on the words extracted in step S109 (step S110). In step S110, the CPU21 obtains the similarity between 2 words, for example, from the distance between 2 words in the text data 5 (the distance to which degree 2 words show a distance). The CPU21 performs hierarchical cluster analysis using a predetermined Method (for example, the shortest distance Method, the longest distance Method, the group average Method, the decimal Method, the Ward's Method, or the like) based on the obtained similarity between words. In step S110, the CPU21 determines the frequency of appearance of each word.

Next, the CPU21 generates screen data for displaying the analysis result based on the result of the hierarchical cluster analysis obtained in step S110 (step S111). In step S111, the CPU21 performs the processing shown in fig. 5.

The CPU21 sets the number of groups to m and sets the maximum number of data in a group to n (step S201). Next, the CPU21 sets the number of clusters to m as a result of the hierarchical cluster analysis, and obtains m clusters (step S202). Next, the CPU21 obtains a total value of the appearance frequencies of the words included in the clusters for each cluster (step S203). Next, the CPU21 determines the display size of each group based on the total value of the appearance frequencies obtained in step S203 (step S204). In step S204, the larger the total value of the appearance frequencies of the words included in the cluster is, the larger the display size of the group is determined to be.

Next, the CPU21 selects a word to be displayed from the words included in the clusters for each cluster (step S205). In step S205, n or fewer words are selected from the words included in each cluster in descending order of the frequency of appearance. Next, the CPU21 determines the display size of each word selected in step S205, based on the frequency of appearance of the word (step S206). In step S206, the display size of the word is determined to be larger for words with higher frequency of appearance.

Next, the CPU21 generates screen data for displaying the result of the hierarchical cluster analysis (step S207). The picture data generated in step S207 includes m groups (represented by a cloud pattern) having the size determined in step S204. Each group contains n or less words having the size determined in step S206. The words are displayed inside the group within the screen. After executing step S207, the CPU21 ends the screen data generation process.

Next, the CPU21 causes the display unit 25 to display a screen based on the screen data generated in step S111 (step S112). Next, the CPU21 receives an instruction from the user (step S113). Next, the CPU21 proceeds to any one of steps S115 to S120 according to the type of the instruction received in step S113 (step S114).

If the instruction received in step S113 is "setting the number of groups", the CPU21 proceeds to step S115. In this case, the CPU21 sets the number of groups m to a value instructed by the user (step S115), and proceeds to step S111. Thereafter, screen data is generated based on the set number m of groups, and a new screen is displayed. In this way, the analysis result screen including the designated number of groups is displayed.

If the instruction received in step S113 is "setting of the maximum number of data in the group", the CPU21 proceeds to step S116. In this case, the CPU21 sets the maximum number of data n in the group to the value instructed by the user (step S116), and proceeds to step S111. Then, screen data is generated based on the maximum number n of data in the set group, and a new screen is displayed. In this way, the analysis result screen in which the number of words included in each group is limited to the specified value or less is displayed.

If the instruction received in step S113 is "setting of the analysis target period", the CPU21 proceeds to step S117. In this case, the CPU21 sets the analysis target period to the period instructed by the user (step S117), and proceeds to step S109. Then, hierarchical cluster analysis is performed with reference to the set analysis target period, screen data for displaying a new analysis result is generated, and a new screen is displayed. In this way, the result of performing hierarchical cluster analysis on the words included in the text data in the designated analysis target period is displayed on the screen.

Fig. 11A is a diagram showing an analysis result screen before the analysis target setting period. Fig. 11B is a diagram showing an analysis result screen after the analysis target period is set. On the analysis result screen 61 before setting shown in fig. 11A, the results of hierarchical cluster analysis of words included in text data from 0 point 1/0/2014 to 0 point 24/31/2015 in the inputted text data 5 are displayed. In the set analysis result screen 62 shown in fig. 11B, the results of hierarchical cluster analysis of words included in text data from 0 point 3/1/0/2014 to 0 point 24/30/9/2014 in the inputted text data 5 are displayed. The display content of the analysis result screen 61 is different from that of the analysis result screen 62. The user can easily recognize the temporal change of the hierarchical cluster analysis result by observing the analysis result screens before and after the period of setting the analysis target.

In the case where the instruction received in step S113 is "word exclusion", the CPU21 proceeds to step S118. In this case, the CPU21 adds the specified word to the excluded word list (step S118), and proceeds to step S109. Then, the designated word is excluded and hierarchical cluster analysis is performed, screen data for displaying a new analysis result is generated, and a new screen is displayed. In this way, the result of excluding the specified word and performing hierarchical cluster analysis is displayed on the screen.

Fig. 12A is a diagram showing an analysis result screen before word exclusion. Fig. 12B is a diagram showing an analysis result screen after word exclusion. The user operates the mouse 29 to select a word to be excluded, and then instructs word exclusion. In the analysis result screen 63 before word exclusion shown in fig. 12A, "shakai (society)" is selected, and "word exclusion" is selected from the menu. Thereafter, the screen displays the results of excluding "shakai" and performing hierarchical cluster analysis. In the analysis result screen 64 after word exclusion shown in fig. 12B, "shingaku (ascending school)" is displayed instead of "shakui". Of the words contained in the same cluster as "shakui", the "shinkau" is the one whose frequency of appearance is the highest next to the 5 words displayed in the analysis result screen 63.

The CPU21 proceeds to step S119 in the case where the instruction received in step S113 is "synonym registration". In this case, the CPU21 adds the indicated word to the list of hypernyms in use (step S119), and proceeds toward step S109. Then, hierarchical cluster analysis is performed in consideration of the indicated synonym, screen data for displaying a new analysis result is generated, and a new screen is displayed. In this way, the result of performing hierarchical clustering analysis using the indicated word as a near-meaning word is displayed on the screen.

Fig. 13A is a diagram showing an analysis result screen before the similar meaning word registration is performed. Fig. 13B is a diagram showing an analysis result screen after the similar meaning word registration. The user operates the mouse 29 to select a plurality of words to be registered as similar meaning words, and then instructs to register the similar meaning words. In the analysis result screen 65 before the similar meaning word registration shown in fig. 13A, "daigakusei" and "gakusei" are selected, and "similar meaning word registration" is selected from the menu. Thereafter, the screen displays the results of hierarchical cluster analysis using "daigakusei" and "gakusei" as similar words. In the analysis result screen 66 after the registration of a similar meaning word shown in fig. 13B, "daigakusei" is displayed in a larger size than the analysis result screen 65, and "shingaku" is displayed instead of "gakusei". "daigakusei" is displayed in a larger size than "daigakusei" within the analysis result screen 65, based on the total value of the appearance frequency of "daigakusei" and the appearance frequency of "gakusei".

The CPU21 proceeds to step S120 in the case where the instruction received in step S113 is "compound word registration". In this case, the CPU21 adds the indicated word to the compound word list in use (step S120), and proceeds toward step S109. Thereafter, hierarchical cluster analysis is performed in consideration of the indicated compound word, screen data for displaying a new analysis result is generated, and a new screen is displayed. In this way, the result of performing hierarchical cluster analysis using the specified word as a compound word is displayed on the screen.

Fig. 14A is a diagram showing an analysis result screen before compound word registration. Fig. 14B is a diagram showing an analysis result screen after compound word registration. The user operates the mouse 29 to select a plurality of words to be registered as compound words, and instructs to perform "compound word registration". In the analysis result screen 67 before compound word registration shown in fig. 14A, "nintai (tolerance)" and "tsuyoi (strong)" are selected, and "compound word registration" is selected from the menu. Then, the results of hierarchical cluster analysis using "nintai" and "tsuyoi" as compound words are displayed on the screen. In the analysis result screen 68 after the compound word registration shown in fig. 14B, "nintaizuyoi (strong endurance)" is displayed in a size of "nintai" and "tsuyoi" or less, instead of "nintai" and "tsuyoi".

As described above, the text mining method according to the present embodiment includes: a text analysis step of performing hierarchical cluster analysis on words extracted from the inputted text data; a picture generation step of generating picture data based on an analysis result of the text analysis step; and an analysis result display step of displaying a picture based on the picture data. In the screen generation step, m clusters are obtained from the analysis result based on the number m of clusters and the maximum number n of data in the clusters, and screen data for displaying on the screen a cluster including a word included in n or less clusters is generated. According to the text mining method of the present embodiment, a group including words included in a cluster can be displayed on a screen based on a result of performing hierarchical cluster analysis on the words included in text data. Also, the number of words included in a group is limited to n or less. Therefore, when the user sees the screen, the user can intuitively understand the result of the hierarchical cluster analysis.

The words included in the group are selected from the words included in the cluster corresponding to the group in descending order of the frequency of appearance. Therefore, inside the group, the word with the higher frequency of appearance among the words contained in the cluster is displayed. Therefore, the user can easily recognize the words with high frequency of appearance included in each cluster. The group has a size corresponding to a total value of the appearance frequencies of the words included in the cluster corresponding to the group on the screen. Therefore, the user can easily recognize a cluster in which the aggregate value of the word appearance frequencies is large. The size of the word included in the group is set to correspond to the frequency of appearance of the word in the screen. Therefore, the user can easily recognize the words whose appearance frequency is high.

The text mining method further includes an instruction input step for inputting an instruction from a user, and any one of the text analysis step and the screen generation step is executed in accordance with the instruction input in the instruction input step. Therefore, the display mode of the result of the hierarchical cluster analysis can be switched according to the instruction from the user. In particular, the instruction input step receives a setting instruction of the number m of groups, and the screen generation step generates screen data based on the number m of groups designated in the instruction input step. In this way, the number of areas (cluster number) displayed on the screen is switched in accordance with an instruction from the user. In the instruction input step, the maximum number of data n in the group is received, and in the screen generation step, screen data is generated based on the maximum number of data n in the group designated in the instruction input step. In this way, the number of words displayed in the region is switched in accordance with the instruction from the user.

In the instruction input step, an instruction for the analysis target period is received, and in the text analysis step, hierarchical cluster analysis is performed on words included in the text data in the analysis target period specified in the instruction input step. Therefore, the result of performing hierarchical cluster analysis on the words included in the text data in the analysis target period indicated by the user is displayed on the screen. Therefore, the user can easily recognize the time-dependent change of the result of the hierarchical cluster analysis. In the instruction input step, an instruction to set an analysis target is received, and in the text analysis step, a word of a type corresponding to the analysis target set in the instruction input step is extracted from the text data 5, and hierarchical cluster analysis is performed. In this way, the results obtained by performing hierarchical clustering analysis by switching the word type of the analysis target in accordance with the analysis target instructed by the user can be displayed on the screen.

In the instruction input step, a word exclusion instruction is received, and in the text analysis step, the word indicated in the instruction input step is excluded and hierarchical cluster analysis is performed. In this way, the result of excluding the word indicated by the user and performing hierarchical cluster analysis can be displayed. In the instruction input step, a similar meaning word registration instruction is received, and in the text analysis step, hierarchical cluster analysis is performed on a plurality of words instructed in the instruction input step, the plurality of words being regarded as identical words. In this way, the result of performing hierarchical cluster analysis with a plurality of words indicated by the user regarded as the same word can be displayed on the screen. In the instruction input step, a compound word registration instruction is received, and in the text analysis step, a plurality of words instructed in the instruction input step are combined into 1 word and hierarchical cluster analysis is performed. In this way, the result of performing hierarchical cluster analysis by merging a plurality of words indicated by the user into 1 word can be displayed on the screen.

In the screen generating step, screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen is generated. Therefore, the analysis result screen and the analysis setting screen are displayed. Therefore, the user can easily switch the display mode of the result obtained by performing the hierarchical cluster analysis using the analysis setting screen.

The text mining program 31 and the text mining device 10 according to the present embodiment have the same configuration as the text mining processing method according to the present embodiment, and exhibit the same effects.

According to the text mining method, the text mining program, and the text mining device of the present embodiment, a group including words included in a cluster having the largest number of data or less can be displayed on a screen based on a result of performing hierarchical cluster analysis on words included in text data. Therefore, when the user sees the screen, the user can intuitively understand the result of the hierarchical cluster analysis.

Further, the present application is claimed based on the priority of japanese patent application 2016-.

Description of reference numerals

5 text data

10 text mining device

11 instruction input unit

12 text analysis section

13 Picture generating part

14 analysis result display part

20 computer

21 CPU

22 main memory

23 storage section

24 input unit

25 display part

30 recording medium

31 text mining program

40 display screen

41. 61-68 analysis result picture

42 analysis setting screen

51 data specifying screen

52 object specifying screen

53 list selection screen of similar meaning words

54 Compound word list selection Screen

Claims

1. A text mining method for displaying an analysis result of text data on a screen, comprising:

in the screen generating step, clusters of the number of clusters are obtained from the analysis result based on the number of clusters and the maximum number of data in the clusters, screen data for displaying a cluster including a word belonging to the cluster and not more than the maximum number of data on a screen is generated,

the group is marked with the word with the highest frequency of occurrence among the words contained in the cluster as a name.

2. The text mining method of claim 1,

3. The text mining method of claim 2,

4. The text mining method of claim 3,

5. The text mining method of claim 1,

6. The text mining method of claim 5,

7. The text mining method of claim 5,

8. The text mining method of claim 5,

9. The text mining method of claim 5,

10. The text mining method of claim 5,

in the instruction input step, a word exclusion instruction is received,

11. The text mining method of claim 5,

12. The text mining method of claim 5,

13. The text mining method of claim 1,

14. A computer-readable recording medium characterized in that,

a text mining program recorded thereon for displaying an analysis result of text data on a screen, the text mining program causing a CPU of a computer to execute the steps of using a memory:

15. The recording medium of claim 14,

16. The recording medium of claim 15,

17. The recording medium of claim 16,

18. The recording medium of claim 14,

19. The recording medium of claim 14,

20. A text mining device for displaying an analysis result of text data on a screen, comprising:

a text analysis unit for performing hierarchical cluster analysis on words extracted from the inputted text data,

a screen generation unit for generating screen data based on the analysis result of the text analysis unit, an

An analysis result display unit for displaying a screen based on the screen data;

the screen generating unit obtains clusters of the number of clusters from the analysis result based on the number of clusters and the maximum number of data in the clusters, generates screen data for displaying a cluster including a word belonging to the cluster and not more than the maximum number of data on a screen,

21. The text mining apparatus of claim 20,

22. The text mining apparatus of claim 21,

23. The text mining apparatus of claim 22,

24. The text mining apparatus of claim 20,

25. The text mining apparatus of claim 20,