WO2018020842A1 - Procédé d'exploration de texte, programme d'exploration de texte et appareil d'exploration de texte - Google Patents

Procédé d'exploration de texte, programme d'exploration de texte et appareil d'exploration de texte Download PDF

Info

Publication number
WO2018020842A1
WO2018020842A1 PCT/JP2017/020922 JP2017020922W WO2018020842A1 WO 2018020842 A1 WO2018020842 A1 WO 2018020842A1 JP 2017020922 W JP2017020922 W JP 2017020922W WO 2018020842 A1 WO2018020842 A1 WO 2018020842A1
Authority
WO
WIPO (PCT)
Prior art keywords
screen
analysis
text
data
group
Prior art date
Application number
PCT/JP2017/020922
Other languages
English (en)
Japanese (ja)
Inventor
正史 秋田
中村 康則
景龍 周
Original Assignee
株式会社Screenホールディングス
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Screenホールディングス filed Critical 株式会社Screenホールディングス
Priority to CN201780043375.8A priority Critical patent/CN109478191B/zh
Priority to KR1020197000933A priority patent/KR102180487B1/ko
Publication of WO2018020842A1 publication Critical patent/WO2018020842A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor

Definitions

  • the present invention relates to text mining, and more particularly, to a text mining method, a text mining program, and a text mining apparatus for displaying an analysis result of text data on a screen.
  • text mining that analyzes a large amount of freely described text data and seeks useful information from the analysis results has attracted attention.
  • text mining for example, a word is extracted from text data to be analyzed, and information is obtained by analyzing the appearance frequency and appearance tendency of the word.
  • Patent Document 1 has a hierarchical clustering unit that constructs a tree diagram, searches the tree diagram, generates an index that can identify the upper layer from the lower layer, and stores it in the storage unit A clustering device is described.
  • Patent Document 2 discloses a distance matrix calculation unit that calculates a distance between keywords, generates distance matrix data that can search for a distance between keywords from the keyword, and stores the distance matrix data in a storage unit;
  • a query providing apparatus is described that includes a clustering unit that stores a dendrogram in a storage unit as a bottom-up index that can be searched from a lower layer to an upper layer by performing dynamic clustering.
  • the conventional text mining device displays the result of hierarchical cluster analysis on the screen using a tree diagram.
  • a text mining device has a problem that the user cannot intuitively understand the analysis result. For example, when setting the number of clusters to 4 in the analysis result shown in FIG. 15, the user sets a cutting line on the tree diagram as shown in FIG. However, the user cannot intuitively recognize the words included in each cluster simply by looking at such a tree diagram. Further, when the number of clusters is changed when the number of words is large, the user cannot intuitively understand how the words included in each cluster change.
  • the user cannot know which words are important.
  • the text data to be analyzed is time-series data having information such as date, time, etc.
  • the user may request to know the temporal change of the analysis result.
  • the conventional text mining device cannot meet the user's request.
  • an object of the present invention is to provide a text mining method, a text mining program, and a text mining apparatus that display a result of hierarchical cluster analysis on a screen so that a user can intuitively understand the result.
  • a first aspect of the present invention is a text mining method for displaying an analysis result of text data on a screen, A text analysis step for performing a hierarchical cluster analysis on words extracted from the input text data; A screen generation step of generating screen data based on the analysis result of the text analysis step; An analysis result display step for displaying a screen based on the screen data, The screen generation step obtains a cluster of the number of groups from the analysis result based on the number of groups and the maximum number of data in the group, and displays a group including words included in the cluster in the maximum number of data on the screen.
  • the screen data for generating is generated.
  • the words included in the group are selected from the words included in the cluster corresponding to the group in descending order of appearance frequency.
  • the group has a size corresponding to a total appearance frequency of words included in a cluster corresponding to the group in the screen.
  • the words included in the group have a size corresponding to the appearance frequency of the words in the screen.
  • a sixth aspect of the present invention is the fifth aspect of the present invention.
  • the instruction input step receives an instruction to set the number of groups,
  • the screen generation step the screen data is generated based on the number of groups set in the instruction input step.
  • the instruction input step receives an instruction to set the maximum number of data
  • the screen generation step generates the screen data based on the maximum number of data set in the instruction input step.
  • the instruction input step receives an analysis target period setting instruction
  • the hierarchical cluster analysis is performed on words included in the text data within the analysis target period set in the instruction input step in the text data.
  • a ninth aspect of the present invention is the fifth aspect of the present invention.
  • the instruction input step receives an analysis instruction setting instruction
  • the hierarchical cluster analysis is performed by extracting words of a type corresponding to the analysis purpose set in the instruction input step from the text data.
  • a tenth aspect of the present invention is the fifth aspect of the present invention.
  • the instruction input step receives a word exclusion instruction
  • the hierarchical cluster analysis is performed by excluding the word instructed in the instruction input step.
  • An eleventh aspect of the present invention is the fifth aspect of the present invention,
  • the instruction input step receives a synonym registration instruction;
  • the hierarchical cluster analysis is performed by regarding the plurality of words specified in the instruction input step as the same word.
  • a twelfth aspect of the present invention is the fifth aspect of the present invention,
  • the instruction input step receives a compound word registration instruction;
  • the hierarchical cluster analysis is performed by merging a plurality of words specified in the instruction input step into one word.
  • the screen generation step generates screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.
  • a fourteenth aspect of the present invention is a text mining program for displaying an analysis result of text data on a screen, A text analysis step for performing a hierarchical cluster analysis on words extracted from the input text data; A screen generation step of generating screen data based on the analysis result of the text analysis step; Based on the screen data, the CPU causes the computer to execute an analysis result display step for displaying the screen, using the memory, The screen generation step obtains a cluster of the number of groups from the analysis result based on the number of groups and the maximum number of data in the group, and displays a group including words included in the cluster in the maximum number of data on the screen. The screen data for generating is generated.
  • a fifteenth aspect of the present invention is the fourteenth aspect of the present invention.
  • the words included in the group are selected from the words included in the cluster corresponding to the group in descending order of appearance frequency.
  • a sixteenth aspect of the present invention is the fifteenth aspect of the present invention,
  • the group has a size corresponding to a total appearance frequency of words included in a cluster corresponding to the group in the screen.
  • a seventeenth aspect of the present invention is the sixteenth aspect of the present invention.
  • the words included in the group have a size corresponding to the appearance frequency of the words in the screen.
  • An eighteenth aspect of the present invention is the fourteenth aspect of the present invention, Causing the computer to further execute an instruction input step for inputting an instruction from the user; One of the text analysis step and the screen generation step is performed based on the instruction input in the instruction input step.
  • a nineteenth aspect of the present invention is the fourteenth aspect of the present invention.
  • the screen generation step generates screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.
  • a twentieth aspect of the present invention is a text mining device that displays an analysis result of text data on a screen, A text analysis unit that performs hierarchical cluster analysis on words extracted from input text data; A screen generation unit for generating screen data based on the analysis result by the text analysis unit; An analysis result display unit for displaying a screen based on the screen data; The screen generation unit obtains a cluster of the number of groups from the analysis result based on the number of groups and the maximum number of data in the group, and displays a group including words included in the cluster in the maximum number of data or less on the screen. The screen data for generating is generated.
  • the 21st aspect of the present invention is the 20th aspect of the present invention.
  • the words included in the group are selected from the words included in the cluster corresponding to the group in descending order of appearance frequency.
  • the group has a size corresponding to a total appearance frequency of words included in a cluster corresponding to the group in the screen.
  • the words included in the group have a size corresponding to the appearance frequency of the words in the screen.
  • the twenty-fourth aspect of the present invention is the twentieth aspect of the present invention, in which It further includes an instruction input unit for inputting an instruction from the user, One of the text analysis unit and the screen generation unit operates based on an instruction input by the instruction input unit.
  • the screen generation unit generates screen data for displaying an analysis result screen including the group and an analysis setting screen for setting a display mode of the analysis result screen.
  • a group including words included in the cluster is displayed on the screen based on the result of performing the hierarchical cluster analysis on the words included in the text data.
  • the Further, the number of words included in the group is limited to the maximum number of data. Therefore, the user can intuitively understand the result of the hierarchical cluster analysis when viewing the screen.
  • words having a high appearance frequency among the words included in the cluster are displayed inside the group. Therefore, the user can easily recognize words that are included in each cluster and have a high appearance frequency.
  • the group has a size corresponding to the total appearance frequency of words included in the cluster in the screen. Therefore, the user can easily recognize a cluster having a large sum of appearance frequencies of words.
  • the word has a size corresponding to the frequency of the word in the screen. Therefore, the user can easily recognize words having a high appearance frequency.
  • the display mode of the result of hierarchical cluster analysis can be switched according to an instruction from the user.
  • the number of groups (number of clusters) displayed on the screen can be switched according to an instruction from the user.
  • the upper limit value of the number of words included in the group can be switched according to an instruction from the user.
  • the result of the hierarchical cluster analysis performed on the words included in the text data within the analysis target period designated by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change in the result of the hierarchical cluster analysis.
  • the result of the hierarchical cluster analysis can be displayed on the screen by switching the type of the analysis target word according to the analysis purpose instructed by the user.
  • the tenth aspect of the present invention it is possible to display on the screen the result of the hierarchical cluster analysis excluding the word instructed by the user.
  • the eleventh aspect of the present invention it is possible to display on the screen the result of hierarchical cluster analysis regarding a plurality of words designated by the user as the same word.
  • the twelfth aspect of the present invention it is possible to display on the screen the result of performing a hierarchical cluster analysis by merging a plurality of words designated by the user into one word.
  • an analysis result screen and an analysis setting screen are displayed. Therefore, the user can easily switch the display mode of the result of the hierarchical cluster analysis using the analysis setting screen.
  • FIG. 1 It is a block diagram which shows the structure of the text mining device which concerns on embodiment of this invention. It is a block diagram which shows the structure of the computer which functions as a text mining apparatus shown in FIG. It is a figure which shows the display screen of the text mining apparatus shown in FIG. It is a flowchart which shows operation
  • the text mining method according to the present embodiment is typically executed using a computer.
  • the text mining program according to the present embodiment is a program for implementing a text mining method using a computer.
  • the text mining device according to the present embodiment is typically configured using a computer.
  • a computer that executes the text mining program functions as a text mining device.
  • FIG. 1 is a block diagram showing a configuration of a text mining apparatus according to an embodiment of the present invention.
  • a text mining apparatus 10 shown in FIG. 1 includes an instruction input unit 11, a text analysis unit 12, a screen generation unit 13, and an analysis result display unit 14.
  • the text mining device 10 receives text data 5 to be analyzed.
  • the text mining device 10 performs a hierarchical cluster analysis on the words extracted from the input text data 5 and displays the analysis result on the screen.
  • the outline of the operation of the text mining device 10 is as follows.
  • the instruction input unit 11 receives an instruction from the user.
  • the text analysis unit 12 extracts words from the input text data 5 and performs a hierarchical cluster analysis on the extracted words.
  • the screen generation unit 13 generates screen data based on the analysis result by the text analysis unit 12.
  • the analysis result display unit 14 displays a screen based on the screen data generated by the screen generation unit 13.
  • the instruction from the user input to the instruction input unit 11 includes setting of the number of groups, setting of the maximum number of data in the group, setting of the analysis target period, word exclusion, synonym registration, compound word registration, and the like.
  • the text data 5 is time-series data having information such as date and time
  • Hierarchical cluster analysis is performed on words included in the text data.
  • the screen generator 13 follows the number of groups and the maximum number of data in the group when generating screen data (details will be described later).
  • the screen generation unit 13 When the user inputs a new instruction, after the instructed process is performed, the screen generation unit 13 generates new screen data, and the analysis result display unit 14 displays a new screen.
  • the text mining device 10 switches between the analysis mode of the text data 5 and the display mode of the analysis result in accordance with an instruction from the user.
  • FIG. 2 is a block diagram illustrating a configuration of a computer that functions as the text mining device 10.
  • the computer 20 illustrated in FIG. 2 includes a CPU 21, a main memory 22, a storage unit 23, an input unit 24, a display unit 25, a communication unit 26, and a recording medium reading unit 27.
  • a DRAM is used as the main memory 22.
  • the input unit 24 includes a keyboard 28 and a mouse 29, for example.
  • a liquid crystal display is used for the display unit 25.
  • the communication unit 26 is an interface circuit for wired communication or wireless communication.
  • the recording medium reading unit 27 is an interface circuit of the recording medium 30 that stores programs and the like.
  • a non-transitory recording medium such as a CD-ROM, a DVD-ROM, or a USB memory is used.
  • the storage unit 23 stores the text mining program 31 and the text data 5.
  • the text mining program 31 and the text data 5 may be received from a server or another computer using the communication unit 26, or may be read from the recording medium 30 using the recording medium reading unit 27.
  • the text mining program 31 and the text data 5 are copied and transferred to the main memory 22.
  • the CPU 21 processes the text data 5 stored in the main memory 22 by executing the text mining program 31 stored in the main memory 22 using the main memory 22 as a working memory.
  • the computer 20 functions as the text mining device 10.
  • the configuration of the computer 20 described above is merely an example, and the text mining apparatus 10 can be configured using an arbitrary computer.
  • FIG. 17 is a diagram showing words appearing in the drawing and its description. Each line in FIG. 17 describes a word (Japanese word) and the meaning of the word. In the following description, when referring to a Japanese word, the meaning of the word may be written in parentheses after the word.
  • the text data 5 may be data in any language.
  • FIG. 3 is a diagram showing a display screen of the text mining apparatus 10.
  • the display screen 40 shown in FIG. 3 includes an analysis result screen 41 and an analysis setting screen 42.
  • the analysis result screen 41 displays an analysis result by the text analysis unit 12.
  • the analysis setting screen 42 displays graphical user interface components for setting the analysis mode in the text analysis unit 12 and the characteristics of the screen data generated by the screen generation unit 13.
  • the cluster displayed on the screen is also called a group.
  • the user designates the number of groups (number of clusters) and the maximum number of data in the group (upper limit value of the number of words included in the group) using the instruction input unit 11.
  • the former is m and the latter is n.
  • words included in the text data 5 are classified into m clusters, and each cluster includes one or more words.
  • the analysis result screen 41 displays m groups, and a word is displayed inside each group.
  • the group is displayed using a cloud figure, and the words included in the group are displayed inside the ellipse area.
  • a first slider for setting the number of groups m, two first buttons (with a symbol “+” or “ ⁇ ”), and the maximum number of data n in the group are set.
  • a second slider and two second buttons are displayed, and four boxes and two third buttons (with a left-pointing arrow or a right-pointing arrow) for setting an analysis target period are displayed.
  • the user operates the mouse 29 and moves the knob of the first slider to the left or right or presses the first button to instruct the number of groups m.
  • the number m of groups increases when the first button with the symbol “+” is pressed, and decreases when the first button with the symbol “ ⁇ ” is pressed.
  • the initial value of the number of groups m is set to, for example, the square root of the type of word included in the analysis result by the text analysis unit 12 or an integer close thereto. For example, when 16 types of words are included in the analysis result by the text analysis unit 12, the initial value of the number of groups m is set to 4.
  • the user operates the mouse 29 to move the knob of the second slider to the left or right or press the second button to instruct the maximum number of data n in the group.
  • the maximum number of data n in the group increases or decreases when the second button is pressed.
  • the initial value of the maximum number of data n in the group is set to 5, for example.
  • the user When the text data 5 is time-series data, the user operates the keyboard 28 or mouse 29 to specify the date and time using the four boxes, or presses the third button, Specify the analysis period.
  • the analysis target period moves to the past by a predetermined amount (for example, one month) when the third button with the left arrow is pressed, and in the opposite direction by a predetermined amount when the third button with the right arrow is pressed. Moving.
  • the initial value of the analysis target period is set to a period from the oldest time to the newest time of the text data 5, for example. If the text data 5 is not time-series data, the user cannot specify the analysis target period.
  • the analysis result screen 41 displays 1 or more and m or less groups, and 1 or more and n or less words are displayed in each group.
  • Each group is displayed larger in the screen as the sum of the appearance frequencies of the words included in the corresponding cluster is larger.
  • n words with high appearance frequency are displayed inside the group.
  • a word included in a group and an elliptical area that includes the word are displayed larger in the screen as the appearance frequency of the word is higher.
  • Each group is given a name. As the name of the group, a word having the highest appearance frequency among the words included in the cluster is used.
  • the name of the group is displayed with an underline inside the group. When a word cannot be displayed inside the elliptical area, the symbol “...” Is displayed instead of the word.
  • the analysis result screen 41 displays a third slider and two fourth buttons (with a symbol “+” or “ ⁇ ”) for specifying the zoom magnification.
  • the user sets the zoom magnification by operating the mouse 29 and moving the knob of the third slider to the left or right or pressing the fourth button.
  • a group including words is displayed enlarged or reduced according to the set zoom magnification.
  • the initial value of the zoom magnification is set to 100%. All the groups are displayed on the analysis result screen 41 in the initial state.
  • the contents of the analysis result screen 41 change accordingly.
  • the user instructs word exclusion, synonym registration, or compound word registration on the analysis result screen 41 the contents of the analysis result screen 41 change accordingly.
  • an excluded word list storing words to be excluded, a synonym list storing words to be processed as synonyms, and A compound word list storing words to be processed as compound words is referred to.
  • the synonym list a plurality of words having the same meaning (or almost the same meaning) and one word representing these words are stored in association with each other.
  • the compound word list a plurality of words that are combined into one compound word and a compound word obtained by connecting these words are stored in association with each other.
  • the synonym list for example, “daigakusei (university student)” and “gakuusei (student)” and “daigakusei” representing both are stored in association with each other.
  • the text mining device 10 may have a plurality of synonym lists and a plurality of compound word lists.
  • FIG. 4 is a flowchart showing the operation of the text mining apparatus 10.
  • FIG. 5 is a flowchart showing details of the screen data generation process (step S111 shown in FIG. 4) of the text mining apparatus 10.
  • the CPU 21 that executes the input unit 24 and step S113 functions as the instruction input unit 11.
  • the CPU 21 that executes steps S109 to S110 functions as the text analysis unit 12.
  • the CPU 21 that executes Step S ⁇ b> 111 functions as the screen generation unit 13.
  • the display unit 25 and the CPU 21 that executes step S112 function as the analysis result display unit 14.
  • the operation of the text mining apparatus 10 will be described with reference to FIGS. 4 and 5.
  • the CPU 21 displays the data designation screen 51 shown in FIG. 6 on the display unit 25 (step S101).
  • a box for specifying a file name and a box for specifying a folder name are displayed.
  • the user designates the text data 5 to be analyzed by designating the file name or folder name on the data designation screen 51.
  • the text data 5 may be stored in the storage unit 23 such as a hard disk, or may be stored in a server or another computer connected using the communication unit 26.
  • FIG. 7 is a diagram illustrating an example of the text data 5.
  • the text data shown in FIG. 7 is data of a report created by a college student, and is time-series data having date information.
  • the text data shown in Fig. 7 shows, in order from the top, "About the relationship between university students and society in this lecture ", "Generally, university students graduate before entering the society ", "We students have high tuition fees. Aware that you're learning ... "and” Student life is a valuable time for confidence to grow. Also ... ".
  • the type of text data 5 analyzed by the text mining apparatus 10 is arbitrary.
  • the CPU 21 displays the purpose designation screen 52 shown in FIG. 8 on the display unit 25 (step S103).
  • the purpose designation screen 52 On the purpose designation screen 52, three radio buttons corresponding to the contents, features, and reputation are displayed. The user operates the mouse 29 and presses one of the radio buttons to select the analysis purpose from the contents, characteristics, and reputation.
  • the CPU 21 receives the analysis purpose designated using the purpose designation screen 52. Thereby, the analysis purpose is input to the text mining apparatus 10 (step S104).
  • the CPU 21 displays the synonym list selection screen 53 shown in FIG. 9 on the display unit 25 (step S105).
  • the synonym list selection screen 53 displays the names of synonym lists that the text mining apparatus 10 has and the synonyms registered in each synonym list.
  • the user operates the mouse 29 to select one of the synonym lists on the synonym list selection screen 53, thereby specifying the synonym list to be used. Thereby, in the text mining device 10, a synonym list is selected (step S106).
  • the CPU 21 displays the compound word list selection screen 54 shown in FIG. 10 on the display unit 25 (step S107).
  • the name of the compound word list of the text mining device 10 and the compound words registered in each compound word list are displayed.
  • the user operates the mouse 29 to select one of the compound word lists on the compound word list selection screen 54, thereby specifying the compound word list to be used.
  • a compound word list is selected (step S108).
  • the CPU 21 considers the excluded word list, the synonym list, and the compound word list, and the analysis specified in step S104 from the text data within the analysis target period in the text data 5 input in step S102.
  • a word of a type corresponding to the purpose is extracted (step S109).
  • the analysis purpose is “content”
  • the CPU 21 extracts nouns, proper nouns, place names, and personal names from the text data 5.
  • the analysis purpose is “feature”
  • the CPU 21 extracts nouns, proper nouns, sa-changing nouns, and verbs from the text data 5.
  • the analysis purpose is “reputation”
  • the CPU 21 extracts an adjective, an adjective verb, and a moving verb from the text data 5.
  • the text mining apparatus 10 may support analysis purposes other than the above three. Further, the CPU 21 may extract different types of words depending on each analysis purpose.
  • the CPU 21 extracts words from only the text data included in the analysis target period instructed by the user from the text data 5 when executing step S109. .
  • the CPU 21 ignores all the words W1 included in the text data 5 when executing step S109.
  • the CPU 21 includes them in the text data 5 when executing step S109. All the processed words W3 are processed as the word W2.
  • the CPU 21 stores the text data 5 in the step S109. All the consecutive words W4 and W5 included are processed as the word W6.
  • step S110 the CPU 21 performs a hierarchical cluster analysis on the words extracted in step S109 (step S110).
  • step S110 for example, the CPU 21 obtains the similarity between the two words based on the distance between the two words in the text data 5 (how far the two words appear apart).
  • the CPU 21 performs hierarchical cluster analysis using a predetermined method (for example, the shortest distance method, the longest distance method, the group average method, the decimal method, the Ward method, etc.) based on the obtained similarity between words.
  • CPU21 calculates
  • step S111 the CPU 21 generates screen data for displaying the analysis result based on the result of the hierarchical cluster analysis obtained in step S110 (step S111).
  • step S111 the CPU 21 performs the process shown in FIG.
  • the CPU 21 sets the number of groups to m and the maximum number of data in the group to n (step S201).
  • CPU21 sets the number of clusters to m about the result of a hierarchical cluster analysis, and calculates m clusters (step S202).
  • CPU21 calculates
  • the CPU 21 determines the display size of each group based on the total appearance frequency obtained in step S203 (step S204). In step S204, the larger the total appearance frequency of words included in the cluster, the larger the group display size.
  • the CPU 21 selects a word to be displayed from the words included in the cluster (step S205).
  • step S205 n or less words are selected in descending order of appearance frequency from words included in each cluster.
  • CPU21 determines the display size of a word based on the appearance frequency of each word selected at step S205 (step S206).
  • step S ⁇ b> 206 the word display size is determined to be larger for words having a higher appearance frequency.
  • step S207 the CPU 21 generates screen data for displaying the result of the hierarchical cluster analysis.
  • the screen data generated in step S207 includes m groups (represented by cloud graphics) having the size determined in step S204. Each group includes n words or less having the size determined in step S206. The word is displayed inside the group on the screen.
  • step S207 the CPU 21 ends the screen data generation process.
  • the CPU 21 causes the display unit 25 to display a screen based on the screen data generated in step S111 (step S112).
  • the CPU 21 receives an instruction from the user (step S113).
  • the CPU 21 proceeds to one of steps S115 to S120 according to the type of instruction received in step S113 (step S114).
  • step S115 when the instruction
  • the CPU 21 sets the number of groups m to a value designated by the user (step S115), and proceeds to step S111. Thereafter, screen data is generated based on the set number of groups m, and a new screen is displayed. Thereby, an analysis result screen including the specified number of groups is displayed.
  • step S116 when the instruction
  • the CPU 21 sets the maximum number of data n in the group to a value designated by the user (step S116), and proceeds to step S111. Thereafter, screen data is generated based on the maximum number of data n in the set group, and a new screen is displayed. Thereby, an analysis result screen in which the number of words included in each group is limited to a specified value or less is displayed.
  • step S117 when the instruction
  • the CPU 21 sets the analysis target period to a period designated by the user (step S117), and proceeds to step S109. Thereafter, hierarchical cluster analysis is performed with reference to the set analysis target period, screen data for displaying a new analysis result is generated, and a new screen is displayed. As a result, the result of the hierarchical cluster analysis for the words included in the text data within the specified analysis target period is displayed on the screen.
  • FIG. 11A is a diagram showing an analysis result screen before setting the analysis target period.
  • FIG. 11B is a diagram illustrating an analysis result screen after setting the analysis target period.
  • the analysis result screen 61 before setting shown in FIG. 11A is included in the text data from January 1, 2014 00:00 to December 31, 2015 24: 0 among the input text data 5.
  • the result of the hierarchical cluster analysis for the word is displayed.
  • the analysis result screen 62 after setting shown in FIG. 11B is included in the text data from March 1, 2014 0:00 to September 30, 2014 24: 0 among the input text data 5.
  • the result of the hierarchical cluster analysis for the word is displayed.
  • the display content of the analysis result screen 61 and the display content of the analysis result screen 62 are different. The user can easily recognize the temporal change in the result of the hierarchical cluster analysis by looking at the analysis result screens before and after setting the analysis target period.
  • step S118 when the instruction
  • the CPU 21 adds the designated word to the excluded word list (step S118), and proceeds to step S109.
  • hierarchical cluster analysis is performed by excluding the designated word, screen data for displaying a new analysis result is generated, and a new screen is displayed. As a result, the result of the hierarchical cluster analysis excluding the designated word is displayed on the screen.
  • FIG. 12A is a diagram showing an analysis result screen before word exclusion.
  • FIG. 12B is a diagram showing an analysis result screen after word exclusion. The user operates the mouse 29 to select a word to be excluded, and then instructs word exclusion.
  • the analysis result screen 63 before word exclusion shown in FIG. 12A “shakai (society)” is selected, and “word exclusion” is selected from the menu. Thereafter, the result of the hierarchical cluster analysis excluding “shakai” is displayed on the screen.
  • “Shingaku (admission)” is displayed instead of “shakai”. “Shingaku” has the highest frequency of appearance next to the five words displayed on the analysis result screen 63 among the words included in the same cluster as “shakai”.
  • step S113 When the instruction received in step S113 is “synonym registration”, the CPU 21 proceeds to step S119. In this case, the CPU 21 adds the instructed word to the in-use synonym list (step S119), and proceeds to step S109. Thereafter, hierarchical cluster analysis is performed in consideration of the instructed synonym, screen data for displaying a new analysis result is generated, and a new screen is displayed. As a result, the result of the hierarchical cluster analysis using the instructed word as a synonym is displayed on the screen.
  • FIG. 13A is a diagram showing an analysis result screen after synonym registration.
  • FIG. 13B is a diagram illustrating an analysis result screen after synonym registration is performed. The user operates the mouse 29 to select a plurality of words to be registered as synonyms, and then instructs the synonym registration.
  • “daigakusei (university student)” and “gakusei (student)” are selected, and “synonym registration” is selected from the menu. Thereafter, the result of the hierarchical cluster analysis using “daigakusei” and “gakuusei” as synonyms is displayed on the screen.
  • “daigakuusei” is displayed in a larger size than the analysis result screen 65, and “shinku (admission)” is displayed instead of “gakuusei”. “Daigakusei” is displayed in a size larger than “daigakusei” in the analysis result screen 65 in accordance with the total appearance frequency of “daigakusei” and the appearance frequency of “gakuusei”.
  • step S120 when the instruction
  • the CPU 21 adds the instructed word to the compound word list in use (step S120), and proceeds to step S109.
  • step S109 Thereafter, hierarchical cluster analysis is performed in consideration of the instructed compound word, screen data for displaying a new analysis result is generated, and a new screen is displayed. As a result, the result of hierarchical cluster analysis using the specified word as a compound word is displayed on the screen.
  • FIG. 14A is a diagram showing an analysis result screen before compound word registration.
  • FIG. 14B is a diagram illustrating an analysis result screen after performing compound word registration. The user operates the mouse 29 to select a plurality of words to be registered as compound words, and then instructs “register synonyms”.
  • the analysis result screen 67 before registering a compound word shown in FIG. 14A “nintai” and “tsuyoi” are selected, and “compound word registration” is selected from the menu. Thereafter, the result of the hierarchical cluster analysis using “nintai” and “tsuyoi” as compound words is displayed on the screen.
  • “nintazuyoi” patient
  • the text mining method is based on the text analysis step of performing hierarchical cluster analysis on words extracted from input text data, and the screen data based on the analysis result of the text analysis step. And a screen generation step for displaying the screen based on the screen data.
  • the screen generation step obtains m clusters from the analysis result based on the number m of groups and the maximum number n of data in the group, and screen data for displaying on the screen a group including n or less words included in the clusters. Is generated.
  • a group including words included in a cluster is displayed on the screen based on a result of hierarchical cluster analysis performed on words included in text data. Further, the number of words included in the group is limited to n or less. Therefore, the user can intuitively understand the result of the hierarchical cluster analysis when viewing the screen.
  • the words included in the group are selected from the words included in the cluster corresponding to the group in descending order of appearance frequency. For this reason, words having a high appearance frequency among the words included in the cluster are displayed inside the group. Therefore, the user can easily recognize words that are included in each cluster and have a high appearance frequency.
  • the group has a size corresponding to the total appearance frequency of words included in the cluster corresponding to the group in the screen. Therefore, the user can easily recognize a cluster having a large sum of appearance frequencies of words. Further, the words included in the group have a size corresponding to the appearance frequency of the words in the screen. Therefore, the user can easily recognize words having a high appearance frequency.
  • the text mining method includes an instruction input step for inputting an instruction from the user, and either the text analysis step or the screen generation step is executed based on the instruction input in the instruction input step. Therefore, the display mode of the result of the hierarchical cluster analysis can be switched according to the instruction from the user.
  • the instruction input step receives a setting instruction for the number of groups m, and the screen generation step generates screen data based on the number of groups m specified in the instruction input step. Thereby, the number of areas (number of clusters) displayed on the screen can be switched in accordance with an instruction from the user.
  • the instruction input step receives the maximum data number n in the group, and the screen generation step generates screen data based on the maximum data number n in the group specified in the instruction input step. Thereby, the number of words displayed in the area can be switched according to an instruction from the user.
  • the instruction input step receives instructions for the analysis target period, and the text analysis step performs hierarchical cluster analysis on the words included in the text data within the analysis target period specified in the instruction input step of the text data. Do. Therefore, the result of the hierarchical cluster analysis performed on the words included in the text data within the analysis target period designated by the user is displayed on the screen. Therefore, the user can easily recognize the temporal change in the result of the hierarchical cluster analysis.
  • the instruction input step receives an analysis purpose setting instruction, and the text analysis step extracts a word of a type corresponding to the analysis purpose set in the instruction input step from the text data 5 and performs hierarchical cluster analysis. As a result, the result of the hierarchical cluster analysis can be displayed on the screen by switching the type of word to be analyzed according to the analysis purpose instructed by the user.
  • the instruction input step receives a word exclusion instruction
  • the text analysis step excludes the word specified in the instruction input step and performs a hierarchical cluster analysis.
  • the instruction input step receives a synonym registration instruction
  • the text analysis step regards the plurality of words specified in the instruction input step as the same word and performs hierarchical cluster analysis.
  • the instruction input step receives a compound word registration instruction, and the text analysis step merges a plurality of words specified in the instruction input step into one word and performs hierarchical cluster analysis. Thereby, it is possible to display on the screen the result of performing a hierarchical cluster analysis by merging a plurality of words designated by the user into one word.
  • the screen generation step generates screen data for displaying an analysis result screen including a group and an analysis setting screen for setting the display mode of the analysis result screen. Therefore, an analysis result screen and an analysis setting screen are displayed. Therefore, the user can easily switch the display mode of the result of the hierarchical cluster analysis using the analysis setting screen.
  • the text mining program 31 according to the present embodiment and the text mining apparatus 10 according to the present embodiment have the same configuration as the text mining processing method according to the present embodiment, and have the same effects.
  • the words included in the cluster are represented by the maximum data. Groups with less than a few are displayed on the screen. Therefore, the user can intuitively understand the result of the hierarchical cluster analysis when viewing the screen.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Dans des étapes d'analyse de texte (S109-S110), une analyse de grappe par hiérarchie est effectuée pour des mots extraits de données de texte entré. Dans une étape de génération d'écran (S111), m grappes sont calculées à partir du résultat d'analyse des étapes d'analyse de texte sur la base de m groupes et du nombre maximal n de données dans chacun des groupes, et des données d'écran prévues pour afficher, sur un écran, un groupe qui ne comprend pas plus de n mots dans les grappes sont générées. Dans une étape d'affichage de résultat d'analyse (S112), l'écran est affiché en fonction des données d'écran générées. Ainsi, le résultat de l'analyse de grappe par hiérarchie est affiché sur l'écran de manière à être compris intuitivement par un utilisateur.
PCT/JP2017/020922 2016-07-25 2017-06-06 Procédé d'exploration de texte, programme d'exploration de texte et appareil d'exploration de texte WO2018020842A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780043375.8A CN109478191B (zh) 2016-07-25 2017-06-06 文本挖掘方法、记录介质及文本挖掘装置
KR1020197000933A KR102180487B1 (ko) 2016-07-25 2017-06-06 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016-145065 2016-07-25
JP2016145065A JP6794162B2 (ja) 2016-07-25 2016-07-25 テキストマイニング方法、テキストマイニングプログラム、および、テキストマイニング装置

Publications (1)

Publication Number Publication Date
WO2018020842A1 true WO2018020842A1 (fr) 2018-02-01

Family

ID=61015910

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/020922 WO2018020842A1 (fr) 2016-07-25 2017-06-06 Procédé d'exploration de texte, programme d'exploration de texte et appareil d'exploration de texte

Country Status (5)

Country Link
JP (1) JP6794162B2 (fr)
KR (1) KR102180487B1 (fr)
CN (1) CN109478191B (fr)
TW (1) TWI686716B (fr)
WO (1) WO2018020842A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309260A (zh) * 2018-03-20 2019-10-08 株式会社斯库林集团 文本挖掘方法、文本挖掘存储介质及文本挖掘装置
WO2021171373A1 (fr) * 2020-02-25 2021-09-02 日本電気株式会社 Système, procédé et programme d'aide à la classification d'articles

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11636144B2 (en) 2019-05-17 2023-04-25 Aixs, Inc. Cluster analysis method, cluster analysis system, and cluster analysis program
JPWO2022130547A1 (fr) * 2020-12-16 2022-06-23

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0991314A (ja) * 1995-07-14 1997-04-04 Fuji Xerox Co Ltd 情報探索装置
JP2000227917A (ja) * 1999-02-05 2000-08-15 Agency Of Ind Science & Technol シソーラスブラウジングシステムと方法およびその処理プログラムを記録した記録媒体
JP2003044491A (ja) * 2001-07-30 2003-02-14 Toshiba Corp 知識分析システムならびに同システムにおける分析条件設定方法、分析条件保存方法および再分析処理方法
JP2005107688A (ja) * 2003-09-29 2005-04-21 Nippon Telegr & Teleph Corp <Ntt> 情報表示方法及びシステム及び情報表示プログラム

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6611825B1 (en) * 1999-06-09 2003-08-26 The Boeing Company Method and system for text mining using multidimensional subspaces
CN1934570B (zh) * 2004-03-18 2012-05-16 日本电气株式会社 文本挖掘装置和其方法
KR20090069874A (ko) * 2007-12-26 2009-07-01 한국과학기술정보연구원 지식맵 분석을 위한 키워드 선정 및 유사도계수 선정 방법및 그 시스템과 그 방법에 대한 컴퓨터 프로그램을 저장한기록매체
JP5022319B2 (ja) * 2008-08-04 2012-09-12 日本電信電話株式会社 テキストマイニング装置、方法、プログラム及びその記録媒体
JP5439261B2 (ja) 2010-04-01 2014-03-12 日本電信電話株式会社 クラスタリング装置、クラスタリング方法及びクラスタリングプログラム
JP5545876B2 (ja) 2011-01-17 2014-07-09 日本電信電話株式会社 クエリ提供装置、クエリ提供方法及びクエリ提供プログラム
US9477704B1 (en) * 2012-12-31 2016-10-25 Teradata Us, Inc. Sentiment expression analysis based on keyword hierarchy
TW201516713A (zh) * 2013-10-16 2015-05-01 Chunghwa Telecom Co Ltd 基於群體特徵值的文件分類方法
CN104142918B (zh) * 2014-07-31 2017-04-05 天津大学 基于tf‑idf特征的短文本聚类以及热点主题提取方法
CN104504024B (zh) * 2014-12-11 2018-09-07 中国科学院计算技术研究所 基于微博内容的关键词挖掘方法及系统
CN105550365A (zh) * 2016-01-15 2016-05-04 中国科学院自动化研究所 一种基于文本主题模型的可视化分析系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0991314A (ja) * 1995-07-14 1997-04-04 Fuji Xerox Co Ltd 情報探索装置
JP2000227917A (ja) * 1999-02-05 2000-08-15 Agency Of Ind Science & Technol シソーラスブラウジングシステムと方法およびその処理プログラムを記録した記録媒体
JP2003044491A (ja) * 2001-07-30 2003-02-14 Toshiba Corp 知識分析システムならびに同システムにおける分析条件設定方法、分析条件保存方法および再分析処理方法
JP2005107688A (ja) * 2003-09-29 2005-04-21 Nippon Telegr & Teleph Corp <Ntt> 情報表示方法及びシステム及び情報表示プログラム

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309260A (zh) * 2018-03-20 2019-10-08 株式会社斯库林集团 文本挖掘方法、文本挖掘存储介质及文本挖掘装置
TWI736860B (zh) * 2018-03-20 2021-08-21 日商斯庫林集團股份有限公司 文字探勘方法、記錄有文字探勘程式之記錄媒體、及文字探勘裝置
CN110309260B (zh) * 2018-03-20 2023-07-18 株式会社斯库林集团 文本挖掘方法、文本挖掘存储介质及文本挖掘装置
WO2021171373A1 (fr) * 2020-02-25 2021-09-02 日本電気株式会社 Système, procédé et programme d'aide à la classification d'articles
JP7456486B2 (ja) 2020-02-25 2024-03-27 日本電気株式会社 アイテム分類支援システム、方法およびプログラム

Also Published As

Publication number Publication date
KR102180487B1 (ko) 2020-11-18
CN109478191B (zh) 2022-04-08
JP2018018118A (ja) 2018-02-01
CN109478191A (zh) 2019-03-15
TWI686716B (zh) 2020-03-01
KR20190018480A (ko) 2019-02-22
TW201807597A (zh) 2018-03-01
JP6794162B2 (ja) 2020-12-02

Similar Documents

Publication Publication Date Title
US10140368B2 (en) Method and apparatus for generating a recommendation page
WO2018020842A1 (fr) Procédé d&#39;exploration de texte, programme d&#39;exploration de texte et appareil d&#39;exploration de texte
US20170315998A1 (en) Active Knowledge Guidance Based on Deep Document Analysis
EP2678774A1 (fr) Procédés de recherche de documents électroniques et de représentation graphique de recherches de documents électroniques
US20150205860A1 (en) Information retrieval device, information retrieval method, and information retrieval program
JP2009251934A (ja) 検索装置、検索方法および検索プログラム
CN111159431A (zh) 基于知识图谱的信息可视化方法、装置、设备及存储介质
JP7281024B1 (ja) 求職者検索システム、情報処理方法及びプログラム
JP5268508B2 (ja) 情報処理装置及び検索方法
US20210173850A1 (en) Categorical search using visual cues and heuristics
JP2008262506A (ja) 情報抽出システム、情報抽出方法および情報抽出用プログラム
JP5112027B2 (ja) 文書群提示装置および文書群提示プログラム
CN107577388B (zh) 输入界面的控制方法及装置
US20160292140A1 (en) Associative input method and terminal
JP5623023B2 (ja) アイデア整理支援装置、アイデア支援方法およびコンピュータプログラム
KR101626756B1 (ko) 확장 가능 기술/경로 탐색 서비스 시스템 및 그 방법
JP2017208047A (ja) 情報検索方法、情報検索装置、及びプログラム
JPWO2012101702A1 (ja) UI(UserInterface)作成支援装置、UI作成支援方法及びプログラム
JP2004118476A (ja) 電子辞書装置、電子辞書の検索結果表示方法、プログラムおよび記録媒体
Nizamee et al. Visualizing the web search results with web search visualization using scatter plot
JP2009075662A (ja) 検索支援装置
JP5302529B2 (ja) 情報処理装置及び情報処理方法、プログラム、記録媒体
JP6987003B2 (ja) テキストマイニング方法、テキストマイニングプログラム、および、テキストマイニング装置
JP5574775B2 (ja) アイデア整理支援装置およびアイデア整理支援プログラム
Sheng et al. The research on touch gestures interaction design for personal portable computer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17833852

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20197000933

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17833852

Country of ref document: EP

Kind code of ref document: A1