WO2019058698A1

WO2019058698A1 - Suggestion generation device, suggestion generation program and suggestion generation method

Info

Publication number: WO2019058698A1
Application number: PCT/JP2018/024841
Authority: WO
Inventors: 明子吉田; 清孝粕渕; 隆夫吉和
Original assignee: 株式会社Ｓｃｒｅｅｎホールディングス
Priority date: 2017-09-20
Filing date: 2018-06-29
Publication date: 2019-03-28
Also published as: JP6967412B2; TW201915785A; TWI703453B; JP2019057017A

Abstract

The present invention reliably presents a word related to an input word with high accuracy. The present invention discloses suggestion generation, wherein topic classification is performed on a morpheme analysis-completed text, and a topic word, which belongs to each topic, is extracted. A feature degree or the like of each topic word is calculated. A belonging topic word, which belongs to each topic, is specified. A topic to be extracted is extracted so that an input word belongs to each topic to be extracted. A score of each candidate word is calculated which indicates the intensity of relevance between the input word and each candidate word of a plurality of candidate words, which belong to the topic to be extracted. The belonging topic is specified so that each candidate word belongs to each belonging topic. A score of each candidate word is calculated from the feature degree or the like of each candidate word which has been calculated for the belonging topic. The plurality of candidate words are presented in order of intensity of relevance that is represented by the score of each candidate word.

Description

SUGGEST GENERATION DEVICE, SUGGEST GENERATION PROGRAM, AND SUGGEST GENERATION METHOD

The present invention relates to a suggestion generating device, a suggestion generating program, and a suggestion generating method for presenting words related to an input word.

When the text is created or a search is performed on the text, a suggestion is generated that presents the word associated with the input word.

The suggestion may be generated by extracting a word from the user's search history and displaying the extracted word, or extracting the text including the input word from the text to be searched, and the word from the extracted text There are also cases in which extraction is performed and the extracted word is displayed. The techniques described in

Patent Documents

1 and 2 are examples of the former, and the techniques described in Patent Document 3 are examples of the latter.

In the technique described in Patent Document 1, the search query history is stored as a search query candidate, and among the stored search query candidates, search query candidates matching the user attribute are presented (paragraphs 0031 and 0032) .

In the technique described in Patent Document 2, a combination of a search query and a re-search query is extracted from a search log database, and a score indicating the degree of association between the search query and the re-search query is calculated for the extracted combination. A predetermined number of re-search queries are extracted as suggestion queries in descending order of score from the re-search queries corresponding to the received search queries (paragraphs 0026, 0030 and 0034). Further, the co-occurrence rate of the search query and the re-search query is calculated, and the combination is excluded when the co-occurrence rate is equal to or more than a predetermined value (paragraphs 0027 and 0029).

In the technique described in Patent Document 3, a document data file including a designated keyword is searched from among document data files to be searched, and a designated keyword is included from the document data file including the searched keyword. A document unit is taken out, words are extracted, word relation data in which the extracted words are arranged in time order is created, word lists of the created word relation data are combined, and displayed in order of document creation time ( Paragraph 0040).

JP, 2015-106354, A Unexamined-Japanese-Patent No. 2012-168844 Unexamined-Japanese-Patent No. 9-259133

However, conventional suggestion generation has a problem that it may not be possible to present a word related to the input word.

For example, in the technology described in Patent Document 1, search query candidates are generated from the history of search queries, so the user does not know the search query associated with the search query, and uses the search query in the past search. If not, it is not possible to present search query candidates associated with the search query.

Similarly, in the technology described in Patent Document 2, since the suggestion query is generated from the search log database, the user does not know the search query associated with the search query, and the search query is used in the past search If not, you can not present suggestion queries that are related to the search query.

Further, in the technology described in Patent Document 3, a word list to be displayed is generated from a document data file group to be searched, and it is assumed that the word list generated in this manner includes words associated with keywords. There is no limit.

The present invention is made to solve the above problems. The problem to be solved by the present invention is to provide a suggestion generating device, a suggestion generating method and a suggestion generating program for presenting words related to an input word with high accuracy.

In the generation of the suggestion, morphological analysis is performed on the text, the text is divided into a plurality of words, and the morphologically analyzed text is obtained.

Topic classification is performed on the morphologically analyzed text, and at least one topic word belonging to each topic of the plurality of topics is extracted from the plurality of words.

For each topic to which at least one topic word belongs, a score factor for each topic word is calculated. The score factor of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word on the topic to which each topic word belongs.

At least one affiliation topic word belonging to each topic is identified. At least one affiliation topic word includes at least a part of at least one topic word extracted.

At least one extracted topic is extracted from a plurality of topics. The extraction of the at least one extracted topic is performed such that the input word belongs to each extracted topic of the at least one extracted topic.

A score of each candidate word indicating the strength of the degree of association between the input word and each of the plurality of candidate words belonging to the at least one extracted topic is calculated.

In calculation of the score of each candidate word, at least one belonging topic is specified in at least one extracted topic. Identification of at least one affiliation topic is performed such that each candidate word belongs to each affiliation topic of at least one affiliation topic.

The score of each candidate word is calculated from at least one score factor of each candidate word calculated for each of at least one belonging topic.

A plurality of candidate words are presented in the order of the degree of relevance indicated by the score of each candidate word.

According to the present invention, since a word presented through topic classification is extracted from text, a suggestion generating device, a suggestion generating method and a suggestion generating program for presenting a word related to an input word with high accuracy are provided.

The objects, features, aspects and advantages of the present invention will be more apparent from the following detailed description and the accompanying drawings.

It is a block diagram which illustrates the hardware constitutions of the suggestion generation device of a 1st embodiment. It is a block diagram illustrating the functional composition of the suggestion generator of a 1st embodiment. It is a figure explaining the process with respect to the several topic performed in the suggestion production | generation apparatus of 1st Embodiment. It is a flowchart which illustrates the flow of the process which the suggestion production | generation apparatus of 1st Embodiment performs. It is a figure illustrating the example of change of the data in the suggestion generation device of a 1st embodiment. It is a figure illustrating the example of change of the data in the suggestion generation device of a 1st embodiment. It is a figure illustrating the example of change of the data in the suggestion generation device of a 1st embodiment. It is a figure explaining the calculation algorithm of the suggestion score about each user group in the suggestion generation device of a 1st embodiment. It is a figure which illustrates the example of the search log memorize | stored in the suggestion production | generation apparatus of 1st Embodiment. It is a figure which illustrates the example of the user management table memorize | stored in the suggestion production | generation apparatus of 1st Embodiment. It is a figure illustrating the example of the addition score factor table calculated in the suggestion generator of a 1st embodiment. It is a figure which illustrates the example of the suggestion word list | wrist produced in the suggestion production | generation apparatus of 1st Embodiment. It is a figure explaining the example of calculation by the 1st calculation method of the suggestion score of each candidate word in the suggestion generation device of a 1st embodiment. It is a figure explaining the example of calculation by the 2nd calculation method of the suggestion score of each candidate word in the suggestion generation device of a 1st embodiment. It is a figure explaining the example of calculation by the 3rd calculation method of the suggestion score of each candidate word in the suggestion generation device of a 1st embodiment. It is a figure explaining the example of calculation by the 4th calculation method of the suggestion score of each candidate word in the suggestion generation device of a 1st embodiment. It is a figure explaining another example of the calculation algorithm of the suggestion score of each candidate word about each user group in the suggestion generation device of a 1st embodiment. It is a schematic diagram which illustrates the example of the screen displayed in the suggestion production | generation apparatus of 1st Embodiment.

1 Hardware Configuration FIG. 1 is a block diagram illustrating the hardware configuration of the suggestion generating device of the first embodiment.

The suggestion generating apparatus 1000 illustrated in FIG. 1 is a personal computer (PC) on which a suggestion generating program 1020 is installed, and includes a central processing unit (CPU) 1040, a memory 1041, a hard disk drive 1042, and a display 1043. The suggestion generator 1000 may comprise components other than these components.

In the suggestion generating apparatus 1000, a suggestion generating program 1020 is installed in the hard disk drive 1042. Even if installation of the suggestion generation program 1020 is performed by writing data read from an external storage medium 1060 such as a compact disc (CD), digital multipurpose disc (DVD), universal serial bus (USB) memory or the like to the hard disk drive 1042 It may be performed by writing data received via the network 1080 to the hard disk drive 1042. The hard disk drive 1042 may be replaced with another type of auxiliary storage device. For example, the hard disk drive 1042 may be replaced by a solid state drive, a random access memory (RAM) disk, or the like. A hard disk drive 1042, an external storage medium 1060, a solid state drive, a RAM disk, and the like are computer readable recording media in which a suggestion generation program 1020 is recorded.

In the suggestion generation apparatus 1000, the suggestion generation program 1020 installed in the hard disk drive 1042 is loaded into the memory 1041, and the loaded suggestion generation program 1020 is executed by the CPU 1040, whereby the PC executes the suggestion generation program 1020. It functions as a suggestion generator 1000.

2 Functional Configuration FIG. 2 is a block diagram illustrating the functional configuration of the suggestion generating device of the first embodiment. FIG. 3 is a diagram for explaining processing on a plurality of topics performed in the suggestion generating device of the first embodiment.

As illustrated in FIG. 2, the suggestion generation apparatus 1000 includes a removal unit 1100, a morphological analysis unit 1101, a topic classification unit 1102, a score factor calculation unit 1103, a specification unit 1104, a score calculation unit 1105, a presentation unit 1106, and a storage unit. A suggestion 1208 is generated from the text 1200 and the input word 1201 to be searched or analyzed. The storage unit 1107 stores a forced extraction term dictionary 1300, an exclusion term dictionary 1301, a search log 1302, and a user management table 1303. The suggestion generator 1000 may comprise components other than these components. The input word 1201 may be a search term used in a search, or may be a word input for creating a new text. The suggestion 1208 is a presentation of words associated with the input word 1201.

The removal unit 1100, the morphological analysis unit 1101, the topic classification unit 1102, the score factor calculation unit 1103, the identification unit 1104, the score calculation unit 1105, and the presentation unit 1106 are configured by causing the PC to execute the suggestion generation program 1020. The storage unit 1107 is configured by at least one of the memory 1041 and the hard disk drive 1042.

All or part of the processing performed by the CPU 1040 may be performed by a processing device other than the CPU 1040. For example, all or part of the processing performed by the CPU 1040 may be performed by a graphics processing unit (GPU). All or part of the processing performed by the CPU 1040 may be performed by hardware that does not execute a program.

The removal unit 1100 removes the stop word from the pre-removal text 1200 in which the stop word is not removed, and obtains the post-removal text 1202 in which the stop word is removed. If it is not necessary to remove the stop word, such as when the text 1200 to be searched or analyzed does not include the stop word, the removing unit 1100 may be omitted.

The morphological analysis unit 1101 performs morphological analysis on the post-removal text 1202 to divide the post-removal text 1202 into a plurality of words, and obtains a morpheme-analyzed text 1203 including a plurality of words obtained by the division. The morphological analysis unit 1101 uses the compulsory extraction word dictionary 1300 in morphological analysis on the post-removal text 1202. Use of the compulsory extraction word dictionary 1300 may be omitted.

The topic classification unit 1102 performs topic classification on the morphologically analyzed text 1203 and extracts at least one topic word 1204 belonging to each topic of a plurality of topics from the plurality of words included in the morphologically analyzed text 1203.

The score factor calculation unit 1103 calculates a score factor 1205 of each topic word with respect to the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs. The score factor 1205 of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word in the topic to which each topic word belongs . The score factor 1205 of each topic word can be a factor included in the candidate score of the candidate word described later.

The identifying unit 1104 identifies at least one affiliation topic word 1206 belonging to each topic of the plurality of topics 1250, as illustrated in FIG. At least one affiliation topic word 1206 belonging to each topic includes at least a part of at least one topic word 1204 belonging to each topic extracted by the topic classification unit 1102. The identifying unit 1104 uses a search log 1302 and an exclusion term dictionary 1301 in identifying at least one affiliation topic word 1206 belonging to each topic as illustrated in FIG. Thereby, at least one affiliation topic word 1206 belonging to each topic includes at least a part of at least one topic word 1204 belonging to each topic, and an unextracted word not included in at least one topic word 1204 belonging to each topic including. The use of at least one of the search log 1302 and the exclusion term dictionary 1301 may be omitted. When the use of the search log 1302 is omitted, at least one affiliated topic word 1206 belonging to each topic does not include an unextracted word which is not included in at least one topic word 1204 belonging to each topic. When the use of the exclusion word dictionary 1301 is omitted, at least one affiliation topic word 1206 belonging to each topic includes all of at least one topic word 1204 belonging to each topic.

The score calculation unit 1105 extracts at least one to-be-extracted topic 1251 to which the input word 1201 belongs from a plurality of topics 1250 as illustrated in FIG. 3. Extraction of at least one extracted topic 1251 is performed such that the input word 1201 belongs to each extracted topic of the at least one extracted topic 1251. A plurality of words belonging to at least one extracted topic 1251 become a plurality of candidate words 1260 which may be presented in the generation of the suggestion 1208.

The score calculation unit 1105 calculates a suggestion score of each candidate word 1261 indicating the strength of the degree of association between the input word 1201 and each candidate word 1261 of the plurality of candidate words 1260. The score calculation unit 1105 specifies at least one affiliation topic 1252 to which each candidate word 1261 belongs in at least one extracted topic 1251 in the calculation of the suggestion score of each candidate word 1261. Identification of at least one affiliation topic 1252 is performed such that each candidate word 1261 belongs to each affiliation topic of the at least one affiliation topic 1252.

The score calculation unit 1105 calculates a suggestion score of each candidate word 1261 from at least one score factor of each candidate word 1261 calculated for each of at least one belonging topic 1252.

The score calculation unit 1105 creates a suggestion word list 1207 by sorting the plurality of candidate words 1260 in the order of the degree of relevance indicated by the suggestion score of each candidate word 1261 as illustrated in FIG. 2. The score calculation unit 1105 uses the search log 1302 and the user management table 1303 in creating the suggestion word list 1207, and creates a suggestion word list 1207 unique to each user group for each user group.

The presentation unit 1106 generates a suggestion 1208 according to the suggestion word list 1207. In the suggestion 1208, a plurality of candidate words 1260 included in the suggestion word list 1207 are presented in the order of the degree of relevance indicated by the suggestion score of each candidate word 1261.

According to the suggestion generating apparatus 1000, the suggestion 1208 is generated from the text 1200 and the input word 1201 to be searched or analyzed. Therefore, when the text 1200 exists, a search history such as the search log 1302 does not exist or a search is made Even when the search history such as the log 1302 is insufficient, the suggestion 1208 is automatically generated, and the word associated with the input word 1201 is automatically presented. Further, according to the suggestion generation apparatus 1000, since the presented word is not a word simply extracted from the text 1200 but a word extracted through the topic classification from the text 1200, the suggestion 1208 having high accuracy is It is generated.

3 Example of Transition of Processing and Data FIG. 4 is a flowchart illustrating the flow of processing performed by the suggestion generating device of the first embodiment. FIG.5, FIG6 and FIG.7 is a figure which illustrates the example of transition of the data in the suggestion production | generation apparatus of 1st Embodiment.

In step S101 illustrated in FIG. 4, the removing unit 1100 removes the stop word from the text 1200 to be searched or analyzed, and obtains the post-removed text 1202. The text 1200 to be searched or analyzed is a text or the like created in the past. The stop word to be removed is a word that becomes unnecessary noise for the subsequent analysis. The words removed as stop words are identification codes or the like that do not represent the specific content of the text 1200. Strings commonly included in various URLs such as "http: //" are also removed as stop words. In the example illustrated in FIG. 5, a text element 1400 "R000003", a text element 1401 "development process customization", a text element 1402 "master data (user, project, product, ...)", "R000002" The text element 1403 includes a text element 1403 “of process ratio at the time of prediction formula registration ...” and a text element 1405 of “you can input the process ratio to the second decimal place ...” in the text 1200,

Text elements

1400 and 1403 have been removed as stop words.

In step S102 subsequent to step S101 illustrated in FIG. 4, the morphological analysis unit 1101 performs morphological analysis on the post-removal text 1202, divides the post-removal text 1202 into a plurality of words, and is obtained by division. A morphologically analyzed text 1203 including a plurality of words is obtained. In the example illustrated in FIG. 5, the text element 1401 is divided into a plurality of words 1411 "development process" and "customize", and the text element 1402 is "master data", "user", "project", "product" Etc., and the text element 1404 is divided into a plurality of words 1414 such as "prediction equation", "registration", "time", "no", "step", "proportion", "no", etc. The text elements 1405 are divided into "process", "rate", "no", "input", "ha", "decimal point", "second place", "up", "input", "possible", " And so on.

The morphological analysis unit 1101 forcibly removes the technical terms registered in the compulsory extraction term dictionary 1300 using the compulsory extraction term dictionary 1300 in which the technical terms that are compound words consisting of two or more morphemes are registered, and removes the technical terms from the text 1202 The post-removal text 1202 is divided into a plurality of words so that the plurality of words included in the morphologically analyzed text 1203 include the specialized words extracted. As a result, technical terms that are compound terms are extracted normally without being divided. In the example shown in FIG. 5, the technical term 1416 "master data" and the technical term 1417 "prediction formula" are forcibly extracted.

In step S103 following step S102 illustrated in FIG. 4, the topic classification unit 1102 performs topic classification on the morphologically analyzed text 1203 and generates at least one word belonging to each topic of a plurality of topics 1250 based on a plurality of words. Extract the topic words 1204 of The topic classification is to estimate the topic handled in the input text, and to classify sentences constituting the input text into a plurality of topics. The topic indicates the meaning of the topic, the field, etc. In the example illustrated in FIG. 6, a plurality of topic words 1420 such as "application", "version", "development" and "specification" belonging to the topic to which the topic No. "0" is assigned are extracted, and topic No A plurality of topic words 1421 of "test", "debug", "single" and "management" belonging to the topic to which "1" is attached is extracted, and "topic No. 2" belongs to the topic to which "topic" is attached. "Design", "Use Case", "Button", and "belonging" to the topic to which a plurality of topic words 1422 such as "soft", "Correspondence", "Due" and "Confirmation" are extracted and the topic No. "3" is given A plurality of topic words "release", "correspondence", "note" and "prepare" belonging to the topic to which a plurality of topic words 1423 "arrange" are extracted and the topic No. "4" is given 424 are extracted, and a plurality of topic words 1425 such as "inquire", "receive", "answer" and "description" belonging to the topic given the topic No. "5" are extracted, and the topic No. "6" is extracted. A plurality of topic words 1426 of “customer”, “hearing”, “main request” and “sub request” belonging to the given topic are extracted.

In step S104 following step S103 illustrated in FIG. 4, the score factor calculation unit 1103 selects each topic word for the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs. Calculate the score factor of. The score factor of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word on the topic to which each topic belongs. In the example illustrated in FIG. 6, for the topic given the topic ID "corpus1_0_0", the characteristic degree 1440 "4.675" and the probability of occurrence 1450 within the topic "11.21%" of the topic word 1430 "app" are calculated. The feature degree 1441 of "4.435" and the appearance probability 1451 of the topic term "5.00%" of the topic word 1431 of "debug" are calculated, and the feature degree 1442 of "3.599" and the "4.30" The in-topic occurrence probability 1452 of% is calculated, the characteristic degree 1443 of “3.199” and the in-topic occurrence probability 1453 of topic word 1433 in “language” are calculated, and the in-topic occurrence probability 1453 of “version” is calculated The characteristic degree 1444 “2.620” and the occurrence probability 1454 within the topic “3.35%” It is calculated.

The feature degree of each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 is an index indicating the ease of appearance of each topic word in the topic to which each topic word belongs, and is obtained in the topic classification The in-topic appearance probability of each topic word is determined to increase as it increases, and as the appearance frequency of each topic word in the text 1200 to be searched or analyzed increases. Desirably, the characteristic degree of each topic word is obtained by dividing the in-topic appearance probability of each topic word by the frequency of appearance of each topic word in the text, as shown in equation (1). Dividing by the frequency of appearance of each topic word in the text suppresses the tendency of words having weak characteristics that belong to various topics and characterize each topic to be presented.

The frequency of appearance of each topic word in the text is obtained by dividing the number of appearances of each topic word in the text by the number of words in the entire text, as shown in equation (2).

In step S105 following step S104 shown in FIG. 4, it is determined whether or not there is a search log 1302 in which words used in the past search are recorded. If it is determined that the search log 1302 exists, the unextracted word is added in step S106 shown in FIG. 4, and the addition score factor is calculated in step S107 shown in FIG. In step S108 shown in FIG. On the other hand, when it is determined that the search log 1302 does not exist, deletion of the exclusion term is performed in step S108 illustrated in FIG. 4.

In step S106, as illustrated in FIG. 7, the identification unit 1104 is used in the past search more than the set number of times, but is included in at least one topic word 1204 extracted by the topic classification unit 1102 Unextracted words are identified from the search log 1302, the identified unextracted words are added to at least one topic word 1204 extracted by the topic classification unit 1102, and updated at least one topic word 1209 is obtained. As a result, at least one belonging topic word 1206 specified by the specifying unit 1104 includes an unextracted word.

FIG. 8 is a diagram for explaining a calculation algorithm of the suggestion score of each candidate word for each user group in the suggestion generating device of the first embodiment. FIG. 9 is a diagram illustrating an example of a search log stored in the suggestion generating device of the first embodiment. FIG. 10 is a diagram illustrating an example of a user management table stored in the suggestion generating device of the first embodiment. FIG. 11 is a diagram illustrating an example of an addition score factor table calculated in the suggestion generating device of the first embodiment.

In the search log 1302, information specifying the user who made each search and the words used in each search are recorded in a mutually associated state. In the example illustrated in FIG. 9, for example, a user identifier (ID) 1500 "001", a search word 1501 "application", and a search time 1502 "2016-12-26 16: 55: 22.916" correspond to each other. It is recorded in the attached state. The user ID 1500 is information for identifying the user who has performed each search. The search word 1501 is a word used in each search.

The user management table 1303 stores information identifying a user and information identifying a user group to which the user belongs, in association with each other. In the example illustrated in FIG. 10, for example, a user ID 1510 "001", a name 1511 "XXXX", and a group (department) ID 1512 "G001" are stored in association with one another, and a group "G001" A (department) ID 1520 and a name 1521 "user window" are stored in association with each other. The user ID 1510 and the name 1511 are information for identifying a user. Group (department) ID 1520 and name 1521 are information for specifying the user group to which the user belongs.

By referring to the search log 1302 and the user management table 1303, it is possible to identify the used word used by the user who belongs to each user group in the past search.

In step S107 shown in FIG. 4, the score factor calculation unit 1103 uses, for each user group, used words used by users belonging to each user group in the past search, as shown in FIG. Are specified from the search log 1302 and the user management table 1303, and the added score factor 1530 of the topic to which the specified used word belongs is calculated. In the example illustrated in FIG. 11, for example, for the user group to which the group ID 1540 of “G001” is assigned, the addition score factor 1542 of “10” of the topic to which the topic ID 1541 of “corpus1_0_0” is assigned is calculated. There is.

In addition, as shown in FIG. 8, the score factor calculation unit 1103 calculates, for each user group, the addition score factor 1530 of the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs. The score factor 1205 of each topic word is calculated by adding to the pre-addition score factor 1531 of each topic word calculated in step S104. The score factor 1205 of each topic word also indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word in the topic to which each topic word belongs There is a score factor for each topic word specific to each user group. The score factor 1205 of each topic word specific to each user group makes it possible to generate a suggestion 1208 suitable for each user group. The score factor 1205 of each topic word calculated in step S107 is used to calculate the suggestion score 1532 of each candidate word 1261. Step S107 may be omitted, and the score factor of each topic word calculated in step S104 may be used to calculate the suggestion score 1532 of each candidate word 1261.

In step S108 illustrated in FIG. 4, the identifying unit 1104 uses at least one topic using the exclusion term dictionary 1301 in which exclusion terms unnecessary for search or analysis are registered as illustrated in FIG. 7. An exclusion term registered in the exclusion term dictionary 1301 is deleted from the term 1209 to obtain at least one affiliation topic term 1206. Thereby, at least one affiliation topic word 1206 specified by the specification unit 1104 does not include the exclusion word.

In step S109 subsequent to step S108 illustrated in FIG. 4, the score calculation unit 1105 includes at least one extracted topic 1251 to which the input word 1201 belongs from a plurality of topics 1250 as illustrated in FIG. 3. Extract The extraction of the at least one extracted topic 1251 is performed such that the input word 1201 belongs to each extracted topic of the at least one extracted topic 1251.

In addition, the score calculation unit 1105 creates a suggestion candidate list 1210 including a plurality of candidate words 1260 attached to at least one extracted topic 1251 as illustrated in FIG. 7.

In step S110 following step S109 illustrated in FIG. 4, the score calculation unit 1105 determines the degree of association between the input word 1201 and each candidate word 1261 of the plurality of candidate words 1260 included in the suggestion candidate list 1210. A suggestion score 1532 of each candidate word 1261 indicating. The score calculation unit 1105 specifies at least one affiliation topic 1252 to which each candidate word 1261 belongs in at least one to-be-extracted topic 1251 in calculation of the suggestion score 1532 of each candidate word 1261. Identification of at least one affiliation topic 1252 is performed such that each candidate word 1261 belongs to each affiliation topic of at least one affiliation topic 1252.

In addition, the score calculation unit 1105 calculates a suggestion score 1532 of each candidate word 1261 from at least one score factor 1205 of each candidate word 1261 calculated for each of at least one belonging topic 1252.

In addition, as illustrated in FIG. 7, the score calculation unit 1105 sorts the plurality of candidate words 1260 included in the suggestion candidate list 1210 in the order of the degree of association indicated by the suggestion score 1532 of each candidate word 1261. Then, a suggestion word list 1207 is created.

Also, the score calculation unit 1105 calculates a suggestion score 1532 of each candidate word 1261 from at least one score factor 1205 of each candidate word 1261 calculated for the user group to which the user who has input the input word 1201 belongs, and the user belongs Create a suggestion word list 1207 specific to the user group.

FIG. 12 is a diagram illustrating an example of a suggestion word list created in the suggestion generating device of the first embodiment.

In the suggestion word list 1207, information specifying topics, candidate words and suggestion scores are stored in association with each other. In the example illustrated in FIG. 12, for example, a topic ID 1550 of "corpus 0_1_1", a topic word 1551 of "app", and a suggestion score 1552 of "4.675" are stored in association with each other. The topic ID 1550 is information for specifying a topic. The topic word 1551 is a candidate word.

In step S111 following step S110 illustrated in FIG. 4, the presentation unit 1106 generates a suggestion 1208 according to the suggestion word list 1207 as illustrated in FIG. 7. In the suggestion 1208, a plurality of candidate words 1260 included in the suggestion word list 1207 are presented in the order of the degree of relevance indicated by the suggestion score 1532 of each candidate word 1261.

4 First Calculation Method of Suggestion Score FIG. 13 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the first calculation method.

In the first calculation method, as illustrated in FIG. 3, the score calculation unit 1105 extracts at least one extracted topic 1251 from the plurality of topics 1250 such that the input word 1201 belongs to each extracted topic. . In the calculation example shown in FIG. 13, at least one to-be-extracted topic 1610 of topics k, l and m is extracted such that the input word 1600 of “application” belongs to each to-be-extracted topic.

Also, as illustrated in FIG. 3, the score calculation unit 1105 specifies at least one affiliation topic 1252 in at least one extracted topic 1251 such that the candidate word 1261 belongs to each affiliation topic. In the calculation example shown in FIG. 13, at least one affiliation topic 1611 of topics k and m is specified such that the candidate word 1601 of “version” belongs to each affiliation topic.

In addition, for each belonging topic of at least one belonging topic 1252, the score calculation unit 1105 calculates the score factor 1205 of the input word 1201 calculated for each belonging topic and the score factor 1205 of the candidate word 1261 calculated for each belonging topic. Calculate the product of In the calculation example illustrated in FIG. 13, for the topic k, the characteristic word 1620 of the “31.2” of the input word 1600 “app” calculated for the topic k and the candidate word 1601 “version” calculated for the topic k The product 1622 of “31.2 × 15.4 = 480.48” with the feature degree 1621 of “15.4” is calculated, and for the topic m, the feature degree 1623 of “0.3” of the input word 1600 of “application” calculated for the topic m A product 1625 of “0.3 × 87.0 = 26.1” with a characteristic degree 1624 of the candidate word 1601 of “version” calculated for the topic m is calculated.

In addition, the score calculation unit 1105 suggests a suggestion score 1532 of the candidate word 1261 indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one product calculated for each of the at least one belonging topic 1252. Calculate In the calculation example shown in FIG. 13, the maximum value “480.48” of the product 1622 “31.2 × 15.4 = 480.48” calculated for the topic k and the product 1625 “0.3 × 87.0 = 26.1” calculated for the topic m 1626 is made into the suggestion score 1627 of the candidate word 1601. Instead of the suggestion score 1627 of the candidate word 1601 matching the maximum value 1626, a suggestion score 1627 of the candidate word 1601 including the maximum value 1626 as a factor may be calculated. For example, a suggestion score 1627 of a candidate word 1601 that matches a constant multiple of the maximum value 1626 may be calculated.

In the first calculation method, generally speaking, the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (3) is calculated using feature word feature words of candidate word words calculated for feature key word t and topic t.

According to the first calculation method, a large feature degree indicating that the word characterizes the topic to which the word belongs is likely to be reflected in the suggestion score 1532 of the candidate word 1261, and the word features the topic to which the word belongs It is hard to reflect the small feature degree which shows that the degree of application is weak in the suggestion score 1532 of the candidate word 1261.

5 Second Calculation Method of Suggestion Score FIG. 14 is a diagram for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the second calculation method.

In the second calculation method, as in the first calculation method, the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extracted topic 1251, and for each affiliation topic, the score factor 1205 of the input word 1201 calculated for each affiliation topic and the score factor for the candidate word 1261 calculated for each affiliation topic Calculate the product with 1205.

In the second calculation method, the score calculation unit 1105 is a candidate word indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the product of at least one product respectively calculated for at least one belonging topic 1252 A suggestion score 1532 of 1261 is calculated. In the calculation example shown in FIG. 14, “480.48 × 26.1 = 12540.528” of product 1622 “31.2 × 15.4 = 480.48” calculated for topic k and product 1625 “0.3 × 87.0 = 26.1” calculated for topic m Is a suggestion score 1629 of the candidate word 1601. A suggestion score 1629 of a candidate word 1601 including the product 1628 as a factor may be calculated instead of the suggestion score 1629 of the candidate word 1601 matching the product 1628. For example, a suggestion score 1629 of a candidate word 1601 that matches a constant multiple of the product 1628 may be calculated.

In the second calculation method, generally speaking, the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (4) is calculated using the feature word featureword of the candidate word word calculated for featurekeywordt and topic t.

According to the second calculation method, any one of a large feature degree indicating that the word characterizes the topic to which the word belongs is strong, and a small feature degree indicating that the word characterizes the topic to which the word belongs are weak Is also reflected in the suggestion score 1532 of the candidate word 1261.

6 Third Calculation Method of Suggestion Score FIG. 15 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the third calculation method.

In the third calculation method, as in the first calculation method, the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extraction topic 1251.

In the third calculation method, for each belonging topic, the score calculation unit 1105 calculates the score factor 1205 of the input word 1201 calculated for each belonging topic and the score factor 1205 of the candidate word 1261 calculated for each belonging topic. Calculate the product. In the calculation example shown in FIG. 15, for the topic k, the characteristic word 1620 of the “31.2” of the input word 1600 “app” calculated for the topic k and the candidate word 1601 “version” calculated for the topic k The product 1631 “31.2 × 0.025 = 0.78” with the probability of occurrence 1630 within the topic “0.025” is calculated, and for the topic m, the feature factor 1623 “0.3” of the input word 1600 “app” calculated for the topic m A product 1633 of “0.3 × 0.350 = 0.105” is calculated with an in-topic appearance probability 1632 of “0.350” of the candidate word 1601 of “version” calculated for the topic m.

In addition, the score calculation unit 1105 suggests a suggestion score 1532 of the candidate word 1261 indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one product calculated for each of the at least one belonging topic 1252. Calculate In the calculation example shown in FIG. 15, the product 1631 “31.2 × 0.025 = 0.78” calculated for the topic k and the product 1633 “31.2 × 0.025 = 0.78” “0.3 × 0.350 = 0.105” calculated for the topic m. The maximum value 1634 of “” is made the suggestion score 1635 of the candidate word 1601. Instead of the suggestion score 1635 of the candidate word 1601 matching the maximum value 1634, a suggestion score 1635 of the candidate word 1601 including the maximum value 1634 as a factor may be calculated. For example, a suggestion score 1635 of a candidate word 1601 that matches a constant multiple of the maximum value 1634 may be calculated.

In the third calculation method, generally speaking, the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (5) is calculated using the in-topic occurrence probability probabilityt of the candidate word word calculated for the featurekeyword and the topic t.

According to the third calculation method, a large feature degree indicating that the word characterizes the topic to which the word belongs is strong, and a large in-topic appearance indicating that the probability of occurrence of the word in the topic to which the word belongs is high. The probability is likely to be reflected in the suggestion score 1532 of the candidate word 1261, and the small feature degree indicating that the word characterizes the topic to which the word belongs is weak, and the in-topic appearance probability of the word in the topic to which the word belongs is low The small in-topic appearance probability shown is hard to be reflected in the suggestion score 1532 of the candidate word 1261.

7 Fourth Calculation Method of Suggestion Score FIG. 16 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the fourth calculation method.

In the fourth calculation method, as in the first calculation method, the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extraction topic 1251.

In the fourth calculation method, the score calculation unit 1105 determines the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one score factor 1205 of the candidate words 1261 calculated for each of the at least one belonging topic 1252 Calculate the suggestion score 1532 of the candidate word 1261 indicating the strength of. In the calculation example shown in FIG. 16, the candidate word 1601 “version” calculated for the topic k has a probability of occurrence within the topic “0.025” 1636 and the candidate word “version” calculated for the topic m “0.350 The maximum value 1638 of “0.350” in the in-topic appearance probability 1637 is set as the suggestion score 1639 of the candidate word 1601. A suggestion score 1639 of a candidate word 1601 including a maximum value 1638 as a factor may be calculated instead of the suggestion score 1639 of the candidate word 1601 matching the maximum value 1638. For example, a suggestion score 1639 of a candidate word 1601 that matches a constant multiple of the maximum value 1638 may be calculated.

In the fourth calculation method, generally speaking, the suggestion score Score (word) of the candidate word is a topic of the candidate word word calculated for at least one belonging topic T (keyword, word) and the topic t. Equation (6) is calculated using the probability of occurrence probability word.

According to the fourth calculation method, a large in-topic appearance probability indicating that the in-topic appearance probability of the word in the topic to which the word belongs is easily reflected in the suggestion score 1532 of the candidate word 1261 and in the topic to which the word belongs The small in-topic occurrence probability indicating that the in-topic occurrence probability of the word is low is hard to be reflected in the suggestion score 1532 of the candidate word 1261.

8 Another Example of Calculation of Suggestion Score for Each User Group FIG. 17 is a view for explaining another example of a calculation algorithm of the suggestion score of each candidate word for each user group in the suggestion generating device of the first embodiment.

In the other example, the score calculation unit 1105 calculates a pre-addition suggestion score 1700 indicating the strength of the degree of association between the input word 1201 and each candidate word 1261 from the score factor 1205 of each topic word.

Also, the score calculation unit 1105 identifies, for each user group, used words used by users belonging to each user group in the past search from the search log 1302 and the user management table 1303, and adds scores of used words. The suggestion score 1532 of each candidate word 1261 is calculated by calculating and adding the addition score 1701 of each candidate word 1261 to the pre-addition suggestion score 1700 of each candidate word 1261.

9 Example of Screen FIG. 18 is a schematic view illustrating an example of a screen displayed in the suggestion generating device of the first embodiment.

The screen 1800 illustrated in FIG. 18 is displayed on the display 1043.

The screen 1800 includes a text box 1820 for receiving an input of an input word 1201 used for a search, a button 1821 for receiving an instruction to start a search, and an area 1822 for displaying a suggestion 1208. Each of text box 1820 and button 1821 may be replaced with another type of graphical user interface (GUI) component.

In the example shown in FIG. 18, a plurality of candidate words 1830 are simultaneously displayed in area 1822, and a plurality of candidate words 1830 are arranged in the order of arrangement that matches in the order of the degree of relevance indicated by the suggestion score of each candidate word 1831. It is arranged. Only one candidate word may be displayed, and one candidate word to be displayed may be switched in order of time that matches in the order of the degree of relevance indicated by the suggestion score of each candidate word 1831.

Although the present invention has been described in detail, the above description is an exemplification in all aspects, and the present invention is not limited thereto. It is understood that countless variations not illustrated are conceivable without departing from the scope of the present invention.

1000 suggestion generation device 1020 suggestion generation program 1100 removal unit 1101 morphological analysis unit 1102 topic classification unit 1103 score factor calculation unit 1104 identification unit 1105 score calculation unit 1106 presentation unit 1107 storage unit 1200 text to be searched or analyzed (text before removal)
1201 Input words 1202 Removed text 1203 Morphologically analyzed text 1204 At least one topic word 1205 Score factor for each topic word 1206 At least one affiliation topic word 1207 Suggested word list 1208 Suggested word list

Claims

A morphological analysis unit that performs morphological analysis on the text, divides the text into a plurality of words, and obtains the morphologically analyzed text;
A topic classification unit which performs topic classification on the morpheme-analyzed text and extracts at least one topic word belonging to each topic of a plurality of topics from the plurality of words;
The feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the topic in the topic to which each topic word belongs, with respect to the topic to which each topic word of the at least one topic word belongs A score factor calculator for calculating a score factor of each topic word indicating at least one of the word's in-topic appearance probability;
An identification unit that identifies at least one affiliation topic word that belongs to each topic and that includes at least a portion of the at least one topic word;
At least one extracted topic is extracted from the plurality of topics such that the input word belongs to each extracted topic, and each candidate word of the plurality of candidate words belonging to the input word and the at least one extracted topic Calculating the score of each candidate word indicating the strength of the degree of association, and in the calculation, at least one belonging topic in the at least one extracted topic so that each candidate word belongs to each belonging topic A score calculation unit which calculates a score of each candidate word from at least one score factor of each candidate word identified and calculated for each of the at least one affiliation topic;
A presentation unit that presents the plurality of candidate words in the order of the degree of relevance indicated by the score of each candidate word;
A suggestion generator comprising:
The suggestion generation device according to claim 1, further comprising: a removal unit that removes a stop word from the pre-removal text and obtains the text.
It further comprises a storage unit for storing a forcedly extracted word dictionary in which compound words are registered,
The suggestion generation device according to claim 1, wherein the morphological analysis unit divides the text so that the plurality of words include the compound word.
It further comprises a storage unit for storing a search log in which words used in past searches are recorded,
The score factor calculation unit
In each of the topic words, at least one of a feature degree indicating the degree to which each topic word characterizes a topic to which each topic word belongs, and a probability of appearance of each topic word in the topic to which each topic word belongs Calculate the pre-addition score factor,
For each user group, identify the used words used by the users belonging to each user group in the past search from the search log, calculate the additive score factor of the topic to which the used words belong, The score factor of each topic word is calculated by adding the additive score factor of the topic to which each topic word belongs to the pre-addition score factor of each topic word,
The score calculation unit
The suggestion generation device according to any one of claims 1 to 3, wherein the score of each candidate word is calculated from at least one score factor of each candidate word calculated for a user group to which the user who has input the input word belongs.
The score factor of each topic word indicates a feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs,
The feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs is the probability of the in-topic appearance of each topic word in the topic to which each topic word belongs. The suggestion generator according to any one of claims 1 to 4, which is obtained by dividing by the appearance frequency.
It further comprises a storage unit for storing a search log in which words used in past searches are recorded,
The identification unit identifies, from the search log, unextracted words that have been used in the past search more than the set number but not included in the at least one topic word, and the at least one affiliation topic word is the word The suggestion generation device according to any one of claims 1 to 5, wherein the at least one affiliation topic word is specified to include an unextracted word.
And a storage unit for storing an exclusion term dictionary in which the exclusion terms are registered,
The suggestion generation device according to any one of claims 1 to 6, wherein the identification unit identifies the at least one affiliation topic word such that the at least one affiliation topic word does not include the exclusion word.
The score calculation unit
Calculating a product of the score factor of the input word calculated for each of the affiliation topics and the score factor of each of the candidate words calculated for each of the affiliation topics for each of the affiliation topics;
The suggestion generating device according to any one of claims 1 to 7, wherein the score of each candidate word is calculated from the maximum value of at least one product calculated for each of the at least one belonging topic.
The score calculation unit
Calculating a product of the score factor of the input word calculated for each of the affiliation topics and the score factor of each of the candidate words calculated for each of the affiliation topics for each of the affiliation topics;
The suggestion generation device according to any one of claims 1 to 7, wherein the score of each candidate word is calculated from the product of at least one product respectively calculated for the at least one belonging topic.
The score factor of the input word calculated for each of the belonging topics indicates a feature degree indicating the degree to which the input word characterizes each of the belonging topics,
10. The suggestion generation device according to claim 8, wherein the score factor of each candidate word calculated for each affiliation topic indicates a feature degree indicating the degree to which each candidate word characterizes each affiliation topic.
The score factor of the input word calculated for each of the belonging topics indicates a feature degree indicating the degree to which the input word characterizes each of the belonging topics,
10. The suggestion generation device according to claim 8, wherein the score factor of each candidate word calculated for each of the belonging topics indicates the in-topic appearance probability of each of the candidate words in each of the belonging topics.
The score calculation unit
The suggestion generation device according to any one of claims 1 to 7, wherein the score of each candidate word is calculated from the maximum value of at least one score factor of each candidate word calculated for each of the at least one belonging topic.
13. The suggestion generation device according to claim 12, wherein the score factor of each candidate word calculated for each affiliation topic is an in-topic appearance probability of each candidate word in each affiliation topic.
It further comprises a storage unit for storing a search log in which words used in past searches are recorded,
The score calculation unit
Calculating a pre-addition score of each candidate word indicating the degree of association between the input word and each candidate word;
For each user group, the used words used by the users belonging to each user group in the past search are specified from the search log, the added score of the used words is calculated, and the added score of each candidate word The suggestion generation device according to any one of claims 1 to 13, wherein the score of each candidate word is calculated by adding the above to the pre-addition score of each candidate word.
a) performing morphological analysis on the text to divide the text into a plurality of words to obtain a morphologically analyzed text;
b) performing topic classification on the morpheme-analyzed text to extract at least one topic word belonging to each topic of a plurality of topics from the plurality of words;
c) With respect to the topic to which each topic word of the at least one topic word belongs, the feature degree indicating the degree to which the topic word characterizes the topic to which the topic word belongs, and the topic degree to which the topic word belongs Calculating a score factor for each topic word indicating at least one of the in-topic appearance probability of each topic word;
d) identifying at least one affiliation topic word belonging to the respective topics and including at least a part of the at least one topic word;
e) at least one extracted topic is extracted from the plurality of topics such that the input word belongs to each extracted topic, and each candidate of a plurality of candidate words belonging to the input word and the at least one extracted topic The score of each candidate word indicating the strength of the degree of association with the word is calculated, and in the calculation, at least one affiliation in the at least one extracted topic such that each candidate word belongs to each affiliation topic Specifying a topic, and calculating a score of each candidate word from at least one score factor of each candidate word calculated for each of the at least one affiliation topic;
f) presenting the plurality of candidate words in the order of the degree of relevance indicated by the score of each candidate word;
A suggestion generator that causes a computer to run.
a) performing morphological analysis on the text to divide the text into a plurality of words to obtain a morphologically analyzed text;
b) performing topic classification on the morpheme-analyzed text to extract at least one topic word belonging to each topic of a plurality of topics from the plurality of words;
c) With respect to the topic to which each topic word of the at least one topic word belongs, the feature degree indicating the degree to which the topic word characterizes the topic to which the topic word belongs, and the topic degree to which the topic word belongs Calculating a score factor for each topic word indicating at least one of the in-topic appearance probability of each topic word;
d) identifying at least one affiliation topic word belonging to the respective topics and including at least a part of the at least one topic word;
e) at least one extracted topic is extracted from the plurality of topics such that the input word belongs to each extracted topic, and each candidate of a plurality of candidate words belonging to the input word and the at least one extracted topic The score of each candidate word indicating the strength of the degree of association with the word is calculated, and in the calculation, at least one affiliation in the at least one extracted topic such that each candidate word belongs to each affiliation topic Specifying a topic, and calculating a score of each candidate word from at least one score factor of each candidate word calculated for each of the at least one affiliation topic;
f) presenting the plurality of candidate words in the order of the degree of relevance indicated by the score of each candidate word;
A method of generating a suggestion comprising: