WO2019058698A1 - Suggestion generation device, suggestion generation program and suggestion generation method - Google Patents
Suggestion generation device, suggestion generation program and suggestion generation method Download PDFInfo
- Publication number
- WO2019058698A1 WO2019058698A1 PCT/JP2018/024841 JP2018024841W WO2019058698A1 WO 2019058698 A1 WO2019058698 A1 WO 2019058698A1 JP 2018024841 W JP2018024841 W JP 2018024841W WO 2019058698 A1 WO2019058698 A1 WO 2019058698A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- topic
- word
- score
- candidate
- calculated
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
Definitions
- the present invention relates to a suggestion generating device, a suggestion generating program, and a suggestion generating method for presenting words related to an input word.
- the suggestion may be generated by extracting a word from the user's search history and displaying the extracted word, or extracting the text including the input word from the text to be searched, and the word from the extracted text There are also cases in which extraction is performed and the extracted word is displayed.
- the techniques described in Patent Documents 1 and 2 are examples of the former, and the techniques described in Patent Document 3 are examples of the latter.
- the search query history is stored as a search query candidate, and among the stored search query candidates, search query candidates matching the user attribute are presented (paragraphs 0031 and 0032) .
- a combination of a search query and a re-search query is extracted from a search log database, and a score indicating the degree of association between the search query and the re-search query is calculated for the extracted combination.
- a predetermined number of re-search queries are extracted as suggestion queries in descending order of score from the re-search queries corresponding to the received search queries (paragraphs 0026, 0030 and 0034). Further, the co-occurrence rate of the search query and the re-search query is calculated, and the combination is excluded when the co-occurrence rate is equal to or more than a predetermined value (paragraphs 0027 and 0029).
- a document data file including a designated keyword is searched from among document data files to be searched, and a designated keyword is included from the document data file including the searched keyword.
- a document unit is taken out, words are extracted, word relation data in which the extracted words are arranged in time order is created, word lists of the created word relation data are combined, and displayed in order of document creation time ( Paragraph 0040).
- search query candidates are generated from the history of search queries, so the user does not know the search query associated with the search query, and uses the search query in the past search. If not, it is not possible to present search query candidates associated with the search query.
- a word list to be displayed is generated from a document data file group to be searched, and it is assumed that the word list generated in this manner includes words associated with keywords. There is no limit.
- the present invention is made to solve the above problems.
- the problem to be solved by the present invention is to provide a suggestion generating device, a suggestion generating method and a suggestion generating program for presenting words related to an input word with high accuracy.
- morphological analysis is performed on the text, the text is divided into a plurality of words, and the morphologically analyzed text is obtained.
- Topic classification is performed on the morphologically analyzed text, and at least one topic word belonging to each topic of the plurality of topics is extracted from the plurality of words.
- a score factor for each topic word is calculated.
- the score factor of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word on the topic to which each topic word belongs.
- At least one affiliation topic word belonging to each topic is identified.
- At least one affiliation topic word includes at least a part of at least one topic word extracted.
- At least one extracted topic is extracted from a plurality of topics.
- the extraction of the at least one extracted topic is performed such that the input word belongs to each extracted topic of the at least one extracted topic.
- a score of each candidate word indicating the strength of the degree of association between the input word and each of the plurality of candidate words belonging to the at least one extracted topic is calculated.
- At least one belonging topic is specified in at least one extracted topic.
- Identification of at least one affiliation topic is performed such that each candidate word belongs to each affiliation topic of at least one affiliation topic.
- the score of each candidate word is calculated from at least one score factor of each candidate word calculated for each of at least one belonging topic.
- a plurality of candidate words are presented in the order of the degree of relevance indicated by the score of each candidate word.
- a suggestion generating device since a word presented through topic classification is extracted from text, a suggestion generating device, a suggestion generating method and a suggestion generating program for presenting a word related to an input word with high accuracy are provided.
- FIG. 1 is a block diagram illustrating the hardware configuration of the suggestion generating device of the first embodiment.
- the suggestion generating apparatus 1000 illustrated in FIG. 1 is a personal computer (PC) on which a suggestion generating program 1020 is installed, and includes a central processing unit (CPU) 1040, a memory 1041, a hard disk drive 1042, and a display 1043.
- the suggestion generator 1000 may comprise components other than these components.
- a suggestion generating program 1020 is installed in the hard disk drive 1042. Even if installation of the suggestion generation program 1020 is performed by writing data read from an external storage medium 1060 such as a compact disc (CD), digital multipurpose disc (DVD), universal serial bus (USB) memory or the like to the hard disk drive 1042 It may be performed by writing data received via the network 1080 to the hard disk drive 1042.
- the hard disk drive 1042 may be replaced with another type of auxiliary storage device.
- the hard disk drive 1042 may be replaced by a solid state drive, a random access memory (RAM) disk, or the like.
- a hard disk drive 1042, an external storage medium 1060, a solid state drive, a RAM disk, and the like are computer readable recording media in which a suggestion generation program 1020 is recorded.
- the suggestion generation program 1020 installed in the hard disk drive 1042 is loaded into the memory 1041, and the loaded suggestion generation program 1020 is executed by the CPU 1040, whereby the PC executes the suggestion generation program 1020. It functions as a suggestion generator 1000.
- FIG. 2 is a block diagram illustrating the functional configuration of the suggestion generating device of the first embodiment.
- FIG. 3 is a diagram for explaining processing on a plurality of topics performed in the suggestion generating device of the first embodiment.
- the suggestion generation apparatus 1000 includes a removal unit 1100, a morphological analysis unit 1101, a topic classification unit 1102, a score factor calculation unit 1103, a specification unit 1104, a score calculation unit 1105, a presentation unit 1106, and a storage unit.
- a suggestion 1208 is generated from the text 1200 and the input word 1201 to be searched or analyzed.
- the storage unit 1107 stores a forced extraction term dictionary 1300, an exclusion term dictionary 1301, a search log 1302, and a user management table 1303.
- the suggestion generator 1000 may comprise components other than these components.
- the input word 1201 may be a search term used in a search, or may be a word input for creating a new text.
- the suggestion 1208 is a presentation of words associated with the input word 1201.
- the removal unit 1100, the morphological analysis unit 1101, the topic classification unit 1102, the score factor calculation unit 1103, the identification unit 1104, the score calculation unit 1105, and the presentation unit 1106 are configured by causing the PC to execute the suggestion generation program 1020.
- the storage unit 1107 is configured by at least one of the memory 1041 and the hard disk drive 1042.
- All or part of the processing performed by the CPU 1040 may be performed by a processing device other than the CPU 1040.
- all or part of the processing performed by the CPU 1040 may be performed by a graphics processing unit (GPU).
- All or part of the processing performed by the CPU 1040 may be performed by hardware that does not execute a program.
- the removal unit 1100 removes the stop word from the pre-removal text 1200 in which the stop word is not removed, and obtains the post-removal text 1202 in which the stop word is removed. If it is not necessary to remove the stop word, such as when the text 1200 to be searched or analyzed does not include the stop word, the removing unit 1100 may be omitted.
- the morphological analysis unit 1101 performs morphological analysis on the post-removal text 1202 to divide the post-removal text 1202 into a plurality of words, and obtains a morpheme-analyzed text 1203 including a plurality of words obtained by the division.
- the morphological analysis unit 1101 uses the compulsory extraction word dictionary 1300 in morphological analysis on the post-removal text 1202. Use of the compulsory extraction word dictionary 1300 may be omitted.
- the topic classification unit 1102 performs topic classification on the morphologically analyzed text 1203 and extracts at least one topic word 1204 belonging to each topic of a plurality of topics from the plurality of words included in the morphologically analyzed text 1203.
- the score factor calculation unit 1103 calculates a score factor 1205 of each topic word with respect to the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs.
- the score factor 1205 of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word in the topic to which each topic word belongs .
- the score factor 1205 of each topic word can be a factor included in the candidate score of the candidate word described later.
- the identifying unit 1104 identifies at least one affiliation topic word 1206 belonging to each topic of the plurality of topics 1250, as illustrated in FIG. At least one affiliation topic word 1206 belonging to each topic includes at least a part of at least one topic word 1204 belonging to each topic extracted by the topic classification unit 1102.
- the identifying unit 1104 uses a search log 1302 and an exclusion term dictionary 1301 in identifying at least one affiliation topic word 1206 belonging to each topic as illustrated in FIG. Thereby, at least one affiliation topic word 1206 belonging to each topic includes at least a part of at least one topic word 1204 belonging to each topic, and an unextracted word not included in at least one topic word 1204 belonging to each topic including.
- the use of at least one of the search log 1302 and the exclusion term dictionary 1301 may be omitted.
- at least one affiliated topic word 1206 belonging to each topic does not include an unextracted word which is not included in at least one topic word 1204 belonging to each topic.
- at least one affiliation topic word 1206 belonging to each topic includes all of at least one topic word 1204 belonging to each topic.
- the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 to which the input word 1201 belongs from a plurality of topics 1250 as illustrated in FIG. 3. Extraction of at least one extracted topic 1251 is performed such that the input word 1201 belongs to each extracted topic of the at least one extracted topic 1251. A plurality of words belonging to at least one extracted topic 1251 become a plurality of candidate words 1260 which may be presented in the generation of the suggestion 1208.
- the score calculation unit 1105 calculates a suggestion score of each candidate word 1261 indicating the strength of the degree of association between the input word 1201 and each candidate word 1261 of the plurality of candidate words 1260.
- the score calculation unit 1105 specifies at least one affiliation topic 1252 to which each candidate word 1261 belongs in at least one extracted topic 1251 in the calculation of the suggestion score of each candidate word 1261. Identification of at least one affiliation topic 1252 is performed such that each candidate word 1261 belongs to each affiliation topic of the at least one affiliation topic 1252.
- the score calculation unit 1105 calculates a suggestion score of each candidate word 1261 from at least one score factor of each candidate word 1261 calculated for each of at least one belonging topic 1252.
- the score calculation unit 1105 creates a suggestion word list 1207 by sorting the plurality of candidate words 1260 in the order of the degree of relevance indicated by the suggestion score of each candidate word 1261 as illustrated in FIG. 2.
- the score calculation unit 1105 uses the search log 1302 and the user management table 1303 in creating the suggestion word list 1207, and creates a suggestion word list 1207 unique to each user group for each user group.
- the presentation unit 1106 generates a suggestion 1208 according to the suggestion word list 1207.
- a plurality of candidate words 1260 included in the suggestion word list 1207 are presented in the order of the degree of relevance indicated by the suggestion score of each candidate word 1261.
- the suggestion 1208 is generated from the text 1200 and the input word 1201 to be searched or analyzed. Therefore, when the text 1200 exists, a search history such as the search log 1302 does not exist or a search is made Even when the search history such as the log 1302 is insufficient, the suggestion 1208 is automatically generated, and the word associated with the input word 1201 is automatically presented. Further, according to the suggestion generation apparatus 1000, since the presented word is not a word simply extracted from the text 1200 but a word extracted through the topic classification from the text 1200, the suggestion 1208 having high accuracy is It is generated.
- FIG. 4 is a flowchart illustrating the flow of processing performed by the suggestion generating device of the first embodiment.
- FIG.5, FIG6 and FIG.7 is a figure which illustrates the example of transition of the data in the suggestion production
- step S101 illustrated in FIG. 4 the removing unit 1100 removes the stop word from the text 1200 to be searched or analyzed, and obtains the post-removed text 1202.
- the text 1200 to be searched or analyzed is a text or the like created in the past.
- the stop word to be removed is a word that becomes unnecessary noise for the subsequent analysis.
- the words removed as stop words are identification codes or the like that do not represent the specific content of the text 1200. Strings commonly included in various URLs such as "http: //" are also removed as stop words. In the example illustrated in FIG.
- the text element 1403 includes a text element 1403 “of process ratio at the time of prediction formula registration ...” and a text element 1405 of “you can input the process ratio to the second decimal place ...” in the text 1200, Text elements 1400 and 1403 have been removed as stop words.
- step S102 subsequent to step S101 illustrated in FIG. 4, the morphological analysis unit 1101 performs morphological analysis on the post-removal text 1202, divides the post-removal text 1202 into a plurality of words, and is obtained by division.
- a morphologically analyzed text 1203 including a plurality of words is obtained.
- the text element 1401 is divided into a plurality of words 1411 "development process” and "customize", and the text element 1402 is "master data", “user”, “project”, “product” Etc., and the text element 1404 is divided into a plurality of words 1414 such as "prediction equation", “registration”, "time”, “no", “step”, “proportion", "no”, etc.
- the text elements 1405 are divided into “process”, “rate”, “no", “input”, “ha”, “decimal point", “second place”, “up”, “input”, “possible”, “ And so on.
- the morphological analysis unit 1101 forcibly removes the technical terms registered in the compulsory extraction term dictionary 1300 using the compulsory extraction term dictionary 1300 in which the technical terms that are compound words consisting of two or more morphemes are registered, and removes the technical terms from the text 1202
- the post-removal text 1202 is divided into a plurality of words so that the plurality of words included in the morphologically analyzed text 1203 include the specialized words extracted.
- technical terms that are compound terms are extracted normally without being divided.
- the technical term 1416 "master data” and the technical term 1417 "prediction formula" are forcibly extracted.
- step S103 the topic classification unit 1102 performs topic classification on the morphologically analyzed text 1203 and generates at least one word belonging to each topic of a plurality of topics 1250 based on a plurality of words.
- Extract the topic words 1204 of The topic classification is to estimate the topic handled in the input text, and to classify sentences constituting the input text into a plurality of topics.
- the topic indicates the meaning of the topic, the field, etc.
- a plurality of topic words 1420 such as "application”, "version”, “development” and “specification" belonging to the topic to which the topic No.
- Topic words 1425 such as "inquire”, “receive”, “answer” and “description” belonging to the topic given the topic No. "5" are extracted, and the topic No. "6” is extracted.
- a plurality of topic words 1426 of “customer”, “hearing”, “main request” and “sub request” belonging to the given topic are extracted.
- step S104 the score factor calculation unit 1103 selects each topic word for the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs. Calculate the score factor of.
- the score factor of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word on the topic to which each topic belongs.
- the characteristic degree 1440 "4.675” and the probability of occurrence 1450 within the topic "11.21%” of the topic word 1430 "app" are calculated.
- the feature degree 1441 of "4.435” and the appearance probability 1451 of the topic term "5.00%” of the topic word 1431 of "debug” are calculated, and the feature degree 1442 of "3.599” and the “4.30”
- the in-topic occurrence probability 1452 of% is calculated, the characteristic degree 1443 of “3.199” and the in-topic occurrence probability 1453 of topic word 1433 in “language” are calculated, and the in-topic occurrence probability 1453 of “version” is calculated.
- the feature degree of each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 is an index indicating the ease of appearance of each topic word in the topic to which each topic word belongs, and is obtained in the topic classification
- the in-topic appearance probability of each topic word is determined to increase as it increases, and as the appearance frequency of each topic word in the text 1200 to be searched or analyzed increases.
- the characteristic degree of each topic word is obtained by dividing the in-topic appearance probability of each topic word by the frequency of appearance of each topic word in the text, as shown in equation (1). Dividing by the frequency of appearance of each topic word in the text suppresses the tendency of words having weak characteristics that belong to various topics and characterize each topic to be presented.
- the frequency of appearance of each topic word in the text is obtained by dividing the number of appearances of each topic word in the text by the number of words in the entire text, as shown in equation (2).
- step S105 following step S104 shown in FIG. 4, it is determined whether or not there is a search log 1302 in which words used in the past search are recorded. If it is determined that the search log 1302 exists, the unextracted word is added in step S106 shown in FIG. 4, and the addition score factor is calculated in step S107 shown in FIG. In step S108 shown in FIG. On the other hand, when it is determined that the search log 1302 does not exist, deletion of the exclusion term is performed in step S108 illustrated in FIG. 4.
- step S106 the identification unit 1104 is used in the past search more than the set number of times, but is included in at least one topic word 1204 extracted by the topic classification unit 1102
- Unextracted words are identified from the search log 1302
- the identified unextracted words are added to at least one topic word 1204 extracted by the topic classification unit 1102, and updated at least one topic word 1209 is obtained.
- at least one belonging topic word 1206 specified by the specifying unit 1104 includes an unextracted word.
- FIG. 8 is a diagram for explaining a calculation algorithm of the suggestion score of each candidate word for each user group in the suggestion generating device of the first embodiment.
- FIG. 9 is a diagram illustrating an example of a search log stored in the suggestion generating device of the first embodiment.
- FIG. 10 is a diagram illustrating an example of a user management table stored in the suggestion generating device of the first embodiment.
- FIG. 11 is a diagram illustrating an example of an addition score factor table calculated in the suggestion generating device of the first embodiment.
- search log 1302 information specifying the user who made each search and the words used in each search are recorded in a mutually associated state.
- a user identifier (ID) 1500 "001", a search word 1501 "application”, and a search time 1502 "2016-12-26 16: 55: 22.916" correspond to each other. It is recorded in the attached state.
- the user ID 1500 is information for identifying the user who has performed each search.
- the search word 1501 is a word used in each search.
- the user management table 1303 stores information identifying a user and information identifying a user group to which the user belongs, in association with each other.
- a user ID 1510 "001", a name 1511 "XXXX”, and a group (department) ID 1512 "G001" are stored in association with one another, and a group "G001" A (department) ID 1520 and a name 1521 "user window" are stored in association with each other.
- the user ID 1510 and the name 1511 are information for identifying a user.
- Group (department) ID 1520 and name 1521 are information for specifying the user group to which the user belongs.
- search log 1302 and the user management table 1303 it is possible to identify the used word used by the user who belongs to each user group in the past search.
- step S107 shown in FIG. 4 the score factor calculation unit 1103 uses, for each user group, used words used by users belonging to each user group in the past search, as shown in FIG. Are specified from the search log 1302 and the user management table 1303, and the added score factor 1530 of the topic to which the specified used word belongs is calculated.
- the addition score factor 1542 of “10” of the topic to which the topic ID 1541 of “corpus1_0_0” is assigned is calculated. There is.
- the score factor calculation unit 1103 calculates, for each user group, the addition score factor 1530 of the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs.
- the score factor 1205 of each topic word is calculated by adding to the pre-addition score factor 1531 of each topic word calculated in step S104.
- the score factor 1205 of each topic word also indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word in the topic to which each topic word belongs There is a score factor for each topic word specific to each user group.
- the score factor 1205 of each topic word specific to each user group makes it possible to generate a suggestion 1208 suitable for each user group.
- the score factor 1205 of each topic word calculated in step S107 is used to calculate the suggestion score 1532 of each candidate word 1261.
- Step S107 may be omitted, and the score factor of each topic word calculated in step S104 may be used to calculate the suggestion score 1532 of each candidate word 1261.
- step S108 illustrated in FIG. 4 the identifying unit 1104 uses at least one topic using the exclusion term dictionary 1301 in which exclusion terms unnecessary for search or analysis are registered as illustrated in FIG. 7.
- An exclusion term registered in the exclusion term dictionary 1301 is deleted from the term 1209 to obtain at least one affiliation topic term 1206.
- at least one affiliation topic word 1206 specified by the specification unit 1104 does not include the exclusion word.
- the score calculation unit 1105 includes at least one extracted topic 1251 to which the input word 1201 belongs from a plurality of topics 1250 as illustrated in FIG. 3. Extract The extraction of the at least one extracted topic 1251 is performed such that the input word 1201 belongs to each extracted topic of the at least one extracted topic 1251.
- the score calculation unit 1105 creates a suggestion candidate list 1210 including a plurality of candidate words 1260 attached to at least one extracted topic 1251 as illustrated in FIG. 7.
- step S110 the score calculation unit 1105 determines the degree of association between the input word 1201 and each candidate word 1261 of the plurality of candidate words 1260 included in the suggestion candidate list 1210.
- a suggestion score 1532 of each candidate word 1261 indicating.
- the score calculation unit 1105 specifies at least one affiliation topic 1252 to which each candidate word 1261 belongs in at least one to-be-extracted topic 1251 in calculation of the suggestion score 1532 of each candidate word 1261. Identification of at least one affiliation topic 1252 is performed such that each candidate word 1261 belongs to each affiliation topic of at least one affiliation topic 1252.
- the score calculation unit 1105 calculates a suggestion score 1532 of each candidate word 1261 from at least one score factor 1205 of each candidate word 1261 calculated for each of at least one belonging topic 1252.
- the score calculation unit 1105 sorts the plurality of candidate words 1260 included in the suggestion candidate list 1210 in the order of the degree of association indicated by the suggestion score 1532 of each candidate word 1261. Then, a suggestion word list 1207 is created.
- the score calculation unit 1105 calculates a suggestion score 1532 of each candidate word 1261 from at least one score factor 1205 of each candidate word 1261 calculated for the user group to which the user who has input the input word 1201 belongs, and the user belongs Create a suggestion word list 1207 specific to the user group.
- FIG. 12 is a diagram illustrating an example of a suggestion word list created in the suggestion generating device of the first embodiment.
- suggestion word list 1207 information specifying topics, candidate words and suggestion scores are stored in association with each other.
- a topic ID 1550 of "corpus 0_1_1”, a topic word 1551 of "app”, and a suggestion score 1552 of "4.675" are stored in association with each other.
- the topic ID 1550 is information for specifying a topic.
- the topic word 1551 is a candidate word.
- step S111 following step S110 illustrated in FIG. 4 the presentation unit 1106 generates a suggestion 1208 according to the suggestion word list 1207 as illustrated in FIG. 7.
- the suggestion 1208 a plurality of candidate words 1260 included in the suggestion word list 1207 are presented in the order of the degree of relevance indicated by the suggestion score 1532 of each candidate word 1261.
- FIG. 13 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the first calculation method.
- the score calculation unit 1105 extracts at least one extracted topic 1251 from the plurality of topics 1250 such that the input word 1201 belongs to each extracted topic. .
- at least one to-be-extracted topic 1610 of topics k, l and m is extracted such that the input word 1600 of “application” belongs to each to-be-extracted topic.
- the score calculation unit 1105 specifies at least one affiliation topic 1252 in at least one extracted topic 1251 such that the candidate word 1261 belongs to each affiliation topic.
- at least one affiliation topic 1611 of topics k and m is specified such that the candidate word 1601 of “version” belongs to each affiliation topic.
- the score calculation unit 1105 calculates the score factor 1205 of the input word 1201 calculated for each belonging topic and the score factor 1205 of the candidate word 1261 calculated for each belonging topic. Calculate the product of In the calculation example illustrated in FIG.
- the score calculation unit 1105 suggests a suggestion score 1532 of the candidate word 1261 indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one product calculated for each of the at least one belonging topic 1252.
- a suggestion score 1627 of the candidate word 1601 including the maximum value 1626 as a factor may be calculated. For example, a suggestion score 1627 of a candidate word 1601 that matches a constant multiple of the maximum value 1626 may be calculated.
- the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (3) is calculated using feature word feature words of candidate word words calculated for feature key word t and topic t.
- a large feature degree indicating that the word characterizes the topic to which the word belongs is likely to be reflected in the suggestion score 1532 of the candidate word 1261, and the word features the topic to which the word belongs It is hard to reflect the small feature degree which shows that the degree of application is weak in the suggestion score 1532 of the candidate word 1261.
- FIG. 14 is a diagram for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the second calculation method.
- the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extracted topic 1251, and for each affiliation topic, the score factor 1205 of the input word 1201 calculated for each affiliation topic and the score factor for the candidate word 1261 calculated for each affiliation topic Calculate the product with 1205.
- the score calculation unit 1105 is a candidate word indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the product of at least one product respectively calculated for at least one belonging topic 1252
- a suggestion score 1532 of 1261 is calculated.
- a suggestion score 1629 of a candidate word 1601 including the product 1628 as a factor may be calculated instead of the suggestion score 1629 of the candidate word 1601 matching the product 1628.
- a suggestion score 1629 of a candidate word 1601 that matches a constant multiple of the product 1628 may be calculated.
- the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (4) is calculated using the feature word featureword of the candidate word word calculated for featurekeywordt and topic t.
- any one of a large feature degree indicating that the word characterizes the topic to which the word belongs is strong, and a small feature degree indicating that the word characterizes the topic to which the word belongs are weak Is also reflected in the suggestion score 1532 of the candidate word 1261.
- FIG. 15 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the third calculation method.
- the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extraction topic 1251.
- the score calculation unit 1105 calculates the score factor 1205 of the input word 1201 calculated for each belonging topic and the score factor 1205 of the candidate word 1261 calculated for each belonging topic. Calculate the product. In the calculation example shown in FIG.
- the score calculation unit 1105 suggests a suggestion score 1532 of the candidate word 1261 indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one product calculated for each of the at least one belonging topic 1252.
- the maximum value 1634 of “” is made the suggestion score 1635 of the candidate word 1601.
- a suggestion score 1635 of the candidate word 1601 including the maximum value 1634 as a factor may be calculated. For example, a suggestion score 1635 of a candidate word 1601 that matches a constant multiple of the maximum value 1634 may be calculated.
- the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (5) is calculated using the in-topic occurrence probability probabilityt of the candidate word word calculated for the featurekeyword and the topic t.
- a large feature degree indicating that the word characterizes the topic to which the word belongs is strong, and a large in-topic appearance indicating that the probability of occurrence of the word in the topic to which the word belongs is high.
- the probability is likely to be reflected in the suggestion score 1532 of the candidate word 1261, and the small feature degree indicating that the word characterizes the topic to which the word belongs is weak, and the in-topic appearance probability of the word in the topic to which the word belongs is low
- the small in-topic appearance probability shown is hard to be reflected in the suggestion score 1532 of the candidate word 1261.
- FIG. 16 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the fourth calculation method.
- the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extraction topic 1251.
- the score calculation unit 1105 determines the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one score factor 1205 of the candidate words 1261 calculated for each of the at least one belonging topic 1252 Calculate the suggestion score 1532 of the candidate word 1261 indicating the strength of.
- the candidate word 1601 “version” calculated for the topic k has a probability of occurrence within the topic “0.025” 1636 and the candidate word “version” calculated for the topic m “0.350
- the maximum value 1638 of “0.350” in the in-topic appearance probability 1637 is set as the suggestion score 1639 of the candidate word 1601.
- a suggestion score 1639 of a candidate word 1601 including a maximum value 1638 as a factor may be calculated instead of the suggestion score 1639 of the candidate word 1601 matching the maximum value 1638. For example, a suggestion score 1639 of a candidate word 1601 that matches a constant multiple of the maximum value 1638 may be calculated.
- the suggestion score Score (word) of the candidate word is a topic of the candidate word word calculated for at least one belonging topic T (keyword, word) and the topic t. Equation (6) is calculated using the probability of occurrence probability word.
- a large in-topic appearance probability indicating that the in-topic appearance probability of the word in the topic to which the word belongs is easily reflected in the suggestion score 1532 of the candidate word 1261 and in the topic to which the word belongs
- the small in-topic occurrence probability indicating that the in-topic occurrence probability of the word is low is hard to be reflected in the suggestion score 1532 of the candidate word 1261.
- FIG. 17 is a view for explaining another example of a calculation algorithm of the suggestion score of each candidate word for each user group in the suggestion generating device of the first embodiment.
- the score calculation unit 1105 calculates a pre-addition suggestion score 1700 indicating the strength of the degree of association between the input word 1201 and each candidate word 1261 from the score factor 1205 of each topic word.
- the score calculation unit 1105 identifies, for each user group, used words used by users belonging to each user group in the past search from the search log 1302 and the user management table 1303, and adds scores of used words.
- the suggestion score 1532 of each candidate word 1261 is calculated by calculating and adding the addition score 1701 of each candidate word 1261 to the pre-addition suggestion score 1700 of each candidate word 1261.
- FIG. 18 is a schematic view illustrating an example of a screen displayed in the suggestion generating device of the first embodiment.
- the screen 1800 illustrated in FIG. 18 is displayed on the display 1043.
- the screen 1800 includes a text box 1820 for receiving an input of an input word 1201 used for a search, a button 1821 for receiving an instruction to start a search, and an area 1822 for displaying a suggestion 1208.
- Each of text box 1820 and button 1821 may be replaced with another type of graphical user interface (GUI) component.
- GUI graphical user interface
- a plurality of candidate words 1830 are simultaneously displayed in area 1822, and a plurality of candidate words 1830 are arranged in the order of arrangement that matches in the order of the degree of relevance indicated by the suggestion score of each candidate word 1831. It is arranged. Only one candidate word may be displayed, and one candidate word to be displayed may be switched in order of time that matches in the order of the degree of relevance indicated by the suggestion score of each candidate word 1831.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention reliably presents a word related to an input word with high accuracy. The present invention discloses suggestion generation, wherein topic classification is performed on a morpheme analysis-completed text, and a topic word, which belongs to each topic, is extracted. A feature degree or the like of each topic word is calculated. A belonging topic word, which belongs to each topic, is specified. A topic to be extracted is extracted so that an input word belongs to each topic to be extracted. A score of each candidate word is calculated which indicates the intensity of relevance between the input word and each candidate word of a plurality of candidate words, which belong to the topic to be extracted. The belonging topic is specified so that each candidate word belongs to each belonging topic. A score of each candidate word is calculated from the feature degree or the like of each candidate word which has been calculated for the belonging topic. The plurality of candidate words are presented in order of intensity of relevance that is represented by the score of each candidate word.
Description
本発明は、入力単語と関連する単語を提示するサジェスト生成装置、サジェスト生成プログラム及びサジェスト生成方法に関する。
The present invention relates to a suggestion generating device, a suggestion generating program, and a suggestion generating method for presenting words related to an input word.
テキストが作成される場合又はテキストに対して検索が行われる場合に、入力単語と関連する単語を提示するサジェストが生成される。
When the text is created or a search is performed on the text, a suggestion is generated that presents the word associated with the input word.
サジェストの生成は、ユーザーの検索履歴から単語を抽出し、抽出した単語を表示することにより行われる場合もあるし、検索を行うテキストから入力単語を含むテキストを抽出し、抽出したテキストから単語をさらに抽出し、抽出した単語を表示することにより行われる場合もある。特許文献1及び2に記載された技術は、前者の例であり、特許文献3に記載された技術は、後者の例である。
The suggestion may be generated by extracting a word from the user's search history and displaying the extracted word, or extracting the text including the input word from the text to be searched, and the word from the extracted text There are also cases in which extraction is performed and the extracted word is displayed. The techniques described in Patent Documents 1 and 2 are examples of the former, and the techniques described in Patent Document 3 are examples of the latter.
特許文献1に記載された技術においては、検索クエリの履歴が検索クエリ候補として記憶され、記憶された検索クエリ候補の中でユーザー属性に適合する検索クエリ候補が提示される(段落0031及び0032)。
In the technique described in Patent Document 1, the search query history is stored as a search query candidate, and among the stored search query candidates, search query candidates matching the user attribute are presented (paragraphs 0031 and 0032) .
特許文献2に記載された技術においては、検索ログデータベースから検索クエリと再検索クエリとの組み合わせが抽出され、抽出された組み合わせについて検索クエリと再検索クエリとの間の関連度を示すスコアが算出され、受信された検索クエリに対応する再検索クエリからスコアの高い順に所定数の再検索クエリがサジェスチョンクエリとして抽出される(段落0026,0030及び0034)。また、検索クエリと再検索クエリとの共起率が算出され、共起率が所定以上である場合に組み合わせが除外される(段落0027及び0029)。
In the technique described in Patent Document 2, a combination of a search query and a re-search query is extracted from a search log database, and a score indicating the degree of association between the search query and the re-search query is calculated for the extracted combination. A predetermined number of re-search queries are extracted as suggestion queries in descending order of score from the re-search queries corresponding to the received search queries (paragraphs 0026, 0030 and 0034). Further, the co-occurrence rate of the search query and the re-search query is calculated, and the combination is excluded when the co-occurrence rate is equal to or more than a predetermined value (paragraphs 0027 and 0029).
特許文献3に記載された技術においては、検索対象の文書データファイル群の中から指定されたキーワードを含む文書データファイルが検索され、検索されたキーワードを含む文書データファイルの中から指定キーワードを含む文書単位が取り出され、単語が抽出され、抽出された単語を時間順に配置した単語関係データが作成され、作成された単語関係データの単語リストが合成されて文書作成時間の順に従って表示される(段落0040)。
In the technique described in Patent Document 3, a document data file including a designated keyword is searched from among document data files to be searched, and a designated keyword is included from the document data file including the searched keyword. A document unit is taken out, words are extracted, word relation data in which the extracted words are arranged in time order is created, word lists of the created word relation data are combined, and displayed in order of document creation time ( Paragraph 0040).
しかし、従来のサジェストの生成には、入力単語と関連する単語を提示できない場合があるという問題がある。
However, conventional suggestion generation has a problem that it may not be possible to present a word related to the input word.
例えば、特許文献1に記載された技術においては、検索クエリの履歴から検索クエリ候補が生成されるため、ユーザーが、検索クエリと関連する検索クエリを知らず、当該検索クエリを過去の検索において使用していない場合は、検索クエリと関連する検索クエリ候補を提示できない。
For example, in the technology described in Patent Document 1, search query candidates are generated from the history of search queries, so the user does not know the search query associated with the search query, and uses the search query in the past search. If not, it is not possible to present search query candidates associated with the search query.
同様に、特許文献2に記載された技術においては、検索ログデータベースからサジェスチョンクエリが生成されるため、ユーザーが、検索クエリと関連する検索クエリを知らず、当該検索クエリを過去の検索において使用していない場合は、検索クエリと関連するサジェスチョンクエリを提示できない。
Similarly, in the technology described in Patent Document 2, since the suggestion query is generated from the search log database, the user does not know the search query associated with the search query, and the search query is used in the past search If not, you can not present suggestion queries that are related to the search query.
また、特許文献3に記載された技術においては、検索対象の文書データファイル群から表示される単語リストが生成されるが、そのようにして生成される単語リストがキーワードと関連する単語を含むとは限らない。
Further, in the technology described in Patent Document 3, a word list to be displayed is generated from a document data file group to be searched, and it is assumed that the word list generated in this manner includes words associated with keywords. There is no limit.
本発明は、上記の問題を解決するためになされる。本発明が解決しようとする課題は、高い精度で入力単語と関連する単語を提示するサジェスト生成装置、サジェスト生成方法及びサジェスト生成プログラムを提供することである。
The present invention is made to solve the above problems. The problem to be solved by the present invention is to provide a suggestion generating device, a suggestion generating method and a suggestion generating program for presenting words related to an input word with high accuracy.
サジェストの生成において、テキストに対して形態素解析が行われ、テキストが複数の単語に分割され、形態素解析済テキストが得られる。
In the generation of the suggestion, morphological analysis is performed on the text, the text is divided into a plurality of words, and the morphologically analyzed text is obtained.
形態素解析済テキストに対してトピック分類が行われ、複数の単語から複数のトピックの各トピックに所属する少なくともひとつのトピック語が抽出される。
Topic classification is performed on the morphologically analyzed text, and at least one topic word belonging to each topic of the plurality of topics is extracted from the plurality of words.
少なくともひとつのトピック語の各トピック語が所属するトピックについて、各トピック語のスコア因子が計算される。各トピック語のスコア因子は、各トピック語が所属するトピックを各トピック語が特徴づける程度を示す特徴度及び各トピック語が所属するトピックにおける各トピック語のトピック内出現確率の少なくとも一方を示す。
For each topic to which at least one topic word belongs, a score factor for each topic word is calculated. The score factor of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word on the topic to which each topic word belongs.
各トピックに所属する少なくともひとつの所属トピック語が特定される。少なくともひとつの所属トピック語は、抽出された少なくともひとつのトピック語の少なくとも一部を含む。
At least one affiliation topic word belonging to each topic is identified. At least one affiliation topic word includes at least a part of at least one topic word extracted.
複数のトピックから少なくともひとつの被抽出トピックが抽出される。少なくともひとつの被抽出トピックの抽出は、少なくともひとつの被抽出トピックの各被抽出トピックに入力単語が所属するように行われる。
At least one extracted topic is extracted from a plurality of topics. The extraction of the at least one extracted topic is performed such that the input word belongs to each extracted topic of the at least one extracted topic.
入力単語と少なくともひとつの被抽出トピックに所属する複数の候補単語の各候補単語との関連度の強さを示す各候補単語のスコアが計算される。
A score of each candidate word indicating the strength of the degree of association between the input word and each of the plurality of candidate words belonging to the at least one extracted topic is calculated.
各候補単語のスコアの計算においては、少なくともひとつの被抽出トピックにおいて少なくともひとつの所属トピックが特定される。少なくともひとつの所属トピックの特定は、少なくともひとつの所属トピックの各所属トピックに各候補単語が所属するように行われる。
In calculation of the score of each candidate word, at least one belonging topic is specified in at least one extracted topic. Identification of at least one affiliation topic is performed such that each candidate word belongs to each affiliation topic of at least one affiliation topic.
少なくともひとつの所属トピックについてそれぞれ計算された各候補単語の少なくともひとつのスコア因子から各候補単語のスコアが計算される。
The score of each candidate word is calculated from at least one score factor of each candidate word calculated for each of at least one belonging topic.
各候補単語のスコアにより示される関連度の強さの順で複数の候補単語が提示される。
A plurality of candidate words are presented in the order of the degree of relevance indicated by the score of each candidate word.
本発明によれば、テキストからトピック分類を経て提示される単語が抽出されるため、高い精度で入力単語に関連する単語を提示するサジェスト生成装置、サジェスト生成方法及びサジェスト生成プログラムが提供される。
According to the present invention, since a word presented through topic classification is extracted from text, a suggestion generating device, a suggestion generating method and a suggestion generating program for presenting a word related to an input word with high accuracy are provided.
この発明の目的、特徴、局面、及び利点は、以下の詳細な説明と添付図面とによって、より明白となる。
The objects, features, aspects and advantages of the present invention will be more apparent from the following detailed description and the accompanying drawings.
1 ハードウェア構成
図1は、第1実施形態のサジェスト生成装置のハードウェア構成を図示するブロック図である。 1 Hardware Configuration FIG. 1 is a block diagram illustrating the hardware configuration of the suggestion generating device of the first embodiment.
図1は、第1実施形態のサジェスト生成装置のハードウェア構成を図示するブロック図である。 1 Hardware Configuration FIG. 1 is a block diagram illustrating the hardware configuration of the suggestion generating device of the first embodiment.
図1に図示されるサジェスト生成装置1000は、サジェスト生成プログラム1020がインストールされたパーソナルコンピューター(PC)であり、中央処理装置(CPU)1040、メモリー1041、ハードディスクドライブ1042及びディスプレイ1043を備える。サジェスト生成装置1000がこれらの構成物以外の構成物を備えてもよい。
The suggestion generating apparatus 1000 illustrated in FIG. 1 is a personal computer (PC) on which a suggestion generating program 1020 is installed, and includes a central processing unit (CPU) 1040, a memory 1041, a hard disk drive 1042, and a display 1043. The suggestion generator 1000 may comprise components other than these components.
サジェスト生成装置1000においては、サジェスト生成プログラム1020がハードディスクドライブ1042にインストールされる。サジェスト生成プログラム1020のインストールは、コンパクトディスク(CD)、デジタル多目的ディスク(DVD)、ユニバーサルシリアルバス(USB)メモリー等の外部記憶媒体1060から読み出したデータをハードディスクドライブ1042に書き込むことにより行われてもよいし、ネットワーク1080を経由して受信したデータをハードディスクドライブ1042に書き込むことにより行われてもよい。ハードディスクドライブ1042が他の種類の補助記憶装置に置き換えられてもよい。例えば、ハードディスクドライブ1042がソリッドステートドライブ、ランダムアクセスメモリー(RAM)ディスク等に置き換えられてもよい。ハードディスクドライブ1042、外部記憶媒体1060、ソリッドステートドライブ、RAMディスク等は、サジェスト生成プログラム1020を記録したコンピュータ読み取り可能な記録媒体である。
In the suggestion generating apparatus 1000, a suggestion generating program 1020 is installed in the hard disk drive 1042. Even if installation of the suggestion generation program 1020 is performed by writing data read from an external storage medium 1060 such as a compact disc (CD), digital multipurpose disc (DVD), universal serial bus (USB) memory or the like to the hard disk drive 1042 It may be performed by writing data received via the network 1080 to the hard disk drive 1042. The hard disk drive 1042 may be replaced with another type of auxiliary storage device. For example, the hard disk drive 1042 may be replaced by a solid state drive, a random access memory (RAM) disk, or the like. A hard disk drive 1042, an external storage medium 1060, a solid state drive, a RAM disk, and the like are computer readable recording media in which a suggestion generation program 1020 is recorded.
サジェスト生成装置1000においては、ハードディスクドライブ1042にインストールされたサジェスト生成プログラム1020がメモリー1041にロードされ、ロードされたサジェスト生成プログラム1020がCPU1040により実行されることにより、PCがサジェスト生成プログラム1020を実行しサジェスト生成装置1000として機能する。
In the suggestion generation apparatus 1000, the suggestion generation program 1020 installed in the hard disk drive 1042 is loaded into the memory 1041, and the loaded suggestion generation program 1020 is executed by the CPU 1040, whereby the PC executes the suggestion generation program 1020. It functions as a suggestion generator 1000.
2 機能的構成
図2は、第1実施形態のサジェスト生成装置の機能的構成を図示するブロック図である。図3は、第1実施形態のサジェスト生成装置において行われる複数のトピックに対する処理を説明する図である。 2 Functional Configuration FIG. 2 is a block diagram illustrating the functional configuration of the suggestion generating device of the first embodiment. FIG. 3 is a diagram for explaining processing on a plurality of topics performed in the suggestion generating device of the first embodiment.
図2は、第1実施形態のサジェスト生成装置の機能的構成を図示するブロック図である。図3は、第1実施形態のサジェスト生成装置において行われる複数のトピックに対する処理を説明する図である。 2 Functional Configuration FIG. 2 is a block diagram illustrating the functional configuration of the suggestion generating device of the first embodiment. FIG. 3 is a diagram for explaining processing on a plurality of topics performed in the suggestion generating device of the first embodiment.
サジェスト生成装置1000は、図2に図示されるように、除去部1100、形態素解析部1101、トピック分類部1102、スコア因子計算部1103、特定部1104、スコア計算部1105、提示部1106及び記憶部1107を備え、検索又は分析の対象のテキスト1200及び入力単語1201からサジェスト1208を生成する。記憶部1107は、強制抽出語辞書1300、除外語辞書1301、検索ログ1302及びユーザー管理テーブル1303を記憶する。サジェスト生成装置1000がこれらの構成物以外の構成物を備えてもよい。入力単語1201は、検索において使用される検索語であってもよいし、新たなテキストの作成のために入力される単語であってもよい。サジェスト1208は、入力単語1201と関連する単語の提示である。
As illustrated in FIG. 2, the suggestion generation apparatus 1000 includes a removal unit 1100, a morphological analysis unit 1101, a topic classification unit 1102, a score factor calculation unit 1103, a specification unit 1104, a score calculation unit 1105, a presentation unit 1106, and a storage unit. A suggestion 1208 is generated from the text 1200 and the input word 1201 to be searched or analyzed. The storage unit 1107 stores a forced extraction term dictionary 1300, an exclusion term dictionary 1301, a search log 1302, and a user management table 1303. The suggestion generator 1000 may comprise components other than these components. The input word 1201 may be a search term used in a search, or may be a word input for creating a new text. The suggestion 1208 is a presentation of words associated with the input word 1201.
除去部1100、形態素解析部1101、トピック分類部1102、スコア因子計算部1103、特定部1104、スコア計算部1105及び提示部1106は、PCにサジェスト生成プログラム1020を実行させることにより構成される。記憶部1107は、メモリー1041及びハードディスクドライブ1042の少なくとも一方により構成される。
The removal unit 1100, the morphological analysis unit 1101, the topic classification unit 1102, the score factor calculation unit 1103, the identification unit 1104, the score calculation unit 1105, and the presentation unit 1106 are configured by causing the PC to execute the suggestion generation program 1020. The storage unit 1107 is configured by at least one of the memory 1041 and the hard disk drive 1042.
CPU1040が行う処理の全部又は一部がCPU1040以外の処理装置により行われてもよい。例えば、CPU1040により行われる処理の全部又は一部がグラフィックス処理装置(GPU)により行われてもよい。CPU1040により行われる処理の全部又は一部がプログラムを実行しないハードウェアにより行われてもよい。
All or part of the processing performed by the CPU 1040 may be performed by a processing device other than the CPU 1040. For example, all or part of the processing performed by the CPU 1040 may be performed by a graphics processing unit (GPU). All or part of the processing performed by the CPU 1040 may be performed by hardware that does not execute a program.
除去部1100は、ストップワードが除去されていない除去前テキスト1200からストップワードを除去してストップワードが除去された除去後テキスト1202を得る。検索又は分析の対象のテキスト1200がストップワードを含まない場合等のストップワードの除去が不要である場合は、除去部1100が省略されてもよい。
The removal unit 1100 removes the stop word from the pre-removal text 1200 in which the stop word is not removed, and obtains the post-removal text 1202 in which the stop word is removed. If it is not necessary to remove the stop word, such as when the text 1200 to be searched or analyzed does not include the stop word, the removing unit 1100 may be omitted.
形態素解析部1101は、除去後テキスト1202に対して形態素解析を行って除去後テキスト1202を複数の単語に分割し、分割により得られる複数の単語を含む形態素解析済テキスト1203を得る。形態素解析部1101は、除去後テキスト1202に対する形態素解析において、強制抽出語辞書1300を使用する。強制抽出語辞書1300の使用が省略されてもよい。
The morphological analysis unit 1101 performs morphological analysis on the post-removal text 1202 to divide the post-removal text 1202 into a plurality of words, and obtains a morpheme-analyzed text 1203 including a plurality of words obtained by the division. The morphological analysis unit 1101 uses the compulsory extraction word dictionary 1300 in morphological analysis on the post-removal text 1202. Use of the compulsory extraction word dictionary 1300 may be omitted.
トピック分類部1102は、形態素解析済テキスト1203に対してトピック分類を行って形態素解析済テキスト1203に含まれる複数の単語から複数のトピックの各トピックに所属する少なくともひとつのトピック語1204を抽出する。
The topic classification unit 1102 performs topic classification on the morphologically analyzed text 1203 and extracts at least one topic word 1204 belonging to each topic of a plurality of topics from the plurality of words included in the morphologically analyzed text 1203.
スコア因子計算部1103は、トピック分類部1102により抽出された少なくともひとつのトピック語1204の各トピック語が所属するトピックについて、各トピック語のスコア因子1205を計算する。各トピック語のスコア因子1205は、各トピック語が所属するトピックを各トピック語が特徴づける程度を示す特徴度及び各トピック語が所属するトピックにおける各トピック語のトピック内出現確率の少なくとも一方を示す。各トピック語のスコア因子1205は、後述する候補単語のサジェストスコアに含まれる因子となりうる。
The score factor calculation unit 1103 calculates a score factor 1205 of each topic word with respect to the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs. The score factor 1205 of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word in the topic to which each topic word belongs . The score factor 1205 of each topic word can be a factor included in the candidate score of the candidate word described later.
特定部1104は、図3に図示されるように、複数のトピック1250の各トピックに属する少なくともひとつの所属トピック語1206を特定する。各トピックに属する少なくともひとつの所属トピック語1206は、トピック分類部1102により抽出された各トピックに属する少なくともひとつのトピック語1204の少なくとも一部を含む。特定部1104は、図2に図示されるように、各トピックに属する少なくともひとつの所属トピック語1206の特定において、検索ログ1302及び除外語辞書1301を使用する。これにより、各トピックに属する少なくともひとつの所属トピック語1206が、各トピックに属する少なくともひとつのトピック語1204の少なくとも一部を含み、各トピックに属する少なくともひとつのトピック語1204に含まれない未抽出単語を含む。検索ログ1302及び除外語辞書1301の少なくとも一方の使用が省略されてもよい。検索ログ1302の使用が省略された場合は、各トピックに属する少なくともひとつの所属トピック語1206が、各トピックに属する少なくともひとつのトピック語1204に含まれない未抽出単語を含まない。除外語辞書1301の使用が省略された場合は、各トピックに属する少なくともひとつの所属トピック語1206が、各トピックに属する少なくともひとつのトピック語1204の全部を含む。
The identifying unit 1104 identifies at least one affiliation topic word 1206 belonging to each topic of the plurality of topics 1250, as illustrated in FIG. At least one affiliation topic word 1206 belonging to each topic includes at least a part of at least one topic word 1204 belonging to each topic extracted by the topic classification unit 1102. The identifying unit 1104 uses a search log 1302 and an exclusion term dictionary 1301 in identifying at least one affiliation topic word 1206 belonging to each topic as illustrated in FIG. Thereby, at least one affiliation topic word 1206 belonging to each topic includes at least a part of at least one topic word 1204 belonging to each topic, and an unextracted word not included in at least one topic word 1204 belonging to each topic including. The use of at least one of the search log 1302 and the exclusion term dictionary 1301 may be omitted. When the use of the search log 1302 is omitted, at least one affiliated topic word 1206 belonging to each topic does not include an unextracted word which is not included in at least one topic word 1204 belonging to each topic. When the use of the exclusion word dictionary 1301 is omitted, at least one affiliation topic word 1206 belonging to each topic includes all of at least one topic word 1204 belonging to each topic.
スコア計算部1105は、図3に図示されるように、複数のトピック1250から、入力単語1201が所属する少なくともひとつの被抽出トピック1251を抽出する。少なくともひとつの被抽出トピック1251の抽出は、少なくともひとつの被抽出トピック1251の各抽出トピックに入力単語1201が所属するように行われる。少なくともひとつの被抽出トピック1251に所属する複数の単語は、サジェスト1208の生成において提示される可能性がある複数の候補単語1260となる。
The score calculation unit 1105 extracts at least one to-be-extracted topic 1251 to which the input word 1201 belongs from a plurality of topics 1250 as illustrated in FIG. 3. Extraction of at least one extracted topic 1251 is performed such that the input word 1201 belongs to each extracted topic of the at least one extracted topic 1251. A plurality of words belonging to at least one extracted topic 1251 become a plurality of candidate words 1260 which may be presented in the generation of the suggestion 1208.
スコア計算部1105は、入力単語1201と複数の候補単語1260の各候補単語1261との関連度の強さを示す各候補単語1261のサジェストスコアを計算する。スコア計算部1105は、各候補単語1261のサジェストスコアの計算において、少なくともひとつの被抽出トピック1251において各候補単語1261が所属する少なくともひとつの所属トピック1252を特定する。少なくともひとつの所属トピック1252の特定は、少なくともひとつの所属トピック1252の各所属トピックに各候補単語1261が所属するように行われる。
The score calculation unit 1105 calculates a suggestion score of each candidate word 1261 indicating the strength of the degree of association between the input word 1201 and each candidate word 1261 of the plurality of candidate words 1260. The score calculation unit 1105 specifies at least one affiliation topic 1252 to which each candidate word 1261 belongs in at least one extracted topic 1251 in the calculation of the suggestion score of each candidate word 1261. Identification of at least one affiliation topic 1252 is performed such that each candidate word 1261 belongs to each affiliation topic of the at least one affiliation topic 1252.
スコア計算部1105は、少なくともひとつの所属トピック1252についてそれぞれ計算された各候補単語1261の少なくともひとつのスコア因子から各候補単語1261のサジェストスコアを計算する。
The score calculation unit 1105 calculates a suggestion score of each candidate word 1261 from at least one score factor of each candidate word 1261 calculated for each of at least one belonging topic 1252.
スコア計算部1105は、図2に図示されるように、各候補単語1261のサジェストスコアにより示される関連度の強さの順で複数の候補単語1260をソートしてサジェストワードリスト1207を作成する。スコア計算部1105は、サジェストワードリスト1207の作成において、検索ログ1302及びユーザー管理テーブル1303を使用し、各ユーザーグループについて各ユーザーグループに固有のサジェストワードリスト1207を作成する。
The score calculation unit 1105 creates a suggestion word list 1207 by sorting the plurality of candidate words 1260 in the order of the degree of relevance indicated by the suggestion score of each candidate word 1261 as illustrated in FIG. 2. The score calculation unit 1105 uses the search log 1302 and the user management table 1303 in creating the suggestion word list 1207, and creates a suggestion word list 1207 unique to each user group for each user group.
提示部1106は、サジェストワードリスト1207にしたがってサジェスト1208を生成する。サジェスト1208においては、サジェストワードリスト1207に含まれる複数の候補単語1260が各候補単語1261のサジェストスコアにより示される関連度の強さの順で提示される。
The presentation unit 1106 generates a suggestion 1208 according to the suggestion word list 1207. In the suggestion 1208, a plurality of candidate words 1260 included in the suggestion word list 1207 are presented in the order of the degree of relevance indicated by the suggestion score of each candidate word 1261.
サジェスト生成装置1000によれば、検索又は分析の対象のテキスト1200及び入力単語1201からサジェスト1208が生成されるので、テキスト1200が存在する場合は、検索ログ1302等の検索履歴が存在しない場合又は検索ログ1302等の検索履歴が不十分である場合においても、サジェスト1208が自動的に生成され、入力単語1201と関連する単語が自動的に提示される。また、サジェスト生成装置1000によれば、提示される単語が、テキスト1200から単純に抽出された単語ではなく、テキスト1200からトピック分類を経て抽出された単語であるため、高い精度を有するサジェスト1208が生成される。
According to the suggestion generating apparatus 1000, the suggestion 1208 is generated from the text 1200 and the input word 1201 to be searched or analyzed. Therefore, when the text 1200 exists, a search history such as the search log 1302 does not exist or a search is made Even when the search history such as the log 1302 is insufficient, the suggestion 1208 is automatically generated, and the word associated with the input word 1201 is automatically presented. Further, according to the suggestion generation apparatus 1000, since the presented word is not a word simply extracted from the text 1200 but a word extracted through the topic classification from the text 1200, the suggestion 1208 having high accuracy is It is generated.
3 処理及びデータの変遷の例
図4は、第1実施形態のサジェスト生成装置が行う処理の流れを図示するフローチャートである。図5、図6及び図7は、第1実施形態のサジェスト生成装置におけるデータの変遷の例を図示する図である。 3 Example of Transition of Processing and Data FIG. 4 is a flowchart illustrating the flow of processing performed by the suggestion generating device of the first embodiment. FIG.5, FIG6 and FIG.7 is a figure which illustrates the example of transition of the data in the suggestion production | generation apparatus of 1st Embodiment.
図4は、第1実施形態のサジェスト生成装置が行う処理の流れを図示するフローチャートである。図5、図6及び図7は、第1実施形態のサジェスト生成装置におけるデータの変遷の例を図示する図である。 3 Example of Transition of Processing and Data FIG. 4 is a flowchart illustrating the flow of processing performed by the suggestion generating device of the first embodiment. FIG.5, FIG6 and FIG.7 is a figure which illustrates the example of transition of the data in the suggestion production | generation apparatus of 1st Embodiment.
図4に図示されるステップS101においては、除去部1100が、検索又は分析の対象のテキスト1200からストップワードを除去して除去後テキスト1202を得る。検索又は分析の対象のテキスト1200は、過去に作成されたテキスト等である。除去されるストップワードは、以降の解析に不要なノイズとなる単語である。ストップワードとして除去される単語は、テキスト1200の具体的内容を表現しない識別符号等である。「http://」等の様々なURLに共通して含まれる文字列もストップワードとして除去される。図5に図示される例においては、「R000003」というテキスト要素1400、「開発工程カスタマイズ」というテキスト要素1401、「マスターデータ(ユーザー、プロジェクト、製品、・・・」というテキスト要素1402、「R000002」というテキスト要素1403、「予測式登録時の工程割合の・・・」というテキスト要素1404及び「工程割合の入力は小数点第2位まで入力可能に…」というテキスト要素1405がテキスト1200に含まれ、テキスト要素1400及び1403がストップワードとして除去されている。
In step S101 illustrated in FIG. 4, the removing unit 1100 removes the stop word from the text 1200 to be searched or analyzed, and obtains the post-removed text 1202. The text 1200 to be searched or analyzed is a text or the like created in the past. The stop word to be removed is a word that becomes unnecessary noise for the subsequent analysis. The words removed as stop words are identification codes or the like that do not represent the specific content of the text 1200. Strings commonly included in various URLs such as "http: //" are also removed as stop words. In the example illustrated in FIG. 5, a text element 1400 "R000003", a text element 1401 "development process customization", a text element 1402 "master data (user, project, product, ...)", "R000002" The text element 1403 includes a text element 1403 “of process ratio at the time of prediction formula registration ...” and a text element 1405 of “you can input the process ratio to the second decimal place ...” in the text 1200, Text elements 1400 and 1403 have been removed as stop words.
図4に図示される、ステップS101に続くステップS102においては、形態素解析部1101が、除去後テキスト1202に対して形態素解析を行って除去後テキスト1202を複数の単語に分割し、分割により得られる複数の単語を含む形態素解析済テキスト1203を得る。図5に図示される例においては、テキスト要素1401が「開発工程」及び「カスタマイズ」という複数の単語1411に分割され、テキスト要素1402が「マスターデータ」、「ユーザー」、「プロジェクト」、「製品」等という複数の単語1412に分割され、テキスト要素1404が「予測式」、「登録」、「時」、「の」、「工程」、「割合」、「の」等という複数の単語1414に分割され、テキスト要素1405が「工程」、「割合」、「の」、「入力」、「は」、「小数点」、「第2位」、「まで」、「入力」、「可能」、「に」等という複数の単語1415に分割されている。
In step S102 subsequent to step S101 illustrated in FIG. 4, the morphological analysis unit 1101 performs morphological analysis on the post-removal text 1202, divides the post-removal text 1202 into a plurality of words, and is obtained by division. A morphologically analyzed text 1203 including a plurality of words is obtained. In the example illustrated in FIG. 5, the text element 1401 is divided into a plurality of words 1411 "development process" and "customize", and the text element 1402 is "master data", "user", "project", "product" Etc., and the text element 1404 is divided into a plurality of words 1414 such as "prediction equation", "registration", "time", "no", "step", "proportion", "no", etc. The text elements 1405 are divided into "process", "rate", "no", "input", "ha", "decimal point", "second place", "up", "input", "possible", " And so on.
形態素解析部1101は、2個以上の形態素からなる複合語である専門用語が登録された強制抽出語辞書1300を使用して強制抽出語辞書1300に登録された専門用語を除去後テキスト1202から強制的に抽出し、形態素解析済テキスト1203に含まれる複数の単語が抽出された専門単語を含むように除去後テキスト1202を複数の単語に分割する。これにより、複合語である専門用語が分割されずに正常に抽出される。図5に示される例においては、「マスターデータ」という専門用語1416及び「予測式」という専門用語1417が強制的に抽出されている。
The morphological analysis unit 1101 forcibly removes the technical terms registered in the compulsory extraction term dictionary 1300 using the compulsory extraction term dictionary 1300 in which the technical terms that are compound words consisting of two or more morphemes are registered, and removes the technical terms from the text 1202 The post-removal text 1202 is divided into a plurality of words so that the plurality of words included in the morphologically analyzed text 1203 include the specialized words extracted. As a result, technical terms that are compound terms are extracted normally without being divided. In the example shown in FIG. 5, the technical term 1416 "master data" and the technical term 1417 "prediction formula" are forcibly extracted.
図4に図示される、ステップS102に続くステップS103においては、トピック分類部1102が、形態素解析済テキスト1203に対してトピック分類を行って複数の単語から複数のトピック1250の各トピックに属する少なくともひとつのトピック語1204を抽出する。トピック分類とは、入力されたテキストにおいて扱われているトピックを推定し、入力されたテキストを構成する文章を複数のトピックに分類することである。トピックは、話題、分野等の概略の意味を示す。図6に図示される例においては、トピックNo.「0」が付与されたトピックに属する「アプリ」、「バージョン」、「開発」及び「仕様」という複数のトピック語1420が抽出され、トピックNo.「1」が付与されたトピックに属する「テスト」、「デバッグ」、「単体」及び「管理」という複数のトピック語1421が抽出され、トピックNo.「2」が付与されたトピックに属する「ソフト」、「対応」、「期日」及び「確認」という複数のトピック語1422が抽出され、トピックNo.「3」が付与されたトピックに属する「設計」、「ユースケース」、「ボタン」及び「配置」という複数のトピック語1423が抽出され、トピックNo.「4」が付与されたトピックに属する「リリース」、「対応」、「ノート」及び「準備」という複数のトピック語1424が抽出され、トピックNo.「5」が付与されたトピックに属する「問い合わせ」、「受ける」、「回答」及び「記述」という複数のトピック語1425が抽出され、トピックNo.「6」が付与されたトピックに属する「顧客」、「ヒアリング」、「主要求」及び「副要求」という複数のトピック語1426が抽出されている。
In step S103 following step S102 illustrated in FIG. 4, the topic classification unit 1102 performs topic classification on the morphologically analyzed text 1203 and generates at least one word belonging to each topic of a plurality of topics 1250 based on a plurality of words. Extract the topic words 1204 of The topic classification is to estimate the topic handled in the input text, and to classify sentences constituting the input text into a plurality of topics. The topic indicates the meaning of the topic, the field, etc. In the example illustrated in FIG. 6, a plurality of topic words 1420 such as "application", "version", "development" and "specification" belonging to the topic to which the topic No. "0" is assigned are extracted, and topic No A plurality of topic words 1421 of "test", "debug", "single" and "management" belonging to the topic to which "1" is attached is extracted, and "topic No. 2" belongs to the topic to which "topic" is attached. "Design", "Use Case", "Button", and "belonging" to the topic to which a plurality of topic words 1422 such as "soft", "Correspondence", "Due" and "Confirmation" are extracted and the topic No. "3" is given A plurality of topic words "release", "correspondence", "note" and "prepare" belonging to the topic to which a plurality of topic words 1423 "arrange" are extracted and the topic No. "4" is given 424 are extracted, and a plurality of topic words 1425 such as "inquire", "receive", "answer" and "description" belonging to the topic given the topic No. "5" are extracted, and the topic No. "6" is extracted. A plurality of topic words 1426 of “customer”, “hearing”, “main request” and “sub request” belonging to the given topic are extracted.
図4に図示される、ステップS103に続くステップS104においては、スコア因子計算部1103が、トピック分類部1102により抽出された少なくともひとつのトピック語1204の各トピック語が所属するトピックについて、各トピック語のスコア因子を計算する。各トピック語のスコア因子は、各トピック語が所属するトピックを各トピック語が特徴づける程度を示す特徴度及び各トピックが所属するトピックにおける各トピック語のトピック内出現確率の少なくとも一方を示す。図6に図示される例においては、トピックID「corpus1_0_0」が付与されたトピックについて、「アプリ」というトピック語1430の「4.675」という特徴度1440及び「11.21%」というトピック内出現確率1450が計算され、「デバッグ」というトピック語1431の「4.435」という特徴度1441及び「5.00%」というトピック内出現確率1451が計算され、「単体」というトピック語1432の「3.599」という特徴度1442及び「4.30%」というトピック内出現確率1452が計算され、「言語」というトピック語1433の「3.199」という特徴度1443及び「3.40%」というトピック内出現確率1453が計算され、「バージョン」というトピック語1434の「2.620」という特徴度1444及び「3.35%」というトピック内出現確率1454が計算されている。
In step S104 following step S103 illustrated in FIG. 4, the score factor calculation unit 1103 selects each topic word for the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs. Calculate the score factor of. The score factor of each topic word indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word on the topic to which each topic belongs. In the example illustrated in FIG. 6, for the topic given the topic ID "corpus1_0_0", the characteristic degree 1440 "4.675" and the probability of occurrence 1450 within the topic "11.21%" of the topic word 1430 "app" are calculated. The feature degree 1441 of "4.435" and the appearance probability 1451 of the topic term "5.00%" of the topic word 1431 of "debug" are calculated, and the feature degree 1442 of "3.599" and the "4.30" The in-topic occurrence probability 1452 of% is calculated, the characteristic degree 1443 of “3.199” and the in-topic occurrence probability 1453 of topic word 1433 in “language” are calculated, and the in-topic occurrence probability 1453 of “version” is calculated The characteristic degree 1444 “2.620” and the occurrence probability 1454 within the topic “3.35%” It is calculated.
トピック分類部1102により抽出された少なくともひとつのトピック語1204の各トピック語の特徴度は、各トピック語が所属するトピックにおける各トピック語の出現しやすさを示す指標であり、トピック分類において求められる各トピック語のトピック内出現確率が大きくなるほど大きくなるように決定され、検索又は分析の対象のテキスト1200における各トピック語の出現頻度が大きくなるほど小さくなるように決定される。望ましくは、各トピック語の特徴度は、式(1)に示されるように、各トピック語のトピック内出現確率をテキストにおける各トピック語の出現頻度で除することにより得られる。テキストにおける各トピック語の出現頻度で除することは、様々なトピックに属し各トピックを特徴づける性質が弱い単語が提示されやすくなることを抑制する。
The feature degree of each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 is an index indicating the ease of appearance of each topic word in the topic to which each topic word belongs, and is obtained in the topic classification The in-topic appearance probability of each topic word is determined to increase as it increases, and as the appearance frequency of each topic word in the text 1200 to be searched or analyzed increases. Desirably, the characteristic degree of each topic word is obtained by dividing the in-topic appearance probability of each topic word by the frequency of appearance of each topic word in the text, as shown in equation (1). Dividing by the frequency of appearance of each topic word in the text suppresses the tendency of words having weak characteristics that belong to various topics and characterize each topic to be presented.
テキストにおける各トピック語の出現頻度は、式(2)に示されるように、テキストにおける各トピック語の出現数をテキストの全体における単語数で除することにより得られる。
The frequency of appearance of each topic word in the text is obtained by dividing the number of appearances of each topic word in the text by the number of words in the entire text, as shown in equation (2).
図4に図示される、ステップS104に続くステップS105においては、過去の検索において使用された単語が記録された検索ログ1302が存在するか否かが判定される。検索ログ1302が存在すると判定された場合は、図4に図示されるステップS106において未抽出単語の追加が行われ、図4に図示されるステップS107において加算スコア因子の計算が行われ、図4に図示されるステップS108において除外語の削除が行われる。一方、検索ログ1302が存在しないと判定された場合は、図4に図示されるステップS108において除外語の削除が行われる。
In step S105 following step S104 shown in FIG. 4, it is determined whether or not there is a search log 1302 in which words used in the past search are recorded. If it is determined that the search log 1302 exists, the unextracted word is added in step S106 shown in FIG. 4, and the addition score factor is calculated in step S107 shown in FIG. In step S108 shown in FIG. On the other hand, when it is determined that the search log 1302 does not exist, deletion of the exclusion term is performed in step S108 illustrated in FIG. 4.
ステップS106においては、特定部1104が、図7に図示されるように、設定回数より多い回数にわたって過去の検索において使用されたが、トピック分類部1102により抽出された少なくともひとつのトピック語1204に含まれない未抽出単語を検索ログ1302から特定し、特定した未抽出単語をトピック分類部1102により抽出された少なくともひとつのトピック語1204に追加し、更新された少なくともひとつのトピック語1209を得る。これにより、特定部1104により特定される少なくともひとつの所属トピック語1206が未抽出単語を含むようになる。
In step S106, as illustrated in FIG. 7, the identification unit 1104 is used in the past search more than the set number of times, but is included in at least one topic word 1204 extracted by the topic classification unit 1102 Unextracted words are identified from the search log 1302, the identified unextracted words are added to at least one topic word 1204 extracted by the topic classification unit 1102, and updated at least one topic word 1209 is obtained. As a result, at least one belonging topic word 1206 specified by the specifying unit 1104 includes an unextracted word.
図8は、第1実施形態のサジェスト生成装置における各ユーザーグループについての各候補単語のサジェストスコアの計算アルゴリズムを説明する図である。図9は、第1実施形態のサジェスト生成装置に記憶される検索ログの例を図示する図である。図10は、第1実施形態のサジェスト生成装置に記憶されるユーザー管理テーブルの例を図示する図である。図11は、第1実施形態のサジェスト生成装置において計算される加算スコア因子テーブルの例を図示する図である。
FIG. 8 is a diagram for explaining a calculation algorithm of the suggestion score of each candidate word for each user group in the suggestion generating device of the first embodiment. FIG. 9 is a diagram illustrating an example of a search log stored in the suggestion generating device of the first embodiment. FIG. 10 is a diagram illustrating an example of a user management table stored in the suggestion generating device of the first embodiment. FIG. 11 is a diagram illustrating an example of an addition score factor table calculated in the suggestion generating device of the first embodiment.
検索ログ1302には、各検索を行ったユーザーを特定する情報及び各検索において使用された単語が互いに対応づけられた状態で記録される。図9に図示される例においては、例えば、「001」というユーザー識別子(ID)1500、「アプリ」という検索ワード1501及び「2016-12-26 16:55:22.916」という検索時刻1502が互いに対応づけられた状態で記録されている。ユーザーID1500は、各検索を行ったユーザーを特定する情報である。検索ワード1501は、各検索において使用された単語である。
In the search log 1302, information specifying the user who made each search and the words used in each search are recorded in a mutually associated state. In the example illustrated in FIG. 9, for example, a user identifier (ID) 1500 "001", a search word 1501 "application", and a search time 1502 "2016-12-26 16: 55: 22.916" correspond to each other. It is recorded in the attached state. The user ID 1500 is information for identifying the user who has performed each search. The search word 1501 is a word used in each search.
ユーザー管理テーブル1303には、ユーザーを特定する情報及びユーザーが所属するユーザーグループを特定する情報が互いに対応づけられた状態で格納される。図10に図示される例においては、例えば、「001」というユーザーID1510、「XXXX」という名前1511及び「G001」というグループ(部門)ID1512が互いに関連づけられた状態で格納され、「G001」というグループ(部門)ID1520及び「ユーザー窓口」という名前1521が互いに関連づけられた状態で格納されている。ユーザーID1510及び名前1511は、ユーザーを特定する情報である。グループ(部門)ID1520及び名前1521は、ユーザーが所属するユーザーグループを特定する情報である。
The user management table 1303 stores information identifying a user and information identifying a user group to which the user belongs, in association with each other. In the example illustrated in FIG. 10, for example, a user ID 1510 "001", a name 1511 "XXXX", and a group (department) ID 1512 "G001" are stored in association with one another, and a group "G001" A (department) ID 1520 and a name 1521 "user window" are stored in association with each other. The user ID 1510 and the name 1511 are information for identifying a user. Group (department) ID 1520 and name 1521 are information for specifying the user group to which the user belongs.
検索ログ1302及びユーザー管理テーブル1303を参照することにより、過去の検索において各ユーザーグループに所属するユーザーにより使用された使用済単語を特定することができる。
By referring to the search log 1302 and the user management table 1303, it is possible to identify the used word used by the user who belongs to each user group in the past search.
図4に図示されるステップS107においては、スコア因子計算部1103が、各ユーザーグループについて、図8に図示されるように、過去の検索において各ユーザーグループに所属するユーザーにより使用された使用済単語を検索ログ1302及びユーザー管理テーブル1303から特定し、特定した使用済単語が所属するトピックの加算スコア因子1530を計算する。図11に図示される例においては、例えば、「G001」というグループID1540が付与されたユーザーグループについて、「corpus1_0_0」というトピックID1541が付与されたトピックの「10」という加算スコア因子1542が計算されている。
In step S107 shown in FIG. 4, the score factor calculation unit 1103 uses, for each user group, used words used by users belonging to each user group in the past search, as shown in FIG. Are specified from the search log 1302 and the user management table 1303, and the added score factor 1530 of the topic to which the specified used word belongs is calculated. In the example illustrated in FIG. 11, for example, for the user group to which the group ID 1540 of “G001” is assigned, the addition score factor 1542 of “10” of the topic to which the topic ID 1541 of “corpus1_0_0” is assigned is calculated. There is.
また、スコア因子計算部1103が、各ユーザーグループについて、図8に図示されるように、トピック分類部1102により抽出された少なくともひとつのトピック語1204の各トピック語が属するトピックの加算スコア因子1530をステップS104において計算された各トピック語の加算前スコア因子1531に加算することにより各トピック語のスコア因子1205を計算する。各トピック語のスコア因子1205も、各トピック語が所属するトピックを各トピック語が特徴づける程度を示す特徴度及び各トピック語が所属するトピックにおける各トピック語のトピック内出現確率の少なくとも一方を示すが、各ユーザーグループに固有の各トピック語のスコア因子となっている。各ユーザーグループに固有の各トピック語のスコア因子1205によれば、各ユーザーグループに適したサジェスト1208を生成することが可能になる。ステップS107において計算された各トピック語のスコア因子1205は、各候補単語1261のサジェストスコア1532の計算に使用される。ステップS107が省略され、ステップS104において計算された各トピック語のスコア因子が各候補単語1261のサジェストスコア1532の計算に使用されてもよい。
In addition, as shown in FIG. 8, the score factor calculation unit 1103 calculates, for each user group, the addition score factor 1530 of the topic to which each topic word of at least one topic word 1204 extracted by the topic classification unit 1102 belongs. The score factor 1205 of each topic word is calculated by adding to the pre-addition score factor 1531 of each topic word calculated in step S104. The score factor 1205 of each topic word also indicates at least one of the feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the in-topic appearance probability of each topic word in the topic to which each topic word belongs There is a score factor for each topic word specific to each user group. The score factor 1205 of each topic word specific to each user group makes it possible to generate a suggestion 1208 suitable for each user group. The score factor 1205 of each topic word calculated in step S107 is used to calculate the suggestion score 1532 of each candidate word 1261. Step S107 may be omitted, and the score factor of each topic word calculated in step S104 may be used to calculate the suggestion score 1532 of each candidate word 1261.
図4に図示されるステップS108においては、特定部1104が、図7に図示されるように、検索又は分析において不要である除外語が登録された除外語辞書1301を使用して少なくともひとつのトピック語1209から除外語辞書1301に登録された除外語を削除し、少なくともひとつの所属トピック語1206を得る。これにより、特定部1104により特定される少なくともひとつの所属トピック語1206が除外語を含まなくなる。
In step S108 illustrated in FIG. 4, the identifying unit 1104 uses at least one topic using the exclusion term dictionary 1301 in which exclusion terms unnecessary for search or analysis are registered as illustrated in FIG. 7. An exclusion term registered in the exclusion term dictionary 1301 is deleted from the term 1209 to obtain at least one affiliation topic term 1206. Thereby, at least one affiliation topic word 1206 specified by the specification unit 1104 does not include the exclusion word.
図4に図示される、ステップS108に続くステップS109においては、スコア計算部1105が、図3に図示されるように、複数のトピック1250から、入力単語1201が所属する少なくともひとつの被抽出トピック1251を抽出する。少なくともひとつの被抽出トピック1251の抽出は、入力単語1201が少なくともひとつの被抽出トピック1251の各抽出トピックに所属するように行われる。
In step S109 subsequent to step S108 illustrated in FIG. 4, the score calculation unit 1105 includes at least one extracted topic 1251 to which the input word 1201 belongs from a plurality of topics 1250 as illustrated in FIG. 3. Extract The extraction of the at least one extracted topic 1251 is performed such that the input word 1201 belongs to each extracted topic of the at least one extracted topic 1251.
また、スコア計算部1105が、図7に図示されるように、少なくともひとつの被抽出トピック1251に付属する複数の候補単語1260を含むサジェスト候補リスト1210を作成する。
In addition, the score calculation unit 1105 creates a suggestion candidate list 1210 including a plurality of candidate words 1260 attached to at least one extracted topic 1251 as illustrated in FIG. 7.
図4に図示される、ステップS109に続くステップS110においては、スコア計算部1105が、入力単語1201とサジェスト候補リスト1210に含まれる複数の候補単語1260の各候補単語1261との関連度の強さを示す各候補単語1261のサジェストスコア1532を計算する。スコア計算部1105は、各候補単語1261のサジェストスコア1532の計算において、少なくともひとつの被抽出トピック1251において各候補単語1261が所属する少なくともひとつの所属トピック1252を特定する。少なくともひとつの所属トピック1252の特定は、各候補単語1261が少なくともひとつの所属トピック1252の各所属トピックに所属するように行われる。
In step S110 following step S109 illustrated in FIG. 4, the score calculation unit 1105 determines the degree of association between the input word 1201 and each candidate word 1261 of the plurality of candidate words 1260 included in the suggestion candidate list 1210. A suggestion score 1532 of each candidate word 1261 indicating. The score calculation unit 1105 specifies at least one affiliation topic 1252 to which each candidate word 1261 belongs in at least one to-be-extracted topic 1251 in calculation of the suggestion score 1532 of each candidate word 1261. Identification of at least one affiliation topic 1252 is performed such that each candidate word 1261 belongs to each affiliation topic of at least one affiliation topic 1252.
また、スコア計算部1105が、少なくともひとつの所属トピック1252についてそれぞれ計算された各候補単語1261の少なくともひとつのスコア因子1205から各候補単語1261のサジェストスコア1532を計算する。
In addition, the score calculation unit 1105 calculates a suggestion score 1532 of each candidate word 1261 from at least one score factor 1205 of each candidate word 1261 calculated for each of at least one belonging topic 1252.
また、スコア計算部1105が、図7に図示されるように、各候補単語1261のサジェストスコア1532により示される関連度の強さの順でサジェスト候補リスト1210に含まれる複数の候補単語1260をソートしてサジェストワードリスト1207を作成する。
In addition, as illustrated in FIG. 7, the score calculation unit 1105 sorts the plurality of candidate words 1260 included in the suggestion candidate list 1210 in the order of the degree of association indicated by the suggestion score 1532 of each candidate word 1261. Then, a suggestion word list 1207 is created.
また、スコア計算部1105は、入力単語1201を入力したユーザーが属するユーザーグループについて計算された各候補単語1261の少なくともひとつのスコア因子1205から各候補単語1261のサジェストスコア1532を計算し、ユーザーが属するユーザーグループに固有のサジェストワードリスト1207を作成する。
Also, the score calculation unit 1105 calculates a suggestion score 1532 of each candidate word 1261 from at least one score factor 1205 of each candidate word 1261 calculated for the user group to which the user who has input the input word 1201 belongs, and the user belongs Create a suggestion word list 1207 specific to the user group.
図12は、第1実施形態のサジェスト生成装置において作成されるサジェストワードリストの例を図示する図である。
FIG. 12 is a diagram illustrating an example of a suggestion word list created in the suggestion generating device of the first embodiment.
サジェストワードリスト1207には、トピックを特定する情報、候補単語及びサジェストスコアが互いに関連付けられた状態で格納される。図12に図示される例においては、例えば、「corpus0_1_1」というトピックID1550、「アプリ」というトピック語1551及び「4.675」というサジェストスコア1552が互いに関連付けられた状態で格納されている。トピックID1550は、トピックを特定する情報である。トピック語1551は、候補単語である。
In the suggestion word list 1207, information specifying topics, candidate words and suggestion scores are stored in association with each other. In the example illustrated in FIG. 12, for example, a topic ID 1550 of "corpus 0_1_1", a topic word 1551 of "app", and a suggestion score 1552 of "4.675" are stored in association with each other. The topic ID 1550 is information for specifying a topic. The topic word 1551 is a candidate word.
図4に図示される、ステップS110に続くステップS111においては、提示部1106が、図7に図示されるように、サジェストワードリスト1207にしたがってサジェスト1208を生成する。サジェスト1208においては、サジェストワードリスト1207に含まれる複数の候補単語1260が各候補単語1261のサジェストスコア1532により示される関連度の強さの順で提示される。
In step S111 following step S110 illustrated in FIG. 4, the presentation unit 1106 generates a suggestion 1208 according to the suggestion word list 1207 as illustrated in FIG. 7. In the suggestion 1208, a plurality of candidate words 1260 included in the suggestion word list 1207 are presented in the order of the degree of relevance indicated by the suggestion score 1532 of each candidate word 1261.
4 サジェストスコアの第1の計算方法
図13は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第1の計算方法による計算例を説明する図である。 4 First Calculation Method of Suggestion Score FIG. 13 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the first calculation method.
図13は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第1の計算方法による計算例を説明する図である。 4 First Calculation Method of Suggestion Score FIG. 13 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the first calculation method.
第1の計算方法においては、スコア計算部1105が、図3に図示されるように、各抽出トピックに入力単語1201が所属するように複数のトピック1250から少なくともひとつの被抽出トピック1251を抽出する。図13に示される計算例においては、各被抽出トピックに「アプリ」という入力単語1600が所属するようにトピックk,l及びmという少なくともひとつの被抽出トピック1610が抽出されている。
In the first calculation method, as illustrated in FIG. 3, the score calculation unit 1105 extracts at least one extracted topic 1251 from the plurality of topics 1250 such that the input word 1201 belongs to each extracted topic. . In the calculation example shown in FIG. 13, at least one to-be-extracted topic 1610 of topics k, l and m is extracted such that the input word 1600 of “application” belongs to each to-be-extracted topic.
また、スコア計算部1105が、図3に図示されるように、各所属トピックに候補単語1261が所属するように少なくともひとつの被抽出トピック1251において少なくともひとつの所属トピック1252を特定する。図13に示される計算例においては、各所属トピックに「バージョン」という候補単語1601が所属するようにトピックk及びmという少なくともひとつの所属トピック1611が特定されている。
Also, as illustrated in FIG. 3, the score calculation unit 1105 specifies at least one affiliation topic 1252 in at least one extracted topic 1251 such that the candidate word 1261 belongs to each affiliation topic. In the calculation example shown in FIG. 13, at least one affiliation topic 1611 of topics k and m is specified such that the candidate word 1601 of “version” belongs to each affiliation topic.
また、スコア計算部1105が、少なくともひとつの所属トピック1252の各所属トピックについて、各所属トピックについて計算された入力単語1201のスコア因子1205と各所属トピックについて計算された候補単語1261のスコア因子1205との積を計算する。図13に図示される計算例においては、トピックkについて、トピックkについて計算された「アプリ」という入力単語1600の「31.2」という特徴度1620とトピックkについて計算された「バージョン」という候補単語1601の「15.4」という特徴度1621との「31.2×15.4=480.48」という積1622が計算され、トピックmについて、トピックmについて計算された「アプリ」という入力単語1600の「0.3」という特徴度1623とトピックmについて計算された「バージョン」という候補単語1601の「87.0」という特徴度1624との「0.3×87.0=26.1」という積1625が計算されている。
In addition, for each belonging topic of at least one belonging topic 1252, the score calculation unit 1105 calculates the score factor 1205 of the input word 1201 calculated for each belonging topic and the score factor 1205 of the candidate word 1261 calculated for each belonging topic. Calculate the product of In the calculation example illustrated in FIG. 13, for the topic k, the characteristic word 1620 of the “31.2” of the input word 1600 “app” calculated for the topic k and the candidate word 1601 “version” calculated for the topic k The product 1622 of “31.2 × 15.4 = 480.48” with the feature degree 1621 of “15.4” is calculated, and for the topic m, the feature degree 1623 of “0.3” of the input word 1600 of “application” calculated for the topic m A product 1625 of “0.3 × 87.0 = 26.1” with a characteristic degree 1624 of the candidate word 1601 of “version” calculated for the topic m is calculated.
また、スコア計算部1105が、少なくともひとつの所属トピック1252についてそれぞれ計算された少なくともひとつの積の最大値から入力単語1201と候補単語1261との関連度の強さを示す候補単語1261のサジェストスコア1532を計算する。図13に示される計算例においては、トピックkについて計算された「31.2×15.4=480.48」という積1622及びトピックmについて計算された「0.3×87.0=26.1」という積1625の「480.48」という最大値1626が候補単語1601のサジェストスコア1627にされている。最大値1626に一致する候補単語1601のサジェストスコア1627に代えて最大値1626を因子として含む候補単語1601のサジェストスコア1627が計算されてもよい。例えば、最大値1626の定数倍に一致する候補単語1601のサジェストスコア1627が計算されてもよい。
In addition, the score calculation unit 1105 suggests a suggestion score 1532 of the candidate word 1261 indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one product calculated for each of the at least one belonging topic 1252. Calculate In the calculation example shown in FIG. 13, the maximum value “480.48” of the product 1622 “31.2 × 15.4 = 480.48” calculated for the topic k and the product 1625 “0.3 × 87.0 = 26.1” calculated for the topic m 1626 is made into the suggestion score 1627 of the candidate word 1601. Instead of the suggestion score 1627 of the candidate word 1601 matching the maximum value 1626, a suggestion score 1627 of the candidate word 1601 including the maximum value 1626 as a factor may be calculated. For example, a suggestion score 1627 of a candidate word 1601 that matches a constant multiple of the maximum value 1626 may be calculated.
第1の計算方法においては、一般的に言って、候補単語wordのサジェストスコアScore(word)は、少なくともひとつの所属トピックT(keyword,word)、トピックtについて計算された入力単語keywordの特徴度featurekeywordt及びトピックtについて計算された候補単語wordの特徴度featurewordtを用いて、式(3)により計算される。
In the first calculation method, generally speaking, the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (3) is calculated using feature word feature words of candidate word words calculated for feature key word t and topic t.
第1の計算方法によれば、単語が所属するトピックを単語が特徴づける程度が強いことを示す大きな特徴度が候補単語1261のサジェストスコア1532に反映されやすく、単語が所属するトピックを単語が特徴づける程度が弱いことを示す小さな特徴度が候補単語1261のサジェストスコア1532に反映されにくい。
According to the first calculation method, a large feature degree indicating that the word characterizes the topic to which the word belongs is likely to be reflected in the suggestion score 1532 of the candidate word 1261, and the word features the topic to which the word belongs It is hard to reflect the small feature degree which shows that the degree of application is weak in the suggestion score 1532 of the candidate word 1261.
5 サジェストスコアの第2の計算方法
図14は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第2の計算方法による計算例を説明する図である。 5 Second Calculation Method of Suggestion Score FIG. 14 is a diagram for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the second calculation method.
図14は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第2の計算方法による計算例を説明する図である。 5 Second Calculation Method of Suggestion Score FIG. 14 is a diagram for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the second calculation method.
第2の計算方法においては、第1の計算方法と同じように、スコア計算部1105が、図3に図示されるように、複数のトピック1250から少なくともひとつの被抽出トピック1251を抽出し、少なくともひとつの被抽出トピック1251において少なくともひとつの所属トピック1252を特定し、各所属トピックについて、各所属トピックについて計算された入力単語1201のスコア因子1205と各所属トピックについて計算された候補単語1261のスコア因子1205との積を計算する。
In the second calculation method, as in the first calculation method, the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extracted topic 1251, and for each affiliation topic, the score factor 1205 of the input word 1201 calculated for each affiliation topic and the score factor for the candidate word 1261 calculated for each affiliation topic Calculate the product with 1205.
第2の計算方法においては、スコア計算部1105が、少なくともひとつの所属トピック1252についてそれぞれ計算された少なくともひとつの積の積から入力単語1201と候補単語1261との関連度の強さを示す候補単語1261のサジェストスコア1532を計算する。図14に示される計算例においては、トピックkについて計算された「31.2×15.4=480.48」という積1622及びトピックmについて計算された「0.3×87.0=26.1」という積1625の「480.48×26.1=12540.528」という積1628が候補単語1601のサジェストスコア1629にされている。積1628に一致する候補単語1601のサジェストスコア1629に代えて積1628を因子として含む候補単語1601のサジェストスコア1629が計算されてもよい。例えば、積1628の定数倍に一致する候補単語1601のサジェストスコア1629が計算されてもよい。
In the second calculation method, the score calculation unit 1105 is a candidate word indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the product of at least one product respectively calculated for at least one belonging topic 1252 A suggestion score 1532 of 1261 is calculated. In the calculation example shown in FIG. 14, “480.48 × 26.1 = 12540.528” of product 1622 “31.2 × 15.4 = 480.48” calculated for topic k and product 1625 “0.3 × 87.0 = 26.1” calculated for topic m Is a suggestion score 1629 of the candidate word 1601. A suggestion score 1629 of a candidate word 1601 including the product 1628 as a factor may be calculated instead of the suggestion score 1629 of the candidate word 1601 matching the product 1628. For example, a suggestion score 1629 of a candidate word 1601 that matches a constant multiple of the product 1628 may be calculated.
第2の計算方法においては、一般的に言って、候補単語wordのサジェストスコアScore(word)は、少なくともひとつの所属トピックT(keyword,word)、トピックtについて計算された入力単語keywordの特徴度featurekeywordt及びトピックtについて計算された候補単語wordの特徴度featurewordtを用いて、式(4)により計算される。
In the second calculation method, generally speaking, the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (4) is calculated using the feature word featureword of the candidate word word calculated for featurekeywordt and topic t.
第2の計算方法によれば、単語が所属するトピックを単語が特徴づける程度が強いことを示す大きな特徴度及び単語が所属するトピックを単語が特徴づける程度が弱いことを示す小さな特徴度のいずれも候補単語1261のサジェストスコア1532に反映される。
According to the second calculation method, any one of a large feature degree indicating that the word characterizes the topic to which the word belongs is strong, and a small feature degree indicating that the word characterizes the topic to which the word belongs are weak Is also reflected in the suggestion score 1532 of the candidate word 1261.
6 サジェストスコアの第3の計算方法
図15は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第3の計算方法による計算例を説明する図である。 6 Third Calculation Method of Suggestion Score FIG. 15 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the third calculation method.
図15は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第3の計算方法による計算例を説明する図である。 6 Third Calculation Method of Suggestion Score FIG. 15 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the third calculation method.
第3の計算方法においては、第1の計算方法と同じように、スコア計算部1105が、図3に図示されるように、複数のトピック1250から少なくともひとつの被抽出トピック1251を抽出し、少なくともひとつの被抽出トピック1251において少なくともひとつの所属トピック1252を特定する。
In the third calculation method, as in the first calculation method, the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extraction topic 1251.
第3の計算方法においては、スコア計算部1105が、各所属トピックについて、各所属トピックについて計算された入力単語1201のスコア因子1205と各所属トピックについて計算された候補単語1261のスコア因子1205との積を計算する。図15に示される計算例においては、トピックkについて、トピックkについて計算された「アプリ」という入力単語1600の「31.2」という特徴度1620とトピックkについて計算された「バージョン」という候補単語1601の「0.025」というトピック内出現確率1630との「31.2×0.025=0.78」という積1631が計算され、トピックmについて、トピックmについて計算された「アプリ」という入力単語1600の「0.3」という特徴度1623とトピックmについて計算された「バージョン」という候補単語1601の「0.350」というトピック内出現確率1632との「0.3×0.350=0.105」という積1633が計算されている。
In the third calculation method, for each belonging topic, the score calculation unit 1105 calculates the score factor 1205 of the input word 1201 calculated for each belonging topic and the score factor 1205 of the candidate word 1261 calculated for each belonging topic. Calculate the product. In the calculation example shown in FIG. 15, for the topic k, the characteristic word 1620 of the “31.2” of the input word 1600 “app” calculated for the topic k and the candidate word 1601 “version” calculated for the topic k The product 1631 “31.2 × 0.025 = 0.78” with the probability of occurrence 1630 within the topic “0.025” is calculated, and for the topic m, the feature factor 1623 “0.3” of the input word 1600 “app” calculated for the topic m A product 1633 of “0.3 × 0.350 = 0.105” is calculated with an in-topic appearance probability 1632 of “0.350” of the candidate word 1601 of “version” calculated for the topic m.
また、スコア計算部1105が、少なくともひとつの所属トピック1252についてそれぞれ計算された少なくともひとつの積の最大値から入力単語1201と候補単語1261との関連度の強さを示す候補単語1261のサジェストスコア1532を計算する。図15に示される計算例においては、トピックkについて計算された「31.2×0.025=0.78」という積1631及びトピックmについて計算された「0.3×0.350=0.105」という積1633の「31.2×0.025=0.78」という最大値1634が候補単語1601のサジェストスコア1635にされている。最大値1634に一致する候補単語1601のサジェストスコア1635に代えて最大値1634を因子として含む候補単語1601のサジェストスコア1635が計算されてもよい。例えば、最大値1634の定数倍に一致する候補単語1601のサジェストスコア1635が計算されてもよい。
In addition, the score calculation unit 1105 suggests a suggestion score 1532 of the candidate word 1261 indicating the strength of the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one product calculated for each of the at least one belonging topic 1252. Calculate In the calculation example shown in FIG. 15, the product 1631 “31.2 × 0.025 = 0.78” calculated for the topic k and the product 1633 “31.2 × 0.025 = 0.78” “0.3 × 0.350 = 0.105” calculated for the topic m. The maximum value 1634 of “” is made the suggestion score 1635 of the candidate word 1601. Instead of the suggestion score 1635 of the candidate word 1601 matching the maximum value 1634, a suggestion score 1635 of the candidate word 1601 including the maximum value 1634 as a factor may be calculated. For example, a suggestion score 1635 of a candidate word 1601 that matches a constant multiple of the maximum value 1634 may be calculated.
第3の計算方法においては、一般的に言って、候補単語wordのサジェストスコアScore(word)は、少なくともひとつの所属トピックT(keyword,word)、トピックtについて計算された入力単語keywordの特徴度featurekeywordt及びトピックtについて計算された候補単語wordのトピック内出現確率probabilitywordtを用いて、式(5)により計算される。
In the third calculation method, generally speaking, the suggestion score Score (word) of the candidate word is the feature degree of the input word keyword calculated for at least one belonging topic T (keyword, word) and topic t. Equation (5) is calculated using the in-topic occurrence probability probabilityt of the candidate word word calculated for the featurekeyword and the topic t.
第3の計算方法によれば、単語が所属するトピックを単語が特徴づける程度が強いことを示す大きな特徴度及び単語が所属するトピックにおける単語のトピック内出現確率が高いことを示す大きなトピック内出現確率が候補単語1261のサジェストスコア1532に反映されやすく、単語が所属するトピックを単語が特徴づける程度が弱いことを示す小さな特徴度及び単語が所属するトピックにおける単語のトピック内出現確率が低いことを示す小さなトピック内出現確率が候補単語1261のサジェストスコア1532に反映されにくい。
According to the third calculation method, a large feature degree indicating that the word characterizes the topic to which the word belongs is strong, and a large in-topic appearance indicating that the probability of occurrence of the word in the topic to which the word belongs is high. The probability is likely to be reflected in the suggestion score 1532 of the candidate word 1261, and the small feature degree indicating that the word characterizes the topic to which the word belongs is weak, and the in-topic appearance probability of the word in the topic to which the word belongs is low The small in-topic appearance probability shown is hard to be reflected in the suggestion score 1532 of the candidate word 1261.
7 サジェストスコアの第4の計算方法
図16は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第4の計算方法による計算例を説明する図である。 7 Fourth Calculation Method of Suggestion Score FIG. 16 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the fourth calculation method.
図16は、第1実施形態のサジェスト生成装置における候補単語のサジェストスコアの、第4の計算方法による計算例を説明する図である。 7 Fourth Calculation Method of Suggestion Score FIG. 16 is a view for explaining a calculation example of the suggestion score of a candidate word in the suggestion generating device of the first embodiment according to the fourth calculation method.
第4の計算方法においては、第1の計算方法と同じように、スコア計算部1105が、図3に図示されるように、複数のトピック1250から少なくともひとつの被抽出トピック1251を抽出し、少なくともひとつの被抽出トピック1251において少なくともひとつの所属トピック1252を特定する。
In the fourth calculation method, as in the first calculation method, the score calculation unit 1105 extracts at least one to-be-extracted topic 1251 from the plurality of topics 1250 as illustrated in FIG. At least one affiliation topic 1252 is specified in one extraction topic 1251.
第4の計算方法においては、スコア計算部1105が、少なくともひとつの所属トピック1252についてそれぞれ計算された候補単語1261の少なくともひとつのスコア因子1205の最大値から入力単語1201と候補単語1261との関連度の強さを示す候補単語1261のサジェストスコア1532を計算する。図16に示される計算例においては、トピックkについて計算された「バージョン」という候補単語1601の「0.025」というトピック内出現確率1636及びトピックmについて計算された「バージョン」という候補単語1601の「0.350」というトピック内出現確率1637の「0.350」という最大値1638が候補単語1601のサジェストスコア1639にされている。最大値1638に一致する候補単語1601のサジェストスコア1639に代えて最大値1638を因子として含む候補単語1601のサジェストスコア1639が計算されてもよい。例えば、最大値1638の定数倍に一致する候補単語1601のサジェストスコア1639が計算されてもよい。
In the fourth calculation method, the score calculation unit 1105 determines the degree of association between the input word 1201 and the candidate word 1261 from the maximum value of at least one score factor 1205 of the candidate words 1261 calculated for each of the at least one belonging topic 1252 Calculate the suggestion score 1532 of the candidate word 1261 indicating the strength of. In the calculation example shown in FIG. 16, the candidate word 1601 “version” calculated for the topic k has a probability of occurrence within the topic “0.025” 1636 and the candidate word “version” calculated for the topic m “0.350 The maximum value 1638 of “0.350” in the in-topic appearance probability 1637 is set as the suggestion score 1639 of the candidate word 1601. A suggestion score 1639 of a candidate word 1601 including a maximum value 1638 as a factor may be calculated instead of the suggestion score 1639 of the candidate word 1601 matching the maximum value 1638. For example, a suggestion score 1639 of a candidate word 1601 that matches a constant multiple of the maximum value 1638 may be calculated.
第4の計算方法においては、一般的に言って、候補単語wordのサジェストスコアScore(word)は、少なくともひとつの所属トピックT(keyword,word)及びトピックtについて計算された候補単語wordのトピック内出現確率probabilitywordtを用いて、式(6)により計算される。
In the fourth calculation method, generally speaking, the suggestion score Score (word) of the candidate word is a topic of the candidate word word calculated for at least one belonging topic T (keyword, word) and the topic t. Equation (6) is calculated using the probability of occurrence probability word.
第4の計算方法によれば、単語が所属するトピックにおける単語のトピック内出現確率が高いことを示す大きなトピック内出現確率が候補単語1261のサジェストスコア1532に反映されやすく、単語が所属するトピックにおける単語のトピック内出現確率が低いことを示す小さなトピック内出現確率が候補単語1261のサジェストスコア1532に反映されにくい。
According to the fourth calculation method, a large in-topic appearance probability indicating that the in-topic appearance probability of the word in the topic to which the word belongs is easily reflected in the suggestion score 1532 of the candidate word 1261 and in the topic to which the word belongs The small in-topic occurrence probability indicating that the in-topic occurrence probability of the word is low is hard to be reflected in the suggestion score 1532 of the candidate word 1261.
8 各ユーザーグループについてのサジェストスコアの計算の別例
図17は、第1実施形態のサジェスト生成装置における各ユーザーグループについての各候補単語のサジェストスコアの計算アルゴリズムの別例を説明する図である。 8 Another Example of Calculation of Suggestion Score for Each User Group FIG. 17 is a view for explaining another example of a calculation algorithm of the suggestion score of each candidate word for each user group in the suggestion generating device of the first embodiment.
図17は、第1実施形態のサジェスト生成装置における各ユーザーグループについての各候補単語のサジェストスコアの計算アルゴリズムの別例を説明する図である。 8 Another Example of Calculation of Suggestion Score for Each User Group FIG. 17 is a view for explaining another example of a calculation algorithm of the suggestion score of each candidate word for each user group in the suggestion generating device of the first embodiment.
当該別例においては、スコア計算部1105が、各トピック語のスコア因子1205から、入力単語1201と各候補単語1261との関連度の強さを示す加算前サジェストスコア1700を計算する。
In the other example, the score calculation unit 1105 calculates a pre-addition suggestion score 1700 indicating the strength of the degree of association between the input word 1201 and each candidate word 1261 from the score factor 1205 of each topic word.
また、スコア計算部1105が、各ユーザーグループについて、過去の検索において各ユーザーグループに属するユーザーにより使用された使用済単語を検索ログ1302及びユーザー管理テーブル1303から特定し、使用済単語の加算スコアを計算し、各候補単語1261の加算スコア1701を各候補単語1261の加算前サジェストスコア1700に加算することにより各候補単語1261のサジェストスコア1532を計算する。
Also, the score calculation unit 1105 identifies, for each user group, used words used by users belonging to each user group in the past search from the search log 1302 and the user management table 1303, and adds scores of used words. The suggestion score 1532 of each candidate word 1261 is calculated by calculating and adding the addition score 1701 of each candidate word 1261 to the pre-addition suggestion score 1700 of each candidate word 1261.
9 画面の例
図18は、第1実施形態のサジェスト生成装置において表示される画面の例を図示する模式図である。 9 Example of Screen FIG. 18 is a schematic view illustrating an example of a screen displayed in the suggestion generating device of the first embodiment.
図18は、第1実施形態のサジェスト生成装置において表示される画面の例を図示する模式図である。 9 Example of Screen FIG. 18 is a schematic view illustrating an example of a screen displayed in the suggestion generating device of the first embodiment.
図18に図示される画面1800は、ディスプレイ1043に表示される。
The screen 1800 illustrated in FIG. 18 is displayed on the display 1043.
画面1800は、検索に使用される入力単語1201の入力を受け付けるテキストボックス1820、検索の開始の指示を受け付けるボタン1821及びサジェスト1208を表示する領域1822を備える。テキストボックス1820及びボタン1821の各々が他の種類のグラフィカルユーザーインターフェース(GUI)部品に置き換えられてもよい。
The screen 1800 includes a text box 1820 for receiving an input of an input word 1201 used for a search, a button 1821 for receiving an instruction to start a search, and an area 1822 for displaying a suggestion 1208. Each of text box 1820 and button 1821 may be replaced with another type of graphical user interface (GUI) component.
図18に示される例においては、複数の候補単語1830が領域1822に同時に表示され、各候補単語1831のサジェストスコアにより示される関連度の強さの順に一致する配列順で複数の候補単語1830が配列される。1個の候補単語のみが表示され、表示される1個の候補単語が各候補単語1831のサジェストスコアにより示される関連度の強さの順に一致する時間順で切り替えられてもよい。
In the example shown in FIG. 18, a plurality of candidate words 1830 are simultaneously displayed in area 1822, and a plurality of candidate words 1830 are arranged in the order of arrangement that matches in the order of the degree of relevance indicated by the suggestion score of each candidate word 1831. It is arranged. Only one candidate word may be displayed, and one candidate word to be displayed may be switched in order of time that matches in the order of the degree of relevance indicated by the suggestion score of each candidate word 1831.
この発明は詳細に説明されたが、上記した説明は、すべての局面において、例示であって、この発明がそれに限定されるものではない。例示されていない無数の変形例が、この発明の範囲から外れることなく想定され得るものと解される。
Although the present invention has been described in detail, the above description is an exemplification in all aspects, and the present invention is not limited thereto. It is understood that countless variations not illustrated are conceivable without departing from the scope of the present invention.
1000 サジェスト生成装置
1020 サジェスト生成プログラム
1100 除去部
1101 形態素解析部
1102 トピック分類部
1103 スコア因子計算部
1104 特定部
1105 スコア計算部
1106 提示部
1107 記憶部
1200 検索又は分析の対象のテキスト(除去前テキスト)
1201 入力単語
1202 除去後テキスト
1203 形態素解析済テキスト
1204 少なくともひとつのトピック語
1205 各トピック語のスコア因子
1206 少なくともひとつの所属トピック語
1207 サジェストワードリスト
1208 サジェスト 1000suggestion generation device 1020 suggestion generation program 1100 removal unit 1101 morphological analysis unit 1102 topic classification unit 1103 score factor calculation unit 1104 identification unit 1105 score calculation unit 1106 presentation unit 1107 storage unit 1200 text to be searched or analyzed (text before removal)
1201Input words 1202 Removed text 1203 Morphologically analyzed text 1204 At least one topic word 1205 Score factor for each topic word 1206 At least one affiliation topic word 1207 Suggested word list 1208 Suggested word list
1020 サジェスト生成プログラム
1100 除去部
1101 形態素解析部
1102 トピック分類部
1103 スコア因子計算部
1104 特定部
1105 スコア計算部
1106 提示部
1107 記憶部
1200 検索又は分析の対象のテキスト(除去前テキスト)
1201 入力単語
1202 除去後テキスト
1203 形態素解析済テキスト
1204 少なくともひとつのトピック語
1205 各トピック語のスコア因子
1206 少なくともひとつの所属トピック語
1207 サジェストワードリスト
1208 サジェスト 1000
1201
Claims (16)
- テキストに対して形態素解析を行って前記テキストを複数の単語に分割し形態素解析済テキストを得る形態素解析部と、
前記形態素解析済テキストに対してトピック分類を行って前記複数の単語から複数のトピックの各トピックに所属する少なくともひとつのトピック語を抽出するトピック分類部と、
前記少なくともひとつのトピック語の各トピック語が所属するトピックについて、前記各トピック語が所属するトピックを前記各トピック語が特徴づける程度を示す特徴度及び前記各トピック語が所属するトピックにおける前記各トピック語のトピック内出現確率の少なくとも一方を示す前記各トピック語のスコア因子を計算するスコア因子計算部と、
前記各トピックに所属し前記少なくともひとつのトピック語の少なくとも一部を含む少なくともひとつの所属トピック語を特定する特定部と、
各被抽出トピックに入力単語が所属するように前記複数のトピックから少なくともひとつの被抽出トピックを抽出し、前記入力単語と前記少なくともひとつの被抽出トピックに所属する複数の候補単語の各候補単語との関連度の強さを示す前記各候補単語のスコアの計算を行い、前記計算において、各所属トピックに前記各候補単語が所属するように前記少なくともひとつの被抽出トピックにおける少なくともひとつの所属トピックを特定し、前記少なくともひとつの所属トピックについてそれぞれ計算された前記各候補単語の少なくともひとつのスコア因子から前記各候補単語のスコアを計算するスコア計算部と、
前記各候補単語のスコアにより示される関連度の強さの順で前記複数の候補単語を提示する提示部と、
を備えるサジェスト生成装置。 A morphological analysis unit that performs morphological analysis on the text, divides the text into a plurality of words, and obtains the morphologically analyzed text;
A topic classification unit which performs topic classification on the morpheme-analyzed text and extracts at least one topic word belonging to each topic of a plurality of topics from the plurality of words;
The feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs, and the topic in the topic to which each topic word belongs, with respect to the topic to which each topic word of the at least one topic word belongs A score factor calculator for calculating a score factor of each topic word indicating at least one of the word's in-topic appearance probability;
An identification unit that identifies at least one affiliation topic word that belongs to each topic and that includes at least a portion of the at least one topic word;
At least one extracted topic is extracted from the plurality of topics such that the input word belongs to each extracted topic, and each candidate word of the plurality of candidate words belonging to the input word and the at least one extracted topic Calculating the score of each candidate word indicating the strength of the degree of association, and in the calculation, at least one belonging topic in the at least one extracted topic so that each candidate word belongs to each belonging topic A score calculation unit which calculates a score of each candidate word from at least one score factor of each candidate word identified and calculated for each of the at least one affiliation topic;
A presentation unit that presents the plurality of candidate words in the order of the degree of relevance indicated by the score of each candidate word;
A suggestion generator comprising: - 除去前テキストからストップワードを除去し前記テキストを得る除去部をさらに備える
請求項1のサジェスト生成装置。 The suggestion generation device according to claim 1, further comprising: a removal unit that removes a stop word from the pre-removal text and obtains the text. - 複合語が登録された強制抽出語辞書を記憶する記憶部をさらに備え、
前記形態素解析部は、前記複数の単語が前記複合語を含むように前記テキストを分割する
請求項1又は2のサジェスト生成装置。 It further comprises a storage unit for storing a forcedly extracted word dictionary in which compound words are registered,
The suggestion generation device according to claim 1, wherein the morphological analysis unit divides the text so that the plurality of words include the compound word. - 過去の検索において使用された単語が記録された検索ログを記憶する記憶部をさらに備え、
前記スコア因子計算部は、
前記各トピック語が所属するトピックを前記各トピック語が特徴づける程度を示す特徴度及び前記各トピック語が所属するトピックにおける前記各トピック語のトピック内出現確率の少なくとも一方を示す前記各トピック語の加算前スコア因子を計算し、
各ユーザーグループについて、前記過去の検索において前記各ユーザーグループに所属するユーザーにより使用された使用済単語を前記検索ログから特定し、前記使用済単語が所属するトピックの加算スコア因子を計算し、前記各トピック語が所属するトピックの加算スコア因子を前記各トピック語の加算前スコア因子に加算することにより前記各トピック語のスコア因子を計算し、
前記スコア計算部は、
前記入力単語を入力したユーザーが属するユーザーグループについて計算された前記各候補単語の少なくともひとつのスコア因子から前記各候補単語のスコアを計算する
請求項1から3までのいずれかのサジェスト生成装置。 It further comprises a storage unit for storing a search log in which words used in past searches are recorded,
The score factor calculation unit
In each of the topic words, at least one of a feature degree indicating the degree to which each topic word characterizes a topic to which each topic word belongs, and a probability of appearance of each topic word in the topic to which each topic word belongs Calculate the pre-addition score factor,
For each user group, identify the used words used by the users belonging to each user group in the past search from the search log, calculate the additive score factor of the topic to which the used words belong, The score factor of each topic word is calculated by adding the additive score factor of the topic to which each topic word belongs to the pre-addition score factor of each topic word,
The score calculation unit
The suggestion generation device according to any one of claims 1 to 3, wherein the score of each candidate word is calculated from at least one score factor of each candidate word calculated for a user group to which the user who has input the input word belongs. - 前記各トピック語のスコア因子は、前記各トピック語が所属するトピックを前記各トピック語が特徴づける程度を示す特徴度を示し、
前記各トピック語が所属するトピックを前記各トピック語が特徴づける程度を示す特徴度は、前記各トピック語が所属するトピックにおける前記各トピック語のトピック内出現確率を前記テキストにおける前記各トピック語の出現頻度で除することにより得られる
請求項1から4までのいずれかのサジェスト生成装置。 The score factor of each topic word indicates a feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs,
The feature degree indicating the degree to which each topic word characterizes the topic to which each topic word belongs is the probability of the in-topic appearance of each topic word in the topic to which each topic word belongs. The suggestion generator according to any one of claims 1 to 4, which is obtained by dividing by the appearance frequency. - 過去の検索において使用された単語が記録された検索ログを記憶する記憶部をさらに備え、
前記特定部は、設定回数より多い回数にわたって前記過去の検索において使用されたが前記少なくともひとつのトピック語に含まれない未抽出単語を前記検索ログから特定し、前記少なくともひとつの所属トピック語が前記未抽出単語を含むように前記少なくともひとつの所属トピック語を特定する
請求項1から5までのいずれかのサジェスト生成装置。 It further comprises a storage unit for storing a search log in which words used in past searches are recorded,
The identification unit identifies, from the search log, unextracted words that have been used in the past search more than the set number but not included in the at least one topic word, and the at least one affiliation topic word is the word The suggestion generation device according to any one of claims 1 to 5, wherein the at least one affiliation topic word is specified to include an unextracted word. - 除外語が登録された除外語辞書を記憶する記憶部をさらに備え、
前記特定部は、前記少なくともひとつの所属トピック語が前記除外語を含まないように前記少なくともひとつの所属トピック語を特定する
請求項1から6までのいずれかのサジェスト生成装置。 And a storage unit for storing an exclusion term dictionary in which the exclusion terms are registered,
The suggestion generation device according to any one of claims 1 to 6, wherein the identification unit identifies the at least one affiliation topic word such that the at least one affiliation topic word does not include the exclusion word. - 前記スコア計算部は、
前記各所属トピックについて、前記各所属トピックについて計算された前記入力単語のスコア因子と前記各所属トピックについて計算された前記各候補単語のスコア因子との積を計算し、
前記少なくともひとつの所属トピックについてそれぞれ計算された少なくともひとつの積の最大値から前記各候補単語のスコアを計算する
請求項1から7までのいずれかのサジェスト生成装置。 The score calculation unit
Calculating a product of the score factor of the input word calculated for each of the affiliation topics and the score factor of each of the candidate words calculated for each of the affiliation topics for each of the affiliation topics;
The suggestion generating device according to any one of claims 1 to 7, wherein the score of each candidate word is calculated from the maximum value of at least one product calculated for each of the at least one belonging topic. - 前記スコア計算部は、
前記各所属トピックについて、前記各所属トピックについて計算された前記入力単語のスコア因子と前記各所属トピックについて計算された前記各候補単語のスコア因子との積を計算し、
前記少なくともひとつの所属トピックについてそれぞれ計算された少なくともひとつの積の積から前記各候補単語のスコアを計算する
請求項1から7までのいずれかのサジェスト生成装置。 The score calculation unit
Calculating a product of the score factor of the input word calculated for each of the affiliation topics and the score factor of each of the candidate words calculated for each of the affiliation topics for each of the affiliation topics;
The suggestion generation device according to any one of claims 1 to 7, wherein the score of each candidate word is calculated from the product of at least one product respectively calculated for the at least one belonging topic. - 前記各所属トピックについて計算された前記入力単語のスコア因子は、前記入力単語が前記各所属トピックを特徴づける程度を示す特徴度を示し、
前記各所属トピックについて計算された前記各候補単語のスコア因子は、前記各候補単語が前記各所属トピックを特徴づける程度を示す特徴度を示す
請求項8又は9のサジェスト生成装置。 The score factor of the input word calculated for each of the belonging topics indicates a feature degree indicating the degree to which the input word characterizes each of the belonging topics,
10. The suggestion generation device according to claim 8, wherein the score factor of each candidate word calculated for each affiliation topic indicates a feature degree indicating the degree to which each candidate word characterizes each affiliation topic. - 前記各所属トピックについて計算された前記入力単語のスコア因子は、前記入力単語が前記各所属トピックを特徴づける程度を示す特徴度を示し、
前記各所属トピックについて計算された前記各候補単語のスコア因子は、前記各所属トピックにおける前記各候補単語のトピック内出現確率を示す
請求項8又は9のサジェスト生成装置。 The score factor of the input word calculated for each of the belonging topics indicates a feature degree indicating the degree to which the input word characterizes each of the belonging topics,
10. The suggestion generation device according to claim 8, wherein the score factor of each candidate word calculated for each of the belonging topics indicates the in-topic appearance probability of each of the candidate words in each of the belonging topics. - 前記スコア計算部は、
前記少なくともひとつの所属トピックについてそれぞれ計算された前記各候補単語の少なくともひとつのスコア因子の最大値から前記各候補単語のスコアを計算する
請求項1から7までのいずれかのサジェスト生成装置。 The score calculation unit
The suggestion generation device according to any one of claims 1 to 7, wherein the score of each candidate word is calculated from the maximum value of at least one score factor of each candidate word calculated for each of the at least one belonging topic. - 前記各所属トピックについて計算された前記各候補単語のスコア因子は、前記各所属トピックにおける前記各候補単語のトピック内出現確率である
請求項12のサジェスト生成装置。 13. The suggestion generation device according to claim 12, wherein the score factor of each candidate word calculated for each affiliation topic is an in-topic appearance probability of each candidate word in each affiliation topic. - 過去の検索において使用された単語が記録された検索ログを記憶する記憶部をさらに備え、
前記スコア計算部は、
前記入力単語と前記各候補単語との関連度の強さを示す前記各候補単語の加算前スコアを計算し、
各ユーザーグループについて、前記過去の検索において前記各ユーザーグループに属するユーザーにより使用された使用済単語を前記検索ログから特定し、前記使用済単語の加算スコアを計算し、前記各候補単語の加算スコアを前記各候補単語の加算前スコアに加算することにより前記各候補単語のスコアを計算する
請求項1から13までのいずれかのサジェスト生成装置。 It further comprises a storage unit for storing a search log in which words used in past searches are recorded,
The score calculation unit
Calculating a pre-addition score of each candidate word indicating the degree of association between the input word and each candidate word;
For each user group, the used words used by the users belonging to each user group in the past search are specified from the search log, the added score of the used words is calculated, and the added score of each candidate word The suggestion generation device according to any one of claims 1 to 13, wherein the score of each candidate word is calculated by adding the above to the pre-addition score of each candidate word. - a) テキストに対して形態素解析を行って前記テキストを複数の単語に分割し形態素解析済テキストを得る工程と、
b) 前記形態素解析済テキストに対してトピック分類を行って前記複数の単語から複数のトピックの各トピックに所属する少なくともひとつのトピック語を抽出する工程と、
c) 前記少なくともひとつのトピック語の各トピック語が所属するトピックについて、前記各トピック語が所属するトピックを前記各トピック語が特徴づける程度を示す特徴度及び前記各トピック語が所属するトピックにおける前記各トピック語のトピック内出現確率の少なくとも一方を示す前記各トピック語のスコア因子を計算する工程と、
d) 前記各トピックに所属し前記少なくともひとつのトピック語の少なくとも一部を含む少なくともひとつの所属トピック語を特定する工程と、
e) 各被抽出トピックに入力単語が所属するように前記複数のトピックから少なくともひとつの被抽出トピックを抽出し、前記入力単語と前記少なくともひとつの被抽出トピックに所属する複数の候補単語の各候補単語との関連度の強さを示す前記各候補単語のスコアの計算を行い、前記計算において、各所属トピックに前記各候補単語が所属するように前記少なくともひとつの被抽出トピックにおける少なくともひとつの所属トピックを特定し、前記少なくともひとつの所属トピックについてそれぞれ計算された前記各候補単語の少なくともひとつのスコア因子から前記各候補単語のスコアを計算する工程と、
f) 前記各候補単語のスコアにより示される関連度の強さの順で前記複数の候補単語を提示する工程と、
をコンピューターに実行させるサジェスト生成プログラム。 a) performing morphological analysis on the text to divide the text into a plurality of words to obtain a morphologically analyzed text;
b) performing topic classification on the morpheme-analyzed text to extract at least one topic word belonging to each topic of a plurality of topics from the plurality of words;
c) With respect to the topic to which each topic word of the at least one topic word belongs, the feature degree indicating the degree to which the topic word characterizes the topic to which the topic word belongs, and the topic degree to which the topic word belongs Calculating a score factor for each topic word indicating at least one of the in-topic appearance probability of each topic word;
d) identifying at least one affiliation topic word belonging to the respective topics and including at least a part of the at least one topic word;
e) at least one extracted topic is extracted from the plurality of topics such that the input word belongs to each extracted topic, and each candidate of a plurality of candidate words belonging to the input word and the at least one extracted topic The score of each candidate word indicating the strength of the degree of association with the word is calculated, and in the calculation, at least one affiliation in the at least one extracted topic such that each candidate word belongs to each affiliation topic Specifying a topic, and calculating a score of each candidate word from at least one score factor of each candidate word calculated for each of the at least one affiliation topic;
f) presenting the plurality of candidate words in the order of the degree of relevance indicated by the score of each candidate word;
A suggestion generator that causes a computer to run. - a) テキストに対して形態素解析を行って前記テキストを複数の単語に分割し形態素解析済テキストを得る工程と、
b) 前記形態素解析済テキストに対してトピック分類を行って前記複数の単語から複数のトピックの各トピックに所属する少なくともひとつのトピック語を抽出する工程と、
c) 前記少なくともひとつのトピック語の各トピック語が所属するトピックについて、前記各トピック語が所属するトピックを前記各トピック語が特徴づける程度を示す特徴度及び前記各トピック語が所属するトピックにおける前記各トピック語のトピック内出現確率の少なくとも一方を示す前記各トピック語のスコア因子を計算する工程と、
d) 前記各トピックに所属し前記少なくともひとつのトピック語の少なくとも一部を含む少なくともひとつの所属トピック語を特定する工程と、
e) 各被抽出トピックに入力単語が所属するように前記複数のトピックから少なくともひとつの被抽出トピックを抽出し、前記入力単語と前記少なくともひとつの被抽出トピックに所属する複数の候補単語の各候補単語との関連度の強さを示す前記各候補単語のスコアの計算を行い、前記計算において、各所属トピックに前記各候補単語が所属するように前記少なくともひとつの被抽出トピックにおける少なくともひとつの所属トピックを特定し、前記少なくともひとつの所属トピックについてそれぞれ計算された前記各候補単語の少なくともひとつのスコア因子から前記各候補単語のスコアを計算する工程と、
f) 前記各候補単語のスコアにより示される関連度の強さの順で前記複数の候補単語を提示する工程と、
を備えるサジェスト生成方法。 a) performing morphological analysis on the text to divide the text into a plurality of words to obtain a morphologically analyzed text;
b) performing topic classification on the morpheme-analyzed text to extract at least one topic word belonging to each topic of a plurality of topics from the plurality of words;
c) With respect to the topic to which each topic word of the at least one topic word belongs, the feature degree indicating the degree to which the topic word characterizes the topic to which the topic word belongs, and the topic degree to which the topic word belongs Calculating a score factor for each topic word indicating at least one of the in-topic appearance probability of each topic word;
d) identifying at least one affiliation topic word belonging to the respective topics and including at least a part of the at least one topic word;
e) at least one extracted topic is extracted from the plurality of topics such that the input word belongs to each extracted topic, and each candidate of a plurality of candidate words belonging to the input word and the at least one extracted topic The score of each candidate word indicating the strength of the degree of association with the word is calculated, and in the calculation, at least one affiliation in the at least one extracted topic such that each candidate word belongs to each affiliation topic Specifying a topic, and calculating a score of each candidate word from at least one score factor of each candidate word calculated for each of the at least one affiliation topic;
f) presenting the plurality of candidate words in the order of the degree of relevance indicated by the score of each candidate word;
A method of generating a suggestion comprising:
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017-180015 | 2017-09-20 | ||
JP2017180015A JP6967412B2 (en) | 2017-09-20 | 2017-09-20 | Suggestion generator, suggestion generator and suggestion generator |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019058698A1 true WO2019058698A1 (en) | 2019-03-28 |
Family
ID=65811318
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2018/024841 WO2019058698A1 (en) | 2017-09-20 | 2018-06-29 | Suggestion generation device, suggestion generation program and suggestion generation method |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP6967412B2 (en) |
TW (1) | TWI703453B (en) |
WO (1) | WO2019058698A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06202685A (en) * | 1992-12-28 | 1994-07-22 | Ricoh Co Ltd | Speech synthesizing device |
JP2010003134A (en) * | 2008-06-20 | 2010-01-07 | Yahoo Japan Corp | Server, method, and program for recommending retrieval keyword |
JP2010009307A (en) * | 2008-06-26 | 2010-01-14 | Kyoto Univ | Feature word automatic learning system, content linkage type advertisement distribution computer system, retrieval linkage type advertisement distribution computer system and text classification computer system, and computer program and method for them |
JP2014067095A (en) * | 2012-09-24 | 2014-04-17 | Yahoo Japan Corp | Search system, search method and program |
JP2017005305A (en) * | 2015-06-04 | 2017-01-05 | キヤノン株式会社 | Information processing unit, control method of the same, and program |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060248078A1 (en) * | 2005-04-15 | 2006-11-02 | William Gross | Search engine with suggestion tool and method of using same |
US20070192318A1 (en) * | 2005-09-14 | 2007-08-16 | Jorey Ramer | Creation of a mobile search suggestion dictionary |
JP5338835B2 (en) * | 2011-03-24 | 2013-11-13 | カシオ計算機株式会社 | Synonym list generation method and generation apparatus, search method and search apparatus using the synonym list, and computer program |
CN105095204B (en) * | 2014-04-17 | 2018-12-14 | 阿里巴巴集团控股有限公司 | The acquisition methods and device of synonym |
-
2017
- 2017-09-20 JP JP2017180015A patent/JP6967412B2/en active Active
-
2018
- 2018-06-29 WO PCT/JP2018/024841 patent/WO2019058698A1/en active Application Filing
- 2018-07-27 TW TW107126176A patent/TWI703453B/en active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06202685A (en) * | 1992-12-28 | 1994-07-22 | Ricoh Co Ltd | Speech synthesizing device |
JP2010003134A (en) * | 2008-06-20 | 2010-01-07 | Yahoo Japan Corp | Server, method, and program for recommending retrieval keyword |
JP2010009307A (en) * | 2008-06-26 | 2010-01-14 | Kyoto Univ | Feature word automatic learning system, content linkage type advertisement distribution computer system, retrieval linkage type advertisement distribution computer system and text classification computer system, and computer program and method for them |
JP2014067095A (en) * | 2012-09-24 | 2014-04-17 | Yahoo Japan Corp | Search system, search method and program |
JP2017005305A (en) * | 2015-06-04 | 2017-01-05 | キヤノン株式会社 | Information processing unit, control method of the same, and program |
Also Published As
Publication number | Publication date |
---|---|
JP6967412B2 (en) | 2021-11-17 |
TW201915785A (en) | 2019-04-16 |
TWI703453B (en) | 2020-09-01 |
JP2019057017A (en) | 2019-04-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8346795B2 (en) | System and method for guiding entity-based searching | |
JP4857333B2 (en) | How to determine context summary information across documents | |
JP6461980B2 (en) | Coherent question answers in search results | |
US10552467B2 (en) | System and method for language sensitive contextual searching | |
AU2015203818B2 (en) | Providing contextual information associated with a source document using information from external reference documents | |
WO2014208213A1 (en) | Non-factoid question-and-answer system and method | |
RU2547213C2 (en) | Assigning actionable attributes to data describing personal identity | |
US20200285808A1 (en) | Synonym dictionary creation apparatus, non-transitory computer-readable recording medium storing synonym dictionary creation program, and synonym dictionary creation method | |
JP2007073054A (en) | Parallel translation phrase presentation program, parallel translation phrase presentation method and parallel translation phrase presentation device | |
JP2001075966A (en) | Data analysis system | |
JPWO2012096388A1 (en) | Unexpectedness determination system, unexpectedness determination method, and program | |
JP2014106665A (en) | Document retrieval device and document retrieval method | |
JP5836893B2 (en) | File management apparatus, file management method, and program | |
JP2005301856A (en) | Method and program for document retrieval, and document retrieving device executing the same | |
JP5345987B2 (en) | Document search apparatus, document search method, and document search program | |
JP2020060811A (en) | Information processing apparatus, information processing method, and program | |
Ullah et al. | Pattern and semantic analysis to improve unsupervised techniques for opinion target identification | |
JP4525433B2 (en) | Document aggregation device and program | |
US10572592B2 (en) | Method, device, and computer program for providing a definition or a translation of a word belonging to a sentence as a function of neighbouring words and of databases | |
WO2019058698A1 (en) | Suggestion generation device, suggestion generation program and suggestion generation method | |
JP2007172179A (en) | Opinion extraction device, opinion extraction method and opinion extraction program | |
CN113919352A (en) | Database sensitive data identification method and device | |
JP2006139484A (en) | Information retrieval method, system therefor and computer program | |
JP2009217406A (en) | Document retrieval device, method, and program | |
JP7488207B2 (en) | Future event estimation system and future event estimation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18859030 Country of ref document: EP Kind code of ref document: A1 |