WO2021044519A1 - 情報処理装置、プログラム及び情報処理方法 - Google Patents
情報処理装置、プログラム及び情報処理方法 Download PDFInfo
- Publication number
- WO2021044519A1 WO2021044519A1 PCT/JP2019/034632 JP2019034632W WO2021044519A1 WO 2021044519 A1 WO2021044519 A1 WO 2021044519A1 JP 2019034632 W JP2019034632 W JP 2019034632W WO 2021044519 A1 WO2021044519 A1 WO 2021044519A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- search
- similarity
- search target
- tokens
- token
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present invention relates to an information processing device, a program, and an information processing method.
- the mainstream method is to give one vector to one token, but such a method cannot eliminate the ambiguity of meaning for tokens that have multiple meanings depending on the context. Therefore, a method of acquiring a vector of tokens that can consider the context has been proposed.
- Non-Patent Document 1 For example, in Non-Patent Document 1, for each token x i included in the search query x, the token having the highest degree of similarity among the tokens Y jk included in the search target sentence Y j is selected. their i-number of word combinations token similarity ⁇ (x i, Y jk) calculated for sentences similarity calculation method using an average value of is described.
- one or more aspects of the present invention aim to reduce the calculation load of similarity in document retrieval.
- the information processing apparatus includes a search target storage unit that stores a plurality of search target sentences including a plurality of search target tokens, each of which is a minimum unit having a meaning, and each of the plurality of search target tokens.
- the similarity that stores the similarity determination information indicating whether the combination with each of the plurality of search tokens, which is the smallest meaningful unit contained in the search statement, has a high degree of similarity or a low degree of similarity.
- the similarity between tokens is calculated for the combination of the degree determination information storage unit and the combination indicated in the similarity determination information to have the high similarity, and the similarity is determined to be the low degree in the similarity determination information.
- the inter-token similarity is set to a predetermined value, and the inter-sentence similarity between the search sentence and each of the plurality of search target sentences is calculated. It is characterized by having a calculation unit.
- the computer is a search target storage unit that stores a plurality of search target sentences including a plurality of search target tokens, each of which is a minimum unit having a meaning, and each of the plurality of search target tokens.
- a similarity that stores similarity determination information indicating whether the combination with each of a plurality of search tokens, which is the smallest meaningful unit contained in the search statement, has a high degree of similarity or a low degree of similarity.
- the similarity between tokens is calculated for the degree determination information storage unit and the combination shown to have high similarity in the similarity determination information, and it is shown that the similarity is low in the similarity determination information.
- Inter-sentence similarity calculation for calculating the inter-sentence similarity between the search sentence and each of the plurality of search target sentences by setting the inter-token similarity to a predetermined value for the combination. It is characterized by functioning as a part.
- the information processing method includes a plurality of search target sentences including a plurality of search target tokens, each of which is a meaningful minimum unit, and a search sentence including a plurality of search target sentences, which are the minimum meaningful units. It is an information processing method for calculating the similarity between a plurality of sentences between the two, and the combination of each of the plurality of search target tokens and each of the plurality of search tokens is high by accepting the input of the search sentence. For combinations showing high similarity in the similarity determination information indicating whether the similarity is high or low, the similarity between tokens is calculated, and the low similarity is calculated in the similarity determination information. By setting the inter-token similarity to a predetermined value for the combination shown to be, the inter-sentence similarity between the search sentence and each of the plurality of search target sentences is calculated. It is characterized by that.
- FIG. 5 is a block diagram schematically showing a configuration of a document retrieval device, which is an information processing device according to the first embodiment. It is the schematic which shows the example of the search target token array. It is a schematic diagram which shows the example of the search target context-dependent expression array. It is the schematic which shows the example of the search query token array. It is a schematic diagram which shows the example of the search query context-sensitive expression array. It is the schematic which shows the example of the similar token table. It is a block diagram which shows schematic the hardware configuration for realizing a document retrieval apparatus. It is a flowchart which shows the process in the search target context-dependent expression generation part in Embodiment 1. FIG. It is a flowchart which shows the process in a data structure conversion part.
- FIG. 5 is a block diagram schematically showing a configuration of a document retrieval device which is an information processing device according to the second embodiment. It is a flowchart which shows the process in the search target context-dependent expression generation part in Embodiment 2.
- FIG. 5 is a block diagram schematically showing a configuration of a document retrieval device which is an information processing device according to the second embodiment. It is a flowchart which shows the process in the search target context-dependent expression generation part in Embodiment 2.
- FIG. 5 is a block diagram schematically showing a configuration of a document retrieval device, which is an information processing device according to the third embodiment. It is a flowchart which shows the process in the search target dimension reduction part. It is a flowchart which shows the process in the search query dimension reduction part.
- FIG. 1 is a block diagram schematically showing a configuration of a document retrieval device 100, which is an information processing device according to the first embodiment.
- the document search device 100 includes a search target database (hereinafter referred to as a search target DB) 101, a search target context-dependent expression generation unit 102, an information generation unit 103, a search query input unit 106, a talkerizer 107, and a search query context. It includes a dependency expression generation unit 108, a similarity token table storage unit 110, an inter-sentence similarity calculation unit 111, and a search result output unit 112.
- the information generation unit 103 includes a data structure conversion unit 104, a search database (hereinafter referred to as a search DB) 105, and a similar token table generation unit 109.
- the search target DB 101 is a search target storage unit that stores a search target sentence and a search target token array corresponding to the search target sentence.
- the search target token array is an array of a plurality of tokens, and one search target token array constitutes one sentence.
- the token is the smallest unit having a meaning, and is a character or a character string. Further, the token included in the search target token array is also referred to as a search target token. Further, it is assumed that the search target DB 101 stores a plurality of search target sentences and a plurality of search target token arrays corresponding to the plurality of search target sentences.
- the search target token array may be in a two-dimensional array format as shown in FIG.
- the p-th article is stored in the p-th row
- the q-th article from the beginning of the p-th article is in the p-row and q-th column.
- the token is stored.
- the search target token is a character or a character string enclosed in “”.
- the search target context-sensitive expression generation unit 102 acquires the search target token array from the search target DB 101. Then, the search target context-dependent expression generation unit 102 creates a search target context-dependent expression array in which search target context-dependent expressions, which are context-dependent expressions of all the search target tokens included in the acquired search target token array, are arranged. Generate.
- the generated search target context-dependent expression array is provided to the data structure conversion unit 104 and the sentence-to-sentence similarity calculation unit 111.
- the context-sensitive expression is a vector
- the search target context-dependent expression is a search target vector.
- the search target context-dependent expression generation unit 102 is a search target vector generation unit that generates a search target vector that is a vector corresponding to the meaning of the search target token included in the search target token array.
- the search target context-dependent expression generation unit 102 specifies the meaning of the search target token according to the context of the search target sentence corresponding to the search target token array including the search target token, and determines the specified meaning. Generate a search target vector as shown.
- the search target context-dependent expression generation unit 102 specifies the meaning of each of the plurality of search target tokens included in the search target token array according to the context. Then, the search target context-dependent expression generation unit 102 can generate a search target context-dependent expression sequence by arranging a multidimensional vector indicating the specified meaning according to each array of the plurality of search target tokens. it can.
- the search target context-dependent expression array may be, for example, a two-dimensional array format as shown in FIG.
- the p-th article is stored in the p-th row
- the q-th article from the beginning of the p-th article is in the p-row and q-th column.
- a known method may be used as a method for specifying the context-sensitive expression corresponding to the search target token.
- a method for acquiring a vector representation of a token that can consider the appearance context is described in, for example, the following documents. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Tounanova, "BERT: Pre-training of Deep Bidirectional Transfer
- the data structure conversion unit 104 acquires the search target context-dependent expression array from the search target context-dependent expression generation unit 102. Then, the data structure conversion unit 104 converts the acquired search target context-dependent expression array into the search data structure. The generated search data structure is stored in the search DB 105.
- the search data structure may be selected from any known data structure according to the k-nearest neighbor search algorithm to be used. For example, when ANN (Approximate Nearest Neighbor search) is used as the algorithm for the k-nearest neighbor search, the data structure of the k-d tree may be selected. Further, when LSH (Locality Sensitive Hashing) is used as the algorithm for the k-nearest neighbor search, the mapping result by the hash function may be selected as the data structure.
- LSH Location Sensitive Hashing
- the search DB 105 stores the search data structure converted by the data structure conversion unit 104.
- the search query input unit 106 is a search input unit that accepts input of a search query that is a search sentence.
- the search query contains multiple tokens.
- the token included in the search query is also called a search token.
- the search query input unit 106 accepts an input as a search query with a question sentence such as "When is the summer vacation?".
- the talkerizer 107 acquires a search query from the search query input unit 106. Then, the talkerizer 107 is a token identification unit that identifies a search query token from the acquired search query and generates a search query token array in which the search query tokens are arranged. The generated search query array is provided to the search query context-sensitive expression generation unit 108. The token included in the search query token array is also referred to as a search query token.
- the talkerizer 107 identifies a token, which is the smallest meaningful unit, from a search query by using an arbitrary known technique such as morphological analysis, and arranges the identified tokens to obtain a search query token array.
- FIG. 4 is a schematic diagram showing an example of a search query token array.
- the r-th token of the search query is stored in the r-th of the search query token array.
- the search query context-sensitive expression generation unit 108 acquires the search query token array from the talkerizer 107. Then, the search query context-dependent expression generation unit 108 arranges the search query context-dependent expressions which are the context-dependent expressions for the search query tokens which are all the tokens included in the acquired search query token array. Generate a representation array.
- the generated search query context request expression array is provided to the similarity token table generation unit 109 and the inter-sentence similarity calculation unit 111.
- the search query context-sensitive expression is a search vector.
- the search query context-sensitive expression generation unit 108 is a search vector generation unit that generates a search vector that is a vector corresponding to the meaning of the search token.
- the search query context-sensitive expression generation unit 108 specifies the meaning of the search token according to the context of the search sentence, and generates a search vector so as to indicate the specified meaning.
- the search query context-sensitive expression generation unit 108 specifies the meaning of each of the plurality of search query tokens included in the search query token array according to the context. Then, the search query context-sensitive expression generation unit 108 can generate a search query context-sensitive expression sequence by arranging a multidimensional vector indicating the specified meaning according to each array of a plurality of search query tokens. it can.
- a known method may be used in the same manner as the above-mentioned search target context-sensitive expression.
- FIG. 5 is a schematic diagram showing an example of a search query context-sensitive expression array.
- the vector which is the context-sensitive expression corresponding to the r-th token of the search query is stored in the r-th of the search query context-dependent expression array.
- the similar token table generation unit 109 acquires the search query context-dependent expression array from the search query context-dependent expression generation unit 108, and acquires the search data structure from the search DB 105. Then, the similarity token table generation unit 109 indicates whether the similarity is relatively high or low for each combination of the search target token and the search query token from the acquired search query context-dependent expression array and search data structure. Generate a similarity token table as similarity judgment information. The generated similar token table is stored in the similar token table storage unit 110.
- the similarity token table generation unit 109 calculates the similarity for all combinations of the search target token and the search query token, and uses the calculated similarity to determine whether or not the similarity is relatively high. It is sufficient to determine whether the similarity is relatively high or low with respect to all combinations of the search target token and the search query token by a known search method that is more efficient than the brute force search.
- the similarity token table generation unit 109 uses a k-nearest neighbor search that searches for k (k is an integer of 1 or more) neighbor points, and k has a relatively high degree of similarity to a certain search query token. All you have to do is search for the number of tokens to be searched.
- the similarity token table generation unit 109 may use the searched k search target tokens as tokens having a relatively high degree of similarity, and the remaining search target tokens as tokens having a relatively low degree of similarity.
- a known technique such as ANN or LSH may be used.
- FIG. 6 is a schematic diagram showing an example of a similar token table.
- each token included in the search query is included in all the search target sentences. It is a look-up table showing that the similarity of each token is relatively high or low among all the search target sentences.
- the row represents the search query token and the column represents the search target token.
- “ ⁇ ” indicates that the degree of similarity is relatively high
- “x” indicates that the degree of similarity is relatively low.
- the search query token "summer” the similarity between the search target tokens "holiday” and “summer” is relatively high among the tokens included in all the search target sentences.
- the k-nearest neighbor search algorithm can be applied to the generation of the similar token table, there is an advantage that the amount of calculation can be reduced.
- the search query token is stored in the row and the search target token is stored in the column.
- the search context-dependent expression corresponding to the search query token is stored in the row. (Ie, search vector) is stored, and the search target context-sensitive expression (that is, search target vector) corresponding to the search target token is stored in the column.
- the data structure conversion unit 104, the search DB 105, and the similar token table generation unit 109 configure the information generation unit 103 that generates the similar token table which is the similarity determination information.
- the information generation unit 103 searches for one or a plurality of neighborhood points located in the vicinity of the point indicated by one search vector among the plurality of search vectors from a plurality of points indicated by the plurality of search target vectors. , One or more combinations of one search token corresponding to the point indicated by the one search vector and one or more search target tokens corresponding to one or more neighboring points are judged to have high similarity.
- One or more combinations of the one search token and one or more search target tokens corresponding to one or more points other than the one or more neighborhood points are judged to be similar by determining the degree of similarity.
- the information generation unit 103 uses a search method that is more efficient than the brute force search that calculates all the distances between the points corresponding to one search vector and the plurality of points corresponding to the plurality of search target vectors. To search for one or more neighborhood points.
- the similarity token table storage unit 110 is a similarity determination information storage unit that stores a similarity token table as similarity determination information.
- the similarity token table indicates whether the combination of each of the plurality of search target tokens and each of the plurality of search tokens has a high degree of similarity or a low degree of similarity.
- the inter-sentence similarity calculation unit 111 acquires a similarity token table from the similarity token table storage unit 110, acquires a search target context-dependent expression array from the search target context-dependent expression generation unit 102, and obtains a search target context-dependent expression generation unit 108. Get the search query context-sensitive representation array from. Then, the inter-sentence similarity calculation unit 111 calculates the inter-sentence similarity, which is the similarity between the search query and the search target sentence, from the acquired similarity token table, the search target context-dependent expression array, and the search query context-dependent expression array. calculate. The calculated inter-sentence similarity is provided to the search result output unit 112.
- the inter-sentence similarity calculation unit 111 calculates the inter-token similarity for the combinations shown to have high similarity in the similarity token table, and shows that the similarity is low in the similarity token table. By setting the similarity between tokens to a predetermined value for the combinations, the calculation load when calculating the similarity between sentences is reduced.
- the inter-sentence similarity calculation unit 111 includes a point indicated by one search target vector among a plurality of search target vectors and one of the plurality of search target vectors. The shorter the distance from the point indicated by the search vector, the higher the similarity between tokens of the combination of the one search target vector and the one search vector.
- the inter-sentence similarity calculation unit 111 indicates that each of the plurality of search tokens is between tokens in combination with each of the plurality of search target tokens included in one search target sentence in the plurality of search target sentences.
- the maximum value of the similarity is specified, and the inter-sentence similarity between the search sentence and one of the search target sentences is calculated from the average value of the specified maximum values.
- the inter-sentence similarity may be calculated using an arbitrary inter-token similarity.
- the inter-sentence similarity may be calculated using the Maximum Alignment method described in Non-Patent Document 1 described above.
- the calculation of the inter-sentence similarity by the general Maximum Alignment method will be described, and then the high-speed inter-sentence similarity calculation in the first embodiment will be described.
- x i is i th search query token search query x
- Y jk is, k-th search target token search subject sentence Y j
- the token-to-token similarity between the search target token Y jk.
- the token-to-token similarity the distance between the vector of the search query token and the vector of the search target token (for example, the cosine similarity of the context-sensitive expression) or the like is used.
- the inter-sentence similarity between the search query and each search target sentence is calculated based on the above concept. As shown in the following equation (2), this obtains the inter-sentence similarity s between the search query and all the search target sentences, and the inter-sentence similarity S between the search query and each search target sentence ( It corresponds to generating x, Y).
- the j-th element of S (x, Y) is the inter-sentence similarity between the search query x and the search target sentence Y j.
- the similarity matrix A (i) including the search query token x i and all the search target tokens is defined by the following equation (3).
- the similarity matrix A (i) is a matrix of the type represented by the following equation (4). Note that
- inter-sentence similarity S (x, Y) between the search query and each search target sentence can be transformed as shown in the following equation (7).
- the inter-sentence similarity calculation unit 111 in the first embodiment speeds up the calculation of the inter-sentence similarity.
- the value of the token-to-token similarity between the search query token and all the search target tokens is relatively compared for each search target sentence, and the maximum value is obtained.
- the maximum value max of the token-to-token similarity between the search query token x i and the search target sentence Y j can be obtained.
- the inter-token similarity calculation unit 111 omits the calculation of the inter-token similarity (for example, approximates as 0). This speeds up the calculation of the similarity between documents.
- the inter-sentence similarity calculation unit 111 approximates the similarity matrix A (i) as shown in the following equation (8).
- gamma (x i, Y jk) is specified by the following formula (9).
- simset (x i) the value in the column within a row of the search query token x i with similar token table, a function that returns a set of search target token Y jk which is " ⁇ ". For example, in the example shown in FIG. 6, the line of the search query token "summer”, the search target token "Holidays” and “summer” it is returned by Simset (x i).
- the search result output unit 112 acquires the inter-sentence similarity from the inter-sentence similarity calculation unit 111 and acquires it as a search target sentence from the search target DB 101. Then, the search result output unit 112 sorts the search target sentences according to the inter-sentence similarity, and outputs the sorted search target sentences as the search result.
- the sorting any sorting method such as ascending or descending order of similarity between sentences may be selected.
- FIG. 7 is a block diagram schematically showing a hardware configuration for realizing the document retrieval device 100.
- the document retrieval device 100 is realized by a computer 190 including a memory 191 and a processor 192, an auxiliary storage device 193, a mouse 194, a keyboard 195, and a display device 196. Can be done.
- the search target context-dependent expression generation unit 102 the data structure conversion unit 104, the talkerizer 107, the search query context-dependent expression generation unit 108, the similarity token table generation unit 109, and the inter-sentence similarity calculation unit described above.
- a part or all of the 111 and the search result output unit 112 can be configured by a memory 191 and a processor 192 such as a CPU (Central Processing Unit) that executes a program stored in the memory 191.
- a program may be provided through a network, or may be recorded and provided on a recording medium. That is, such a program may be provided as, for example, a program product.
- the search target DB 101, the search DB 105, and the similar token table storage unit 110 can be realized by the processor 192 using the auxiliary storage device 193.
- the auxiliary storage device 193 does not necessarily have to exist in the document retrieval device 100, and an auxiliary storage device existing on the cloud may be used via a communication interface (not shown).
- the similar token table storage unit 110 may be realized by the memory 191.
- the search query input unit 106 can be realized by the processor 192 using a mouse 194 and a keyboard 195 as input devices, and a display device 196.
- the mouse 194 and the keyboard 195 function as an input unit
- the display device 196 functions as a display unit.
- FIG. 8 is a flowchart showing the processing in the search target context-dependent expression generation unit 102.
- the search target context-dependent expression generation unit 102 acquires the search target token array from the search target DB 101 (S10).
- the search target context-dependent expression generation unit 102 specifies the meaning of each of all the search target tokens included in the acquired search target token array according to the context, and the search target indicating the specified meaning.
- a search target context-dependent expression array is generated by arranging the context-dependent expressions (that is, the search target vector) according to the acquired search target token array (S11).
- the search target context-dependent expression generation unit 102 provides the generated search target context-dependent expression sequence to the data structure conversion unit 104 and the inter-sentence similarity calculation unit 111 (S12).
- FIG. 9 is a flowchart showing processing by the data structure conversion unit 104.
- the data structure conversion unit 104 acquires the search target context-dependent expression array from the search target context-dependent expression generation unit 102 (S20).
- the data structure conversion unit 104 uses the acquired search target context-dependent expression array as a search target token having a relatively high similarity to the search query token by a search method that is more efficient than the round-robin search. It is converted into a search data structure used for searching (S21).
- the data structure conversion unit 104 provides the converted search data structure to the search DB 105 (S22).
- the search DB 105 stores the provided search data structure.
- FIG. 10 is a flowchart showing the processing in the talkerizer 107.
- the talkerizer 107 acquires a search query from the search query input unit 106 (S30).
- the talkerizer 107 identifies a search query token, which is the smallest meaningful unit, from the acquired search query, and generates a search query token array by arranging the identified search query tokens according to the search query. (S31).
- the talkerizer 107 provides the generated search query token array to the search query context-sensitive expression generation unit 108 (S32).
- FIG. 11 is a flowchart showing the processing in the search query context-sensitive expression generation unit 108.
- the search query context-sensitive expression generation unit 108 acquires the search query token array from the talkerizer 107 (S40).
- search query context-dependent expression generation unit 108 specifies the meaning of each of all the search query tokens included in the acquired search query token array according to the context, and the context-dependent indicating the specified meaning.
- a search query context-sensitive expression array is generated by arranging a vector (hereinafter, also referred to as a search query vector) which is an expression (hereinafter, also referred to as a search query context-sensitive expression) according to the acquired search query token array (S41). ).
- the search query context-dependent expression generation unit 108 provides the generated search query context-dependent expression array to the similarity token table generation unit 109 and the inter-sentence similarity calculation unit 111 (S42).
- FIG. 12 is a flowchart showing processing in the similar token table generation unit 109.
- the similar token table generation unit 109 acquires a search query context-dependent expression array from the search query context-dependent expression generation unit 108 (S50). Further, the similar token table generation unit 109 acquires the search data structure from the search DB 105 (S51).
- the similar token table generation unit 109 uses a search method that is more efficient than the round-robin search in the search data structure to display all the search query context-sensitive expressions contained in the search query context-sensitive expression array. For each, by searching for the search target context-dependent expressions having a relatively high degree of similarity from all the search target context-dependent expressions, each of the search query context-dependent expressions and each of the search target context-dependent expressions can be obtained. Generates a similarity token table indicating whether the similarity is high or low (S52).
- the similar token table generation unit 109 provides the generated similar token table to the similar token table storage unit 110 and stores it (S53).
- FIG. 13 is a flowchart showing processing in the inter-sentence similarity calculation unit 111.
- the inter-sentence similarity calculation unit 111 acquires a similar token table from the similar token table storage unit 110 (S60). Further, the inter-sentence similarity calculation unit 111 acquires a search query context-dependent expression array from the search query context-dependent expression generation unit 108 (S61). Further, the inter-sentence similarity calculation unit 111 acquires a search target context-dependent expression array from the search target context-dependent expression generation unit 102 (S62).
- the inter-sentence similarity calculation unit 111 calculates the inter-token similarity for the combination of the search query token determined to have high similarity and the search target token by referring to the similarity token table. For the combination determined to have low similarity, the inter-sentence similarity between the search target sentence and the search query is calculated by setting a predetermined value (for example, 0) (S63).
- the inter-sentence similarity calculation unit 111 provides the calculated inter-sentence similarity to the search result output unit 112 (S64).
- FIG. 14 is a flowchart showing processing in the search result output unit 112.
- the search result output unit 112 acquires the inter-sentence similarity from the inter-sentence similarity calculation unit 111 (S70).
- the search result output unit 112 generates a search result capable of identifying at least the search target sentence having the highest inter-sentence similarity by rearranging the search target sentences according to the acquired inter-sentence similarity. (S71).
- the search result output unit 112 may acquire the search target sentence from the search target DB 101.
- the search result output unit 112 outputs the search result by displaying the generated search result on, for example, the display device 196 shown in FIG. 7 (S72).
- the inter-token similarity between tokens determined to have no high similarity can be set to a predetermined value. Therefore, the calculation load of the inter-sentence similarity can be reduced.
- FIG. 15 is a block diagram schematically showing the configuration of the document retrieval device 200, which is the information processing device according to the second embodiment.
- the document search device 200 includes a search target DB 101, a search target context-sensitive expression generation unit 202, an information generation unit 103, a search query input unit 106, a talkerizer 107, a search query context-dependent expression generation unit 108, and a similar token. It includes a table storage unit 110, an inter-sentence similarity calculation unit 111, a search result output unit 112, and an ontology DB 213.
- the calculation unit 111 and the search result output unit 112 include the search target DB 101, the information generation unit 103, the search query input unit 106, the talkerizer 107, the search query context-dependent expression generation unit 108, and the similar token table generation unit 109 in the first embodiment. This is the same as the similar token table storage unit 110, the inter-sentence similarity calculation unit 111, and the search result output unit 112.
- the ontology DB 213 is a semantic relationship information storage unit that stores an ontology that is semantic relationship information indicating the semantic relationship of tokens.
- the ontology shall indicate at least one of the synonymous relationship and the inclusion relationship of tokens as a semantic relationship.
- the ontology DB 213 can be realized, for example, by using the auxiliary storage device 193 in the processor 192 shown in FIG. 7.
- the search target context-sensitive expression generation unit 202 acquires the search target token array from the search target DB 101. Then, the search target context-dependent expression generation unit 202 can treat the search target tokens included in the acquired search target token array as having the same meaning by referring to the ontology stored in the ontology DB 213. Divide into groups. For example, the search target context-sensitive expression generation unit 202 groups search target tokens that are shown to have a synonymous relationship or an inclusion relationship in the ontology as one group. Specifically, since "vacation" and "holiday” both mean "holiday", in other words, they have a synonymous relationship, the search target context-sensitive expression generation unit 202 groups these into one group. ..
- the search target context-dependent expression generation unit 202 assigns one search target context-dependent expression to one group and generates a search target context-dependent expression array.
- the search target context-dependent expression generation unit 202 generates a search target vector that is the same search target context-dependent expression from a plurality of search target tokens whose specified meanings have a synonymous relationship or an inclusion relationship.
- the search target context-dependent expression generation unit 202 may use any one of the search target tokens included in one group as the search target context-dependent expression of the group, or one search target context-dependent expression.
- the representative value (for example, the average value) of the search target context-dependent expression of the search target token included in the group may be used as the search target context-dependent expression of the group.
- FIG. 16 is a flowchart showing the processing in the search target context-dependent expression generation unit 202 in the second embodiment.
- the search target context-dependent expression generation unit 202 acquires the search target token array from the search target DB 101 (S80). Further, the search target context-dependent expression generation unit 202 acquires an ontology from the ontology DB 213 (S81).
- the search target context-dependent expression generation unit 202 identifies the meaning of each of all the search target tokens included in the acquired search target token array according to the context, and refers to the acquired ontology. , Group by using the specified meaning, assign one search target context-dependent expression to the search target tokens that belong to the group, and assign the search target context for the specified meaning to the search target tokens that do not belong to the group. By assigning the dependent expression, the search target context-dependent expression array is generated (S82).
- the search target context-dependent expression generation unit 202 provides the generated search target context-dependent expression sequence to the data structure conversion unit 104 and the sentence-to-sentence similarity calculation unit 111 (S83).
- the similar token table generation unit 109 determines whether or not the search query token and the search target token have a high degree of similarity. Since the number of targets is reduced, the processing load on the similar token table generation unit 109 can be reduced.
- FIG. 17 is a block diagram schematically showing the configuration of the document retrieval device 300, which is the information processing device according to the third embodiment.
- the document search device 300 includes a search target DB 101, a search target context-dependent expression generation unit 202, an information generation unit 103, a search query input unit 106, a talkerizer 107, a search query context-dependent expression generation unit 108, and a similar token. It includes a table storage unit 110, an inter-sentence similarity calculation unit 111, a search result output unit 112, an ontology DB 213, a search target dimension reduction unit 314, and a search query dimension reduction unit 315.
- the calculation unit 111 and the search result output unit 112 include the search target DB 101, the information generation unit 103, the search query input unit 106, the talkerizer 107, the search query context-dependent expression generation unit 108, and the similar token table generation unit 109 in the first embodiment. This is the same as the similar token table storage unit 110, the inter-sentence similarity calculation unit 111, and the search result output unit 112.
- the search query context-dependent expression generation unit 108 in the third embodiment provides the search query context-dependent expression array to the search query dimension reduction unit 315 and the sentence-to-sentence similarity calculation unit 111.
- search target context-dependent expression generation unit 202 and the ontology DB 213 in the third embodiment are the same as the search target context-dependent expression generation unit 202 and the ontology DB 213 in the second embodiment.
- the search target context-dependent expression generation unit 202 in the third embodiment provides the search target dimension reduction unit 314 and the inter-sentence similarity calculation unit 111 with the search target dependent expression array.
- the search target dimension reduction unit 314 acquires the search target context-dependent expression array from the search target context-dependent expression generation unit 202. Then, the search target dimension reduction unit 314 performs dimension compression of all the search target context-dependent expressions included in the acquired search target context-dependent expression array, thereby reducing the dimension of the low-dimensional search target context-dependent expression.
- a representation that is, a low-dimensional search target vector
- the low-dimensional search target context-dependent expression is arranged to generate a dimension-reduced low-dimensional search target context-dependent expression array.
- the search target dimension reduction unit 314 provides the data structure conversion unit 104 with the generated low-dimensional search target context-dependent expression array. Any known technique such as principal component analysis may be used for dimension compression.
- the data structure conversion unit 104 in the third embodiment converts the low-dimensional search target context-dependent expression array into the search data structure.
- the conversion method is the same as in the first embodiment.
- the search query dimensionality reduction unit 315 acquires a search query context-sensitive expression array from the search query context-sensitive expression generation unit 108. Then, the search query dimension reduction unit 315 performs dimension compression of all the search query context-dependent expressions included in the acquired search query context-dependent expression array, thereby reducing the dimension of the low-dimensional search query context dependency. It is a search dimension reduction unit that generates an expression (that is, a low-dimensional search vector), arranges the low-dimensional search query context-dependent expressions, and generates a dimension-reduced low-dimensional search query context-dependent expression array. The search query dimension reduction unit 315 provides the generated low-dimensional search query context-sensitive expression array to the similarity token table generation unit 109. Any known technique such as principal component analysis may be used for dimension compression.
- the similar token table generation unit 109 uses the low-dimensional search query context-dependent expression array acquired from the search query dimension reduction unit 315 and the search data structure acquired from the search DB 105 to generate a similar token table. Generate.
- the generation method is the same as that of the first embodiment.
- the information generation unit 103 uses the low-dimensional search target context-dependent expression array and the low-dimensional search query context-dependent expression array generated by the search target dimension reduction unit 314 to generate similar tokens. Generate a table. Specifically, the information generation unit 103 sets one or a plurality of neighboring points, which are one or a plurality of points, located in the vicinity of the point indicated by one low-dimensional search vector among the plurality of low-dimensional search vectors. By searching from a plurality of points indicated by a plurality of low-dimensional search target vectors, one search token corresponding to the point indicated by the one low-dimensional search vector and one corresponding to one or a plurality of neighboring points thereof.
- one or a plurality of combinations with a plurality of search target tokens are judged to have high similarity, and one or a plurality of searches corresponding to the one search token and one or a plurality of points other than the one or a plurality of neighboring points are performed.
- a similarity token table is generated by determining one or more combinations with the target token as low similarity.
- the information generation unit 103 is more efficient than the brute force search that calculates all the distances between the points corresponding to one low-dimensional search vector and the plurality of points corresponding to the plurality of low-dimensional search target vectors. Search for one or more neighborhood points using the search method.
- a part or all of the search target dimension reduction unit 314 and the search query dimension reduction unit 315 described above include the memory 191 shown in FIG. 7 and the processor 192 that executes the program stored in the memory 191. Can be configured by.
- FIG. 18 is a flowchart showing the processing in the search target dimension reduction unit 314.
- the search target dimension reduction unit 314 acquires the search target context-dependent expression array from the search target context-dependent expression generation unit 202 (S90).
- the search target dimension reduction unit 314 generates a low-dimensional search target context-dependent expression array by reducing the dimensions of all the search target context-dependent expressions included in the acquired search target context-dependent expression array. (S91).
- the search target dimension reduction unit 314 provides the data structure conversion unit 104 with a low-dimensional search target context-dependent expression array (S92).
- FIG. 19 is a flowchart showing the processing in the search query dimension reduction unit 315.
- the search query dimensionality reduction unit 315 acquires a search query context-dependent expression array from the search query context-dependent expression generation unit 108 (S100).
- the search query dimension reduction unit 315 generates a low-dimensional search query context-dependent expression array by reducing the dimensions of all the search query context-dependent expressions contained in the acquired search query context-dependent expression array. (S101).
- the search query dimension reduction unit 315 provides the low-dimensional search query context-dependent expression array to the similarity token table generation unit 109 (S102).
- the processing load in the similar token table generation unit 109 can be reduced by reducing the dimensions. It can be mitigated.
- the search target DB 101 stores a plurality of search target sentences and a plurality of search target token arrays corresponding to the plurality of search target sentences. 1 to 3 are not limited to such an example.
- the search target DB 101 may store a plurality of search target sentences, and the search target context-dependent expression generation unit 102 may generate a plurality of corresponding search target token arrays using a known technique.
- the talkerizer 107 generates the search query token sequence, but the first to third embodiments are not limited to such an example.
- the search query context-sensitive expression generation unit 108 may generate a search query token array from the search query using a known technique.
- the search target context-sensitive expression generation units 102 and 202 and the search query context-sensitive expression generation unit 108 generate context-dependent vectors from the tokens.
- Embodiments 1 to 3 are not limited to such an example.
- context-independent vectors may be generated that have a one-to-one correspondence to tokens.
- the calculation load of the inter-sentence similarity can be reduced without preparing a look-up table in which the inter-token similarity, which is the similarity between tokens, is stored in advance. Can be done.
- the search target dimension reduction unit 314 and the search query dimension reduction unit 315 are added to the second embodiment, but these may be added to the first embodiment.
- 100,200,300 document search device 101 search target DB, 102,202 search target context-dependent expression generation unit, 103,303 information generation unit, 104 data structure conversion unit, 105 search DB, 106 search query input unit, 107 Talkerizer, 108 search query context-dependent expression generation unit, 109 similar token table generation unit, 111 inter-sentence similarity calculation unit, 112 search result output unit, 213 ontology DB, 314 search target dimension reduction unit, 315 search query dimension reduction unit.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Signal Processing For Digital Recording And Reproducing (AREA)
Abstract
Description
図1は、実施の形態1に係る情報処理装置である文書検索装置100の構成を概略的に示すブロック図である。
文書検索装置100は、検索対象データベース(以下、検索対象DBという)101と、検索対象文脈依存表現生成部102と、情報生成部103と、検索クエリ入力部106と、トーカナイザ107と、検索クエリ文脈依存表現生成部108と、類似トークンテーブル記憶部110と、文間類似度計算部111と、検索結果出力部112とを備える。
また、情報生成部103は、データ構造変換部104と、探索用データベース(以下、探索用DBという)105と、類似トークンテーブル生成部109とを備える。
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, CoRR, abs/1810.04805, May 24, 2018
なお、これらのアルゴリズムは、下記の文献で説明されている。
和田 俊和 著、「最近傍探索の理論とアルゴリズム」、研究報告コンピュータビジョンとイメージメディア、no.13、2009年
例えば、検索クエリ入力部106は、「夏の休暇はいつからいつまでですか?」といった質問文を検索クエリとして入力を受け付ける。
図4は、検索クエリトークン配列の例を示す概略図である。
図4に示されている例では、検索クエリトークン配列のr番目に、検索クエリのr番目のトークンが格納されている。
図5に示されている例では、検索クエリ文脈依存表現配列のr番目に、検索クエリのr番目のトークンに対応する文脈依存表現であるベクトルが格納されている。
図6に示されている例は、前述の検索クエリ「夏の休暇は…」が入力されたときに、その検索クエリに含まれている各トークンに対して、全ての検索対象文に含まれている各トークンの類似度が、全ての検索対象文の中で相対的に高い又は低いことを表すルックアップテーブルである。
ここで、類似トークンテーブルの生成には、k近似最近傍探索アルゴリズムが適用できるため、計算量を少なくできるという利点がある。
情報生成部103は、複数の検索ベクトルの内の一つの検索ベクトルで示される点の近傍に位置する一又は複数の近傍点を、複数の検索対象ベクトルで示される複数の点から探索することで、その一つの検索ベクトルで示される点に対応する一つの検索トークンと、その一又は複数の近傍点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを高い類似度と判定し、その一つの検索トークンと、その一又は複数の近傍点以外の一又は複数の点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを低い類似度と判定することで、類似トークンテーブルを生成する。ここで、情報生成部103は、一つの検索ベクトルに対応する点と、複数の検索対象ベクトルに対応する複数の点との全ての距離を算出する総当たり探索よりも効率的な探索方法を用いて、一又は複数の近傍点を探索する。
類似トークンテーブルは、複数の検索対象トークンの各々と、複数の検索トークンの各々との組み合わせが高い類似度であるか、低い類似度であるかを示す。
文間類似度の計算には、任意のトークン間類似度を用いて文間類似度を算出すればよい。例えば、上述の非特許文献1に記載されたMaximum Alignment方式を用いて、文間類似度が計算されればよい。
ここでは、まず、一般的なMaximum Alignment方式による文間類似度の計算を説明し、その後に、実施の形態1における高速化した文間類似度の計算を説明する。
これは、下記の(2)式に示されているように、検索クエリと、全ての検索対象文との文間類似度sを求め、検索クエリと各検索対象文の文間類似度S(x,Y)を生成することに相当する。
しかしながら、類似度行列A(i)を求めるための計算量は、O(|x|Σj|Yj|)である。このため、検索対象文が大規模である場合には、Σj|Yj|の計算量が膨大であり、実用的な計算量ではないという問題があった。
高速化前の、Maximum Alignment方式では、検索対象文毎に、検索クエリトークンと、その全ての検索対象トークンとの間のトークン間類似度の値を相対的に比較して、その最大値を取得することで、上記の(6)式に示されているように、検索クエリトークンxiと、検索対象文Yjとのトークン間類似度の最大値maxが得られる。
そこで、文間類似度計算部111は、トークン間類似度が全ての検索対象文の中で相対的に低い場合には、そのトークン間類似度の計算を省略する(例えば、0として近似する)ことで、文書間類似度の計算を高速化する。
例えば、図6に示されている例では、検索クエリトークン「夏の」の行では、検索対象トークン「休日」及び「夏季」がSimset(xi)により返される。
ここで、並べ替えは、文間類似度の昇順又は降順等の任意の並べ替えの方法が選択されればよい。
図7に示されているように、文書検索装置100は、メモリ191と、プロセッサ192と、補助記憶装置193と、マウス194と、キーボード195と、表示装置196とを備えるコンピュータ190により実現することができる。
検索クエリ入力部106は、プロセッサ192が、入力装置としてのマウス194及びキーボード195、並びに、表示装置196を利用することで実現することができる。なお、マウス194及びキーボード195は、入力部として機能し、表示装置196は、表示部として機能する。
まず、検索対象文脈依存表現生成部102は、検索対象DB101から検索対象トークン配列を取得する(S10)。
まず、データ構造変換部104は、検索対象文脈依存表現生成部102から検索対象文脈依存表現配列を取得する(S20)。
トーカナイザ107は、検索クエリ入力部106から、検索クエリを取得する(S30)。
まず、検索クエリ文脈依存表現生成部108は、トーカナイザ107から、検索クエリトークン配列を取得する(S40)。
まず、類似トークンテーブル生成部109は、検索クエリ文脈依存表現生成部108から検索クエリ文脈依存表現配列を取得する(S50)。
また、類似トークンテーブル生成部109は、探索用DB105から探索用データ構造を取得する(S51)。
まず、文間類似度計算部111は、類似トークンテーブル記憶部110から類似トークンテーブルを取得する(S60)。
また、文間類似度計算部111は、検索クエリ文脈依存表現生成部108から検索クエリ文脈依存表現配列を取得する(S61)。
さらに、文間類似度計算部111は、検索対象文脈依存表現生成部102から検索対象文脈依存表現配列を取得する(S62)。
まず、検索結果出力部112は、文間類似度計算部111から文間類似度を取得する(S70)。
図15は、実施の形態2に係る情報処理装置である文書検索装置200の構成を概略的に示すブロック図である。
文書検索装置200は、検索対象DB101と、検索対象文脈依存表現生成部202と、情報生成部103と、検索クエリ入力部106と、トーカナイザ107と、検索クエリ文脈依存表現生成部108と、類似トークンテーブル記憶部110と、文間類似度計算部111と、検索結果出力部112と、オントロジDB213とを備える。
まず、検索対象文脈依存表現生成部202は、検索対象DB101から検索対象トークン配列を取得する(S80)。
また、検索対象文脈依存表現生成部202は、オントロジDB213からオントロジを取得する(S81)。
図17は、実施の形態3に係る情報処理装置である文書検索装置300の構成を概略的に示すブロック図である。
文書検索装置300は、検索対象DB101と、検索対象文脈依存表現生成部202と、情報生成部103と、検索クエリ入力部106と、トーカナイザ107と、検索クエリ文脈依存表現生成部108と、類似トークンテーブル記憶部110と、文間類似度計算部111と、検索結果出力部112と、オントロジDB213と、検索対象次元削減部314と、検索クエリ次元削減部315とを備える。
但し、実施の形態3における検索クエリ文脈依存表現生成部108は、検索クエリ次元削減部315及び文間類似度計算部111に、検索クエリ文脈依存表現配列を提供する。
但し、実施の形態3における検索対象文脈依存表現生成部202は、検索対象次元削減部314及び文間類似度計算部111に検索対象依存表現配列を提供する。
具体的には、情報生成部103は、複数の低次元検索ベクトルの内の一つの低次元検索ベクトルで示される点の近傍に位置する一又は複数の点である一又は複数の近傍点を、複数の低次元検索対象ベクトルで示される複数の点から探索することで、その一つの低次元検索ベクトルで示される点に対応する一つの検索トークンと、その一又は複数の近傍点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを高い類似度と判定し、その一つの検索トークンと、その一又は複数の近傍点以外の一又は複数の点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを低い類似度と判定することで、類似トークンテーブルを生成する。ここで、情報生成部103は、一つの低次元検索ベクトルに対応する点と、複数の低次元検索対象ベクトルに対応する複数の点との全ての距離を算出する総当たり探索よりも効率的な探索方法を用いて、一又は複数の近傍点を探索する。
まず、検索対象次元削減部314は、検索対象文脈依存表現生成部202から検索対象文脈依存表現配列を取得する(S90)。
まず、検索クエリ次元削減部315は、検索クエリ文脈依存表現生成部108から検索クエリ文脈依存表現配列を取得する(S100)。
Claims (10)
- 各々が意味を有する最小単位である複数の検索対象トークンを含む複数の検索対象文を記憶する検索対象記憶部と、
前記複数の検索対象トークンの各々と、検索文に含まれている、意味を有する最小単位である複数の検索トークンの各々との組み合わせが高い類似度であるか、低い類似度であるかを示す類似度判定情報を記憶する類似度判定情報記憶部と、
前記類似度判定情報において前記高い類似度であることが示されている組み合わせについてはトークン間類似度を計算し、前記類似度判定情報において前記低い類似度であることが示されている組み合わせについてはトークン間類似度を予め定められた値とすることで、前記検索文と、前記複数の検索対象文の各々との間の文間類似度を計算する文間類似度計算部と、を備えること
を特徴とする情報処理装置。 - 各々が前記複数の検索対象トークンの各々の意味に対応するベクトルである複数の検索対象ベクトルを生成する検索対象ベクトル生成部と、
各々が前記複数の検索トークンの各々の意味に対応するベクトルである複数の検索ベクトルを生成する検索ベクトル生成部と、
前記複数の検索ベクトルの内の一つの検索ベクトルで示される点の近傍に位置する一又は複数の近傍点を、前記複数の検索対象ベクトルで示される複数の点から探索することで、前記一つの検索ベクトルで示される点に対応する一つの検索トークンと、前記一又は複数の近傍点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを前記高い類似度と判定し、前記一つの検索トークンと、前記一又は複数の近傍点以外の一又は複数の点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを前記低い類似度と判定することで、前記類似度判定情報を生成する情報生成部と、をさらに備え、
前記情報生成部は、前記一つの検索ベクトルに対応する点と、前記複数の検索対象ベクトルに対応する複数の点との全ての距離を算出する総当たり探索よりも効率的な探索方法を用いて、前記一又は複数の近傍点を探索すること
を特徴とする請求項1に記載の情報処理装置。 - 各々が前記複数の検索対象トークンの各々の意味に対応するベクトルである複数の検索対象ベクトルを生成する検索対象ベクトル生成部と、
前記複数の検索対象ベクトルの各々の次元を削減することで、複数の低次元検索対象ベクトルを生成する検索対象次元削減部と、
各々が前記複数の検索トークンの各々の意味に対応するベクトルである複数の検索ベクトルを生成する検索ベクトル生成部と、
前記複数の検索ベクトルの各々の次元を削減することで、複数の低次元検索ベクトルを生成する検索次元削減部と、
前記複数の低次元検索ベクトルの内の一つの低次元検索ベクトルで示される点の近傍に位置する一又は複数の近傍点を、前記複数の低次元検索対象ベクトルで示される複数の点から探索することで、前記一つの低次元検索ベクトルで示される点に対応する一つの検索トークンと、前記一又は複数の近傍点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを前記高い類似度と判定し、前記一つの検索トークンと、前記一又は複数の近傍点以外の一又は複数の点に対応する一又は複数の検索対象トークンとの一又は複数の組み合わせを前記低い類似度と判定することで、前記類似度判定情報を生成する情報生成部と、をさらに備え、
前記情報生成部は、前記一つの低次元検索ベクトルに対応する点と、前記複数の低次元検索対象ベクトルに対応する複数の点との全ての距離を算出する総当たり探索よりも効率的な探索方法を用いて、前記一又は複数の近傍点を探索すること
を特徴とする請求項1に記載の情報処理装置。 - 前記情報生成部は、k個(kは、1以上の整数)の近傍点を探索するk近似最近傍探索により、前記一又は複数の近傍点を探索すること
を特徴とする請求項2又は3に記載の情報処理装置。 - 前記検索対象ベクトル生成部は、前記複数の検索対象文の各々の文脈に応じて、前記複数の検索対象トークンの各々の意味を特定して、前記複数の検索対象ベクトルを生成し、
前記検索ベクトル生成部は、前記検索文の文脈に応じて、前記複数の検索トークンの各々の意味を特定して、前記複数の検索ベクトルを生成すること
を特徴する請求項2から4の何れか一項に記載の情報処理装置。 - 前記検索対象ベクトル生成部は、前記特定された意味が同義関係又は包含関係を有する複数の検索対象トークンから同じ検索対象ベクトルを生成すること
を特徴とする請求項5に記載の情報処理装置。 - 各々が前記複数の検索対象トークンの各々の意味に対応するベクトルである複数の検索対象ベクトルを生成する検索対象ベクトル生成部と、
各々が前記複数の検索トークンの各々の意味に対応するベクトルである複数の検索ベクトルを生成する検索ベクトル生成部と、をさらに備え、
前記文間類似度計算部は、前記トークン間類似度を計算する場合には、前記複数の検索対象ベクトルの内の一つの検索対象ベクトルで示される点と、前記複数の検索ベクトルの内の一つの検索ベクトルで示される点との距離が短いほど、前記一つの検索対象ベクトルと前記一つの検索ベクトルとの組み合わせのトークン間類似度が高くなるようにすること
を特徴とする請求項1に記載の情報処理装置。 - 前記文間類似度計算部は、前記複数の検索トークンの各々について、前記複数の検索対象文の内の一つの検索対象文に含まれている複数の検索対象トークンの各々との組み合わせにおけるトークン間類似度の最大値を特定し、特定された最大値を平均することで、前記検索文と前記一つの検索対象文との文間類似度を算出すること
を特徴とする請求項1から7の何れか一項に記載の情報処理装置。 - コンピュータを、
各々が意味を有する最小単位である複数の検索対象トークンを含む複数の検索対象文を記憶する検索対象記憶部、
前記複数の検索対象トークンの各々と、検索文に含まれている、意味を有する最小単位である複数の検索トークンの各々との組み合わせが高い類似度であるか、低い類似度であるかを示す類似度判定情報を記憶する類似度判定情報記憶部、及び、
前記類似度判定情報において高い類似度であることが示されている組み合わせについてはトークン間類似度を計算し、前記類似度判定情報において低い類似度であることが示されている組み合わせについてはトークン間類似度を予め定められた値とすることで、前記検索文と、前記複数の検索対象文の各々との間の文間類似度を計算する文間類似度計算部、として機能させること
を特徴とするプログラム。 - 各々が意味を有する最小単位である複数の検索対象トークンを含む複数の検索対象文と、意味を有する最小単位である複数の検索トークンを含む検索文との間の複数の文間類似度を計算する情報処理方法であって、
前記検索文の入力を受け付け、
前記複数の検索対象トークンの各々と、前記複数の検索トークンの各々との組み合わせが高い類似度であるか、低い類似度であるかを示す類似度判定情報において高い類似度であることが示されている組み合わせについてはトークン間類似度を計算し、前記類似度判定情報において低い類似度であることが示されている組み合わせについてはトークン間類似度を予め定められた値とすることで、前記検索文と、前記複数の検索対象文の各々との間の文間類似度を計算すること
を特徴とする情報処理方法。
Priority Applications (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021541602A JP7058807B2 (ja) | 2019-09-03 | 2019-09-03 | 情報処理装置、プログラム及び情報処理方法 |
PCT/JP2019/034632 WO2021044519A1 (ja) | 2019-09-03 | 2019-09-03 | 情報処理装置、プログラム及び情報処理方法 |
DE112019007599.3T DE112019007599T5 (de) | 2019-09-03 | 2019-09-03 | Informationsverarbeitungsvorrichtung, programm und informationsverarbeitungsverfahren |
KR1020227005501A KR102473788B1 (ko) | 2019-09-03 | 2019-09-03 | 정보 처리 장치, 컴퓨터 판독 가능한 기록 매체 및 정보 처리 방법 |
CN201980099685.0A CN114341837A (zh) | 2019-09-03 | 2019-09-03 | 信息处理装置、程序以及信息处理方法 |
TW109108561A TWI770477B (zh) | 2019-09-03 | 2020-03-16 | 資訊處理裝置、儲存媒體、程式產品及資訊處理方法 |
US17/676,963 US20220179890A1 (en) | 2019-09-03 | 2022-02-22 | Information processing apparatus, non-transitory computer-readable storage medium, and information processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2019/034632 WO2021044519A1 (ja) | 2019-09-03 | 2019-09-03 | 情報処理装置、プログラム及び情報処理方法 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/676,963 Continuation US20220179890A1 (en) | 2019-09-03 | 2022-02-22 | Information processing apparatus, non-transitory computer-readable storage medium, and information processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021044519A1 true WO2021044519A1 (ja) | 2021-03-11 |
Family
ID=74852567
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2019/034632 WO2021044519A1 (ja) | 2019-09-03 | 2019-09-03 | 情報処理装置、プログラム及び情報処理方法 |
Country Status (7)
Country | Link |
---|---|
US (1) | US20220179890A1 (ja) |
JP (1) | JP7058807B2 (ja) |
KR (1) | KR102473788B1 (ja) |
CN (1) | CN114341837A (ja) |
DE (1) | DE112019007599T5 (ja) |
TW (1) | TWI770477B (ja) |
WO (1) | WO2021044519A1 (ja) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220374345A1 (en) * | 2021-05-24 | 2022-11-24 | Infor (Us), Llc | Techniques for similarity determination across software testing configuration data entities |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259627A (ja) * | 1999-03-08 | 2000-09-22 | Ai Soft Kk | 自然言語文関係判定装置、自然言語文関係判定方法およびこれを用いた検索装置、検索方法ならびに記録媒体 |
JP2009217689A (ja) * | 2008-03-12 | 2009-09-24 | National Institute Of Information & Communication Technology | 情報処理装置、情報処理方法、及びプログラム |
JP2019082931A (ja) * | 2017-10-31 | 2019-05-30 | 三菱重工業株式会社 | 検索装置、類似度算出方法、およびプログラム |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101662450B1 (ko) | 2015-05-29 | 2016-10-05 | 포항공과대학교 산학협력단 | 다중 소스 하이브리드 질의응답 방법 및 시스템 |
KR20170018620A (ko) * | 2015-08-10 | 2017-02-20 | 삼성전자주식회사 | 유사 문장 식별 방법 및 이를 적용한 식별 장치 |
KR101841615B1 (ko) * | 2016-02-05 | 2018-03-26 | 한국과학기술원 | 의미 기반 명사 유사도 계산 장치 및 방법 |
TW201820172A (zh) * | 2016-11-24 | 2018-06-01 | 財團法人資訊工業策進會 | 對話模式分析系統、方法及非暫態電腦可讀取記錄媒體 |
CN108959551B (zh) * | 2018-06-29 | 2021-07-13 | 北京百度网讯科技有限公司 | 近邻语义的挖掘方法、装置、存储介质和终端设备 |
-
2019
- 2019-09-03 KR KR1020227005501A patent/KR102473788B1/ko active IP Right Grant
- 2019-09-03 DE DE112019007599.3T patent/DE112019007599T5/de active Pending
- 2019-09-03 CN CN201980099685.0A patent/CN114341837A/zh active Pending
- 2019-09-03 JP JP2021541602A patent/JP7058807B2/ja active Active
- 2019-09-03 WO PCT/JP2019/034632 patent/WO2021044519A1/ja active Application Filing
-
2020
- 2020-03-16 TW TW109108561A patent/TWI770477B/zh active
-
2022
- 2022-02-22 US US17/676,963 patent/US20220179890A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259627A (ja) * | 1999-03-08 | 2000-09-22 | Ai Soft Kk | 自然言語文関係判定装置、自然言語文関係判定方法およびこれを用いた検索装置、検索方法ならびに記録媒体 |
JP2009217689A (ja) * | 2008-03-12 | 2009-09-24 | National Institute Of Information & Communication Technology | 情報処理装置、情報処理方法、及びプログラム |
JP2019082931A (ja) * | 2017-10-31 | 2019-05-30 | 三菱重工業株式会社 | 検索装置、類似度算出方法、およびプログラム |
Also Published As
Publication number | Publication date |
---|---|
KR102473788B1 (ko) | 2022-12-02 |
KR20220027273A (ko) | 2022-03-07 |
TW202111571A (zh) | 2021-03-16 |
JPWO2021044519A1 (ja) | 2021-03-11 |
TWI770477B (zh) | 2022-07-11 |
US20220179890A1 (en) | 2022-06-09 |
CN114341837A (zh) | 2022-04-12 |
JP7058807B2 (ja) | 2022-04-22 |
DE112019007599T5 (de) | 2022-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9087111B2 (en) | Personalized tag ranking | |
US10984344B2 (en) | Document classifying device | |
JP5346279B2 (ja) | 検索による注釈付与 | |
TW201805839A (zh) | 資料處理方法、設備及系統 | |
CN108399213B (zh) | 一种面向用户个人文件的聚类方法及系统 | |
Wei et al. | Scalable heterogeneous translated hashing | |
CN113806487B (zh) | 基于神经网络的语义搜索方法、装置、设备和存储介质 | |
CN104484392A (zh) | 数据库查询语句生成方法及装置 | |
US7716144B2 (en) | Consistent weighted sampling of multisets and distributions | |
CN104615723B (zh) | 查询词权重值的确定方法和装置 | |
JP7058807B2 (ja) | 情報処理装置、プログラム及び情報処理方法 | |
JPWO2015145981A1 (ja) | 多言語文書類似度学習装置、多言語文書類似度判定装置、多言語文書類似度学習方法、多言語文書類似度判定方法、および、多言語文書類似度学習プログラム | |
Wei et al. | Heterogeneous translated hashing: A scalable solution towards multi-modal similarity search | |
JP5224537B2 (ja) | 局所性検知可能ハッシュの構築装置、類似近傍検索処理装置及びプログラム | |
CN112256730A (zh) | 信息检索方法、装置、电子设备及可读存储介质 | |
Afreen et al. | Document clustering using different unsupervised learning approaches: A survey | |
JP7297855B2 (ja) | キーワード抽出装置、キーワード抽出方法、およびプログラム | |
Li et al. | Parallel image search application based on online hashing hierarchical ranking | |
JP2011248827A (ja) | 言語横断型情報検索方法、言語横断型情報検索システム及び言語横断型情報検索プログラム | |
KR100952077B1 (ko) | 키워드를 이용한 표제어 선정 장치 및 방법 | |
JP6403850B1 (ja) | 情報処理装置、情報処理方法及びプログラム | |
JP2023013868A (ja) | 情報処理装置、情報処理方法、及び情報処理プログラム | |
JP2023013869A (ja) | 情報処理装置、情報処理方法、及び情報処理プログラム | |
JP2023013864A (ja) | 情報処理装置、情報処理方法、及び情報処理プログラム | |
JP2023013863A (ja) | 情報処理装置、情報処理方法、及び情報処理プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19944453 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021541602 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20227005501 Country of ref document: KR Kind code of ref document: A |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19944453 Country of ref document: EP Kind code of ref document: A1 |