WO2021038887A1

WO2021038887A1 - Similar document retrieval method, similar document retrieval program, similar document retrieval device, index information creation method, index information creation program, and index information creation device

Info

Publication number: WO2021038887A1
Application number: PCT/JP2019/034306
Authority: WO
Inventors: 謙介馬場; 智哉野呂; 茂紀福田; 清司大倉
Original assignee: 富士通株式会社
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-03-04
Also published as: JP7193000B2; JPWO2021038887A1

Abstract

In a similar document retrieval method according to an embodiment, a computer executes a generation process, a computation process, and a retrieval process. The generation process generates a hash function for allocating a value to each word included in a set of words in a retrievable document on the basis of the set of words and word interval information indicating word meaning closeness, said hash function allocating a closer value with reference to a specific word in the order of closeness to said word. The computation process computes summary information for each of a plurality of retrievable documents on the basis of the generated hash function and computes summary information for an inputted document on the basis of the generated hash function. The retrieval process retrieves a document similar to the inputted document among the plurality of retrievable documents on the basis of a comparison between the computed summary information for the retrievable documents and the computed summary information for the inputted document.

Description

Similar document search method, similar document search program, similar document search device, index information creation method, index information creation program and index information creation device

An embodiment of the present invention relates to a similar document search method, a similar document search program, a similar document search device, an index information creation method, an index information creation program, and an index information creation device.

Conventionally, as one of the natural language processes by a computer, there is a search process for searching a document similar to an input document from the documents stored in a database. For example, a sample inquiry document and a response document corresponding to the sample are registered in the database. Then, it is conceivable to construct a dialogue interface such as a chatbot that searches for a sample similar to the input inquiry document and outputs a response document corresponding to the similar sample.

As a method of searching for a sample similar to the input document from the samples registered in the database, there is a method of evaluating the commonality of words appearing between the two documents. For example, the following search methods can be considered. A plurality of hash functions (sometimes called Min-hash functions) that calculate one hash value from a word set included in a certain document are defined. Each hash function has a correspondence relationship in which different values are associated with different words, and outputs the smallest value among the values corresponding to the words included in a certain word set as a hash value. For the sample registered in the database, a vector enumerating a plurality of hash values calculated by using the plurality of hash functions is generated in advance. Then, a vector is similarly calculated from the word set included in the input document and the above-mentioned plurality of hash functions, and a vector similar to the sample vector registered in the database is searched for.

JP-A-2018-173909

However, in the above-mentioned prior art, there is a problem that the commonality of words is evaluated low and the search accuracy is lowered when the input document is short, the expressions used in the document are diverse, and there are few expressions that match the sample. is there. For example, the input document of a chatbot is mainly spoken language, and since one sentence is short and the expressions are diverse, the probability that a common word appears between the input document and the sample is generally low. As a result, the accuracy of the sample search for the input document tends to be low.

In one aspect, it is an object of the present invention to provide a similar document search method, a similar document search program, a similar document search device, an index information creation method, an index information creation program, and an index information creation device that improve the search accuracy of similar documents. To do.

In one plan, in the similar document search method, a computer executes a process of generating, a process of calculating, and a process of searching. The process of generating is based on the set of words included in the search target document and the inter-word information indicating the closeness of the meanings of the words, and for each word included in the set of words, the corresponding word is used as a reference. Generate a hash function that assigns values to words in order of closeness. In the calculation process, the summary information of each of the plurality of search target documents is calculated based on the generated hash function, and the summary information of the input document is calculated based on the generated hash function. The search process searches for a document similar to the input document from a plurality of search target documents based on the comparison between the calculated summary information of the search target document and the summary information of the input document.

On one side, the search accuracy of similar documents is improved.

FIG. 1 is a block diagram showing a functional configuration example of the information processing apparatus according to the embodiment. FIG. 2 is an explanatory diagram showing an example of a processing flow of the information processing apparatus according to the embodiment. FIG. 3 is a flowchart showing an example of the hash function generation process. FIG. 4 is an explanatory diagram illustrating a hash function. FIG. 5 is an explanatory diagram illustrating the degree of similarity between words. FIG. 6 is an explanatory diagram illustrating a hash value by Min hash. FIG. 7 is an explanatory diagram illustrating an outline of the operation of the information processing apparatus according to the embodiment. FIG. 8 is an explanatory diagram for explaining the narrowing down of the search target documents. FIG. 9 is an explanatory diagram showing a display example of the operation screen. FIG. 10 is a diagram showing an example of a computer that executes a program.

Hereinafter, the similar document search method, the similar document search program, the similar document search device, the index information creation method, the index information creation program, and the index information creation device according to the embodiment will be described with reference to the drawings. Configurations having the same function in the embodiment are designated by the same reference numerals, and duplicate description will be omitted. The similar document search method, the similar document search program, the similar document search device, the index information creation method, the index information creation program, and the index information creation device described in the following embodiments are merely examples. It is not limited. In addition, the following embodiments may be appropriately combined within a consistent range.

[Overview]
In the present embodiment, an information processing device that searches for a document similar to an input document (hereinafter, also referred to as an inquiry document) from a sample (hereinafter, also referred to as a search target document) registered in a database is illustrated.

In this information processing device, first pre-processing is performed, and then search processing is performed. In the preprocessing, a hash value is calculated for each search target document using a hash function, and an index structure (for example, a search tree) such as a tree structure for searching the search target document is created from the calculated hash value.

In the search process, the hash value of the inquiry document is calculated using the hash function. Next, the information processing apparatus compares the hash value of the inquiry document with the hash value of each search target document, and searches for the closest hash value of each search target document indicated by the index structure. Next, the information processing apparatus sets a search target document having a hash value close to the hash value of the inquiry document as a search result of a similar document.

A plurality of hash functions for calculating the hash value are defined based on the set of words obtained by extracting the words included in the search target document registered in the database. Specifically, the information processing device has a hash function in which W is a set of words and h is a set of all injective functions (plural hash functions) from W to {0,1, ..., | W | -1}. Generate multiple.

The information processing device performs the following processing on the search target document and the inquiry document to obtain a vector with a plurality of hash values and obtain summary information summarizing the search target document and the inquiry document by a hash function.
-Randomly select a hash function from h.
-Extract the words included in the search target document and obtain a set of words.
-The smallest integer obtained from each word by the selected h is used as the hash value.
-Multiple hash values are obtained by randomly selecting a function multiple times.

Documents that are similar to each other have a high rate of appearance of common words (Jacquard coefficient). For the selection of the hash function from h, the probability that the hash values match matches the Jakar coefficient. Therefore, in comparing the hash values of documents when finding similar documents, the Jakar coefficient can be calculated probabilistically from the matching ratio of each element of the vector, and the humming distance of the hash values (for example, the number of mismatches) is between documents. It reflects the closeness (similarity) of.

Also, let n be the number of search target documents, m be the number of words (average value) for each search target document, and k be the number of hash functions. Then, the amount of calculation for calculating the vector of the search target document becomes O (kmn). The amount of calculation for calculating the vector of one inquiry document is O (km), and the calculation cost of the neighborhood search using this vector is O (log (n)). K hash functions are randomly generated in advance. When k is O (log (n)), the collision probability that the same hash value is calculated from different search target documents can be sufficiently reduced.

However, in the verification of similarity by Hamming distance, the difference between each element of the vector is meaningless, and if the elements do not match (when there is no common word), the similarity is 0. Therefore, when the appearance of common words is small, the accuracy of similar search tends to be low.

Therefore, in the present embodiment, the information processing device generates a hash function that assigns values in the order of close word meanings based on a predetermined word, based on the word-to-word distance information indicating the closeness of word meanings between words. Specifically, the information processing device extracts words included in the search target document, sets the words, and randomly selects a reference word. Next, the information processing device sorts the words included in the set of words in the order of their meanings closer to the reference word based on the inter-word distance information, and sets a unique value in the sort order (for example, an arrangement that increases in the sort order). Numerical value) is assigned. The information processing device generates a plurality of hash functions by repeating the above processing.

For example, when the set of words is {cat, rice, ... Nyanko, food} and the reference word is {cat}, the information processing device sorts the words in order of closeness based on {cat}. As a result, {cat, cat, ... rice, food} can be obtained. For the set of words sorted in this way, the information processing device assigns integer values in the sort order, such as {cat = 0, cat = 1, ... rice = 5, food = 6}.

The hash value generated by the hash function in this way is a value corresponding to the distance between words (also called between words) with respect to a predetermined word. Therefore, it is possible to verify not only the Hamming distance between hash values but also the similarity by the Euclidean distance. Therefore, even if there is no common word and the hash values do not match (when the elements do not match), the similarity can be verified by the Euclidean distance, and the accuracy of the similarity search can be improved.

[Configuration example]
FIG. 1 is a block diagram showing a functional configuration example of the information processing apparatus according to the embodiment. As shown in FIG. 1, the information processing device 100 is a device that performs a process of obtaining a similar similar document 102 from the search target documents stored in the search target document database 121 for the input inquiry document 101. As the information processing device 100, for example, a personal computer or the like can be applied.

The information processing device 100 includes an index generation module 111 that performs preprocessing to create an index structure, a search module 112 that performs search processing, and a storage unit (search target document database 121, inter-word distance information) that stores various data related to the processing. It has a storage unit 122, a hash function storage unit 123, and an index storage unit 124). That is, the information processing device 100 is an example of a similar document retrieval device and an index information creating device.

The search target document database 121 is a database in which search target documents to be searched are registered for the inquiry document 101. The search target document in the search target document database 121 may be registered in advance, may be added through a dialogue in a dialogue interface such as a chatbot with a user who uses the information processing device 100, or may be automatically added from the network. May be collected in.

The inter-word distance information storage unit 122 stores inter-word distance information indicating the closeness (interval between words) of the meaning of each word to other words. Specifically, the inter-word distance information includes a function d (d, w) that represents the distance (interval between words) between each word (v) and another word (w). That is, the inter-word distance information is an example of inter-word information indicating the closeness of word meanings.

The hash function storage unit 123 stores a plurality of different hash functions generated by the index generation module 111. Each hash function has a correspondence relationship in which a unique integer (in order of close meaning to a predetermined word) is associated with each word that can appear in the search target document, accepts a word set, and outputs one hash value. To do. Also, different hash functions have different correspondences.

The index storage unit 124 stores an index structure for searching a search target document similar to the inquiry document 101. The index structure is generated by the index generation unit 132 based on the vector (summary information) calculated from the search target document using a plurality of hash functions, and is the index information for searching the summary information of each search target document. This is an example.

As the index structure, for example, a search tree can be applied. The search tree contains a plurality of nodes (a leaf node and each node leading to the leaf node) connected to the tree structure. The search Konoha node points to the document to be searched. For example, the leaf node contains a vector of the search target document and identification information (for example, a document ID) that identifies the search target document. However, the leaf node does not have to contain the vector.

Two child nodes are connected to each node other than the leaf node. Each node, except the leaf node, has a threshold for a particular dimension in the vector. If the hash value of a specific dimension in the input vector is greater than or equal to the threshold value, the process proceeds to the right child node, and if the hash value of the specific dimension is less than the threshold value, the process proceeds to the left child node. In this way, by tracing the search tree from the root node to the leaf node, it is possible to efficiently search for a search target document close to a predetermined vector.

The index generation module 111 has a hash function generation unit 131 and an index generation unit 132.

The hash function generation unit 131 generates a plurality of hash functions based on the search target document stored in the hash function generation unit 131 and the inter-word distance information stored in the inquiry receiving unit 133, and a plurality of generated hash functions. The hash function is stored in the hash function storage unit 123.

Specifically, the hash function generation unit 131 extracts words included in the search target document, sets the words, and randomly selects a reference word. Next, the hash function generation unit 131 refers to the inter-word distance information of the inter-word distance information storage unit 122, and arranges (sorts) the words included in the word set in the order of the word meaning closer to the reference word. Next, the hash function generation unit 131 generates a hash function by assigning a unique value (for example, an integer value that increases in the sort order) to the words included in the word set in the sort order. The hash function generation unit 131 generates a plurality of hash functions by repeating the process of generating the hash function.

For example, it is assumed that a function d (d, w) representing the distance (interval between words) between each word (v) and another word (w) is given as inter-word distance information. This function d (d, w) can be created in advance by referring to the similarity between words and the vector representation of words. The function d (d, w) is as follows.
• For any v, w ∈ W, 0 ≦ d (v, w).
• For any w ∈ W, d (w, w) = 0.

Regarding the hash function, it is assumed that there is w ∈ W such that h (u)> h (v) ⇔ d (u, w)> d (v, w) for any u, v ∈ W.

The hash function generator 131 randomly selects w from W, and sorts all v ∈ W in ascending order of d (v, w). Next, the hash function generation unit 131 assigns

integers

0, 1, 2, ... To _{the sorted words w, v 1} , v _{2 ...} Note that the hash function generation unit 131 may use the value of d (v, w) as it is instead of assigning an integer (assuming that there is no duplication).

The index generation unit 132 generates an index structure based on the search target document stored in the search target document database 121 and the hash function stored in the hash function storage unit 123, and the generated index structure is stored in the index storage unit 124. Store.

Specifically, the index generation unit 132 extracts a word set for each search target document, inputs the extracted word set to each of a plurality of hash functions, and inputs a vector of hash values, that is, summary information of the search target document. calculate. Next, the index generation unit 132 generates an index structure so that a plurality of vectors corresponding to the plurality of search target documents can be efficiently searched. For example, the index generation unit 132 pays attention to one dimension in the vector and repeatedly determines the threshold value of the hash value of the dimension so that the set of vectors is divided into two to generate a search tree. .. At this time, the index generation unit 132 generates an intermediate node so that a single vector is associated with the leaf node of the search tree as much as possible.

The search module 112 includes an inquiry receiving unit 133, a hash value calculation unit 134, a search unit 135, and an output unit 136.

The inquiry receiving unit 133 receives the inquiry document 101. The inquiry receiving unit 133 may receive the inquiry document 101 input as a character string from the user, or may convert the voice signal of the inquiry utterance uttered by the user into a character string. In addition, the inquiry receiving unit 133 may receive a character string or an audio signal from another information processing device.

The hash value calculation unit 134 generates a vector corresponding to the inquiry document 101, that is, summary information of the inquiry document 101, based on a plurality of hash functions stored in the hash function storage unit 123. Specifically, the hash value calculation unit 134 extracts a word set from the inquiry document 101, inputs the extracted word set into each of the plurality of hash functions, and calculates a hash value vector.

The search unit 135 searches for the search target document most similar to the inquiry document 101 by neighborhood search based on the index structure stored in the index storage unit 124 and the vector of the inquiry document 101. Specifically, the search target document most similar to the inquiry document 101 has the largest number of dimensions in which the hash values match when the vectors are compared.

For example, the search unit 135 traces the search tree stored in the index storage unit 124 from the root node to the leaf node while comparing the hash value of a specific dimension in the vector of the inquiry document 101 with the threshold value. Reach a specific leaf node. The search unit 135 selects the search target document corresponding to the reached leaf node.

When two or more search target documents are associated with the reached leaf node, that is, when the number of dimensions in which the hash values match is the same and the search tree cannot narrow down the search target documents to one. The search unit 135 compares the hash values with each other to obtain the Euclidean distance. Next, the search unit 135 sets the document having the closer Euclidean distance as the most similar search target document.

The output unit 136 outputs the searched search target document as a similar document 102. For example, the output unit 136 may display the character string of the similar document 102 on a display or the like, or may convert the similar document 102 into a voice signal and reproduce the voice by the speaker. Further, the output unit 136 may transmit a character string or an audio signal of the similar document 102 to another information processing device.

Further, instead of outputting the similar document 102, the output unit 136 may perform a process associated with the searched search target document in advance in the search target document database 121. Specifically, it is assumed that predetermined processes (for example, schedule registration and e-mail transmission) are registered in the search target document database 121 for each search target document. The output unit 136 can perform the process corresponding to the inquiry document 101 by reading the process associated with the searched search target document from the search target document database 121 and executing the process.

[Operation example]
FIG. 2 is an explanatory diagram showing an example of a processing flow of the information processing apparatus according to the embodiment. As shown in FIG. 2, the information processing apparatus 100 performs a pre-processing (S1) for creating an index structure and a search process (S2) for searching and outputting a similar document 102 for the inquiry document 101.

First, the pre-processing (S1) will be described. In the pre-processing (S1), the search target document is first read from the search target document database 121 and input to the hash function generation unit 131 (S11).

The hash function generation unit 131 receives input of the distance between words from the inquiry receiving unit 133 (S13), and generates a plurality of hash functions 123a based on the input search target document and the distance between words (S12). ).

FIG. 3 is a flowchart showing an example of the hash function generation process. As shown in FIG. 3, the hash function generation unit 131 accepts the input of the search target document (S31) and extracts the words (appearing words) included in the search target document (S32) to set the words (words). Set) (S33).

Next, the hash function generation unit 131 generates a plurality of hash functions 123a by repeating k times, which is the number of hash functions that generate the processes of S34 to S39.

Specifically, the hash function generation unit 131 randomly selects one word from the word set (S35), accepts input of inter-word distance information indicating the distance between words (S36), and selects the selected word. The distance between the word and another word is referred to from the inter-word distance information (S37). Next, the hash function generation unit 131 sorts the words in the word set in order of proximity to the selected word (S38), and assigns an integer value that increases in the order of arrangement to each arranged word as a hash value (S39). ..

Next, the hash function generation unit 131 outputs a plurality of hash functions 123a obtained by repeating the processes of S34 to S39 k times and stores them in the index generation unit 132 (S40).

FIG. 4 is an explanatory diagram illustrating the hash function 123a. As shown in FIG. 4, h ₁ , h ₂ , ... h 2, ... In the hash function 123a is one hash function.

For example, h ₁ uses (cat) as a reference word, and an integer value corresponding to the distance between words with respect to (cat) is assigned to each word in the word set. Further, h ₂ uses (rice) as a reference word, and an integer value corresponding to the distance between words for (rice) is assigned to each word in the word set. Further, h ₃ uses (Nyanko) as a reference word, and an integer value corresponding to the distance between words for (Nyanko) is assigned to each word in the word set. Further, h ₄ uses (food) as a reference word, and an integer value corresponding to the distance between words with respect to (food) is assigned to each word in the word set. Also, h ₅ is assigned an integer value corresponding to the distance between the word for (flowers) and a reference word, and for each word in the word set (flowers). Further, h ₆ uses (water) as a reference word, and an integer value corresponding to the distance between words with respect to (water) is assigned to each word in the word set.

Here, a case where the hash values of the following documents A to C are obtained by the hash function 123a illustrated in FIG. 4 is illustrated.
Document A: "Give cats food"
Document B: "Feed Nyanko"
Document C: "Watering flowers"

The word set of document A is {cat, rice}, the word set of document B is {nyanko, food}, and the word set of document C is {flower, water}. Therefore, the hash value _{H A} of document A by a hash function 123a illustrated in Figure _4, the H A = 001 133. Similarly, the hash value H _B of the document B is H _B = 110022. Further, the hash value _{H C} of document C _becomes H C = 424 200.

Here, when calculating the Hamming distance by comparing the hash value of the document _{_{_{A ~ C, H A, H}}} B: 6, H A, H C: 6, H B, H C: a 6. Further, when the Euclidean distance is calculated by comparing the hash values of the documents A to C, it becomes _HA , H _B : 1, H _A , H _C : 6.9, H _B , H _C : 6.2. In this way, even when it is difficult to verify the similarity with the Hamming distance (when the common language is not included), documents similar to each other (documents A and B in the illustrated example) can be verified with the Euclidean distance.

Returning to FIG. 2, following S12, the index generation unit 132 generates an index structure 124a based on the input search target document and the generated hash function 123a (S14), and stores the generated index structure 124a as an index. It is stored in the unit 124.

In the search process (S2), the inquiry document 101 received by the inquiry receiving unit 133 is input to the hash value calculating unit 134 (S21). The hash value calculation unit 134 generates a plurality of hash values of the input inquiry document 101 based on the plurality of hash functions stored in the hash function storage unit 123 (S22), and obtains a vector corresponding to the inquiry document 101. ..

Next, the search unit 135 collates the hash value of each vector of the search target document in the index structure stored in the index storage unit 124 with the hash value of the vector of the inquiry document 101 (S23), and the search target most similar to the inquiry document 101. Search for documents. The output unit 136 outputs the similar document 102 searched by the search unit 135 (S24).

FIG. 5 is an explanatory diagram for explaining the degree of similarity between words. Specifically, FIG. 5 is a bird's-eye view of the arrangement of words W1 to W6 in a high-dimensional space showing the degree of similarity between words. Each of the words W1 to W6 in FIG. 5 indicates a word contained in the document. Here, the words W1 to W3 are similar to words such as "cat" and form clusters shown by dotted lines. Similarly, the words W4 to W6 are similar for words such as "dog" and form a cluster separate from the words W1 to W3.

As shown in FIG. 5, in a high-dimensional space showing the similarity between words, it is difficult to evaluate the similarity well with a simple projection. For example, in orthogonal projection on the axis A1, the values of similar words (words W1 to W3 or words W4 to W6) are close to each other. However, in orthogonal projection onto an axis A2 that is different from the axis A1, words that are not similar to each other (eg, words W1 and W4) may be closer than words that are similar to each other (eg, words W1 and W3).

In this embodiment, since the reference word is randomly selected and the hash value by Min hash is used, the projection based on the distance from the reference word (reference point) is used.

FIG. 6 is an explanatory diagram for explaining the hash value by Min hash. Here, the word W1 is "cat", the word W2 is "nyanko", the word W3 is "cat", the word W4 is "dog", the word W5 is "mouse", the word W6 is "dog", and the word W7 is "dog". It shall be. Further, it is assumed that the reference point is the word W1.

As shown in FIG. 6, based on the word W1 which is the reference point, in the hash function 123a, 1: "cat", 2: "cat", 3: "cat", 4: "dog", 5: " Values of "mouse", 6: "dog", and 7: "dog" are assigned. Here, as far as the distance from the reference point is concerned, the closer to the reference point, the more the degree of similarity is maintained (for example, 1: "cat", 2: "nyanko", 3: "cat", etc.). However, when the distance from the reference point is long (for example, 5: "mouse", 6: "dog"), the values of dissimilar words may be close to each other.

In this embodiment, since the minimum value among the hash values of the word set is used (Min hash), it is appropriate as a projection capable of expressing the similarity of words while maintaining the magnitude relationship of the similarity.

FIG. 7 is an explanatory diagram illustrating an outline of the operation of the information processing apparatus 100 according to the embodiment. As shown in FIG. 7, in the pre-processing (S1), the information processing apparatus 100 uses the words included in each of the search target documents 121a of the search target document database 121 based on the similarity between words based on the inter-word distance information. For a set of, generate a plurality of hash functions that assign values close to each other in order of close meaning based on a predetermined word. Next, the information processing apparatus 100 converts each of the search target documents 121a by Min Hash using the generated plurality of hash functions, generates an index structure for searching the calculated vector, and stores the index storage unit 124. Store.

In the search process (S2), the information processing apparatus 100 converts the input inquiry document 101 by Min Hash using the same plurality of hash functions generated in S1. Next, the information processing apparatus 100 compares the hash value vector calculated from the inquiry document 101 with the hash value vector stored in the index storage unit 124, so that the search target document 121a most similar to the inquiry document 101 To search for.

In the illustrated example, for the inquiry document 101 that "wants to adjust the meeting", the hash values of "schedule adjustment" and "meeting adjustment" in the search target document 121a match in the word "adjustment". Therefore, the Hamming distance is short. Here, the meaning of the word "meeting" is closer to that of the "meeting" included in the inquiry document 101 than the "schedule", and the Euclidean distance is closer. Therefore, for the inquiry document 101 that "wants to coordinate the meeting", a similar document 102 that "wants to coordinate the meeting" is obtained.

Note that the output unit 136 may output one similar document 102 that is most similar to the searched search target document, or a plurality of documents whose similarity obtained by the Hamming distance and the Euclidean distance is equal to or higher than a predetermined threshold value. You may output the similar document 102 of.

FIG. 8 is an explanatory diagram for explaining the narrowing down of the search target documents. Specifically, FIG. 8 is a graph showing an example of the relationship between the threshold value of similarity and the number of hits of the search target document 121a.

Graph 10 shows the relationship between the threshold value of the similarity between the inquiry document 101 and the search target document 121a and the number (hits) of the search target document 121a whose similarity is greater than the threshold value. The similarity is the number of dimensions in which the hash values match between the vector of the inquiry utterance and the vector of the search target utterance, and the Euclidean distance between the hash values.

(A) In the method of calculating the vector without considering the related words, the similarity between the inquiry document 101 and each search target document 121a is calculated to be low as a whole. Therefore, the relationship between the similarity threshold and the number of hits is as shown in curve 11. That is, even if the threshold value of the similarity is set low, the number of hits is small, and the search omission of the similar search target document 121a is large.

(B) In the method of equating related words and calculating the vector, the degree of similarity between the inquiry document 101 and each search target document 121a is calculated to be high as a whole. Therefore, the relationship between the similarity threshold and the number of hits is as shown in curve 12. That is, even if the threshold value of the similarity is set high, the number of hits increases, and it is difficult to efficiently narrow down similar search target documents 121a.

(C) In the method of the present embodiment, the similarity between the inquiry document 101 and each search target document 121a takes into account the Euclidean distance. Therefore, the relationship between the similarity threshold and the number of hits is continuous as shown in curve 13. As a result, the search target document 121a similar to the inquiry document 101 can be efficiently narrowed down.

FIG. 9 is an explanatory diagram showing a display example of the operation screen. The operation screen 20 is a screen that accepts various operations from the user by presenting it to the user of the information processing apparatus 100 in a dialogue interface such as a chatbot. On the operation screen 20, the display area 21 is an area for displaying a processing result or the like, and the input area 22 is an area for inputting a document or the like. For example, the information processing apparatus 100 searches for a similar document 102 with respect to the inquiry document 101 input to the input area 22, and outputs an output according to the search result to the display area 21.

For example, the information processing apparatus 100 searches for a similar document 102 using the input document 21a that "wants to adjust the schedule" as the inquiry document 101, and displays the search result 21b in the display area 21. Specifically, for the inquiry document 101 that "wants to adjust the schedule", the schedule registration process corresponding to the "schedule adjustment" obtained as the similar document 102 from the search target document 121a is executed.

In the information processing apparatus 100, even when the inquiry document 101 is mainly a spoken language and one sentence is short and the expressions are diverse, as in the interactive interface such as the chatbot in the example of FIG. 9, a similar document is appropriately used. 102 searches can be performed.

[effect]
As described above, the information processing apparatus 100 includes a hash function generation unit 131, an index generation unit 132, a hash value calculation unit 134, and a search unit 135. The hash function generation unit 131 determines a predetermined word for each word included in the word set based on the set of words included in the search target document 121a and the inter-word distance information indicating the closeness of the word meanings of the words. Generates a hash function that assigns values that are closer to each other in order of increasing meaning based on. The index generation unit 132 calculates the summary information of each of the plurality of search target documents 121a based on the generated hash function. The hash value calculation unit 134 calculates the summary information of the input document (inquiry document 101) based on the generated hash function. The search unit 135 searches a plurality of search target documents 121a for documents similar to the input document based on the comparison between the calculated summary information of the search target document 121a and the summary information of the inquiry document 101.

Therefore, in the information processing apparatus 100, the hash value included in the summary information of the search target document 121a calculated based on the generated hash function and the summary information of the inquiry document 101 depends on the distance of the meaning of the word to the predetermined word. Value. Therefore, in the information processing apparatus 100, in the comparison between the summary information of the search target document 121a and the summary information of the inquiry document 101, for example, the similarity using not only the Hamming distance between the summary information but also the Euclidean distance is used. The degree can be verified, and the search accuracy of similar documents can be improved.

Further, each of the index generation unit 132 and the hash value calculation unit 134 sets the minimum value among the values assigned to each of the words in the set of words included in the search target document 121a or the inquiry document 101 in the generated hash function. Calculated as a hash value. In this way, the information processing apparatus 100 can calculate the summary information of the search target document 121a or the inquiry document 101 by the Min-hash function.

Further, the hash function generation unit 131 generates a plurality of hash functions by repeating the process of reselecting a predetermined word and generating the hash function. Then, each of the index generation unit 132 and the hash value calculation unit 134 summarizes a vector containing a plurality of hash values calculated from a set of words included in the search target document 121a or the inquiry document 101 by the generated hash functions. Calculated as information. As a result, in the information processing apparatus 100, in the comparison between the summary information of the search target document 121a and the summary information of the inquiry document 101, by comparing the vectors listing a plurality of hash values, for example, the Hamming distance, the Euclidean distance, etc. Can be obtained to verify the degree of similarity.

Further, the index generation unit 132 of the information processing apparatus 100 generates index information in which the calculated summary information of each of the plurality of search target documents 121a is associated with the search target document 121a. The search unit 135 compares the summary information of the inquiry document 101 with the summary information associated with the search target document 121a in the index information. By generating index information, the information processing apparatus 100 can quickly search for a document similar to the inquiry document 101 from among a plurality of search target documents 121a.

[Other]
It should be noted that each component of each of the illustrated devices does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. It can be integrated and configured.

For example, in the present embodiment, the information processing device 100 having the index generation module 111 and the search module 112 is illustrated, but even if the index generation module 111 and the search module 112 have different information processing devices, respectively. Good. That is, the pre-processing (S1) and the search processing (S2) may be performed by different information processing devices.

Further, the various processing functions performed by the information processing apparatus 100 may execute all or any part thereof on the CPU (or a microcomputer such as an MPU or an MCU (Micro Controller Unit)). Further, various processing functions may be executed in whole or in any part on a program analyzed and executed by a CPU (or a microcomputer such as an MPU or MCU) or on hardware by wired logic. Needless to say, it's good. Further, various processing functions performed by the information processing apparatus 100 may be executed by a plurality of computers in cooperation by cloud computing.

By the way, various processes described in the above embodiment can be realized by executing a program prepared in advance on a computer. Therefore, an example of a computer (hardware) that executes a program having the same function as that of the above embodiment will be described below. FIG. 10 is a diagram showing an example of a computer that executes a program.

As shown in FIG. 10, the computer 1 has a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input, a monitor 203, and a speaker 204. Further, the computer 1 has a medium reading device 205 for reading a program or the like from a storage medium, an interface device 206 for connecting to various devices, and a communication device 207 for communicating with an external device by wire or wirelessly. Further, the computer 1 has a RAM 208 for temporarily storing various information and a hard disk device 209. Further, each part (201 to 209) in the computer 1 is connected to the bus 210.

The hard disk device 209 stores a program 211 for executing various processes described in the above embodiment. Further, the hard disk device 209 stores various data 212 (for example, information of the inter-word distance information storage unit 122, the search target document database 121, the hash function storage unit 123, and the index storage unit 124) referred to by the program 211. The input device 202 receives, for example, input of operation information from the operator of the computer 1. The monitor 203 displays, for example, various screens operated by the operator. For example, a printing device or the like is connected to the interface device 206. The communication device 207 is connected to a communication network such as a LAN (Local Area Network), and exchanges various information with an external device via the communication network.

The CPU 201 reads the program 211 stored in the hard disk device 209, expands it into the RAM 208, and executes it, thereby executing the hash function generation unit 131, the index generation unit 132, the inquiry reception unit 133, the hash value calculation unit 134, and the search unit. Various processes related to 135 and the output unit 136 are performed. The program 211 may not be stored in the hard disk device 209. For example, the computer 1 may read and execute the program 211 stored in the storage medium that can be read by the computer 1. The storage medium that can be read by the computer 1 corresponds to, for example, a CD-ROM, a DVD disk, a portable recording medium such as a USB (Universal Serial Bus) memory, a semiconductor memory such as a flash memory, a hard disk drive, or the like. Further, the program 211 may be stored in a device connected to a public line, the Internet, a LAN, or the like, and the computer 1 may read the program from these and execute the program.

1 ... Computer 10 ... Graphs 11 to 13 ... Curve 20 ... Operation screen 21 ... Display area 21a ... Input document 21b ... Search result 22 ... Input area 100 ... Information processing device 101 ... Inquiry document 102 ... Similar document 111 ... Index generation module 112 ... Search module 121 ... Search target document database 121a ... Search target document 122 ... Inter-word distance information storage unit 123 ... Hash function storage unit 123a ... Hash function 124 ... Index storage unit 124a ... Index structure 131 ... Hash function generation unit 132 ... Index Generation unit 133 ... Inquiry receiving unit 134 ... Hash value calculation unit 135 ... Search unit 136 ... Output unit 201 ... CPU
202 ... Input device 203 ... Monitor 204 ... Speaker 205 ... Media reader 206 ... Interface device 207 ... Communication device 208 ... RAM
209 ... Hard disk device 210 ... Bus 211 ... Program 212 ... Various data A1 to A2 ... Axis W1 to W7 ... Words

Claims

Based on the set of words included in the search target document and the inter-word information indicating the closeness of the word meanings of the words, for each word included in the set of words, with respect to the predetermined word as a reference. Generate a hash function that assigns values closest to each other in order of word meaning,
Based on the generated hash function, the summary information of each of the plurality of search target documents is calculated.
The summary information of the input document is calculated based on the generated hash function.
Based on the comparison between the calculated summary information of the search target document and the summary information of the input document, a document similar to the input document is searched from among the plurality of search target documents.
A similar document retrieval method characterized by a computer performing processing.
In each of the calculation processes, the smallest value among the values assigned to each word of the set of words included in the search target document or the input document in the generated hash function is calculated as the hash value.
The similar document retrieval method according to claim 1, wherein the similar document is searched for.
In the process of generating, a plurality of the hash functions are generated by repeating the process of reselecting the predetermined word and generating the hash function.
Each of the calculation processes calculates as summary information a vector including a plurality of hash values calculated from a set of words included in the search target document or the input document by the generated hash functions.
The similar document retrieval method according to claim 2, wherein the similar document is searched for.
The computer further executes a process of generating index information in which the calculated summary information of each of the plurality of search target documents is associated with the search target documents.
The search process compares the summary information of the input document with the summary information associated with the search target document in the index information.
The similar document retrieval method according to claim 1, wherein the similar document is searched for.
Based on the set of words included in the search target document and the inter-word information indicating the closeness of the word meanings of the words, for each word included in the set of words, for the word based on a predetermined word. Generate a hash function that assigns values closest to each other in order of word meaning,
Based on the generated hash function, the summary information of each of the plurality of search target documents is calculated.
The summary information of the input document is calculated based on the generated hash function.
Based on the comparison between the calculated summary information of the search target document and the summary information of the input document, a document similar to the input document is searched from among the plurality of search target documents.
A similar document retrieval program characterized by having a computer perform processing.
In each of the calculation processes, the smallest value among the values assigned to each word of the set of words included in the search target document or the input document in the generated hash function is calculated as the hash value.
The similar document retrieval program according to claim 5.
In the process of generating, a plurality of the hash functions are generated by repeating the process of reselecting the predetermined word and generating the hash function.
Each of the calculation processes calculates as summary information a vector including a plurality of hash values calculated from a set of words included in the search target document or the input document by the generated hash functions.
The similar document search program according to claim 6.
The computer further executes a process of generating index information in which the calculated summary information of each of the plurality of search target documents is associated with the search target documents.
The search process compares the summary information of the input document with the summary information associated with the search target document in the index information.
The similar document retrieval program according to claim 5.
Based on the set of words included in the search target document and the inter-word information indicating the closeness of the word meanings of the words, for each word included in the set of words, for the word based on a predetermined word. A hash function generator that generates a hash function that assigns values that are closer to each other in order of word meaning,
A first calculation unit that calculates summary information for each of the plurality of search target documents based on the generated hash function, and
A second calculation unit that calculates the summary information of the input document based on the generated hash function,
A search unit that searches for a document similar to the input document from a plurality of the search target documents based on a comparison between the calculated summary information of the search target document and the summary information of the input document.
A similar document retrieval device characterized by having.
Each of the first calculation unit and the second calculation unit is the smallest value assigned to each word in the set of words included in the search target document or the input document in the generated hash function. Calculate the value as a hash value,
The similar document retrieval device according to claim 9.
The hash function generation unit generates a plurality of the hash functions by repeating the process of reselecting the predetermined word and generating the hash function.
Each of the first calculation unit and the second calculation unit is a vector containing a plurality of hash values calculated from a set of words included in the search target document or the input document by the generated hash functions. Is calculated as summary information,
The similar document retrieval device according to claim 10.
It further has an index information generation unit that generates index information in which the calculated summary information of each of the plurality of search target documents is associated with the search target documents.
The search unit compares the summary information of the input document with the summary information associated with the search target document in the index information.
The similar document retrieval device according to claim 9.
Based on the set of words included in the search target document and the inter-word information indicating the closeness of the word meanings of the words, for each word included in the set of words, for the word based on a predetermined word. Generate a hash function that assigns values closest to each other in order of word meaning,
Based on the generated hash function, the summary information of each of the plurality of search target documents is calculated.
Generate index information for searching the summary information of each of the plurality of calculated search target documents.
A method of creating index information, characterized in that the processing is performed by a computer.
In the calculation process, the smallest value among the values assigned to each word in the set of words included in the search target document in the generated hash function is calculated as the hash value.
The index information creation method according to claim 13, wherein the index information is created.
In the process of generating, a plurality of the hash functions are generated by repeating the process of reselecting the predetermined word and generating the hash function.
In the calculation process, a vector including a plurality of hash values calculated from a set of words included in the search target document by the generated plurality of hash functions is calculated as summary information.
The index information creation method according to claim 14, wherein the index information is created.
Based on the set of words included in the search target document and the inter-word information indicating the closeness of the word meanings of the words, for each word included in the set of words, for the word based on a predetermined word. Generate a hash function that assigns values closest to each other in order of word meaning,
Based on the generated hash function, the summary information of each of the plurality of search target documents is calculated.
Generate index information for searching the summary information of each of the plurality of calculated search target documents.
An index information creation program characterized by having a computer execute processing.
In the calculation process, the smallest value among the values assigned to each word in the set of words included in the search target document in the generated hash function is calculated as the hash value.
16. The index information creation program according to claim 16.
In the process of generating, a plurality of the hash functions are generated by repeating the process of reselecting the predetermined word and generating the hash function.
In the calculation process, a vector including a plurality of hash values calculated from a set of words included in the search target document by the generated plurality of hash functions is calculated as summary information.
The index information creation program according to claim 17, wherein the index information is created.
Based on the set of words included in the search target document and the inter-word information indicating the closeness of the word meanings of the words, for each word included in the set of words, for the word based on a predetermined word. A hash function generator that generates a hash function that assigns values that are closer to each other in order of word meaning,
A calculation unit that calculates summary information for each of the plurality of search target documents based on the generated hash function.
An index information generation unit that generates index information for searching summary information of each of the plurality of calculated search target documents, and an index information generation unit.
An index information creation device characterized by having.
The calculation unit calculates, as a hash value, the smallest value among the values assigned to each word in the set of words included in the search target document in the generated hash function.
The index information creating apparatus according to claim 19.
The hash function generation unit generates a plurality of the hash functions by repeating the process of reselecting the predetermined word and generating the hash function.
The calculation unit calculates as summary information a vector including a plurality of hash values calculated from a set of words included in the search target document by the plurality of generated hash functions.
The index information creating apparatus according to claim 20.