JP6340351B2

JP6340351B2 - Information search device, dictionary creation device, method, and program

Info

Publication number: JP6340351B2
Application number: JP2015197647A
Authority: JP
Inventors: 淳史大塚; 克人別所; 中村　孝; 孝中村; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-10-05
Filing date: 2015-10-05
Publication date: 2018-06-06
Anticipated expiration: 2035-10-05
Also published as: JP2017072885A

Description

本発明は、音声またはテキストを入力とする情報検索装置、辞書作成装置、方法、及びプログラムに関するものである。 The present invention relates to an information search device, a dictionary creation device, a method, and a program that use speech or text as input.

従来、情報検索システムとして、ユーザが入力したクエリに対して、キーワードマッチ等の処理によってクエリに適合する文書を検索する手法が知られている。キーワードマッチ検索の場合はクエリのキーワードと文書内のキーワードとが完全一致していなくてはならず、検索の再現率（Recall）が低下してしまうという課題があった。そこで、クエリ中の含まれるキーワードを自動的に増やすことでより幅広い文書にマッチさせる技術としてクエリ拡張が知られている（特許文献１）。 2. Description of the Related Art Conventionally, as an information search system, a method of searching a document that matches a query by a process such as keyword matching for a query input by a user is known. In the case of the keyword match search, the keyword of the query must match the keyword in the document completely, and there is a problem that the recall rate (Recall) of the search is lowered. Therefore, query expansion is known as a technique for matching a wider range of documents by automatically increasing the number of keywords included in the query (Patent Document 1).

また、キーワードマッチ型以外の検索手法として、概念検索が知られている。概念検索はキーワードを連続値のｎ次元のベクトルで表現し、そのベクトルの重心をクエリベクトルと見なす手法である。同様に文書ベクトルも文書内のキーワードベクトルの重心で表現し、クエリベクトルと文書ベクトルの類似度を計算する。類似度が高い順に検索結果を出力することで検索を実行する。概念検索ではキーワードマッチと異なり、キーワードが完全一致しなくてもクエリ近い話題に関する文書が検索可能になるという利点がある。 As a search method other than the keyword match type, concept search is known. Concept search is a technique in which a keyword is expressed by an n-dimensional vector of continuous values, and the center of gravity of the vector is regarded as a query vector. Similarly, the document vector is expressed by the center of gravity of the keyword vector in the document, and the similarity between the query vector and the document vector is calculated. Search is executed by outputting search results in descending order of similarity. Unlike keyword matching, conceptual search has the advantage that documents related to topics close to a query can be searched even if the keywords do not match completely.

特開２０１０−１２３０３６号公報JP 2010-123036 A 特開２０１０−１８２０４１号公報JP 2010-182041 A

しかし、従来のクエリ拡張型の検索システムでは、拡張するためのキーワード数を人手で決定する必要がある。また、拡張するキーワードが多すぎると本来のクエリのキーワードに対して関連が低いキーワードがマッチするリスクが高まる。そして、反対に拡張するキーワード数が少ない場合には拡張後のクエリでもキーワードがマッチしない可能性が高まる。そのため、キーワード拡張数が妥当なクエリ拡張を行うことが難しいという問題がある。 However, in a conventional query expansion type search system, it is necessary to manually determine the number of keywords for expansion. Also, if there are too many keywords to expand, there is a higher risk that keywords that are less relevant to the keywords of the original query will match. On the contrary, when the number of keywords to be expanded is small, the possibility that the keywords do not match in the expanded query increases. Therefore, there is a problem that it is difficult to perform query expansion in which the keyword expansion number is reasonable.

また、概念ベクトルを使用した概念検索型の検索では、クエリを拡張せずに、内容が概念的に近接している文書を検索することができるが、概念検索では文書中の各々の単語の重みは考慮されず、重要な単語は異なっていてもその他の部分（機能語部分等）が一致していた場合、高いスコアを示すことがあるという問題がある。また、概念ベクトルの検索では文書中の全単語の重心ベクトルを求めるため、文書が長い場合などに検索精度が低下するという問題がある。 In addition, in the concept search type search using concept vectors, it is possible to search documents whose contents are conceptually close without expanding the query. In concept search, the weight of each word in the document is searched. There is a problem that even if important words are different, but other parts (function word part etc.) are matched, a high score may be shown. In addition, the concept vector search finds the center-of-gravity vector of all words in the document, so that there is a problem that the search accuracy decreases when the document is long.

本発明では、上記問題点を解決するために成されたものであり、クエリに関連する文書を精度よく検索することができる情報検索装置、辞書作成装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object thereof is to provide an information search device, a dictionary creation device, a method, and a program that can accurately search a document related to a query. And

上記目的を達成するために、第１の発明に係る情報検索装置は、検索対象文書集合に含まれる検索対象文書の各々について作成された、前記検索対象文書に含まれる検索対象文書キーワードの重さと、前記検索対象文書キーワードと、前記検索対象文書を表す文書ＩＤとの組み合わせである検索インデックスと、前記検索対象文書の各々について作成された、前記検索対象文書の文書内容と前記検索対象文書の前記文書ＩＤとの組み合わせである文書データベースと、概念文書集合に基づいて作成された、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルと、前記検索インデックスと、前記文書データベースとに基づいて、前記検索対象文書の各々について作成された、前記検索対象文書に含まれる前記検索対象文書キーワードに対し、最も類似度が高い前記概念文書キーワードを類似度と共に記録した概念類似度辞書と、入力されたクエリと、前記検索インデックスと、前記概念類似度辞書とに基づいて、前記文書データベースに含まれる前記検索対象文書の各々に対し、前記クエリに含まれるキーワードと類似する前記検索対象文書キーワードとの類似度、及び前記検索対象文書キーワードの重みを用いて、前記検索対象文書との関連度スコアを計算するスコア計算部と、を含んで構成されている。 In order to achieve the above object, an information search device according to a first aspect of the present invention provides a weight of a search target document keyword included in a search target document created for each search target document included in a search target document set. , A search index that is a combination of the search target document keyword and a document ID representing the search target document, and the document content of the search target document and the search target document created for each of the search target documents A document database that is a combination of document IDs, a concept vector that is created based on a set of concept documents and that expresses each of the concept document keywords included in the concept document, and is represented by an n-dimensional vector; and the search index The search target document created for each of the search target documents based on the document database Based on the concept similarity dictionary in which the concept document keyword having the highest similarity is recorded together with the similarity to the search target document keyword included, the input query, the search index, and the concept similarity dictionary Then, for each of the search target documents included in the document database, the search using the similarity between the search target document keyword similar to the keyword included in the query and the weight of the search target document keyword And a score calculation unit for calculating a relevance score with the target document.

第２の発明に係る情報検索方法は、検索対象文書集合に含まれる検索対象文書の各々について作成された、前記検索対象文書に含まれる検索対象文書キーワードの重さと、前記検索対象文書キーワードと、前記検索対象文書を表す文書ＩＤとの組み合わせである検索インデックスと、前記検索対象文書の各々について作成された、前記検索対象文書の文書内容と前記検索対象文書の前記文書ＩＤとの組み合わせである文書データベースと、概念文書集合に基づいて作成された、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルと、前記検索インデックスとに基づいて、前記検索対象文書の各々について作成された、前記検索対象文書に含まれる前記検索対象文書キーワードに対し、最も類似度が高い前記概念文書キーワードを類似度と共に記録した概念類似度辞書と、スコア計算部とを含む、情報検索装置における、情報検索方法であって、前記スコア計算部は、入力されたクエリと、前記検索インデックスと、前記概念類似度辞書とに基づいて、前記文書データベースに含まれる前記検索対象文書の各々に対し、前記クエリに含まれるキーワードと類似する前記検索対象文書キーワードとの類似度、及び前記検索対象文書キーワードの重みを用いて、前記検索対象文書との関連度スコアを計算する。 According to a second aspect of the present invention, there is provided an information search method comprising: a weight of a search target document keyword included in the search target document created for each search target document included in the search target document set; the search target document keyword; A search index that is a combination of a document ID representing the search target document, and a document that is created for each of the search target documents and is a combination of the document content of the search target document and the document ID of the search target document The search target document based on a database, a concept vector represented by an n-dimensional vector representing each of the concept document keywords included in the concept document, created based on the concept document set, and the search index For each of the search target document keywords included in the search target document created for each of the The information search method in the information search apparatus includes a concept similarity dictionary that records the concept document keyword with a high degree of similarity together with a score calculation unit, and the score calculation unit includes an input query, Based on the search index and the concept similarity dictionary, for each of the search target documents included in the document database, similarity with the search target document keyword similar to the keyword included in the query, and A relevance score with the search target document is calculated using the weight of the search target document keyword.

第１及び第２の発明によれば、スコア計算部により、入力されたクエリと、検索インデックスと、概念類似度辞書とに基づいて、文書データベースに含まれる検索対象文書の各々に対し、クエリに含まれるキーワードと類似する検索対象文書キーワードとの類似度、及び検索対象文書キーワードの重みを用いて、検索対象文書との関連度スコアを計算する。 According to the first and second inventions, the score calculator calculates a query for each of the search target documents included in the document database based on the input query, the search index, and the concept similarity dictionary. The relevance score with the search target document is calculated using the similarity between the included keyword and the search target document keyword and the weight of the search target document keyword.

このように、入力されたクエリと、検索インデックスと、概念類似度辞書とに基づいて、文書データベースに含まれる検索対象文書の各々に対し、クエリに含まれるキーワードと類似する検索対象文書キーワードとの類似度、及び検索対象文書キーワードの重みを用いて、検索対象文書との関連度スコアを計算することにより、クエリに関連する文書を精度よく検索することができる。 As described above, based on the input query, the search index, and the concept similarity dictionary, for each search target document included in the document database, a search target document keyword similar to the keyword included in the query is determined. By calculating the relevance score with the search target document using the similarity and the weight of the search target document keyword, a document related to the query can be searched with high accuracy.

第３の発明に係る情報検索装置は、検索対象文書集合に含まれる検索対象文書の各々について、前記検索対象文書に含まれる検索対象文書キーワードの重さと、前記検索対象文書キーワードと、前記検索対象文書を表す文書ＩＤとの組み合わせを格納した検索インデックスを作成する検索インデックス作成部と、前記検索対象文書の各々について作成された、前記検索対象文書の文書内容と前記検索対象文書の前記文書ＩＤとの組み合わせである文書データベースと、概念文書集合に基づいて、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルを作成する概念ベクトルモデル作成部と、前記概念ベクトルと前記検索インデックスと前記文書データベースとに基づいて、前記検索対象文書の各々について、前記検索対象文書に含まれる前記検索対象文書キーワードに対し、最も類似度が高い前記概念文書キーワードを類似度と共に記録した概念類似度辞書を作成する概念類似度辞書作成部と、入力されたクエリと、前記検索インデックスと、前記概念類似度辞書とに基づいて、前記文書データベースに含まれる前記検索対象文書の各々に対し、前記クエリに含まれるキーワードと類似する前記検索対象文書キーワードとの類似度、及び前記検索対象文書キーワードの重みを用いて、前記検索対象文書との関連度スコアを計算するスコア計算部と、を含んで構成されている。 According to a third aspect of the present invention, there is provided an information search apparatus including, for each search target document included in a search target document set, a weight of the search target document keyword included in the search target document, the search target document keyword, and the search target. A search index creation unit for creating a search index storing a combination of a document ID representing a document, a document content of the search target document, and a document ID of the search target document created for each of the search target documents A document database that is a combination of the above, a concept vector model creation unit that creates a concept vector represented by an n-dimensional vector that represents each of the concept document keywords included in the concept document based on the set of concept documents, and the concept Based on the vector, the search index, and the document database, the search target document A concept similarity dictionary creating unit that creates a concept similarity dictionary in which the concept document keyword having the highest similarity is recorded together with the similarity with respect to the search target document keyword included in the search target document. And the search target document keyword similar to the keyword included in the query for each of the search target documents included in the document database based on the query, the search index, and the concept similarity dictionary. A score calculation unit for calculating a relevance score with the search target document using the similarity and the weight of the search target document keyword.

第４の発明に係る情報検索方法は、検索対象文書集合に含まれる検索対象文書の各々について作成された、前記検索対象文書の文書内容と前記検索対象文書の文書ＩＤとの組み合わせである文書データベースと、検索インデックス作成部と、概念ベクトルモデル作成部と、概念類似度辞書作成部と、スコア計算部と、を含む情報検索装置における、情報検索方法であって、前記検索インデックス作成部は、前記検索対象文書集合に含まれる検索対象文書の各々について、前記検索対象文書に含まれる検索対象文書キーワードの重さと、前記検索対象文書キーワードと、前記検索対象文書を表す文書ＩＤとの組み合わせである検索インデックスを作成し、前記概念ベクトルモデル作成部は、概念文書集合に基づいて、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルを作成し、前記概念類似度辞書作成部は、前記概念ベクトルと前記検索インデックスとに基づいて、前記検索対象文書の各々について、前記検索対象文書に含まれる前記検索対象文書キーワードに対し、最も類似度が高い前記概念文書キーワードを類似度と共に記録した概念類似度辞書を作成し、前記スコア計算部は、入力されたクエリと、前記検索インデックスと、前記概念類似度辞書とに基づいて、前記文書データベースに含まれる前記検索対象文書の各々に対し、前記クエリに含まれるキーワードと類似する前記検索対象文書キーワードとの類似度、及び前記検索対象文書キーワードの重みを用いて、前記検索対象文書との関連度スコアを計算する。 An information search method according to a fourth invention is a document database that is a combination of the document content of the search target document and the document ID of the search target document, created for each of the search target documents included in the search target document set. And an information search method in an information search apparatus including a search index creation unit, a concept vector model creation unit, a concept similarity dictionary creation unit, and a score calculation unit, wherein the search index creation unit For each search target document included in the search target document set, a search is a combination of the weight of the search target document keyword included in the search target document, the search target document keyword, and a document ID representing the search target document. An index is created, and the concept vector model creating unit creates a concept sentence included in the concept document based on the concept document set. A concept vector expressed by an n-dimensional vector that expresses each of the keywords is created, and the concept similarity dictionary creating unit generates, for each of the search target documents, based on the concept vector and the search index. For the search target document keyword included in the search target document, create a concept similarity dictionary in which the concept document keyword having the highest similarity is recorded together with the similarity, and the score calculation unit includes the input query, Based on the search index and the concept similarity dictionary, for each of the search target documents included in the document database, the similarity with the search target document keyword similar to the keyword included in the query, and The relevance score with the search target document is calculated using the weight of the search target document keyword.

第３及び第４の発明によれば、検索インデックス作成部により、検索対象文書集合に含まれる検索対象文書の各々について、検索対象文書に含まれる検索対象文書キーワードの重さと、検索対象文書キーワードと、検索対象文書を表す文書ＩＤとの組み合わせである検索インデックスを作成し、概念ベクトルモデル作成部により、概念文書集合に基づいて、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルを作成し、概念類似度辞書作成部により、概念ベクトルと検索インデックスとに基づいて、検索対象文書の各々について、検索対象文書に含まれる検索対象文書キーワードに対し、最も類似度が高い概念文書キーワードを類似度と共に記録した概念類似度辞書を作成し、スコア計算部により、入力されたクエリと、検索インデックスと、概念類似度辞書とに基づいて、文書データベースに含まれる検索対象文書の各々に対し、クエリに含まれるキーワードと類似する検索対象文書キーワードとの類似度、及び検索対象文書キーワードの重みを用いて、検索対象文書との関連度スコアを計算する。 According to the third and fourth aspects of the invention, the search index creation unit determines, for each of the search target documents included in the search target document set, the weight of the search target document keyword included in the search target document, the search target document keyword, A search index that is a combination with a document ID representing a search target document is created, and a concept vector model creation unit represents each of the concept document keywords included in the concept document based on the concept document set. A concept vector expressed in vector is created, and based on the concept vector and the search index, the concept similarity dictionary creation unit creates the most similar to the search target document keyword included in the search target document for each of the search target documents. Create a concept similarity dictionary in which high-concept conceptual document keywords are recorded together with the similarity, and score calculator Thus, based on the input query, the search index, and the concept similarity dictionary, each search target document included in the document database is similar to the search target document keyword similar to the keyword included in the query. , And the weight of the search target document keyword, the relevance score with the search target document is calculated.

このように、検索対象文書集合に含まれる検索対象文書の各々について、検索インデックスを作成し、概念文書集合に基づいて、概念ベクトルを作成し、概念ベクトルと検索インデックスとに基づいて、検索対象文書の各々について、概念類似度辞書を作成し、入力されたクエリと、検索インデックスと、概念類似度辞書とに基づいて、文書データベースに含まれる検索対象文書の各々に対し、検索対象文書との関連度スコアを計算することにより、クエリに関連する文書を精度よく検索することができる。 Thus, for each search target document included in the search target document set, a search index is created, a concept vector is created based on the concept document set, and the search target document is set based on the concept vector and the search index. For each of the above, a concept similarity dictionary is created, and based on the input query, search index, and concept similarity dictionary, each search target document included in the document database is related to the search target document. By calculating the degree score, a document related to the query can be searched with high accuracy.

第５の発明に係る辞書作成装置は、検索対象文書集合に含まれる検索対象文書の各々について、前記検索対象文書に含まれる検索対象文書キーワードの重さと、前記検索対象文書キーワードと、前記検索対象文書を表す文書ＩＤとの組み合わせを格納した検索インデックスを作成する検索インデックス作成部と、前記検索対象文書の各々について作成された、前記検索対象文書の文書内容と前記検索対象文書の前記文書ＩＤとの組み合わせである文書データベースと、概念文書集合に基づいて、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルを作成する概念ベクトルモデル作成部と、前記概念ベクトルと前記検索インデックスと前記文書データベースとに基づいて、前記検索対象文書の各々について、前記検索対象文書に含まれる前記検索対象文書キーワードに対し、最も類似度が高い前記概念文書キーワードを類似度と共に記録した概念類似度辞書を作成する概念類似度辞書作成部と、を含んで構成される。 The dictionary creation device according to a fifth aspect of the present invention provides a weight of a search target document keyword included in the search target document, the search target document keyword, and the search target for each search target document included in the search target document set. A search index creation unit for creating a search index storing a combination of a document ID representing a document, a document content of the search target document, and a document ID of the search target document created for each of the search target documents A document database that is a combination of the above, a concept vector model creation unit that creates a concept vector represented by an n-dimensional vector that represents each of the concept document keywords included in the concept document based on the set of concept documents, and the concept Based on the vector, the search index, and the document database, the search target document A concept similarity dictionary creating unit that creates a concept similarity dictionary in which the concept document keyword having the highest similarity is recorded together with the similarity to the search target document keyword included in the search target document. Consists of.

第６の発明に係る辞書作成方法は、検索対象文書集合に含まれる検索対象文書の各々について作成された、前記検索対象文書の文書内容と前記検索対象文書の文書ＩＤとの組み合わせである文書データベースと、検索インデックス作成部と、概念ベクトルモデル作成部と、概念類似度辞書作成部と、を含む辞書作成装置における、辞書作成方法であって、前記検索インデックス作成部は、前記検索対象文書集合に含まれる検索対象文書の各々について、前記検索対象文書に含まれる検索対象文書キーワードの重さと、前記検索対象文書キーワードと、前記検索対象文書を表す文書ＩＤとの組み合わせである検索インデックスを作成し、前記概念ベクトルモデル作成部は、単語をｎ次元のベクトルで表現した概念ベクトルを作成するための概念文書集合に基づいて、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルを作成し、前記概念類似度辞書作成部は、前記概念ベクトルと前記検索インデックスとに基づいて、前記検索対象文書の各々について、前記検索対象文書に含まれる前記検索対象文書キーワードに対し、最も類似度が高い前記概念文書キーワードを類似度と共に記録した概念類似度辞書を作成する。 A dictionary creation method according to a sixth aspect of the present invention is a document database that is created for each of the search target documents included in the search target document set and is a combination of the document content of the search target document and the document ID of the search target document. A dictionary creation method including a search index creation unit, a concept vector model creation unit, and a concept similarity dictionary creation unit, wherein the search index creation unit adds the search index to the search target document set. For each search target document included, create a search index that is a combination of the weight of the search target document keyword included in the search target document, the search target document keyword, and a document ID representing the search target document; The concept vector model creation unit is a concept for creating a concept vector in which a word is expressed by an n-dimensional vector. A concept vector represented by an n-dimensional vector that represents each of the concept document keywords included in the concept document based on the book set, and the concept similarity dictionary creating unit includes the concept vector, the search index, For each of the search target documents, a concept similarity dictionary is created in which the concept document keyword having the highest similarity is recorded together with the similarity with respect to the search target document keyword included in the search target document.

第５及び第６の発明によれば、検索インデックス作成部により、検索対象文書集合に含まれる検索対象文書の各々について、検索対象文書に含まれる検索対象文書キーワードの重さと、検索対象文書キーワードと、検索対象文書を表す文書ＩＤとの組み合わせである検索インデックスを作成し、概念ベクトルモデル作成部により、単語をｎ次元のベクトルで表現した概念ベクトルを作成するための概念文書集合に基づいて、概念文書に含まれる概念文書キーワードの各々を表現する、ｎ次元のベクトルで表現した概念ベクトルを作成し、概念類似度辞書作成部により、概念ベクトルと検索インデックスとに基づいて、検索対象文書の各々について、検索対象文書に含まれる検索対象文書キーワードに対し、最も類似度が高い前記概念文書キーワードを類似度と共に記録した概念類似度辞書を作成する。 According to the fifth and sixth inventions, for each of the search target documents included in the search target document set, the search index creation unit calculates the weight of the search target document keyword included in the search target document, the search target document keyword, Based on a concept document set for creating a search index that is a combination with a document ID representing a search target document and creating a concept vector in which words are expressed by n-dimensional vectors by a concept vector model creation unit A concept vector expressed by an n-dimensional vector that expresses each of the concept document keywords included in the document is created, and each of the search target documents is created based on the concept vector and the search index by the concept similarity dictionary creating unit. The concept document key having the highest similarity with respect to the search target document keyword included in the search target document The over de creates a record concepts similarity dictionary with similarity.

このように、検索対象文書集合に含まれる検索対象文書の各々について、検索インデックスを作成し、概念文書集合に基づいて、概念ベクトルを作成し、概念ベクトルと検索インデックスとに基づいて、検索対象文書の各々について、概念類似度辞書を作成することにより、クエリに関連する文書を精度よく検索するための概念類似度辞書を作成することができる。 Thus, for each search target document included in the search target document set, a search index is created, a concept vector is created based on the concept document set, and the search target document is set based on the concept vector and the search index. By creating a concept similarity dictionary for each of the above, it is possible to create a concept similarity dictionary for accurately searching for documents related to the query.

また、本発明のプログラムは、コンピュータを、上記の情報検索装置、若しくは辞書作成装置を構成する各部として機能させ、又はコンピュータに、上記の情報検索方法、若しくは辞書作成方法の各ステップを実行させるためのプログラムである。 In addition, the program of the present invention causes a computer to function as each unit constituting the information search device or the dictionary creation device, or causes the computer to execute each step of the information search method or the dictionary creation method. It is a program.

以上説明したように、本発明の情報検索装置、方法、及びプログラムによれば、入力されたクエリと、検索インデックスと、概念類似度辞書とに基づいて、文書データベースに含まれる検索対象文書の各々に対し、クエリに含まれるキーワードと類似する検索対象文書キーワードとの類似度、及び検索対象文書キーワードの重みを用いて、検索対象文書との関連度スコアを計算することにより、クエリに関連する文書を精度よく検索することができる。 As described above, according to the information search apparatus, method, and program of the present invention, each of the search target documents included in the document database based on the input query, the search index, and the concept similarity dictionary. On the other hand, a document related to the query is calculated by calculating a relevance score with the search target document by using the similarity between the search target document keyword similar to the keyword included in the query and the weight of the search target document keyword. Can be searched with high accuracy.

また、情報検索装置、辞書作成装置、方法、及びプログラムによれば、検索対象文書集合に含まれる検索対象文書の各々について、検索インデックスを作成し、概念文書集合に基づいて、概念ベクトルを作成し、概念ベクトルと検索インデックスとに基づいて、検索対象文書の各々について、概念類似度辞書を作成することにより、クエリに関連する文書を精度よく検索するための概念類似度辞書を作成することができる。 Further, according to the information search device, the dictionary creation device, the method, and the program, a search index is created for each search target document included in the search target document set, and a concept vector is created based on the concept document set. By creating a concept similarity dictionary for each search target document based on the concept vector and the search index, it is possible to create a concept similarity dictionary for accurately searching for documents related to the query. .

本発明の実施形態に係る情報検索装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the information search device which concerns on embodiment of this invention. 検索インデックスの一例を示す図である。It is a figure which shows an example of a search index. 概念ベクトルモデルの一例を示す図である。It is a figure which shows an example of a concept vector model. 概念類似度辞書の一例を示す図である。It is a figure which shows an example of a concept similarity dictionary. 本実施形態に係る情報検索装置を用いた計算内容の一例を示す図である。It is a figure which shows an example of the calculation content using the information search device which concerns on this embodiment. 本実施形態に係る情報検索装置を用いた計算内容の一例で用いるデータの一例を示す図である。It is a figure which shows an example of the data used by an example of the calculation content using the information search device which concerns on this embodiment. 本発明の実施形態に係る情報検索装置におけるデータ作成処理ルーチンのフローチャート図である。It is a flowchart figure of the data creation process routine in the information search device which concerns on embodiment of this invention. 本発明の実施形態に係る情報検索装置における情報検索処理ルーチンのフローチャート図である。It is a flowchart figure of the information search process routine in the information search device which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施形態の概要＞
まず、本発明の実施形態の概要について説明する。 <Outline of Embodiment of the Present Invention>
First, the outline | summary of embodiment of this invention is demonstrated.

本実施形態は、検索対象文書に含まれる検索対象文書キーワードの重さと、検索対象文書キーワードと、検索対象文書を表す文書ＩＤとの組み合わせを格納した検索インデックスに含まれる検索対象文書のキーワードに対して予め、概念ベクトルモデルのキーワードとの類似関係を計算しておくことによる、キーワード単位での概念検索を行う点がポイントである。 In the present embodiment, the keyword of the search target document included in the search index storing the combination of the weight of the search target document keyword included in the search target document and the combination of the search target document keyword and the document ID representing the search target document. The point is that a concept search is performed in units of keywords by calculating a similarity relationship with the keywords of the concept vector model in advance.

また、検索対象文書集合Ｄの検索インデックス中のある文書ｄ中のあるキーワードｗに対して、当該ｗと概念ベクトルモデルに登録されている全キーワードとの概念空間上での類似度を計算し記録する。これを文書ｄ内のキーワード全て、また、文書集合Ｄ内の全ての文書に対して適用する。 For a keyword w in a document d in the search index of the search target document set D, the similarity in the concept space between the w and all keywords registered in the concept vector model is calculated and recorded. To do. This is applied to all the keywords in the document d and all the documents in the document set D.

本実施形態に係る情報検索装置を実装した検索システムにクエリＱが入力されたとき、クエリＱと文書ｄとの関連度スコアの計算する際に、予め計算、及び記録しておいた単語類似度を用いて、クエリＱ中のキーワードｑと最も類似度が高いキーワードを文書ｄ内から探索する。 When the query Q is input to the search system in which the information search apparatus according to the present embodiment is implemented, the word similarity that is calculated and recorded in advance when calculating the relevance score between the query Q and the document d Is used to search the document d for a keyword having the highest similarity to the keyword q in the query Q.

ここで、キーワードｑと最も類似度が高い文書ｄ中のキーワードがｗだった場合、ｗが持つ重み（ＴＦ・ＩＤＦ等）と、キーワードｑとｗの類似度とを用いてスコアを計算する。これをクエリＱ中の全てのキーワードで計算し、最終的にクエリＱ中の全キーワードのスコアの総和が、クエリＱと文書ｄの関連度スコアとなる。これを検索対象文書集合Ｄ内の全文書で計算し、最後に関連度スコアの順にソートすることで、クエリＱに合致した文書を検索する。 Here, when the keyword in the document d having the highest similarity with the keyword q is w, the score is calculated using the weight (TF, IDF, etc.) possessed by w and the similarity between the keywords q and w. This is calculated for all the keywords in the query Q, and finally, the sum of the scores of all the keywords in the query Q becomes the relevance score between the query Q and the document d. This is calculated for all the documents in the search target document set D, and finally sorted in the order of the relevance score to search for documents that match the query Q.

そのため、本実施形態に係る情報検索装置において、クエリ拡張を用いたキーワードマッチ型の検索と、概念ベクトルを用いた概念ベクトル双方の利点を兼ね備えた検索とが可能になる。クエリのキーワードと検索対象文書のキーワードとのマッチは常に１対１で行われるため、クエリ拡張の様に拡張するキーワード数を設定する必要が無くなる。また、必要以上に拡張語がマッチしてしまう可能性を排除できる。スコア計算ではＴＦ・ＩＤＦなどのキーワードの重みを使用できるため、概念検索の様に全てのキーワードが同列の扱いではなく、重要なキーワードを考慮した検索が可能になる。 Therefore, in the information search apparatus according to the present embodiment, a keyword match type search using query expansion and a search having the advantages of both concept vectors using concept vectors are possible. Since the query keyword and the keyword of the search target document are always matched on a one-to-one basis, it is not necessary to set the number of keywords to be expanded as in query expansion. Further, it is possible to eliminate the possibility that extended words match more than necessary. Since the weights of keywords such as TF / IDF can be used in the score calculation, all keywords are not handled in the same row as in the concept search, and a search considering important keywords can be performed.

例えば、インターネット系の文書集合を検索する際に、 For example, when searching a set of Internet documents,

クエリ：「ショッピングでクレジットカードが使用できない」
文書Ａ：「弊社サービスのオンライン決済について」
文書Ｂ：「ショッピングサービスでのメールのご利用方法について」 Query: “Credit card is not available for shopping”
Document A: “Online payment for our services”
Document B: “How to use e-mail for shopping services”

という、文書Ａ、Ｂ、及びクエリがあった場合、概念検索ではクエリに対して文書Ｂが高いスコアを示す傾向にある。これは、「ショッピング」と「ショッピングサービス」、「使用」と「ご利用方法」など全体的な文の類似性から判断しているためである。 When there are documents A and B and a query, document B tends to show a high score for the query in the concept search. This is because the determination is based on the similarity of overall sentences such as “shopping” and “shopping service”, “use” and “use method”.

しかし、本実施形態に係る情報検索装置においては、文書Ａに高スコアが付与される。これは、「クレジットカード」と「オンライン決済」との類似性の他、かつ「オンライン決済」というキーワード自体の重みを考慮できるためである。 However, a high score is assigned to the document A in the information search apparatus according to the present embodiment. This is because the weight of the keyword “online payment” can be taken into account in addition to the similarity between “credit card” and “online payment”.

本実施形態に係る情報検索装置は、クエリ中のキーワードを、概念ベクトルモデルを用いて、検索対象文書中に出現するキーワードに置き換えることと同義である。そのため「クレカ」と「クレジットカード」、「ネット」と「インターネット」など省略語や同義語のマッチングに効果を発揮する。これは、表現の揺れが大きくなる自然文検索や音声検索で特に有用だといえる。 The information search apparatus according to the present embodiment is synonymous with replacing a keyword in a query with a keyword that appears in a search target document using a concept vector model. Therefore, it is effective for matching abbreviations and synonyms such as “Kureka” and “Credit Card”, “Net” and “Internet”. This can be said to be particularly useful for natural sentence search and voice search where the fluctuation of expression is large.

＜本発明の実施形態に係る情報検索装置の構成＞
次に、本発明の実施形態に係る情報検索装置の構成について説明する。図１に示すように、本発明の実施形態に係る情報検索装置１００は、ＣＰＵと、ＲＡＭと、後述するデータ作成処理ルーチン、及び情報検索処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この情報検索装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、結果出力部９０とを含んで構成されている。 <Configuration of Information Retrieval Device According to Embodiment of the Present Invention>
Next, the configuration of the information search apparatus according to the embodiment of the present invention will be described. As shown in FIG. 1, an information search apparatus 100 according to an embodiment of the present invention stores a CPU, a RAM, a data creation processing routine to be described later, a program for executing an information search processing routine, and various data. It can be composed of a computer including a ROM. The information search apparatus 100 is functionally configured to include an input unit 10, a calculation unit 20, and a result output unit 90 as shown in FIG.

入力部１０は、検索対象となる文書集合（以後、検索対象文書集合）と、概念ベクトルモデル４２を作成するための文書集合（以後、概念文書集合）を受け付け、類似度計算部３０に出力する。ここで、概念文書集合の収集方法については特に指定はなく、検索対象文書集合と内容が合致するＷｉｋｉｐｅｄｉａ（登録商標）のページ集合や、検索対象文書集合から抽出したキーワードをクエリとしたときにＷｅｂ検索結果のＷｅｂページの集合を利用しても良い。なお、検索対象文書集合に含まれる文書の各々を検索対象文書とし、概念文書集合に含まれる文書の各々を概念文書とする。 The input unit 10 accepts a document set to be searched (hereinafter referred to as a search target document set) and a document set for creating the concept vector model 42 (hereinafter referred to as a concept document set), and outputs them to the similarity calculation unit 30. . Here, the collection method of the conceptual document set is not particularly specified, and when a query is performed on a Wikipedia (registered trademark) page set whose contents match the search target document set or a keyword extracted from the search target document set, A set of search result Web pages may be used. Each document included in the search target document set is set as a search target document, and each document included in the concept document set is set as a concept document.

また、入力部１０は、ユーザにより入力されたクエリ（以後、入力クエリ）を受け付け、類似度一致検索部５０に出力する。 Further, the input unit 10 accepts a query (hereinafter referred to as an input query) input by the user and outputs it to the similarity matching search unit 50.

演算部２０は、類似度計算部３０と、記憶部４０と、類似度一致検索部５０とを含んで構成されている。 The calculation unit 20 includes a similarity calculation unit 30, a storage unit 40, and a similarity match search unit 50.

類似度計算部３０は、入力部１０から受け付けた検索対象文書集合に基づいて、検索インデックス４４、及び文書データベース４６を作成し、記憶部４０に記憶する。また、類似度計算部３０は、入力部１０から受け付けた概念文書集合に基づいて、概念ベクトルモデル４２を作成し、記憶部４０に記憶する。また、類似度計算部３０は、入力部１０から受け付けた検索対象文書集合、及び概念文書集合に基づいて、概念類似度辞書４８を作成し、記憶部４０に記憶する。また、類似度計算部３０は、キーワード抽出部３２と、検索インデックス作成部３４と、概念ベクトルモデル作成部３６と、概念類似度辞書作成部３８とを含んで構成されている。なお、類似度計算部が、本発明に係る辞書作成装置の一例である。 The similarity calculation unit 30 creates a search index 44 and a document database 46 based on the search target document set received from the input unit 10 and stores the search index 44 and the document database 46 in the storage unit 40. The similarity calculation unit 30 creates a concept vector model 42 based on the concept document set received from the input unit 10 and stores the concept vector model 42 in the storage unit 40. Further, the similarity calculation unit 30 creates a concept similarity dictionary 48 based on the search target document set and the concept document set received from the input unit 10 and stores them in the storage unit 40. In addition, the similarity calculation unit 30 includes a keyword extraction unit 32, a search index creation unit 34, a concept vector model creation unit 36, and a concept similarity dictionary creation unit 38. The similarity calculation unit is an example of a dictionary creation device according to the present invention.

キーワード抽出部３２は、入力部１０から受け付けた検索対象文書集合と概念文書集合に含まれる検索対象文書、及び概念文書の各々について、キーワード単位に分割する。また、キーワード抽出部３２は、キーワード単位に分割した検索対象文書集合に含まれる検索対象文書の各々を、検索インデックス作成部３４に出力し、キーワード単位に分割した概念文書集合に含まれる概念文書の各々を、概念ベクトルモデル作成部３６に出力する。なお、検索対象文書を分割したキーワード単位を検索対象文書キーワードとし、概念文書を分割したキーワード単位を概念文書キーワードとする。 The keyword extraction unit 32 divides the search target document set received from the input unit 10, the search target document included in the concept document set, and the concept document into keyword units. Further, the keyword extraction unit 32 outputs each of the search target documents included in the search target document set divided into keyword units to the search index creation unit 34, and the concept document included in the concept document set divided into keyword units. Each is output to the concept vector model creation unit 36. A keyword unit obtained by dividing the search target document is set as a search target document keyword, and a keyword unit obtained by dividing the concept document is set as a concept document keyword.

ここで、キーワードは、英語であれば単語区切りにしたもの、日本語であれば形態素解析の結果を基に、名詞が連続した場合は接合するなどの処理、名詞と動詞のみを抽出するといった処理によりキーワードを作成する。このとき、キーワード作成処理の処理方法やルールについては検索対象等に応じて自由に設定できる。しかし、検索対象文書集合と概念文書集合に対しては（形態素解析の辞書も含め）同じ処理手順、ルールによりキーワードを抽出する。 Here, if the keyword is English, it is divided into words, if it is Japanese, based on the result of morphological analysis, if nouns are consecutive, processing such as joining, processing such as extracting only nouns and verbs To create a keyword. At this time, the processing method and rules of the keyword creation processing can be freely set according to the search target. However, keywords are extracted according to the same processing procedure and rules (including a morphological analysis dictionary) for the search target document set and the concept document set.

検索インデックス作成部３４は、キーワード抽出部３２から入力された、キーワード単位に分割した検索対象文書の各々に基づいて、例えば、図２に示すような検索用のインデックスである検索インデックス４４を作成し、記憶部４０に記憶すると共に、文書データベース４６を作成し、記憶部４０に記憶する。なお、検索インデックス４４は、一般に情報検索システムで使用している転置インデックスと同様のものとなる。 The search index creation unit 34 creates, for example, a search index 44 that is a search index as shown in FIG. 2 based on each of the search target documents input from the keyword extraction unit 32 and divided into keyword units. The document database 46 is created and stored in the storage unit 40. The search index 44 is generally the same as an inverted index used in an information search system.

ここで、図２に示す検索インデックス４４は、検索対象文書のＫｅｙとなる文書ＩＤ、検索対象文書内の検索対象文書キーワード、そして検索対象文書キーワードの重みから構成される。重みの計算方法については、ＴＦ・ＩＤＦを用いる。なお、当該検索インデックス４４は、類似度一致検索部５０でのスコア計算でも使用する。また、重みの計算方法は、ＢＭ２５など任意の重み付けアルゴリズムを使用してもよい。 Here, the search index 44 shown in FIG. 2 includes a document ID that becomes the key of the search target document, a search target document keyword in the search target document, and a weight of the search target document keyword. As a weight calculation method, TF / IDF is used. The search index 44 is also used for score calculation in the similarity match search unit 50. As a weight calculation method, any weighting algorithm such as BM25 may be used.

また、文書データベース４６には、文書ＩＤがＫｅｙとなり、検索対象文書本文が記録されている。 In the document database 46, the document ID is “Key” and the text of the search target document is recorded.

具体的には、検索インデックス作成部３４は、取得したキーワード単位に分割した検索対象文書の各々について、当該検索対象文書について、Ｋｅｙとなる文書ＩＤを設定し、当該検索対象文書に含まれる検索対象文書キーワードの各々について、当該検索対象文書キーワードの重みを計算し、文書ＩＤと、検索対象文書キーワードと、当該検索対象文書キーワードの重みとを１つのインデックスデータとして、当該検索対象文書に含まれる検索対象文書キーワードの各々についてのインデックスデータを検索インデックスに追加する。 Specifically, the search index creation unit 34 sets a document ID to be a key for each search target document divided into the obtained keyword units, and sets the search target included in the search target document. For each of the document keywords, the weight of the search target document keyword is calculated, and the search included in the search target document includes the document ID, the search target document keyword, and the weight of the search target document keyword as one index data. Index data for each target document keyword is added to the search index.

また、検索インデックス作成部３４は、取得したキーワード単位に分割した検索対象文書の各々について、検索インデックスを作成する際に設定された当該検索対象文書の文書ＩＤと、当該検索対象文書の検索対象文書本文（文書内容）とを組み合わせて、文書データベース４６に追加する。 The search index creation unit 34 also sets the document ID of the search target document set when creating the search index and the search target document of the search target document for each of the search target documents divided into the obtained keyword units. The text (document content) is combined and added to the document database 46.

概念ベクトルモデル作成部３６は、キーワード抽出部３２から入力された、キーワード単位に分割した概念文書の各々に基づいて、概念ベクトルモデル４２を作成し、記憶部４０に記憶する。ここで、概念ベクトルモデル４２とは、例えば、図３に示すように、概念文書キーワードの各々に対する、単語をｎ次元の連続値のベクトルで表現した概念ベクトルからなるモデルである。ここで、本実施形態においては、概念ベクトルモデル４２の作成方法については、特異値分解を用いたＬＳＩを用いる。なお、概念ベクトルモデル４２の作成方法として、特異値分解を用いたＬＳＩではなく、トピックモデル、ニューラルネットワークを用いたモデルなど、任意のモデルを採用してもよい。 The concept vector model creation unit 36 creates a concept vector model 42 based on each of the concept documents divided from the keyword input from the keyword extraction unit 32 and stores the concept vector model 42 in the storage unit 40. Here, the concept vector model 42 is, for example, as shown in FIG. 3, a model composed of concept vectors in which words are expressed by vectors of n-dimensional continuous values for each of the concept document keywords. Here, in the present embodiment, an LSI using singular value decomposition is used as a method for creating the concept vector model 42. As a method for creating the concept vector model 42, any model such as a topic model or a model using a neural network may be adopted instead of an LSI using singular value decomposition.

概念類似度辞書作成部３８は、記憶部４０に記憶されている検索インデックス４４、及び概念ベクトルモデル４２に基づいて、概念ベクトルモデル４２の概念文書キーワードと検索対象文書との単語類似度を計算し、当該計算結果をまとめた概念類似度辞書４８を作成し、記憶部４０に記憶する。 The concept similarity dictionary creation unit 38 calculates the word similarity between the concept document keyword of the concept vector model 42 and the search target document based on the search index 44 and the concept vector model 42 stored in the storage unit 40. The concept similarity dictionary 48 that summarizes the calculation results is created and stored in the storage unit 40.

図４に、概念類似度辞書４８の一例を示す。概念類似度辞書４８は概念ベクトルモデル４２の概念文書キーワード、文書ＩＤ、検索対象文書の検索対象文書キーワード、キーワード間の類似度から構成される。ここで、概念ベクトルモデル４２中にある概念文書キーワードに対し、検索対象文書集合の各検索対象文書において最も類似度が高い検索対象文書キーワードを抽出し記録している。 FIG. 4 shows an example of the concept similarity dictionary 48. The concept similarity dictionary 48 includes the concept document keyword of the concept vector model 42, the document ID, the search target document keyword of the search target document, and the similarity between the keywords. Here, with respect to the concept document keyword in the concept vector model 42, the search target document keyword having the highest similarity in each search target document of the search target document set is extracted and recorded.

当該処理により、概念ベクトルモデル４２内の概念文書キーワードがクエリとして入力されたとき、検索対象文書中のどの検索対象文書キーワードに対応付けばよいのかを即座に参照することができる。また、類似度が高い検索対象文書キーワードほど関連度が高いキーワードであるため、対応付けは類似度が最も高いものを選択する必要がある。本実施形態に係る情報検索装置１００においては、類似度が最も高い検索対象文書キーワード以外で検索対象文書中に出現する検索対象文書キーワードは本実施形態では使用しないため記録しないことで、ディスクやメモリの容量を削減できる。 With this processing, when a concept document keyword in the concept vector model 42 is input as a query, it is possible to immediately refer to which search target document keyword in the search target document should be associated. In addition, since a search target document keyword having a higher similarity is a keyword having a higher degree of association, it is necessary to select an association having the highest degree of similarity. In the information search apparatus 100 according to the present embodiment, a search target document keyword that appears in the search target document other than the search target document keyword having the highest similarity is not used and is not recorded in this embodiment, so that the disk or memory Can be reduced.

当該処理を、概念ベクトルモデル４２の全ての概念文書キーワードと検索対象文書集合の全文書との各組み合わせで行う。概念ベクトルモデル４２に１０００００個のキーワード、検索対象文書が５００文書あるとき、１０００００個×５００文書の組み合わせの辞書が作成されることになる。 This processing is performed for each combination of all the concept document keywords of the concept vector model 42 and all the documents of the search target document set. When there are 100000 keywords and 500 search target documents in the concept vector model 42, a dictionary of combinations of 100,000 × 500 documents is created.

類似度の計算方法についてはコサイン距離等の類似度の範囲が０〜１の間に正規化できるものを用いる。なお、計算量と辞書のメモリ容量の削減のために、概念ベクトルモデル４２のキーワードと検索対象文書のキーワードが完全に一致する場合は、類似度計算を行わず、概念類似度辞書４８にも記録しない方針も採用できる（類似度が最大であることが自明のため）。また、類似度計算の結果類似度が設定した閾値以下の場合には概念類似度辞書４８に記録しない（類似度０と見なす）ことも可能である。また、作成した類似度辞書は類似度一致検索部で使用する。また、類似度辞書作成のための計算は分散処理等によって行うことも可能である。 As a method for calculating the similarity, a method that can be normalized while the range of similarity such as cosine distance is 0 to 1 is used. In order to reduce the amount of calculation and the memory capacity of the dictionary, if the keyword of the concept vector model 42 and the keyword of the search target document completely match, the similarity is not calculated and is also recorded in the concept similarity dictionary 48. Can be adopted (because it is obvious that the degree of similarity is maximum). Further, when the similarity is not more than the set threshold as a result of the similarity calculation, it is not possible to record it in the concept similarity dictionary 48 (it is assumed that the similarity is 0). The created similarity dictionary is used in the similarity match search unit. The calculation for creating the similarity dictionary can also be performed by distributed processing or the like.

記憶部４０には、概念ベクトルモデル４２、検索インデックス４４、文書データベース４６、及び概念類似度辞書４８が記憶されている。 The storage unit 40 stores a concept vector model 42, a search index 44, a document database 46, and a concept similarity dictionary 48.

類似度一致検索部５０は、入力部１０から入力された、入力クエリと、記憶部４０に記憶されている検索インデックス４４と、概念類似度辞書４８と、文書データベース４６とに基づいて、入力クエリと、検索対象文書の各々とのスコアを計算し、当該スコアの各々に基づく結果を、結果出力部９０から出力する。なお、当該スコアは、入力クエリと、対象となる検索対象文書との関連度を表すスコアである。 Based on the input query input from the input unit 10, the search index 44 stored in the storage unit 40, the concept similarity dictionary 48, and the document database 46, the similarity match search unit 50 And the score with each of the search target documents is calculated, and a result based on each of the scores is output from the result output unit 90. The score is a score representing the degree of association between the input query and the target search target document.

また、類似度一致検索部５０は、クエリキーワード抽出部５２と、スコア計算部６０とを含んで構成されている。 The similarity match search unit 50 includes a query keyword extraction unit 52 and a score calculation unit 60.

クエリキーワード抽出部５２は、入力部１０から入力された入力クエリについて、キーワード抽出部３２と同様（処理手順、及びルール）の処理に従って、キーワード単位に分割し、スコア計算部６０に送信する。なお、ここで、入力クエリが自然文、又は音声入力文の場合には、クエリキーワード抽出部５２における処理を行うが、入力クエリが、既にキーワード単位になっている場合には、クエリキーワード抽出部５２における処理を実行しない。 The query keyword extraction unit 52 divides the input query input from the input unit 10 into keyword units according to the same processing (processing procedure and rule) as the keyword extraction unit 32 and transmits the keyword to the score calculation unit 60. Here, when the input query is a natural sentence or a voice input sentence, the processing in the query keyword extraction unit 52 is performed. However, when the input query is already in units of keywords, the query keyword extraction unit The process in 52 is not executed.

スコア計算部６０は、クエリキーワード抽出部５２から取得したキーワード単位に分割された入力クエリと、検索インデックス４４と、文書データベース４６と、概念類似度辞書４８とに基づいて、入力クエリと、検索対象文書の各々との関連度スコアを計算し、当該関連度スコアの各々に基づく結果を、結果出力部９０から出力する。 The score calculation unit 60 uses the input query obtained from the query keyword extraction unit 52 and is divided into keyword units, the search index 44, the document database 46, and the concept similarity dictionary 48. A relevance score with each document is calculated, and a result based on each relevance score is output from the result output unit 90.

また、スコア計算部６０は、概念類似度参照部６２と、計算部６４とを含んで構成されている。 The score calculation unit 60 includes a concept similarity reference unit 62 and a calculation unit 64.

概念類似度参照部６２は、文書データベース４６に含まれる検索対象文書を１つ選択し、当該検索対象文書の文書ＩＤを取得する。また、概念類似度参照部６２は、クエリキーワード抽出部５２から取得したキーワード単位に分割された入力クエリに基づいて、当該入力クエリに含まれるキーワードを１つ選択する。また、概念類似度参照部６２は、選択したキーワードと、概念類似度辞書４８の「キーワード（概念）」の欄とが一致し、かつ取得した文書ＩＤと、概念類似度辞書４８の「文書ＩＤ」の欄とが一致する「キーワード（検索文書）」の欄の情報を参照キーワードとして取得する。なお、当該参照キーワードを取得する処理を、入力クエリに含まれるキーワードの全てについて行う。 The concept similarity reference unit 62 selects one search target document included in the document database 46 and acquires the document ID of the search target document. In addition, the concept similarity reference unit 62 selects one keyword included in the input query based on the input query divided into keyword units acquired from the query keyword extraction unit 52. Further, the concept similarity reference unit 62 matches the selected keyword with the “keyword (concept)” field of the concept similarity dictionary 48 and the acquired document ID and the “document ID” of the concept similarity dictionary 48. The information in the “keyword (search document)” column that matches the “” column is acquired as a reference keyword. In addition, the process which acquires the said reference keyword is performed about all the keywords contained in an input query.

また、入力クエリに含まれるキーワードのうち、当該キーワードと、概念類似度辞書４８の「キーワード（概念）」の欄とが一致し、かつ取得した文書ＩＤと、概念類似度辞書４８の「文書ＩＤ」の欄とが一致する情報が存在しない場合には、当該キーワードについては、以後の処理対象から除外するものとする。 Of the keywords included in the input query, the keyword matches the “keyword (concept)” field of the concept similarity dictionary 48, and the acquired document ID and the “document ID” of the concept similarity dictionary 48 When there is no information that matches the column “”, the keyword is excluded from the target of subsequent processing.

また、処理対象から除外すると判定されたキーワードを、特定のデータベースに記憶し、情報検索装置１００の一連の処理が終了した後に、当該データベースに含まれるキーワードに基づいて、当該キーワードに関連する検索対象文書集合と、概念文書集合とを、当該キーワードに基づいてインターネット等を検索することによって、受け付け、上述の類似度計算部３０の処理を行ってもよい。 In addition, the keyword determined to be excluded from the processing target is stored in a specific database, and after a series of processing of the information search apparatus 100 is completed, the search target related to the keyword is based on the keyword included in the database. The document set and the concept document set may be received by searching the Internet or the like based on the keyword, and the above-described similarity calculation unit 30 may perform the process.

計算部６４は、概念類似度参照部６２において選択された検索対象文書と、入力クエリとの関連度スコアを、下記（１）式に従って、算出する。なお、関連度スコアの計算には、検索インデックス４４に記憶されている検索対象文書キーワードの重みと、概念類似度辞書４８のキーワード間の類似度とを用いる。また、下記（１）式において、選択した検索対象文書をｄ、入力クエリをＱとする。また、本実施形態においては、関連度スコアは、値が大きい程関連度が高いことを表わすものとする。 The calculation unit 64 calculates the relevance score between the search target document selected by the concept similarity reference unit 62 and the input query according to the following equation (1). In calculating the relevance score, the weight of the search target document keyword stored in the search index 44 and the similarity between the keywords in the concept similarity dictionary 48 are used. In the following formula (1), the selected search target document is d and the input query is Q. Further, in the present embodiment, the relevance score represents that the relevance is higher as the value is larger.

ここで、ｑは入力クエリＱ中に含まれるキーワード、ｗは概念類似度参照部６２において取得した参照キーワード（キーワードｑと類似度最大でマッチする検索対象文書ｄ中の検索対象文書キーワード）、ｗｅｉｇｈｔ（ｗ）は、参照キーワードｗの重み、ｓｉｍ（ｑ,ｗ）はキーワードｑと参照キーワードｗの類似度である。 Here, q is a keyword included in the input query Q, w is a reference keyword acquired in the concept similarity reference unit 62 (a search target document keyword in the search target document d that matches the keyword q with the maximum similarity), and weight (W) is the weight of the reference keyword w, and sim (q, w) is the similarity between the keyword q and the reference keyword w.

また、類似度は０．０が最小値とし、１．０が最大値（キーワード完全一致検索と同様の扱い）となるようにする。 The similarity is set to 0.0 as the minimum value and 1.0 as the maximum value (similar to the keyword exact match search).

また、計算部６４は、文書データベース４６に含まれる検索対象文書の全てについて、関連度スコアを算出している場合には、当該関連度スコアの降順となるように文書データベース４６に含まれる検索対象文書の文書本文の各々を並べ替えたもののうち、上位Ｎ件を、ユーザのクエリに対する検索結果として、結果出力部９０から出力する。 In addition, when the calculation unit 64 calculates the relevance score for all the search target documents included in the document database 46, the search target included in the document database 46 so that the relevance score is in descending order. Out of the sorted document texts, the top N items are output from the result output unit 90 as search results for the user query.

また、計算部６４は、文書データベース４６に含まれる検索対象文書の全てについて、関連度スコアを算出していない場合には、概念類似度参照部６２の処理と計算部６４との処理を繰り返す。このようにすることにより、検索対象文書集合の全ての検索対象文書で関連度スコアを計算することができる。 In addition, when the relevance score is not calculated for all the search target documents included in the document database 46, the calculation unit 64 repeats the process of the concept similarity reference unit 62 and the process of the calculation unit 64. In this way, the relevance score can be calculated for all the search target documents in the search target document set.

なお、概念ベクトルモデル上での類似度は、関連のあるキーワード同士の類似度は高くなる。特に「クレカ」と「クレジットカード」、「スマホ」と「スマートフォン」など表現の揺れや省略形など同義関係にあるキーワード間の類似度は極めて高くなる（類似度０．９以上など）。 The similarity on the concept vector model is high between related keywords. In particular, the degree of similarity between keywords such as “Kureka” and “credit card”, “smartphone” and “smartphone” that have synonymous relationships such as shaking of expressions and abbreviations is extremely high (eg, a degree of similarity of 0.9 or more).

この場合、関連度スコアは検索対象文書中の検索対象文書キーワードの重みが、ほぼそのまま使用される形になる（例えば、「クレカ」のスコア＝「クレジットカード」のスコア＊０．９）。一方、関連が薄いキーワードは低くなる。本実施形態に用いる手法では、クエリのキーワードは検索対象文書中の検索対象文書キーワードのどれか一つには必ずマッチする仕組みとなる。 In this case, the relevance score is such that the weight of the search target document keyword in the search target document is used as it is (for example, “Kureka” score = “credit card” score * 0.9). On the other hand, keywords with low relevance are low. In the method used in the present embodiment, the query keyword always matches one of the search target document keywords in the search target document.

そのため、関連の低いキーワードがマッチしてしまった際に、類似度を乗算することで、当該キーワードの影響を低減させることができる（関連度スコア計算時に検索対象文書キーワードの重みに０．２〜０．３といった類似度が乗算されたものが使われる）。 Therefore, when a low-relevance keyword is matched, the influence of the keyword can be reduced by multiplying the similarity (0.2 to the weight of the search target document keyword when calculating the relevance score). The one with a similarity of 0.3 is used).

図５に、本実施形態に係る情報検索装置１００において行われる関連度スコアの計算内容の例を示す。なお、当該例においては、概念ベクトルモデル４２は、事前に作成されたものを用いることとする。また、当該計算内容の例において用いる、検索対象文書集合、文書データベース４６、検索インデックス４４、及び概念類似度辞書４８は、図６に示すものを使用するものとする。 In FIG. 5, the example of the content of calculation of the relevance score performed in the information search device 100 which concerns on this embodiment is shown. In this example, the concept vector model 42 is created in advance. Further, the search target document set, the document database 46, the search index 44, and the concept similarity dictionary 48 used in the example of the calculation contents are assumed to use those shown in FIG.

＜本発明の実施形態に係る情報検索装置の作用＞
次に、本発明の実施形態に係る情報検索装置１００の作用について説明する。情報検索装置１００は、入力部１０によって、検索対象文書集合、及び概念文書集合を受け付けると、情報検索装置１００によって、図７に示すデータ作成処理ルーチンが実行される。また、情報検索装置１００は、データ作成処理ルーチン後、入力部１０によって、入力クエリを受け付けると、情報検索装置１００によって、図８に示す情報検索処理ルーチンが実行される。なお、データ作成処理ルーチンが、本発明に係る辞書作成方法の一例である。 <Operation of Information Retrieval Device According to Embodiment of the Present Invention>
Next, the operation of the information search apparatus 100 according to the embodiment of the present invention will be described. When the information search apparatus 100 receives the search target document set and the concept document set by the input unit 10, the information search apparatus 100 executes a data creation processing routine shown in FIG. When the information search apparatus 100 receives an input query by the input unit 10 after the data creation process routine, the information search apparatus 100 executes the information search process routine shown in FIG. The data creation processing routine is an example of a dictionary creation method according to the present invention.

まず、図７に示すデータ作成処理ルーチンについて説明する。 First, the data creation processing routine shown in FIG. 7 will be described.

図７に示すデータ作成処理ルーチンのステップＳ１００で、入力部１０において受け付けた検索対象文書集合に含まれる検索対象文書の各々、及び概念文書集合に含まれる概念文書の各々について、当該検索対象文書、又は概念文書を、キーワード単位に分割し、検索対象文書キーワード、及び概念文書キーワードの各々を抽出する。 For each of the search target documents included in the search target document set received by the input unit 10 and each of the concept documents included in the conceptual document set in step S100 of the data creation processing routine shown in FIG. Alternatively, the concept document is divided into keyword units, and each of the search target document keyword and the concept document keyword is extracted.

次に、ステップＳ１０２で、ステップＳ１００において取得したキーワード単位に分割された検索対象文書の各々に基づいて、検索インデックス４４を作成し、記憶部４０に記憶する。 Next, in step S102, a search index 44 is created based on each of the search target documents divided in keyword units acquired in step S100, and stored in the storage unit 40.

次に、ステップＳ１０４で、ステップＳ１００において取得したキーワード単位に分割された検索対象文書の各々と、ステップＳ１０２において取得した検索インデックス４４とに基づいて、文書データベース４６を作成し、記憶部４０に記憶する。 Next, in step S104, a document database 46 is created based on each of the search target documents divided in keyword units acquired in step S100 and the search index 44 acquired in step S102, and stored in the storage unit 40. To do.

次に、ステップＳ１０６で、ステップＳ１００において取得したキーワード単位に分割された概念文書の各々に基づいて、概念ベクトルモデル４２を作成し、記憶部４０に記憶する。 Next, in step S106, a concept vector model 42 is created based on each of the concept documents divided in keyword units acquired in step S100, and stored in the storage unit 40.

次に、ステップＳ１０８で、ステップＳ１０２において取得した検索インデックス４４と、ステップＳ１０６において取得した概念ベクトルモデル４２とに基づいて、概念類似度辞書４８を作成し、記憶部４０に記憶し、データ作成処理ルーチンを終了する。 Next, in step S108, based on the search index 44 acquired in step S102 and the concept vector model 42 acquired in step S106, a concept similarity dictionary 48 is created and stored in the storage unit 40, and data creation processing is performed. End the routine.

次に、図８に示す情報検索処理ルーチンについて説明する。 Next, the information search processing routine shown in FIG. 8 will be described.

まず、図８に示す情報検索処理ルーチンのステップＳ２００で、検索インデックス４４、文書データベース４６、及び概念類似度辞書４８を読み込む。 First, in step S200 of the information search processing routine shown in FIG. 8, the search index 44, the document database 46, and the concept similarity dictionary 48 are read.

次に、ステップＳ２０２で、上述のステップＳ１００と同様に、入力部１０において受け付けた入力クエリをキーワード単位に分割し、キーワードを抽出する。 Next, in step S202, as in step S100 described above, the input query received by the input unit 10 is divided into keyword units, and keywords are extracted.

次に、ステップＳ２０４で、ステップＳ２００において取得した文書データベース４６に含まれる検索対象文書のうち、処理対象となる検索対象文書を決定する。また、ステップＳ２０４で、処理対象となる検索対象文書の文書ＩＤを文書データベースから取得する。 Next, in step S204, a search target document to be processed is determined from the search target documents included in the document database 46 acquired in step S200. In step S204, the document ID of the search target document to be processed is acquired from the document database.

次に、ステップＳ２０８で、ステップＳ２０２において取得したキーワードの各々について、ステップＳ２００において取得した概念類似度辞書４８と、ステップＳ２０４において取得した処理対象となる検索対象文書の文書ＩＤとに基づいて、参照キーワードを取得する。 Next, in step S208, for each keyword acquired in step S202, reference is made based on the concept similarity dictionary 48 acquired in step S200 and the document ID of the search target document to be processed acquired in step S204. Get keywords.

次に、ステップＳ２１２で、ステップＳ２００において取得した、検索インデックス４４、及び概念類似度辞書４８と、ステップＳ２０８において取得した入力クエリのキーワードの各々の参照キーワードとに基づいて、上記（１）式に従って、処理対象となる検索対象文書の関連度スコアを計算する。 Next, in step S212, based on the search index 44 and the concept similarity dictionary 48 acquired in step S200, and each reference keyword of the keyword of the input query acquired in step S208, according to the above equation (1). The relevance score of the search target document to be processed is calculated.

次に、ステップＳ２１４で、ステップＳ２００において取得した文書データベース４６に含まれる、全ての検索対象文書についてステップＳ２０４〜ステップＳ２１２までの処理を終了したか否かを判定する。全ての検索対象文書について、ステップＳ２０４〜ステップＳ２１２までの処理を終了したと判定した場合には、情報検索処理は、ステップＳ２１６へ移行する。一方、全ての検索対象文書について、ステップＳ２０４〜ステップＳ２１２までの処理を終了していないと判定した場合には、情報検索処理は、ステップＳ２０４へ移行し、処理対象となる検索対象文書を変更し、ステップＳ２０８〜ステップＳ２１４までの処理を繰り返す。 Next, in step S214, it is determined whether or not the processing from step S204 to step S212 has been completed for all search target documents included in the document database 46 acquired in step S200. If it is determined that the processing from step S204 to step S212 has been completed for all search target documents, the information search processing proceeds to step S216. On the other hand, if it is determined that the processing from step S204 to step S212 has not been completed for all search target documents, the information search process proceeds to step S204, and the search target document to be processed is changed. The processes from step S208 to step S214 are repeated.

次に、ステップＳ２１６で、ステップＳ２１２において取得した検索対象文書の各々の関連度スコアと、文書データベース４６とに基づいて、関連度スコアの降順に検索対象文書の文書本文を並べたものの、上位Ｎ件を、結果出力部９０から出力して、情報検索処理ルーチンを終了する。 Next, in step S216, the document body of the search target document is arranged in descending order of the relevance score based on the relevance score of each search target document acquired in step S212 and the document database 46. Are output from the result output unit 90, and the information search processing routine is terminated.

以上説明したように、本発明の本実施形態に係る情報検索装置によれば、入力されたクエリと、検索インデックスと、概念類似度辞書とに基づいて、文書データベースに含まれる検索対象文書の各々に対し、クエリに含まれるキーワードと類似する検索対象文書キーワードとの類似度、及び検索対象文書キーワードの重みを用いて、検索対象文書との関連度スコアを計算することにより、クエリに関連する文書を精度よく検索することができる。 As described above, according to the information search device according to the embodiment of the present invention, each of the search target documents included in the document database based on the input query, the search index, and the concept similarity dictionary. On the other hand, a document related to the query is calculated by calculating a relevance score with the search target document by using the similarity between the search target document keyword similar to the keyword included in the query and the weight of the search target document keyword. Can be searched with high accuracy.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、本実施形態においては、類似度計算部と、類似度一致検索部とを同一の情報検索装置に含むように構成する場合について説明したが、類似度計算部と、類似度一致検索部とを別々の装置として構成してもよい。この場合、類似度計算部を含む装置により作成された、検索インデックス、文書データベース、及び概念類似度辞書を、類似度一致検索部を含む装置で用いる。 For example, in the present embodiment, the case where the similarity calculation unit and the similarity match search unit are configured to be included in the same information search device has been described. However, the similarity calculation unit, the similarity match search unit, May be configured as separate devices. In this case, the search index, the document database, and the concept similarity dictionary created by the apparatus including the similarity calculation unit are used in the apparatus including the similarity match search unit.

また、本実施形態においては、類似度計算部による処理の後に、類似度一致検索部による処理を行う場合について説明したが、これに限定されるものではない。例えば、類似度計算部の処理をオフラインで事前に処理しておき、類似度一致検索部の処理をオンラインで実行してもよい。 In the present embodiment, the case where the process by the similarity matching search unit is performed after the process by the similarity calculation unit has been described, but the present invention is not limited to this. For example, the process of the similarity calculation unit may be processed offline in advance, and the process of the similarity match search unit may be executed online.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do.

１０入力部
２０演算部
３０類似度計算部
３２キーワード抽出部
３４検索インデックス作成部
３６概念ベクトルモデル作成部
３８作成部
４０記憶部
４２概念ベクトルモデル
４４検索インデックス
４６文書データベース
４８概念類似度辞書
５０類似度一致検索部
５２クエリキーワード抽出部
６０スコア計算部
６２概念類似度参照部
６４計算部
９０結果出力部
１００情報検索装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 30 Similarity degree calculation part 32 Keyword extraction part 34 Search index creation part 36 Concept vector model creation part 38 Creation part 40 Storage part 42 Concept vector model 44 Search index 46 Document database 48 Concept similarity dictionary 50 Similarity degree Match search unit 52 Query keyword extraction unit 60 Score calculation unit 62 Concept similarity reference unit 64 Calculation unit 90 Result output unit 100 Information search device

Claims

A combination of the weight of the search target document keyword included in the search target document, the search target document keyword, and the document ID representing the search target document, created for each of the search target documents included in the search target document set A search index that is
A document database created for each of the search target documents, which is a combination of the document content of the search target document and the document ID of the search target document;
The search based on a concept vector represented by an n-dimensional vector that expresses each of the concept document keywords included in the concept document created based on the concept document set, the search index, and the document database. A concept similarity dictionary that is created for each of the target documents and records the concept document keyword having the highest similarity with the similarity to the search target document keyword included in the search target document;
The search target document keyword that is similar to the keyword included in the query for each of the search target documents included in the document database based on the input query, the search index, and the concept similarity dictionary. A score calculation unit that calculates a relevance score with the search target document using the similarity to the search target document keyword and a weight of the search target document keyword;
Information retrieval device including

For each search target document included in the search target document set, a combination of the weight of the search target document keyword included in the search target document, the search target document keyword, and a document ID representing the search target document is stored. A search index creation unit for creating a search index;
A document database created for each of the search target documents, which is a combination of the document content of the search target document and the document ID of the search target document;
A concept vector model creation unit for creating a concept vector represented by an n-dimensional vector that represents each of the concept document keywords included in the concept document based on the concept document set;
Based on the concept vector, the search index, and the document database, for each of the search target documents, the concept document keyword having the highest similarity is similar to the search target document keyword included in the search target document. A concept similarity dictionary creation unit that creates a concept similarity dictionary recorded with the degree,
The search target document keyword that is similar to the keyword included in the query for each of the search target documents included in the document database based on the input query, the search index, and the concept similarity dictionary. A score calculation unit that calculates a relevance score with the search target document using the similarity to the search target document keyword and a weight of the search target document keyword;
Information retrieval device including

For each search target document included in the search target document set, a combination of the weight of the search target document keyword included in the search target document, the search target document keyword, and a document ID representing the search target document is stored. A search index creation unit for creating a search index;
A document database created for each of the search target documents, which is a combination of the document content of the search target document and the document ID of the search target document;
A concept vector model creation unit for creating a concept vector represented by an n-dimensional vector that represents each of the concept document keywords included in the concept document based on the concept document set;
Based on the concept vector, the search index, and the document database, for each of the search target documents, the concept document keyword having the highest similarity is similar to the search target document keyword included in the search target document. A concept similarity dictionary creation unit that creates a concept similarity dictionary recorded with the degree,
Dictionary creation device including

A combination of the weight of the search target document keyword included in the search target document, the search target document keyword, and the document ID representing the search target document, created for each of the search target documents included in the search target document set A search index that is
A document database created for each of the search target documents, which is a combination of the document content of the search target document and the document ID of the search target document;
Each of the search target documents based on a concept vector expressed by an n-dimensional vector that expresses each of the concept document keywords included in the concept document, created based on the concept document set, and the search index A concept similarity dictionary in which the concept document keyword having the highest similarity is recorded together with the similarity to the created search target document keyword included in the search target document;
An information search method in an information search device, including a score calculation unit,
The score calculation unit is similar to the keyword included in the query for each of the search target documents included in the document database based on the input query, the search index, and the concept similarity dictionary. An information search method for calculating a relevance score with the search target document by using the similarity with the search target document keyword and the weight of the search target document keyword.

A document database that is a combination of the document content of the search target document and the document ID of the search target document, each of the search target documents included in the search target document set, a search index creation unit, and a concept vector model An information search method in an information search device including a creation unit, a concept similarity dictionary creation unit, and a score calculation unit,
The search index creation unit represents the weight of the search target document keyword included in the search target document, the search target document keyword, and the search target document for each of the search target documents included in the search target document set. Create a search index that is a combination with the document ID,
The concept vector model creation unit creates a concept vector represented by an n-dimensional vector that represents each of the concept document keywords included in the concept document based on the concept document set,
The concept similarity dictionary creation unit has the highest similarity with respect to the search target document keyword included in the search target document for each of the search target documents based on the concept vector and the search index. Create a concept similarity dictionary that records concept document keywords along with similarities,
The score calculation unit is similar to the keyword included in the query for each of the search target documents included in the document database based on the input query, the search index, and the concept similarity dictionary. An information search method for calculating a relevance score with the search target document by using the similarity with the search target document keyword and the weight of the search target document keyword.

A document database that is a combination of the document content of the search target document and the document ID of the search target document, each of the search target documents included in the search target document set, a search index creation unit, and a concept vector model A dictionary creation method in a dictionary creation device including a creation unit and a concept similarity dictionary creation unit,
The search index creation unit represents the weight of the search target document keyword included in the search target document, the search target document keyword, and the search target document for each of the search target documents included in the search target document set. Create a search index that is a combination with the document ID,
The concept vector model creation unit is an n-dimensional vector that represents each of the concept document keywords included in the concept document based on a concept document set for creating a concept vector in which a word is represented by an n-dimensional vector. Create a conceptual vector that represents
The concept similarity dictionary creation unit has the highest similarity with respect to the search target document keyword included in the search target document for each of the search target documents based on the concept vector and the search index. A dictionary creation method for creating a concept similarity dictionary in which conceptual document keywords are recorded together with similarities.

A computer is caused to function as each part of the information search device according to claim 1 or 2, or the dictionary creation device according to claim 3, or the information search method according to claim 4 or 5, or the computer according to claim 6. A program for executing each step of the dictionary search method.