JP6334491B2

JP6334491B2 - Concept base generation device, concept search device, method, and program

Info

Publication number: JP6334491B2
Application number: JP2015197646A
Authority: JP
Inventors: 克人別所; 淳史大塚; 中村　孝; 孝中村; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-10-05
Filing date: 2015-10-05
Publication date: 2018-05-30
Anticipated expiration: 2035-10-05
Also published as: JP2017072884A

Description

本発明は、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索するための概念ベース生成装置、概念検索装置、方法、及びプログラムに関する。 The present invention relates to a concept base generation device, a concept search device, a method, and a program for searching for a search target document that conceptually matches a search query input by a user.

概念検索は、検索対象となる文書である検索対象文書の集合から、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索するというものである。
以下の非特許文献１では、コーパスから、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースを生成する。各検索対象文書に対し、該検索対象文書中の単語の、単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成する。検索クエリに対し、該検索クエリ中の単語の、単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する。検索結果として、類似度の降順にランキングした検索対象文書を表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書を表示する。 The concept search is a search for a search target document that conceptually matches a search query input by a user from a set of search target documents that are documents to be searched.
In Non-Patent Document 1 below, a word concept base that is a set of a word and a word concept vector representing the concept of the word is generated from the corpus. For each search target document, a corresponding word concept vector in the word concept base of the words in the search target document is synthesized to generate a search target document concept vector that is a concept vector of the search target document. A search query concept vector that is a concept vector of the search query is generated by synthesizing the corresponding word concept vector in the word concept base of the words in the search query with respect to the search query, and for each search target document The similarity between the search query concept vector and the concept vector of the search target document is calculated. As a search result, search target documents ranked in descending order of similarity are displayed. Alternatively, a search target document having a similarity greater than a certain threshold is displayed.

別所克人, 内山俊郎, 内山匡, 片岡良治, 奥雅博,“単語・意味属性間共起に基づくコーパス概念ベースの生成方式,”情報処理学会論文誌, Vol.49, No.12, pp.3997-4006, Dec. 2008.Katsuto Bessho, Toshiro Uchiyama, Kei Uchiyama, Ryoji Kataoka, Masahiro Oku, “Corpus Concept-Based Generation Based on Co-occurrence Between Words and Semantic Attributes,” Information Processing Society of Japan, Vol.49, No.12, pp. 3997-4006, Dec. 2008.

検索クエリと、該検索クエリに概念的に適合する検索対象文書である正解文書の集合との、組の集合が与えられているとする。この正解情報は、検索精度を向上させる可能性をもっていると考えられるが、従来の概念検索技術では、この情報を扱えなかった。 It is assumed that a set of a search query and a set of correct documents that are search target documents conceptually matching the search query are given. This correct answer information is considered to have a possibility of improving the search accuracy, but the conventional concept search technique cannot handle this information.

本発明の目的は、この正解情報を用いて、検索精度を向上させる概念ベース生成装置、概念検索装置、方法、及びプログラムを提供することにある。 An object of the present invention is to provide a concept base generation device, a concept search device, a method, and a program that improve search accuracy using the correct answer information.

上記課題を解決するため、第１の発明に係る概念ベース生成装置は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａと、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより、前記集合Ａ中の該正解文書を更新する学習手段と、前記集合Ａ中の各検索対象文書に対し、該検索対象文書中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成する検索対象文書概念ベース生成手段と、を含んで構成されている。 In order to solve the above-described problem, a concept base generation device according to a first aspect of the present invention provides a word concept base that is a set of a word and a word concept vector representing the concept of the word, and a search that is a document to be searched. A set B of a set A of a target document, a search query, and a set of correct documents that are search target documents in the set A that conceptually match the search query are input. Learning means for updating the correct documents in the set A by connecting each of the search queries associated with the correct documents in the set B to the correct documents for each correct document; A search target document that is a concept vector of the search target document by synthesizing a corresponding word concept vector in the word concept base of a word in the search target document for each search target document in the set A Generates a sense vector is configured to include a search target document concept based generation means for generating a search target document concept base for storing a set of the set of the search target document concept vector with the target document, the.

第２の発明に係る概念検索装置は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａであって、かつ検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより該正解文書を更新された前記集合Ａ中の各検索対象文書に対し、該検索対象文書中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより生成された、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルと、該検索対象文書との組の集合を格納する検索対象文書概念ベースと、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する検索手段と、を含んで構成されている。 A concept search device according to a second invention is a set A of a word concept base which is a set of a word and a word concept vector representing the concept of the word, and a search target document set A which is a search target document. And the correct answer in the set B for each correct document in the set B of the search query and a set of correct documents that are search target documents in the set A that conceptually match the search query. For each search target document in the set A in which the correct answer document is updated by linking each search query associated with the document to the correct answer document, the word in the search target document is A search pair that stores a set of a set of a search target document concept vector, which is a concept vector of the search target document, generated by combining corresponding word concept vectors in the word concept base, and the search target document A search query concept vector that is a concept vector of the search query is generated by synthesizing the word concept vector of the word in the search query with the document concept base and the new search query, and corresponding word concept vectors in the word concept base And a search means for calculating the similarity between the search query concept vector and the concept vector of the search target document for each search target document in the search target document concept base.

第３の発明に係る概念検索装置は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａと、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより、前記集合Ａ中の該正解文書を更新する学習手段と、前記集合Ａ中の各検索対象文書に対し、該検索対象文書中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成する検索対象文書概念ベース生成手段と、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する検索手段と、を含んで構成されている。 A concept search apparatus according to a third invention includes a word concept base that is a set of a word and a word concept vector representing the concept of the word, a set A of search target documents that are documents to be searched, and a search A set B of a query and a set of correct documents as search target documents in the set A that conceptually matches the search query is input, and for each correct document in the set B, the set A search means for updating the correct document in the set A by connecting each of the search queries associated with the correct document in B to the correct document, and each search target document in the set A On the other hand, by synthesizing corresponding word concept vectors in the word concept base of words in the search target document, a search target document concept vector that is a concept vector of the search target document is generated, and the search pair A search target document concept base generating means for generating a search target document concept base for storing a set of documents and a set of search target document concept vectors; and for a new search query, the word of the word in the search query By synthesizing corresponding word concept vectors in the concept base, a search query concept vector that is a concept vector of the search query is generated, and for each search target document in the search target document concept base, the search query concept vector And search means for calculating the similarity to the concept vector of the search target document.

また、第４の発明に係る概念ベース生成方法は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベース、検索対象となる文書である検索対象文書の集合Ａ、学習手段、及び検索対象文書概念ベース生成手段を含む概念ベース生成装置における概念ベース生成方法であって、前記学習手段が、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより、前記集合Ａ中の該正解文書を更新するステップと、前記検索対象文書概念ベース生成手段が、前記集合Ａ中の各検索対象文書に対し、該検索対象文書中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成するステップと、を含んで構成されている。 The concept base generation method according to the fourth invention includes a word concept base that is a set of a word and a word concept vector representing the concept of the word, a set A of search target documents that are documents to be searched, learning means, and a conceptual base generating method in concept base generator including the search target document concept based generation means, searching the learning means, the search queries and the in set a conceptually conforming to the search query Each set of search queries associated with the correct document in the set B for each correct document in the set B with the set B of the set of correct documents that are the target documents is input. By connecting to the correct document, the step of updating the correct document in the set A, and the search target document concept base generation means, for each search target document in the set A, the search pair A search target document concept vector that is a concept vector of the search target document is generated by synthesizing corresponding word concept vectors in the word concept base of the words in the document, and the search target document and the search target document concept Generating a search target document concept base that stores a set of pairs with vectors.

また、第５の発明に係る概念検索方法は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａであって、かつ検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより該正解文書を更新された前記集合Ａ中の各検索対象文書に対し、該検索対象文書中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより生成された、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルと、該検索対象文書との組の集合を格納する検索対象文書概念ベースと、検索手段とを含む概念検索装置における概念検索方法であって、前記検索手段が、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出するステップを含んで構成されている。 The concept search method according to the fifth invention includes a word concept base that is a set of a word and a word concept vector representing the concept of the word, and a set A of search target documents that are documents to be searched. For each correct document in the set B, the search query and a set of correct documents that are search target documents in the set A that conceptually match the search query. Each search query associated with the correct answer document is linked to the correct answer document, and the correct answer document is updated for each search target document in the set A. And storing a set of sets of a search target document concept vector, which is a concept vector of the search target document, generated by combining corresponding word concept vectors in the word concept base, and the search target document. A concept search method in a concept search device including a search target document concept base and a search means, wherein the search means corresponds to a new search query with respect to a word in the search query in the word concept base. A search query concept vector that is a concept vector of the search query is generated by synthesizing a word concept vector, and for each search target document in the search target document concept base, the search query concept vector and the search target The method includes a step of calculating the similarity with the concept vector of the document.

また、第６の発明に係る概念検索方法は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベース、検索対象となる文書である検索対象文書の集合Ａ、学習手段、検索対象文書概念ベース生成手段、及び検索手段を含む概念検索装置における概念検索方法であって、前記学習手段が、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより、前記集合Ａ中の該正解文書を更新するステップと、前記検索対象文書概念ベース生成手段が、前記集合Ａ中の各検索対象文書に対し、該検索対象文書中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成するステップと、前記検索手段が、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出するステップと、を含んで構成されている。 The concept search method according to the sixth invention includes a word concept base that is a set of a set of a word and a word concept vector representing the concept of the word, a set A of search target documents that are documents to be searched, learning A concept search method in a concept search device including means, a search target document concept base generation means, and a search means, wherein the learning means is a search query and a search in the set A that conceptually matches the search query Each set of search queries associated with the correct document in the set B for each correct document in the set B with the set B of the set of correct documents that are the target documents is input. The step of updating the correct document in the set A by connecting to the correct document, and the search target document concept base generation means for each search target document in the set A in the search target document By synthesizing corresponding word concept vectors of the word in the word concept base, a search target document concept vector that is a concept vector of the search target document is generated, and the search target document and the search target document concept vector A step of generating a search target document concept base for storing a set of sets, and the search means synthesizes a corresponding word concept vector in the word concept base of a word in the search query for a new search query Thus, a search query concept vector that is a concept vector of the search query is generated, and for each search target document in the search target document concept base, the search query concept vector and the concept vector of the search target document are Calculating the similarity.

また、本発明のプログラムは、コンピュータを、上記の概念ベース生成装置若しくは上記の概念検索装置の各手段として機能させるための、又はコンピュータに、上記の概念ベース生成方法若しくは上記の概念検索方法の各ステップを実行させるためのプログラムである。 Further, the program of the present invention causes a computer to function as each means of the concept base generation device or the concept search device, or causes the computer to execute each of the concept base generation method or the concept search method. This is a program for executing steps.

本発明では、学習手段と検索対象文書概念ベース生成手段の処理までが、検索の事前処理であり、検索手段の処理が検索処理である。 In the present invention, the processing up to the processing of the learning means and the search target document concept base generation means is the search pre-processing, and the search means processing is the search processing.

本発明の概念ベース生成装置、概念検索装置、方法、及びプログラムによれば、正解情報を用いて、検索精度を向上させることができる。 According to the concept base generation device, concept search device, method, and program of the present invention, search accuracy can be improved using correct answer information.

本発明の実施の形態に係る概念検索装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the concept search apparatus which concerns on embodiment of this invention. 検索対象文書集合の構成例を示す図である。It is a figure which shows the structural example of a search object document set. 正解情報の構成例を示す図である。It is a figure which shows the structural example of correct information. 更新後の検索対象文書集合の構成例を示す図である。It is a figure which shows the structural example of the search object document set after an update. 単語概念ベース２６の例を示す図である。It is a figure which shows the example of the word concept base. 検索対象文書概念ベース３０の例を示す図である。3 is a diagram illustrating an example of a search target document concept base 30. FIG. 本発明の実施の形態に係る概念検索装置の学習手段及び検索対象文書概念ベース生成手段における処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the processing routine in the learning means and the search object document concept base production | generation means of the concept search apparatus concerning embodiment of this invention. 本発明の実施の形態に係る概念検索装置の検索手段における処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the processing routine in the search means of the concept search apparatus which concerns on embodiment of this invention.

以下、図面とともに本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜本発明の実施の形態の概要＞
本発明の実施の形態の学習手段は、検索対象文書Ｘを、対応する検索クエリのそれぞれを包含するように更新する。更新後の検索対象文書Ｘ中の単語の集合は、対応する検索クエリｐ中の単語の集合を包含する。したがって、検索対象文書概念ベース生成手段において、検索対象文書Ｘ中の単語の概念ベクトルを合成して得られる検索対象文書概念ベクトルは、対応する検索クエリｐ中の単語の概念ベクトルを合成して得られる検索クエリ概念ベクトル（この概念ベクトルは検索対象文書概念ベース生成手段において生成するわけではない）の方へ、更新前と比べて近づく。検索手段において、対応する検索クエリｐに概念的に近い新規の検索クエリｇが入力されたとき、新規検索クエリｇの概念ベクトルは、対応する検索クエリｐの概念ベクトルと近い。このため、検索対象文書Ｘの概念ベクトルは、新規検索クエリｇの概念ベクトルの方へ、更新前と比べて近づく。これにより、新規検索クエリｇに対し、概念的に適合する検索対象文書Ｘとの類似度が、更新前と比べ高くなる。 <Outline of Embodiment of the Present Invention>
The learning unit according to the embodiment of the present invention updates the search target document X so as to include each of the corresponding search queries. The set of words in the search target document X after the update includes the set of words in the corresponding search query p. Accordingly, the search target document concept vector obtained by combining the concept vectors of the words in the search target document X in the search target document concept base generation unit is obtained by combining the concept vectors of the words in the corresponding search query p. The search query concept vector (this concept vector is not generated by the search target document concept base generation means) is closer than before the update. When a new search query g conceptually close to the corresponding search query p is input in the search means, the concept vector of the new search query g is close to the concept vector of the corresponding search query p. For this reason, the concept vector of the search target document X is closer to the concept vector of the new search query g than before the update. Thereby, the similarity with the search target document X that conceptually matches the new search query g is higher than before the update.

＜概念検索装置の構成＞
本発明の実施の形態に係る概念検索装置の構成について説明する。図１は、本発明の請求項３の概念検索装置の構成例である。図１に示すように、本発明の実施の形態に係る概念検索装置１００は、ＣＰＵと、ＲＡＭと、後述する各処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この概念検索装置１００は、機能的には図１に示すように入力手段１０と、演算手段２０と、出力手段４０とを備えている。 <Configuration of concept retrieval device>
A configuration of the concept search device according to the embodiment of the present invention will be described. FIG. 1 is a configuration example of a concept retrieval apparatus according to claim 3 of the present invention. As shown in FIG. 1, a concept retrieval apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing each processing routine described later and various data. Can be configured. Functionally, the concept search apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 40 as shown in FIG.

入力手段１０は、検索対象文書の集合である検索対象文書集合と、正解情報とを入力として受け付ける。 The input unit 10 receives a search target document set, which is a set of search target documents, and correct answer information as inputs.

図２は、検索対象文書集合の構成例である。各レコードは、検索対象となる文書である検索対象文書を一意に特定する検索対象文書ＩＤと、検索対象文書テキストからなる。正解情報は、検索クエリと、該検索クエリに概念的に適合する検索対象文書集合中の検索対象文書である正解文書の集合との、組の集合である。正解文書のそれぞれが、該検索クエリに概念的に適合する。図３は、正解情報の構成例である。各レコードは、検索クエリテキストと、それに概念的に適合する正解文書のＩＤの集合とからなる。 FIG. 2 is a configuration example of a search target document set. Each record includes a search target document ID that uniquely identifies a search target document that is a search target document, and a search target document text. The correct answer information is a set of a search query and a set of correct documents that are search target documents in a search target document set that conceptually matches the search query. Each correct document conceptually matches the search query. FIG. 3 is a configuration example of correct answer information. Each record includes a search query text and a set of IDs of correct documents that conceptually match the search query text.

また、入力手段１０は、新規の検索クエリを受け付ける。 The input means 10 accepts a new search query.

演算手段２０は、学習手段２２と、更新後検索対象文書集合データベース２４と、単語概念ベース２６と、検索対象文書概念ベース生成手段２８と、検索対象文書概念ベース３０と、検索手段３２と、を含んで構成されている。なお、学習手段２２と、単語概念ベース２６と、検索対象文書概念ベース生成手段２８とが、概念ベース生成装置の一例である。 The computing means 20 includes a learning means 22, an updated search target document set database 24, a word concept base 26, a search target document concept base generation means 28, a search target document concept base 30, and a search means 32. It is configured to include. The learning unit 22, the word concept base 26, and the search target document concept base generation unit 28 are examples of the concept base generation device.

学習手段２２は、正解情報中の各正解文書に対し、正解情報において該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより、検索対象文書集合中の該正解文書を更新する。以下、詳細に説明する。 The learning means 22 connects each of the search queries associated with the correct document in the correct information for each correct document in the correct information, thereby connecting the correct answer in the search target document set. Update the document. Details will be described below.

図４は、更新後の検索対象文書集合の構成例である。図３の正解情報において、正解文書Ｘに対応付けられている検索クエリは、テキストｐ、テキストｑ、テキストｓである。そこで、図４の正解文書Ｘのレコードのように、正解文書Ｘのテキストｘに、テキストｐ、テキストｑ、テキストｓを連結する。連結する際は、連結後のテキストに対する単語分割処理で、各連結対象テキストを別々に処理できるように、テキスト間に、改行ないし空白等の識別文字が入るようにする。図３の他の正解文書（Ｙ、Ｚ、・・・）についても、同様の処理を行う。 FIG. 4 is a configuration example of the search target document set after the update. In the correct answer information in FIG. 3, the search queries associated with the correct answer document X are text p, text q, and text s. Therefore, the text p, the text q, and the text s are connected to the text x of the correct answer document X as in the record of the correct answer document X in FIG. At the time of connection, an identification character such as a line feed or a space is inserted between the texts so that each connection target text can be processed separately in the word division processing for the connected text. Similar processing is performed for the other correct answer documents (Y, Z,...) Shown in FIG.

なお、連結対象テキストで、文字列が全く同じものが複数あれば、そのようなテキストの内、２番目以降のものは連結しないというようにしてもよい。 Note that if there are a plurality of texts to be linked that have exactly the same character string, the second and subsequent texts may not be linked.

図３の正解情報では、１つの検索クエリに対し、正解文書ＩＤの集合が対応付けられているが、１つの正解文書ＩＤに対し、検索クエリの集合が対応付けられている構成例をとっていてもよい。この場合は、正解文書に連結する検索クエリテキストの集合が、既に得られていることになる。 In the correct answer information in FIG. 3, a set of correct document IDs is associated with one search query, but a configuration example is provided in which a set of search queries is associated with one correct document ID. May be. In this case, a set of search query texts linked to the correct document has already been obtained.

また、学習手段２２は、更新後の検索対象文書集合を、更新後検索対象文書集合データベース２４に格納する。 The learning unit 22 stores the updated search target document set in the updated search target document set database 24.

検索対象文書概念ベース生成手段２８は、更新後検索対象文書集合中の各検索対象文書に対し、該検索対象文書中の単語の、単語概念ベース２６における対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書とその検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベース３０を生成する。以下、詳細に説明する。 The search target document concept base generation unit 28 synthesizes the corresponding word concept vector in the word concept base 26 of the words in the search target document for each search target document in the updated search target document set, A search target document concept vector, which is a concept vector of the search target document, is generated, and a search target document concept base 30 that stores a set of the search target document and the search target document concept vector is generated. Details will be described below.

単語概念ベース２６は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である。図５は、単語概念ベース２６の例である。単語概念ベース２６は、例えば、非特許文献１の手法によって生成する。 The word concept base 26 is a set of a set of a word and a word concept vector representing the concept of the word. FIG. 5 is an example of the word concept base 26. The word concept base 26 is generated by the method of Non-Patent Document 1, for example.

単語概念ベース２６には名詞、動詞、形容詞等の内容語のみを登録するというようにしてもよい。単語概念ベース２６において単語は、該単語の終止形で登録されており、単語概念ベース２６を検索する際は、単語の終止形で検索する。
各単語の単語概念ベクトルは長さ１に正規化されたｄ次元ベクトルであり、概念的に近い単語の概念ベクトルは、近くに配置されている。 Only word words such as nouns, verbs, and adjectives may be registered in the word concept base 26. In the word concept base 26, the word is registered with the word end form, and when searching the word concept base 26, the word concept base 26 is searched with the word end form.
The word concept vector of each word is a d-dimensional vector normalized to length 1, and the concept vectors of words that are conceptually close are arranged nearby.

検索対象文書概念ベース生成手段２８の処理では、検索対象文書の更新後のテキストを単語分割する。単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該検索対象文書の概念ベクトルとする。 In the processing of the search target document concept base generation unit 28, the updated text of the search target document is divided into words. The word concept base 26 is searched for each word in the word division result, the acquired word concept vectors are added, and the concept vector obtained as a result is normalized to a length of 1 as the concept vector of the search target document. To do.

ここで、単語分割結果における単語の内、内容語のみを使用して、検索対象文書概念ベクトルを生成してもよい。また、同一の単語が複数ある場合は、対応する単語概念ベクトルを、その個数分加算してもよいし、１回だけ加算してもよい。また、取得した単語概念ベクトルに対し、対応する単語の所属する連結対象テキストによって、異なる重みを該単語概念ベクトルに乗じた上で加算するというようにしてもよい。 Here, the search target document concept vector may be generated using only the content word among the words in the word division result. When there are a plurality of the same words, the corresponding word concept vectors may be added by the number thereof, or may be added only once. Alternatively, the obtained word concept vector may be added after the word concept vector is multiplied by a different weight depending on the connection target text to which the corresponding word belongs.

図６は、検索対象文書概念ベース３０の構成例である。各検索対象文書に対し、そのＩＤと検索対象文書概念ベクトルとの組を、検索対象文書概念ベース３０の１レコードとして登録する。 FIG. 6 is a configuration example of the search target document concept base 30. For each search target document, a set of the ID and search target document concept vector is registered as one record of the search target document concept base 30.

検索手段３２の処理では、入力手段１０によって受け付けた新規の検索クエリに対し、そのテキストを単語分割する。単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該検索クエリの概念ベクトルとする。 In the processing of the search means 32, the text is divided into words for the new search query received by the input means 10. The word concept base 26 is searched for each word in the word division result, the acquired word concept vectors are added, and the concept vector obtained as a result is normalized to a length of 1 as the concept vector of the search query. .

ここで、単語分割結果における単語の内、内容語のみを使用して、検索クエリ概念ベクトルを生成してもよい。また、同一の単語が複数ある場合は、対応する単語概念ベクトルを、その個数分加算してもよいし、１回だけ加算してもよい。 Here, the search query concept vector may be generated using only the content word among the words in the word division result. When there are a plurality of the same words, the corresponding word concept vectors may be added by the number thereof, or may be added only once.

検索対象文書概念ベース３０中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する。類似度として、例えばコサイン類似度をとることができる。 For each search target document in the search target document concept base 30, the similarity between the search query concept vector and the concept vector of the search target document is calculated. As the similarity, for example, a cosine similarity can be taken.

出力手段４０は、検索結果として、類似度の降順にランキングした検索対象文書を表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書を表示する。 The output means 40 displays search target documents ranked in descending order of similarity as search results. Alternatively, a search target document having a similarity greater than a certain threshold is displayed.

なお、本発明の構成において、学習手段２２の処理を行わず、更新前の検索対象文書集合を入力として検索対象文書概念ベース生成手段２８の処理を行って検索対象文書概念ベース３０を生成し、その検索対象文書概念ベース３０を使用して、検索手段３２の処理を行うことも、もちろん可能である。 In the configuration of the present invention, the processing of the learning unit 22 is not performed, the search target document concept base generation unit 28 is processed by using the search target document set before update as an input, and the search target document concept base 30 is generated. Of course, it is possible to perform the processing of the search means 32 using the search object document concept base 30.

図７は、学習手段２２及び検索対象文書概念ベース生成手段２８の処理フローの一例である。入力手段１０が、検索対象文書集合と正解情報とを受け付けると、図７に示す検索対象文書概念ベース生成処理ルーチンが実行される。 FIG. 7 is an example of a processing flow of the learning unit 22 and the search target document concept base generation unit 28. When the input unit 10 receives the search target document set and the correct answer information, the search target document concept base generation processing routine shown in FIG. 7 is executed.

まず、ステップＳ１００において、学習手段２２は、入力手段１０によって受け付けた、検索対象文書集合及び正解情報を取得する。 First, in step S <b> 100, the learning unit 22 acquires a search target document set and correct information received by the input unit 10.

そして、ステップＳ１０２において、学習手段２２は、上記ステップＳ１００で取得された検索対象文書集合及び正解情報に基づいて、正解情報中の各正解文書に対し、正解情報において該正解文書に対応づけられている検索クエリのそれぞれを、該正解文書に連結することにより、検索対象文書集合中の該正解文書を更新し、更新後検索対象文書集合データベース２４に格納する。 In step S102, the learning unit 22 associates each correct document in the correct information with the correct document in the correct information based on the search target document set and the correct information acquired in step S100. By connecting each of the search queries to the correct answer document, the correct answer document in the search target document set is updated and stored in the updated search target document set database 24.

ステップＳ１０４において、検索対象文書概念ベース生成手段２８は、上記ステップＳ１０２で更新後検索対象文書集合データベース２４に格納された更新後の検索対象文書の各々について、更新後の検索対象文書のテキストを単語分割する。そして、検索対象文書概念ベース生成手段２８は、更新後の検索対象文書の各々について、単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該検索対象文書の概念ベクトルとする。そして、検索対象文書概念ベース生成手段２８は、検索対象文書のＩＤと検索対象文書の概念ベクトルとの組を、検索対象文書概念ベース３０に格納し、学習処理ルーチンを終了する。 In step S104, the search target document concept base generation unit 28 uses the updated search target document text as a word for each of the updated search target documents stored in the updated search target document set database 24 in step S102. To divide. Then, the search target document concept base generation unit 28 searches the word concept base 26 for each word in the word division result for each of the updated search target documents, adds the acquired word concept vector, and obtains the result. The concept vector obtained by normalizing the concept vector to length 1 is used as the concept vector of the search target document. Then, the search target document concept base generation unit 28 stores the set of the search target document ID and the search target document concept vector in the search target document concept base 30 and ends the learning processing routine.

図８は、検索手段３２の処理フローの一例である。入力手段１０が、新規の検索クエリを受け付けると、図８に示す検索処理ルーチンが実行される。 FIG. 8 is an example of the processing flow of the search means 32. When the input means 10 receives a new search query, a search processing routine shown in FIG. 8 is executed.

まず、ステップＳ２００において、検索手段３２は、入力手段１０によって受け付けた新規の検索クエリを取得する。 First, in step S200, the search means 32 acquires a new search query accepted by the input means 10.

次に、ステップＳ２０２において、検索手段３２は、上記ステップＳ２００で取得した新規の検索クエリに対し、そのテキストを単語分割する。そして、検索手段３２は、単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該検索クエリの概念ベクトルとする。 Next, in step S202, the search means 32 divides the text into words for the new search query acquired in step S200. Then, the search means 32 searches the word concept base 26 for each word in the word division result, adds the acquired word concept vector, and normalizes the obtained concept vector to length 1, A concept vector of a search query.

次に、ステップＳ２０４において、検索手段３２は、検索対象文書概念ベース３０中の各検索対象文書に対し、上記ステップＳ２０２で生成された新規の検索クエリの概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する。 Next, in step S204, the search means 32, for each search target document in the search target document concept base 30, the concept vector of the new search query generated in step S202 and the concept vector of the search target document. The similarity is calculated.

そして、ステップＳ２０６において、出力手段４０は、検索結果として、上記ステップＳ２０４で算出された類似度の降順にランキングした検索対象文書を表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書を表示する。 In step S206, the output unit 40 displays the search target documents ranked in descending order of the similarity calculated in step S204 as a search result. Alternatively, a search target document having a similarity greater than a certain threshold is displayed.

これまで述べた処理をプログラムとして構築し、当該プログラムを通信回線または記録媒体からインストールし、ＣＰＵ等の手段で実施することが可能である。 It is possible to construct the processing described so far as a program, install the program from a communication line or a recording medium, and implement it by means such as a CPU.

なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

本発明は、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索する概念検索技術に適用可能である。 The present invention is applicable to a concept search technique for searching for a search target document that conceptually matches a search query input by a user.

１０入力手段
２０演算手段
２２学習手段
２４更新後検索対象文書集合データベース
２６単語概念ベース
２８検索対象文書概念ベース生成手段
３０検索対象文書概念ベース
３２検索手段
４０出力手段
１００概念検索装置 DESCRIPTION OF SYMBOLS 10 Input means 20 Calculation means 22 Learning means 24 Updated search object document set database 26 Word concept base 28 Search target document concept base generation means 30 Search target document concept base 32 Search means 40 Output means 100 Concept search apparatus

Claims

A word concept base that is a set of a word and a word concept vector representing the concept of the word;
A set A of search target documents which are documents to be searched;
A set B of a search query and a set of correct documents as search target documents in the set A that conceptually matches the search query is input, and for each correct document in the set B, the Learning means for updating each correct query in the set A by linking each search query associated with the correct document in the set B to the correct document;
A search target document concept that is a concept vector of the search target document is synthesized with each search target document in the set A by combining corresponding word concept vectors of the words in the search target document in the word concept base. Search target document concept base generating means for generating a search target document concept base for generating a vector and storing a set of sets of the search target document and the search target document concept vector;
A concept-based generation apparatus characterized by including:

A word concept base that is a set of a word and a word concept vector representing the concept of the word;
A set A of search target documents that are documents to be searched, a search query and a set of correct documents that are search target documents in the set A that conceptually match the search query For each correct document in the set B, each search query associated with the correct document in the set B is connected to the correct document to update each correct document in the set A. A search target document concept vector, which is a concept vector of the search target document, generated by synthesizing a corresponding word concept vector in the word concept base of a word in the search target document with respect to the search target document; A search target document concept base for storing a set of pairs with the search target document;
A search query concept vector that is a concept vector of the search query is generated by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query, and the search target Search means for calculating the similarity between the search query concept vector and the concept vector of the search target document for each search target document in the document concept base;
The concept search device characterized by including.

A word concept base that is a set of a word and a word concept vector representing the concept of the word;
A set A of search target documents which are documents to be searched;
A set B of a search query and a set of correct documents as search target documents in the set A that conceptually matches the search query is input, and for each correct document in the set B, the Learning means for updating each correct query in the set A by linking each search query associated with the correct document in the set B to the correct document;
A search target document concept that is a concept vector of the search target document is synthesized with each search target document in the set A by combining corresponding word concept vectors of the words in the search target document in the word concept base. Search target document concept base generating means for generating a search target document concept base for generating a vector and storing a set of sets of the search target document and the search target document concept vector;
A search query concept vector that is a concept vector of the search query is generated by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query, and the search target Search means for calculating the similarity between the search query concept vector and the concept vector of the search target document for each search target document in the document concept base;
The concept search device characterized by including.

A concept including a word concept base that is a set of a word and a word concept vector representing the concept of the word, a set A of search target documents that are documents to be searched, a learning unit, and a search target document concept base generation unit A concept base generation method in a base generation device,
The learning means takes as input a set B of a search query and a set of correct documents as search target documents in the set A that conceptually match the search query, and each correct answer in the set B Updating the correct document in the set A by linking each search query associated with the correct document in the set B to the correct document for the document;
The search target document concept base generation unit synthesizes a word concept vector corresponding to a word in the search target document for each search target document in the set A, in the word concept base. Generating a search target document concept vector, which is a concept vector of the document, and generating a search target document concept base for storing a set of the search target document and the search target document concept vector;
A concept-based generation method characterized by comprising:

A word concept base that is a set of a word and a word concept vector that represents the concept of the word, a set A of search target documents that are documents to be searched, a search query, and a concept in the search query For each correct document in the set B of the set with the correct document set that is the search target document in the set A, each of the search queries associated with the correct document in the set B Are combined with the correct document to synthesize the corresponding word concept vector in the word concept base of the words in the search target document for each search target document in the set A whose correct answer document has been updated. A search target document concept vector, which is a concept vector of the search target document, and a search target document concept base that stores a set of sets of the search target document, and a search means. A concept search process in the no concept search device,
The search means generates a search query concept vector that is a concept vector of the search query by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query. And calculating a similarity between the search query concept vector and the concept vector of the search target document for each search target document in the search target document concept base.

A word concept base that is a set of a word and a word concept vector representing the concept of the word, a set A of search target documents that are documents to be searched, a learning unit, a search target document concept base generation unit, and a search unit A concept search method in a concept search device including:
The learning means takes as input a set B of a search query and a set of correct documents as search target documents in the set A that conceptually match the search query, and each correct answer in the set B Updating the correct document in the set A by linking each search query associated with the correct document in the set B to the correct document for the document;
The search target document concept base generation unit synthesizes a word concept vector corresponding to a word in the search target document for each search target document in the set A, in the word concept base. Generating a search target document concept vector, which is a concept vector of the document, and generating a search target document concept base for storing a set of the search target document and the search target document concept vector;
The search means generates a search query concept vector that is a concept vector of the search query by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query. Calculating a similarity between the search query concept vector and the concept vector of the search target document for each search target document in the search target document concept base;
The concept search method characterized by including this.

A concept base according to claim 4 for causing a computer to function as each means of the concept base generation device according to claim 1 or the concept search device according to any one of claims 2 to 3. The program for performing each step of the production | generation method or the concept search method of any one of Claims 5-6.