JP2018180789A

JP2018180789A - Query clustering device, method, and program

Info

Publication number: JP2018180789A
Application number: JP2017077069A
Authority: JP
Inventors: 克人別所; Katsuto Bessho; 久子浅野; Hisako Asano; 松尾　義博; Yoshihiro Matsuo; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2018-11-15
Anticipated expiration: 2037-04-07
Also published as: JP6722615B2

Abstract

PROBLEM TO BE SOLVED: To enable the improvement of retrieval accuracy.SOLUTION: Retrieval object document concept base generation means 22 inputs a list A of sets of a retrieval object document being a document to be a retrieval object and a list of retrieval queries semantically matched to the retrieval object document, generates a text concept vector being a concept vector of each text by combining corresponding word concept vectors in a word concept base of words in the text for each text of retrieval object document and retrieval queries in each element, in the element in the list A, generates a list B of concept vectors of a cluster of text concept vectors by clustering lists of the text concept vectors, and generates a retrieval object document concept base storing a list of sets of the retrieval object document and the list B.SELECTED DRAWING: Figure 2

Description

本発明は、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索するためのクエリクラスタリング装置、方法、及びプログラムに関する。 The present invention relates to a query clustering apparatus, method, and program for searching a search target document that conceptually matches a search query input by a user.

概念検索は、検索対象となる文書である検索対象文書のリストから、ユーザが入力した検索クエリに意味的に適合する検索対象文書を検索するというものである。
以下の非特許文献１では、コーパスから、単語と該単語の概念を表す単語概念ベクトルとの組のリストである単語概念ベースを生成する。各検索対象文書に対し、該検索対象文書中の単語の、単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成する。検索クエリに対し、該検索クエリ中の単語の、単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する。検索結果として、類似度の降順にランキングした検索対象文書を表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書を表示する。 The concept search is to search a search target document that matches the search query input by the user in a meaningful manner from the list of search target documents which are the search target documents.
In Non-Patent Document 1 below, a word concept base, which is a list of pairs of words and word concept vectors representing the concepts of the words, is generated from a corpus. A search target document concept vector, which is a concept vector of the search target document, is generated by combining, for each search target document, the corresponding word concept vector in the word concept base of the word in the search target document. A search query concept vector, which is a concept vector of the search query, is generated by combining corresponding word concept vectors in the word concept base of the words in the search query with respect to the search query, and for each search target document The similarity between the search query concept vector and the concept vector of the search target document is calculated. As search results, search target documents ranked in descending order of similarity are displayed. Alternatively, a document to be searched having a degree of similarity equal to or higher than a certain threshold is displayed.

別所克人, 内山俊郎, 内山匡, 片岡良治, 奥雅博,“単語・意味属性間共起に基づくコーパス概念ベースの生成方式,”情報処理学会論文誌, Vol.49, No.12, pp.3997-4006, Dec. 2008.Katsuto Bessho, Toshiro Uchiyama, Kei Uchiyama, Ryoji Kataoka, Masahiro Oku, "Generation Method of Corpus Concept Base Based on Co-occurrence of Words and Semantic Attributes," IPSJ Journal, Vol. 49, No. 12, pp. 3997-4006, Dec. 2008.

検索対象文書に複数の話題が混在している場合、上記従来手法で生成される一つの検索対象文書概念ベクトルは、いずれかの話題に関する単語の概念ベクトルのいずれとも遠い、曖昧性をもった概念ベクトルとなる。このため、いずれかの話題に関する検索クエリが入力されたとき、正解の検索対象文書との類似度が低くなり、検索精度に問題があった。 When a plurality of topics are mixed in the retrieval target document, one retrieval target document concept vector generated by the above-described conventional method is an ambiguity concept far from any of the concept vectors of the words related to any of the topics. It becomes a vector. Therefore, when a search query related to any topic is input, the degree of similarity with the correct search target document is low, and there is a problem in search accuracy.

本発明の目的は、この課題を解決し、検索精度を向上させるクエリクラスタリング装置、方法、及びプログラムを提供することにある。 An object of the present invention is to provide a query clustering device, method, and program that solve this problem and improve search accuracy.

上記課題を解決するため、第１の発明に係るクエリクラスタリング装置は、単語と該単語の概念を表す単語概念ベクトルとの組のリストである単語概念ベースと、検索対象となる文書である検索対象文書と、該検索対象文書と意味的に適合する検索クエリのリストとの、組のリストＡを入力とし、該リストＡ中の各要素において、該要素中の検索対象文書及び検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、該テキスト概念ベクトルのリストをクラスタリングすることにより、テキスト概念ベクトルのクラスタの概念ベクトルのリストＢを生成し、該検索対象文書と該リストＢとの組のリストを格納する検索対象文書概念ベースを生成する検索対象文書概念ベース生成手段と、を含んで構成されている。 In order to solve the above problems, a query clustering device according to a first aspect of the present invention comprises a word concept base that is a list of a set of a word and a word concept vector representing a concept of the word, and a search target that is a document to be searched A list A of a set of a document and a list of search queries that match the search target document is input, and in each element in the list A, each text of the search target document in the element and the search query And generating a text concept vector which is a concept vector of the text by combining corresponding word concept vectors in the word concept base of the words in the text, and clustering the list of the text concept vectors Generates a list B of concept vectors of clusters of text concept vectors, and sets a set of the search target document and the list B. It is configured to include a search target document concept based generation means for generating a target document concept base storing strike, the.

第２の発明に係るクエリクラスタリング装置は、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の各概念ベクトルとの類似度の最大値を、該検索対象文書の類似度として算出する検索手段をさらに含んで構成されている。 A query clustering device according to a second aspect of the present invention is a concept vector of the search query by combining a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query. A search query concept vector is generated, and for each search target document in the search target document concept base, the maximum value of the similarity between the search query concept vector and each concept vector of the search target document is the search target It further comprises search means for calculating as document similarity.

また、第３の発明に係るクエリクラスタリング方法は、単語と該単語の概念を表す単語概念ベクトルとの組のリストである単語概念ベースと、検索対象文書概念ベース生成手段とを含むクエリクラスタリング装置におけるクエリクラスタリング方法であって、前記検索対象文書概念ベース生成手段が、検索対象となる文書である検索対象文書と、該検索対象文書と意味的に適合する検索クエリのリストとの、組のリストＡを入力とし、該リストＡ中の各要素において、該要素中の検索対象文書及び検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、該テキスト概念ベクトルのリストをクラスタリングすることにより、テキスト概念ベクトルのクラスタの概念ベクトルのリストＢを生成し、該検索対象文書と該リストＢとの組のリストを格納する検索対象文書概念ベースを生成するステップを含んで構成されている。 A query clustering method according to a third aspect of the present invention is a query clustering apparatus including a word concept base which is a list of a set of a word and a word concept vector representing a concept of the word, and search target document concept base generation means. A query clustering method, wherein the search target document concept base generation unit is a list A of a set of a search target document as a search target document and a list of search queries that are semantically compatible with the search target document. And, for each element in the list A, combining the corresponding word concept vector in the word concept base of the word in the text for each text of the search target document and the search query in the list A. Generates a text concept vector, which is a concept vector of the text, and generates a list of the text concept vectors. Generating a list B of concept vectors of clusters of text concept vectors by tumbling, and generating a search target document concept base storing a list of sets of the search target document and the list B ing.

また、第４の発明に係るクエリクラスタリング方法は、検索手段をさらに含むクエリクラスタリング方法であって、前記検索手段が、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の各概念ベクトルとの類似度の最大値を、該検索対象文書の類似度として算出するステップをさらに含んで構成されている。 The query clustering method according to the fourth aspect of the present invention is the query clustering method further including a search means, wherein the search means is for a new search query, in the word concept base of a word in the search query. A search query concept vector which is a concept vector of the search query is generated by synthesizing corresponding word concept vectors, and the search query concept vector for each search target document in the search target document concept base, and The method further includes the step of calculating the maximum value of the similarity with each concept vector of the document to be searched as the similarity of the document to be searched.

また、本発明のプログラムは、コンピュータを、本発明のクエリクラスタリング装置の各手段として機能させるためのプログラムである。 Further, a program of the present invention is a program for causing a computer to function as each means of the query clustering device of the present invention.

本発明では、検索対象文書概念ベース生成手段の処理が、検索の事前処理であり、検索手段の処理が検索処理である。 In the present invention, the process of the search target document concept base generation means is a preliminary process of search, and the process of the search means is a search process.

本発明のクエリクラスタリング装置、方法、及びプログラムによれば、検索精度を向上させることができる。 According to the query clustering apparatus, method, and program of the present invention, search accuracy can be improved.

本発明の実施の形態の効果を説明するための説明図である。It is an explanatory view for explaining an effect of an embodiment of the present invention. 本発明の実施の形態に係るクエリクラスタリング装置の機能的構成を示すブロック図である。It is a block diagram showing functional composition of a query clustering device concerning an embodiment of the invention. 検索対象文書リストの構成例を示す図である。It is a figure which shows the structural example of a search object document list. 検索対象文書と、該検索対象文書と意味的に適合する検索クエリのリストとの、組のリストＡの構成例を示す図である。It is a figure which shows the structural example of list A of a set of the search object document and the list of search queries which match the search object document semantically. 単語概念ベース２４の例を示す図である。It is a figure which shows the example of the word concept base 24. FIG. 検索対象文書概念ベース２６の構成例を示す図である。FIG. 2 is a diagram showing an example of the configuration of a search target document concept base 26. 本発明の実施の形態に係るクエリクラスタリング装置の検索対象文書概念ベース生成手段における処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the processing routine in the search object document concept base production | generation means of the query clustering apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るクエリクラスタリング装置の検索手段における処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the process routine in the search means of the query clustering apparatus which concerns on embodiment of this invention.

以下、図面とともに本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜本発明の実施の形態の概要＞
図１は、本発明の効果を説明するための図である。 <Overview of the embodiment of the present invention>
FIG. 1 is a figure for demonstrating the effect of this invention.

ある検索対象文書Ｇに二つの話題が混在しているとする。本発明の実施の形態では、検索対象文書概念ベース生成手段が、Ｇに対応するテキスト概念ベクトルリストをクラスタリングし、その結果、２個のクラスタが形成され、各クラスタの概念ベクトルａ，ｂが生成される。従来手法では、概念ベクトルａ，ｂの重心ないしは重心に近い一つの検索対象文書概念ベクトルｃが生成される。 Suppose that two topics are mixed in a certain search target document G. In the embodiment of the present invention, the search target document concept base generation unit clusters the text concept vector list corresponding to G, and as a result, two clusters are formed, and the concept vectors a and b of each cluster are generated. Be done. In the conventional method, the center of gravity of the concept vectors a and b or one search target document concept vector c close to the center of gravity is generated.

別の検索対象文書Ｈにも二つの話題が混在しているとする。本発明の実施の形態では、検索対象文書概念ベース生成手段が、Ｈに対応するテキスト概念ベクトルリストをクラスタリングし、その結果、２個のクラスタが形成され、各クラスタの概念ベクトルｐ，ｑが生成される。従来手法では、概念ベクトルｐ，ｑの重心ないしは重心に近い一つの検索対象文書概念ベクトルｒが生成される。 It is assumed that two topics are mixed in another search target document H. In the embodiment of the present invention, the search target document concept base generation unit clusters the text concept vector list corresponding to H. As a result, two clusters are formed, and concept vectors p and q of each cluster are generated. Be done. In the conventional method, one search object document concept vector r which is close to the centroid of the concept vectors p and q or to the centroid is generated.

概念ベクトルａに対応する話題に関する検索クエリが入力された場合、検索対象文書Ｇの方が、検索対象文書Ｈよりも類似度が高くなるのが望ましい。該検索クエリの概念ベクトルｘは、概念ベクトルａの近くにプロットされる。 When a search query relating to a topic corresponding to the concept vector a is input, it is desirable that the search target document G has a higher similarity than the search target document H. The concept vector x of the search query is plotted near the concept vector a.

図１のような位置関係の場合、従来手法では、概念ベクトルｒの方が、概念ベクトルｃよりも、概念ベクトルｘに近いため、検索対象文書Ｈの方が、検索対象文書Ｇよりも類似度が高くなってしまう。 In the case of the positional relationship as shown in FIG. 1, in the conventional method, the concept vector r is closer to the concept vector x than the concept vector c, so the search target document H is more similar than the search target document G. Becomes high.

本発明の実施の形態では、検索対象文書Ｇに対しては、概念ベクトルａの方が、概念ベクトルｂよりも、概念ベクトルｘとの類似度が高いため、概念ベクトルａとの類似度が、検索対象文書Ｇの類似度となる。検索対象文書Ｈに対しては、概念ベクトルｐの方が、概念ベクトルｑよりも、概念ベクトルｘとの類似度が高いため、概念ベクトルｐとの類似度が、検索対象文書Ｈの類似度となる。概念ベクトルａの方が、概念ベクトルｐよりも、概念ベクトルｘとの類似度が高いため、検索対象文書Ｇの方が、検索対象文書Ｈよりも類似度が高くなる。 In the embodiment of the present invention, for the search target document G, the concept vector a has a higher degree of similarity to the concept vector x than the concept vector b, so the similarity to the concept vector a is This is the similarity of the search target document G. For the search target document H, since the concept vector p has a higher degree of similarity with the concept vector x than the concept vector q, the similarity with the concept vector p is the degree of similarity with the search target document H Become. Since the concept vector a has a higher degree of similarity with the concept vector x than the concept vector p, the search target document G has a higher degree of similarity than the search target document H.

このようにして、本発明の実施の形態では、各検索対象文書に対し、包含する話題ごとに、対応するクラスタの概念ベクトルであるクラスタ概念ベクトルを生成し、クラスタ概念ベクトルとの類似度の最大値を、該検索対象文書の類似度とするので、検索クエリと意味的に適合する話題を包含する検索対象文書の類似度が高くなり、検索精度が従来手法より高くなる。 Thus, according to the embodiment of the present invention, for each search target document, a cluster concept vector which is a concept vector of the corresponding cluster is generated for each included topic, and the maximum similarity with the cluster concept vector is generated. Since the value is the similarity of the search target document, the similarity of the search target document including a topic that matches the search query semantically becomes high, and the search accuracy becomes higher than that of the conventional method.

＜クエリクラスタリング装置の構成＞
本発明の実施の形態に係るクエリクラスタリング装置の構成について説明する。図２は、本発明のクエリクラスタリング装置の構成例である。図２に示すように、本発明の実施の形態に係るクエリクラスタリング装置１００は、ＣＰＵと、ＲＡＭと、後述する各処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このクエリクラスタリング装置１００は、機能的には図２に示すように入力手段１０と、演算手段２０と、出力手段３０とを備えている。 <Configuration of Query Clustering Device>
The configuration of the query clustering apparatus according to the embodiment of the present invention will be described. FIG. 2 is a configuration example of the query clustering device of the present invention. As shown in FIG. 2, the query clustering apparatus 100 according to the embodiment of the present invention is a computer including a CPU, a RAM, and a ROM storing programs for executing processing routines to be described later and various data. Can be composed of The query clustering apparatus 100 functionally includes an input unit 10, an arithmetic unit 20, and an output unit 30, as shown in FIG.

入力手段１０は、検索対象となる文書である検索対象文書と、該検索対象文書と意味的に適合する検索クエリのリストとの、組のリストＡを入力として受け付ける。また、入力手段１０は、新規の検索クエリを受け付ける。 The input unit 10 receives, as an input, a list A of a set of a search target document, which is a document to be a search target, and a list of search queries that are semantically matched with the search target document. Further, the input unit 10 receives a new search query.

演算手段２０は、検索対象文書概念ベース生成手段２２と、単語概念ベース２４と、検索対象文書概念ベース２６と、検索手段２８と、を含んで構成されている。 The calculation means 20 includes a search target document concept base generation means 22, a word concept base 24, a search target document concept base 26, and a search means 28.

検索対象文書概念ベース生成手段２２は、検索対象となる文書である検索対象文書と、該検索対象文書と意味的に適合する検索クエリのリストとの、組のリストＡを入力とし、該リストＡ中の各要素において、該要素中の検索対象文書及び検索クエリの各テキストに対し、該テキスト中の単語の、単語概念ベース２４における対応する単語概念ベクトルを合成することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、該テキスト概念ベクトルのリストをクラスタリングすることにより、テキスト概念ベクトルのクラスタの概念ベクトルのリストＢを生成し、該検索対象文書と該リストＢとの組のリストを格納する検索対象文書概念ベース２６を生成する。以下、詳細に説明する。 The search target document concept base generation unit 22 receives, as an input, a list A of a set of a search target document as a search target document and a list of search queries that are semantically matched with the search target document. In each element in, for each text of the search target document in the element and the search query, the concept vector of the text by synthesizing the corresponding word concept vector in the word concept base 24 of the word in the text A list B of text concept vectors is generated by generating a text concept vector that is and clustering the list of text concept vectors, and a list of sets of the search target document and the list B is generated. A search target document concept base 26 to be stored is generated. The details will be described below.

図３は、検索対象文書リストの構成例である。各レコードは、検索対象となる文書である検索対象文書を一意に特定する検索対象文書ＩＤと、検索対象文書テキストからなる。 FIG. 3 is a configuration example of a search target document list. Each record includes a search target document ID for uniquely specifying a search target document which is a search target document, and a search target document text.

図４は、検索対象文書と、該検索対象文書と意味的に適合する検索クエリのリストとの、組のリストＡの構成例である。各レコードは、検索対象文書ＩＤと、対応するテキストのリストからなる。対応するテキストのリストは、当該検索対象文書のテキストと、該検索対象文書と意味的に適合する検索クエリから構成される。例えば、図４に示されるように、本実施の形態では、１レコードがテキストと検索クエリのリストとの２つからなるため組と称し、この組が複数あるためリストＡと称する。リストＡの要素は１レコードであり、検索クエリのリストの要素は検索クエリである。 FIG. 4 is a configuration example of a list A of a set of a search target document and a list of search queries that match the search target document semantically. Each record comprises a search target document ID and a list of corresponding texts. The corresponding text list is composed of the text of the document to be searched and a search query that matches the document to be searched semantically. For example, as shown in FIG. 4, in the present embodiment, one record is referred to as a set because it is composed of two, a text and a list of search queries, and this set is called a list A because there are a plurality of sets. The element of list A is one record, and the element of the search query list is a search query.

単語概念ベース２４は、単語と該単語の概念を表す単語概念ベクトルとの組のリストである。図５は、単語概念ベース２４の例である。単語概念ベース２４は、例えば、非特許文献１の手法によって生成する。 The word concept base 24 is a list of pairs of words and word concept vectors representing the word concepts. FIG. 5 is an example of the word concept base 24. The word concept base 24 is generated, for example, by the method of Non-Patent Document 1.

単語概念ベース２４には名詞、動詞、形容詞等の内容語のみを登録するというようにしてもよい。単語概念ベース２４において単語は、該単語の終止形で登録されており、単語概念ベース２４を検索する際は、単語の終止形で検索する。 In the word concept base 24, only content words such as nouns, verbs and adjectives may be registered. In the word concept base 24, words are registered in the end form of the word, and when the word concept base 24 is searched, the word end base is searched.

各単語の単語概念ベクトルはｄ次元ベクトルであり、概念的に近い単語の概念ベクトルは、近くに配置されている。単語概念ベクトルは、長さ１に正規化しておいてもよい。 The word concept vector of each word is a d-dimensional vector, and the concept vectors of words that are conceptually close are arranged near each other. The word concept vector may be normalized to length one.

検索対象文書概念ベース生成手段２２の処理では、入力手段１０で受け付けたリストＡ中の各要素において、該要素中の検索対象文書テキスト及び検索クエリの各テキストに対し、単語分割を行う。各テキストに対し、単語分割結果における各単語で単語概念ベース２４を検索し、取得した単語概念ベクトルを加算したものを、該テキストの概念ベクトルであるテキスト概念ベクトルとする。テキスト概念ベクトルは、長さ１に正規化しておいてもよい。 In the processing of the search target document concept base generation unit 22, in each element in the list A accepted by the input unit 10, word division is performed on the search target document text in the element and each text of the search query. For each text, the word concept base 24 is searched for each word in the word division result, and the obtained word concept vector is added to be a text concept vector which is a concept vector of the text. The text concept vector may be normalized to a length of one.

ここで、単語分割結果における単語の内、内容語のみを使用して、テキスト概念ベクトルを生成してもよい。また、同一の単語が複数ある場合は、対応する単語概念ベクトルを、その個数分加算してもよいし、１回だけ加算してもよい。 Here, among the words in the word division result, only the content word may be used to generate the text concept vector. In addition, when there are a plurality of identical words, the corresponding word concept vectors may be added by the number thereof, or may be added only once.

図４の検索対象文書ＩＤがＸのレコードに対しては、テキストｘ、検索クエリｐ、ｑ、ｓのそれぞれに対し、テキスト概念ベクトルが生成される。 For a record having a search target document ID of X in FIG. 4, a text concept vector is generated for each of the text x and the search queries p, q, and s.

検索対象文書概念ベース生成手段２２の処理では、その後、リストＡ中の各要素において、生成したテキスト概念ベクトルのリストをクラスタリングする。図４の検索対象文書ＩＤがＸのレコードに対しては、テキストｘ、検索クエリｐ、ｑ、ｓそれぞれから生成した４個のテキスト概念ベクトルをクラスタリングすることになる。 Subsequently, in the process of the search target document concept base generation unit 22, in each element in the list A, the list of generated text concept vectors is clustered. For a record whose search target document ID is X in FIG. 4, four text concept vectors generated from the text x and the search queries p, q, and s are clustered.

クラスタリングの手法としては、ウォード法やk-means法など、各種クラスタリングの手法が考えられる。クラスタリングにより、該要素が包含する話題に対応する、テキスト概念ベクトルのクラスタが生成され、また、各クラスタに対応する概念ベクトルであるクラスタ概念ベクトルが生成される。クラスタ概念ベクトルは、長さ１に正規化しておいてもよい。このようにして、リストＡ中の各要素に対し、クラスタ概念ベクトルのリストＢが生成される。 As a method of clustering, various methods of clustering such as Ward's method and k-means can be considered. Clustering generates clusters of text concept vectors corresponding to the topics contained by the elements, and also generates cluster concept vectors that are concept vectors corresponding to each cluster. Cluster concept vectors may be normalized to a length of one. Thus, for each element in the list A, a list B of cluster concept vectors is generated.

ここで、検索対象文書テキストの概念ベクトルは、それだけで一つのクラスタ概念ベクトルとし、検索クエリ概念ベクトルリストに対しクラスタリングを行うというようにしてもよい。クラスタリングの結果得られたクラスタ概念ベクトルのリストと、検索対象文書テキストの概念ベクトルを合わせたものを、リストＢとする。 Here, the concept vector of the search target document text may be made into one cluster concept vector by itself, and clustering may be performed on the search query concept vector list. A combination of a list of cluster concept vectors obtained as a result of clustering and a concept vector of search target document text is referred to as a list B.

検索対象文書概念ベース生成手段２２の処理では、リストＡ中の各要素において、検索対象文書ＩＤとリストＢとの組を、検索対象文書概念ベース２６に１レコードとして格納する。図６は、検索対象文書概念ベース２６の構成例である。図４の検索対象文書ＩＤがＸのレコードに対しては、クラスタリングの結果、３個のクラスタ概念ベクトルからなるリストＢが生成され、図６のように、ＸとリストＢとの組が格納されている。 In the processing of the search target document concept base generation unit 22, in each element in the list A, a combination of the search target document ID and the list B is stored in the search target document concept base 26 as one record. FIG. 6 is a configuration example of the search target document concept base 26. As shown in FIG. For a record whose search target document ID is X in FIG. 4, as a result of clustering, a list B consisting of three cluster concept vectors is generated, and a pair of X and list B is stored as shown in FIG. ing.

検索手段２８は、入力手段１０で受け付けた新規の検索クエリに対し、該検索クエリ中の単語の、単語概念ベース２４における対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、検索対象文書概念ベース２６中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の各概念ベクトルとの類似度の最大値を、該検索対象文書の類似度として算出する。以下、詳細に説明する。 The search means 28 is a concept vector of the search query by combining the corresponding word concept vector in the word concept base 24 of the word in the search query with the new search query accepted by the input means 10 A search query concept vector is generated, and for each search target document in the search target document concept base 26, the maximum value of the similarity between the search query concept vector and each concept vector of the search target document is the search target Calculated as document similarity. The details will be described below.

検索手段２８の処理では、新規の検索クエリに対し、単語分割を行う。単語分割結果における各単語で単語概念ベース２４を検索し、取得した単語概念ベクトルを加算したものを、該検索クエリの概念ベクトルである検索クエリ概念ベクトルとする。検索クエリ概念ベクトルは、長さ１に正規化しておいてもよい。 In the processing of the search means 28, word division is performed on a new search query. The word concept base 24 is searched by each word in the word division result, and the sum of the acquired word concept vectors is set as a search query concept vector which is a concept vector of the search query. The search query concept vector may be normalized to a length of 1.

ここで、単語分割結果における単語の内、内容語のみを使用して、検索クエリ概念ベクトルを生成してもよい。また、同一の単語が複数ある場合は、対応する単語概念ベクトルを、その個数分加算してもよいし、１回だけ加算してもよい。 Here, among the words in the word division result, only the content word may be used to generate a search query concept vector. In addition, when there are a plurality of identical words, the corresponding word concept vectors may be added by the number thereof, or may be added only once.

検索手段２８の処理では、その後、検索対象文書概念ベース２６中の各検索対象文書ＩＤに対し、該検索クエリ概念ベクトルと、該検索対象文書ＩＤに対応する各概念ベクトルとの類似度を算出する。類似度として、例えばコサイン類似度をとることができる。算出した類似度の最大値を、該検索対象文書ＩＤの類似度とする。 Subsequently, in the processing of the search means 28, for each search target document ID in the search target document concept base 26, the similarity between the search query concept vector and each concept vector corresponding to the search target document ID is calculated. . As similarity, for example, cosine similarity can be taken. The maximum value of the calculated similarity is taken as the similarity of the search target document ID.

検索手段２８の処理では、検索結果として、類似度の降順にランキングした検索対象文書ＩＤを表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書ＩＤを表示する。 In the processing of the search means 28, the search target document IDs ranked in the descending order of the degree of similarity are displayed as the search results. Alternatively, a search target document ID having a degree of similarity equal to or higher than a certain threshold value is displayed.

そして、出力手段３０は、検索手段２８によって得られた結果を出力する。 Then, the output unit 30 outputs the result obtained by the search unit 28.

図７は、検索対象文書概念ベース生成手段２２の処理フローの一例である。入力手段１０が、検索対象文書と、該検索対象文書と意味的に適合する検索クエリのリストとの、組のリストＡを入力として受け付けると、図７に示す検索対象文書概念ベース生成処理ルーチンが実行される。 FIG. 7 is an example of the processing flow of the search target document concept base generation unit 22. When the input unit 10 receives, as an input, a list A of a set of a search target document and a search query list that matches the search target document semantically, the search target document concept base generation processing routine shown in FIG. To be executed.

まず、ステップＳ１００において、検索対象文書概念ベース生成手段２２は、リストＡ中の各要素において、該要素中の検索対象文書テキスト及び検索クエリの各テキストに対し、単語分割を行う。 First, in step S100, the search target document concept base generation unit 22 performs word division on each text of the search target document text and the search query in each element in the list A.

そして、ステップＳ１０２において、検索対象文書概念ベース生成手段２２は、各テキストに対し、単語分割結果における各単語で単語概念ベース２４を検索し、取得した単語概念ベクトルを加算したものを、該テキストの概念ベクトルであるテキスト概念ベクトルとする。 Then, in step S102, the search target document concept base generation unit 22 searches the word concept base 24 for each word for each word in the word division result, and adds the acquired word concept vectors to the text. It is a text concept vector that is a concept vector.

ステップＳ１０４において、検索対象文書概念ベース生成手段２２は、リストＡ中の各要素において、上記ステップＳ１０２で生成したテキスト概念ベクトルのリストをクラスタリングして、リストＢを得る。 In step S104, the search target document concept base generation unit 22 obtains a list B by clustering the list of text concept vectors generated in step S102 in each element in the list A.

ステップＳ１０６において、検索対象文書概念ベース生成手段２２は、リストＡ中の各要素において、検索対象文書ＩＤとリストＢとの組を、検索対象文書概念ベース２６に１レコードとして格納し、検索対象文書概念ベース生成処理ルーチンを終了する。 In step S106, the search target document concept base generation unit 22 stores the combination of the search target document ID and the list B as one record in the search target document concept base 26 in each element in the list A, and the search target document End the concept base generation processing routine.

図８は、検索手段２８の処理フローの一例である。入力手段１０が、新規の検索クエリを受け付けると、図８に示す検索処理ルーチンが実行される。 FIG. 8 is an example of the process flow of the search means 28. When the input unit 10 receives a new search query, a search processing routine shown in FIG. 8 is executed.

まず、ステップＳ２００において、検索手段２８は、新規の検索クエリに対し、単語分割を行う。 First, in step S200, the search unit 28 performs word division on a new search query.

次に、ステップＳ２０２において、検索手段２８は、上記ステップＳ２００で得られた単語分割結果における各単語で単語概念ベース２４を検索し、取得した単語概念ベクトルを加算したものを、該検索クエリの概念ベクトルである検索クエリ概念ベクトルとする。 Next, in step S202, the search means 28 searches the word concept base 24 for each word in the word division result obtained in the above step S200, and adds the acquired word concept vector to the concept of the search query. It is a search query concept vector that is a vector.

次に、ステップＳ２０４において、検索手段２８は、検索対象文書概念ベース２６中の各検索対象文書ＩＤに対し、上記ステップＳ２０２で得られた該検索クエリ概念ベクトルと、該検索対象文書ＩＤに対応する各概念ベクトルとの類似度を算出する。算出した類似度の最大値を、該検索対象文書ＩＤの類似度とする。 Next, in step S204, the search means 28 corresponds to the search query concept vector obtained in step S202 and the search target document ID for each search target document ID in the search target document concept base 26. The similarity to each concept vector is calculated. The maximum value of the calculated similarity is taken as the similarity of the search target document ID.

そして、ステップＳ２０６において、検索手段２８は、検索結果として、上記ステップＳ２０４で得られた類似度の降順にランキングした検索対象文書ＩＤを表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書ＩＤを表示する。 Then, in step S206, the search means 28 displays the search target document ID ranked in the descending order of the degree of similarity obtained in step S204 as a search result. Alternatively, a search target document ID having a degree of similarity equal to or higher than a certain threshold value is displayed.

出力手段３０は、上記ステップＳ２０６で得られた結果を出力する。 The output means 30 outputs the result obtained in step S206.

これまで述べた処理をプログラムとして構築し、当該プログラムを通信回線または記録媒体からインストールし、ＣＰＵ等の手段で実施することが可能である。 It is possible to construct the processing described above as a program, install the program from a communication line or a recording medium, and implement it by means such as a CPU.

なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above embodiments, and various modifications and applications are possible within the scope of the claims.

本発明は、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索する概念検索技術に適用可能である。 The present invention is applicable to a concept search technique for searching a search target document that conceptually matches a search query input by a user.

１０入力手段
２０演算手段
２２検索対象文書概念ベース生成手段
２４単語概念ベース
２６検索対象文書概念ベース
２８検索手段
３０出力手段
１００クエリクラスタリング装置 DESCRIPTION OF SYMBOLS 10 input means 20 calculation means 22 search object document concept base generation means 24 word concept base 26 search object document concept base 28 search means 30 output means 100 query clustering apparatus

Claims

A word concept base, which is a list of pairs of words and word concept vectors representing the word concepts;
A list A of a set of a search target document which is a search target document and a list of search queries that are semantically matched with the search target document is input, and in each element in the list A, among the elements A text concept vector, which is a concept vector of the text, is generated by synthesizing, for each text of the search target document and the search query, corresponding word concept vectors in the word concept base of the words in the text; A list B of text concept vectors is generated by clustering a list of text concept vectors to generate a list B of concept vectors of clusters of text concept vectors, and a search target document concept base storing a list of sets of the search target document and the list B Search target document concept base generation means;
A query clustering apparatus characterized in that it comprises:

A search query concept vector, which is a concept vector of the search query, is generated by synthesizing a corresponding word concept vector in the word concept base of a word in the search query for a new search query, and the search target The search means for calculating, for each search target document in the document concept base, the maximum value of the similarity between the search query concept vector and each concept vector of the search target document as the similarity of the search target document The query clustering device according to claim 1, comprising:

A query clustering method in a query clustering apparatus, comprising: a word concept base which is a list of a set of a word and a word concept vector representing a concept of the word, and search target document concept base generation means,
The search target document concept base generation unit receives as an input a list A of a set of a search target document as a search target document and a list of search queries that are semantically matched with the search target document; In each element in, for each text of a search target document in the element and a search query, a concept vector of the text by synthesizing a corresponding word concept vector in the word concept base of a word in the text A list B of text concept vectors is generated by generating a text concept vector that is and clustering the list of text concept vectors, and a list of sets of the search target document and the list B is generated. Generating a search target document concept base to be stored.

A query clustering method further comprising a search means, wherein
The search means generates a search query concept vector, which is a concept vector of the search query, by combining corresponding word concept vectors in the word concept base of words in the search query with respect to a new search query. The maximum value of the similarity between the search query concept vector and each concept vector of the search target document is calculated as the similarity of the search target document for each search target document in the search target document concept base The query clustering method according to claim 3, further comprising the step of:

The program for functioning a computer as each means of the query clustering apparatus of Claim 1 or Claim 2.