JP2017068742A

JP2017068742A - Relevant document retrieval device, model creation device, method and program therefor

Info

Publication number: JP2017068742A
Application number: JP2015195860A
Authority: JP
Inventors: 孝中村; Takashi Nakamura; 克人別所; Katsuto Bessho; 淳史大塚; Atsushi Otsuka
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-10-01
Filing date: 2015-10-01
Publication date: 2017-04-06
Anticipated expiration: 2035-10-01
Also published as: JP6426074B2

Abstract

PROBLEM TO BE SOLVED: To provide a relevant document retrieval technique capable of retrieving more accurately relative to prior arts.SOLUTION: A relevant document retrieval device comprises: a feature amount extraction part 3 for extracting a similarity feature amount group about an input sentence and each retrieval object sentence, in which a similarity feature amount group about a certain sentence and a certain retrieval object sentence is plural different feature amounts indicating a similarity between the certain sentence and the certain retrieval object sentence; a similarity score model storage part 4 for storing a similarity score model indicating a relationship between the similarity feature amount group and a similarity score corresponding to the similarity feature amount group; and a similarity score calculation part 5 for calculating the similarity score about the input sentence and each retrieval object sentence using the similarity score model stored in the similarity score model storage part 4 and the similarity feature amount group which is extracted.SELECTED DRAWING: Figure 1

Description

この発明は、入力文章に関連した文書の検索を行う技術に関する。 The present invention relates to a technique for searching for a document related to an input sentence.

入力文章に関連した文書の検索を行う技術は、大別すると、（１）入力された自然な文章から検索に用いる単語(キーワード)を抽出し、キーワードに適合する文書を探す技術と、（２）入力文章全体と検索対象文書との関連性を算出し、高い関連性の文書を探す技術とに分類される。以下、（１）をキーワードマッチ方式、（２）を文章類似性判定方式、とする。 The techniques for searching for documents related to input sentences are roughly classified as follows: (1) A technique for extracting words (keywords) used for search from input natural sentences and searching for documents that match the keywords; ) The relationship between the entire input sentence and the search target document is calculated, and the technique is classified into a technique for searching for a highly relevant document. Hereinafter, (1) is a keyword matching method, and (2) is a sentence similarity determination method.

キーワードマッチ方式では、非特許文献１のように、事前に転置インデックスを整備しておき、入力文章に紐づくキーワードが含まれる文書を転置インデックスを引くことで求める。その際、キーワードを類似性、文字ゆらぎ等の観点で拡張しておき、元々のキーワードに加え、拡張したキーワードを含む文書を求めることで、より再現率の高い検索を可能とする。 In the keyword matching method, as in Non-Patent Document 1, a transposed index is prepared in advance, and a document including a keyword associated with an input sentence is obtained by subtracting the transposed index. At that time, the keyword is expanded in terms of similarity, character fluctuation, etc., and a document including the expanded keyword is obtained in addition to the original keyword, thereby enabling a search with a higher recall.

文章類似性判定方式では、特許文献１のように、入力文章および検索対象文書を概念ベクトル化し、文書類似性を概念ベクトル間の近さ（をコサイン測度として求める）とすることで検索を行う。 In the sentence similarity determination method, as in Patent Document 1, a search is performed by converting an input sentence and a search target document into concept vectors and using the document similarity as the closeness between the concept vectors (which is obtained as a cosine measure).

検索エンジンの仕組みと技術の発展（情報の科学と技術 54(2), 66-71, 2004-02-01）Search engine mechanisms and technology development (Information Science and Technology 54 (2), 66-71, 2004-02-01)

特開２００７−３１７１３２号公報JP 2007-317132 A

しかしながら、キーワードマッチ方式では、入力文章に含まれるキーワードや拡張されたキーワード等を用いるが、あくまでキーワードのみに着目しているので、文書全体の意味を見ずに局所的な単語のみを見ているので、文意としては関連性が低い文書も類似していると判断してしまう場合がある。 However, in the keyword matching method, keywords included in the input sentence, expanded keywords, etc. are used, but since only the keywords are focused, only local words are seen without looking at the meaning of the whole document. Therefore, there is a case where it is determined that documents having low relevance are similar in terms of meaning.

また、文書類似性判定方式では、文書全体の近さを見ているので、文書全体が複数の意味・トピックを持っていたり、修飾的な文(挨拶、前置き、特殊な単語の説明等)が含まれていたりすると、文書の概念がぼやけてしまい、関連性を正しく判断できなくなる場合がある。 In the document similarity judgment method, the closeness of the whole document is seen, so the whole document has multiple meanings / topics, and there are descriptive sentences (greetings, introductions, explanations of special words, etc.). If it is included, the concept of the document may be blurred, and the relevance may not be determined correctly.

この発明の目的は、従来よりも精度の高い検索を可能とする関連文書検索装置、モデル作成装置、これらの方法及びプログラムを提供することである。 An object of the present invention is to provide a related document search device, a model creation device, and a method and a program thereof that enable a search with higher accuracy than before.

この発明の一態様による関連文書検索装置は、ある文章とある検索対象文書とについての類似度特徴量群をそのある文章とそのある検索対象文書との間の類似度を表す異なる複数の特徴量として、入力された文章と各検索対象文書とについての類似度特徴量群を抽出する特徴量抽出部と、類似度特徴量群とその類似度特徴量群に対応する類似度スコアとの関係を表す類似度スコアモデルが記憶された類似度スコアモデル記憶部と、類似度スコアモデル記憶部に記憶された類似度スコアモデルと抽出された類似度特徴量群とを用いて、入力された文章と各検索対象文書についての類似度スコアを計算する類似度スコア計算部と、を備えている。 The related document search apparatus according to an aspect of the present invention provides a plurality of different feature amounts representing similarity between a certain sentence and the certain search target document as a similarity feature amount group between the certain sentence and the certain search target document. The relationship between the feature quantity extraction unit that extracts the similarity feature quantity group for the input sentence and each search target document, and the similarity score corresponding to the similarity feature quantity group and the similarity feature quantity group An input sentence using the similarity score model storage unit storing the similarity score model to be represented, the similarity score model stored in the similarity score model storage unit, and the extracted similarity feature amount group; A similarity score calculation unit that calculates a similarity score for each search target document.

この発明の一態様によるモデル作成装置は、ある文章とある文書とについての類似度特徴量群をそのある文章とそのある文書との間の類似度を表す異なる複数の特徴量として、入力された文章と各文書とについての類似度特徴量群を抽出する特徴量抽出部と、入力された文章と各文書との間の類似度スコアが予め定められているとして、抽出された類似度特徴量群を説明変数とし、抽出された類似度特徴量群に対応する類似度スコアを目的変数として回帰分析をすることにより、類似度特徴量群とその類似度特徴量群に対応する類似度スコアとの関係を表す類似度スコアモデルを作成する作成部と、を備えている。 In the model creation device according to an aspect of the present invention, the similarity feature amount group for a certain sentence and a certain document is input as a plurality of different feature amounts representing the similarity between the certain sentence and the certain document. The feature amount extraction unit that extracts a similarity feature amount group for a sentence and each document, and the similarity feature amount extracted assuming that the similarity score between the input sentence and each document is predetermined. By performing regression analysis using the similarity score corresponding to the extracted similarity score group as the objective variable, the similarity score group corresponding to the similarity score group and the similarity score corresponding to the similarity score group And a creation unit for creating a similarity score model representing the relationship between the two.

従来よりも精度の高い検索が可能となる。 Search with higher accuracy than before is possible.

関連文書検索装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a related document search apparatus. 関連文書検索方法の例を説明するための流れ図。The flowchart for demonstrating the example of a related document search method. モデル作成装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a model production apparatus. モデル作成方法の例を説明するための流れ図。The flowchart for demonstrating the example of a model creation method. 検索対象文書記憶部１に記憶されている検索対象文書についての情報の例を示す図。6 is a diagram showing an example of information about a search target document stored in a search target document storage unit 1. FIG. 第１特徴量の例を示す図。The figure which shows the example of the 1st feature-value. 第２特徴量の例を示す図。The figure which shows the example of the 2nd feature-value. 特徴量抽出部３により抽出された類似度特徴量群の例を示す図。The figure which shows the example of the similarity degree feature-value group extracted by the feature-value extraction part 3. FIG. 出力条件記憶部６に記憶されている単語のペアの例を示す図。The figure which shows the example of the pair of the word memorize | stored in the output condition memory | storage part. 学習用文書記憶部１０に記憶されている複数の文書及び類似度スコアの例を示す図。The figure which shows the example of the some document memorize | stored in the learning document memory | storage part 10, and a similarity score. 特徴量抽出部８により抽出された類似度特徴量群の例を示す図。The figure which shows the example of the similarity feature-value group extracted by the feature-value extraction part 8. FIG.

［関連文書検索装置及び方法］
以下、図面を参照して、この関連文書検索装置及び方法の一実施形態について説明する。関連文書検索装置は、図１に示すように、検索対象文書記憶部１と、検索対象文書絞込部２と、特徴量抽出部３と、類似度スコアモデル記憶部４と、類似度スコア計算部５と、出力条件記憶部６と、出力部７とを例えば備えている。関連文書検索装置の各部が、図２の各ステップの処理を行うことにより、関連文書検索方法が実現される。 [Related Document Retrieval Apparatus and Method]
Hereinafter, an embodiment of the related document search apparatus and method will be described with reference to the drawings. As shown in FIG. 1, the related document search apparatus includes a search target document storage unit 1, a search target document narrowing unit 2, a feature amount extraction unit 3, a similarity score model storage unit 4, and a similarity score calculation. The unit 5, the output condition storage unit 6, and the output unit 7 are provided, for example. Each unit of the related document search apparatus performs the process of each step in FIG. 2 to realize the related document search method.

＜検索対象文書記憶部１＞
検索対象文書記憶部１には、複数の検索対象文書が記憶されている。 <Search target document storage unit 1>
The search target document storage unit 1 stores a plurality of search target documents.

検索対象文書は、その検索対象文書を識別するための識別子である検索対象文書IDと共に検索対象文書記憶部１に記憶されている。 The search target document is stored in the search target document storage unit 1 together with a search target document ID that is an identifier for identifying the search target document.

＜検索対象文書絞込部２＞
検索対象文書絞込部２には、文章と、検索対象文書記憶部１から読み込んだ検索対象文書とが入力される。 <Search target document filtering unit 2>
The search target document narrowing unit 2 receives the text and the search target document read from the search target document storage unit 1.

検索対象文書絞込部２は、入力された文書のカテゴリを判定し、検索対象文書記憶部１に記憶された複数の検索対象文書の中からその判定されたカテゴリの検索対象文書を選択する（ステップＳ２）。 The search target document narrowing unit 2 determines the category of the input document, and selects the search target document of the determined category from the plurality of search target documents stored in the search target document storage unit 1 ( Step S2).

まず、検索対象文書絞込部２は、例えば下記のテキストパタン抽出技術、トピック推定技術及び多値分類技術を用いて、入力された文書のカテゴリを抽出する。もちろん、検索対象文書絞込部２は、他のカテゴリ判定技術を用いて、入力された文書のカテゴリを抽出してもよい。 First, the search target document narrowing unit 2 extracts the category of the input document using, for example, the following text pattern extraction technique, topic estimation technique, and multi-value classification technique. Of course, the search target document narrowing unit 2 may extract the category of the input document using another category determination technique.

テキストパタン抽出技術の例は、参考文献１を参照のこと。 See Reference 1 for examples of text pattern extraction techniques.

〔参考文献１〕日本電信電話株式会社、“テキスト知識抽出技術「リッチインデクサ」”、［online］、［平成２７年９月２４日検索］、インターネット〈URL：http://www.ntt.co.jp/svlab/activity/category_2/product2_07.html〉
トピック推定技術(LDA等)の例は、参考文献２を参照のこと。 [Reference 1] Nippon Telegraph and Telephone Corporation, “Text Knowledge Extraction Technology“ Rich Indexer ””, [online], [searched September 24, 2015], Internet <URL: http://www.ntt.co .jp / svlab / activity / category_2 / product2_07.html>
See Reference 2 for examples of topic estimation techniques (LDA, etc.).

〔参考文献２〕David M. Blei、外２名、“Latent Dirichlet Allocation”、［online］、Journal of Machine Learning Research 3 (2003) 993-1022、［平成２７年９月２４日検索］、インターネット〈URL：https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf〉
多値分類技術(多層パーセプトロン、SVC(SVM)等)の例は、参考文献３を参照のこと。 [Reference 2] David M. Blei, two others, “Latent Dirichlet Allocation”, [online], Journal of Machine Learning Research 3 (2003) 993-1022, [searched September 24, 2015], Internet < URL: https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf>
See Reference 3 for examples of multilevel classification techniques (multilayer perceptron, SVC (SVM), etc.).

〔参考文献３〕Asa Ben-Hur、外３名、“Support Vector Clustering”、［online］、Journal of Machine Learning Research 2 (2001) 125-137、［平成２７年９月２４日検索］、インターネット〈URL：http://www.jmlr.org/papers/volume2/horn01a/rev1/horn01ar1.pdf〉 [Reference 3] Asa Ben-Hur, three others, “Support Vector Clustering”, [online], Journal of Machine Learning Research 2 (2001) 125-137, [searched September 24, 2015], Internet < URL: http://www.jmlr.org/papers/volume2/horn01a/rev1/horn01ar1.pdf>

検索対象文書絞込部２の処理を行う場合には、図５に例示するように、検索対象文書記憶部１には、検索対象文書のカテゴリが予め定められて検索対象文書IDと共に記憶されているとする。 When the processing of the search target document narrowing unit 2 is performed, as illustrated in FIG. 5, the search target document storage unit 1 stores the search target document category in advance and is stored together with the search target document ID. Suppose that

検索対象文書絞込部２は、判定された入力された文書のカテゴリと同じカテゴリの検索対象文書を検索対象文書記憶部１に記憶された複数の検索対象文書の中から選択し、選択された検索対象文書の検索対象文書IDを出力する。 The search target document narrowing unit 2 selects a search target document having the same category as the determined input document category from a plurality of search target documents stored in the search target document storage unit 1 and is selected. Outputs the search target document ID of the search target document.

検索対象を絞り込むことで、不要な文書に対する検索処理を削減して効率的に検索でき、トピックの異なる文書を検索対象から外すことで精度向上が期待できる。 By narrowing down the search target, it is possible to efficiently search by reducing search processing for unnecessary documents, and it is expected to improve accuracy by excluding documents with different topics from the search target.

＜特徴量抽出部３＞
特徴量抽出部３には、入力された文章と、検索対象文書絞込部２によって選択された検索対象文書とが入力される。検索対象文書絞込部２によって選択された検索対象文書は、検索対象文書絞込部２が出力した検索対象文書IDにより特定される。 <Feature Extraction Unit 3>
The feature amount extraction unit 3 receives the input text and the search target document selected by the search target document narrowing unit 2. The search target document selected by the search target document narrowing unit 2 is specified by the search target document ID output by the search target document narrowing unit 2.

特徴量抽出部３は、入力された文章と、検索対象文書絞込部２によって選択された各検索対象文書とについての類似度特徴量群を抽出する（ステップＳ３）。抽出された類似度特徴量群は、例えば、対応する検索対象文書IDと共に類似度スコア計算部５に出力される。 The feature amount extraction unit 3 extracts a similarity feature amount group for the input sentence and each search target document selected by the search target document narrowing unit 2 (step S3). The extracted similarity feature quantity group is output to the similarity score calculation unit 5 together with the corresponding search target document ID, for example.

ここで、ある文章とある検索対象文書とについての類似度特徴量群を、そのある文章とそのある検索対象文書との間の類似度を表す異なる複数の特徴量とする。言い換えれば、類似度特徴量群は、複数の異なる性質の技術で求めた特徴量の組である。 Here, a similarity feature amount group for a certain sentence and a certain search target document is set as a plurality of different feature amounts representing the similarity between the certain sentence and the certain search target document. In other words, the similarity feature amount group is a set of feature amounts obtained by a plurality of techniques having different properties.

例えば、第１の技術（例えば、キーワードマッチ方式）で求めた文書間の特徴量と、第１の技術とは性質が異なる第２の技術（例えば、文章類似性判定方式）で求めた文書間の特徴量との組を、類似度特徴量とすることができる。以下、この例を挙げて、特徴量抽出部３の処理について説明する。もちろん、これはあくまで一例であり、３個以上の技術のそれぞれで求めた文書間の特徴量の組を類似度特徴量としてもよい。また、第１の技術及び第２の技術は、キーワードマッチング方式と文書類似性判定方式に限定されるものでもない。文書同士の類似度を示す情報を算出可能な、他の性質を有する技術があれば用いてよい。 For example, the inter-document feature amount obtained by the first technique (for example, keyword matching method) and the document amount obtained by the second technique (for example, sentence similarity determination method) having different properties from the first technique. A pair with the feature amount of the second feature amount can be used as the similarity feature amount. Hereinafter, the process of the feature amount extraction unit 3 will be described with reference to this example. Of course, this is merely an example, and a set of feature amounts between documents obtained by each of three or more technologies may be used as the similarity feature amount. Further, the first technique and the second technique are not limited to the keyword matching method and the document similarity determination method. If there is a technique having other properties capable of calculating information indicating the degree of similarity between documents, it may be used.

特徴量抽出部３の第一計算部３１は、入力された文章と、検索対象文書絞込部２によって選択された各検索対象文書との間の類似度を表す第１特徴量を第１の技術に基づいて計算する。 The first calculation unit 31 of the feature quantity extraction unit 3 uses the first feature quantity representing the similarity between the input sentence and each search target document selected by the search target document narrowing unit 2 as the first feature quantity. Calculate based on technology.

図６に、第１特徴量の例を示す。図６では、入力された文書と、各検索対象文書IDの検索対象文書との間の第１特徴量の例が記載されている。この図６の例では、第１特徴量は、３個の要素から構成されるベクトルである。この図６の例のように、特徴量、第１特徴量及び第２特徴量は、複数の要素から構成されるベクトルであってもよい。 FIG. 6 shows an example of the first feature amount. FIG. 6 shows an example of the first feature amount between the input document and the search target document of each search target document ID. In the example of FIG. 6, the first feature amount is a vector composed of three elements. As in the example of FIG. 6, the feature quantity, the first feature quantity, and the second feature quantity may be vectors composed of a plurality of elements.

また、特徴量抽出部３の第二計算部３２は、入力された文章と、検索対象文書絞込部２によって選択された各検索対象文書との間の類似度を表す第２特徴量を第２の技術に基づいて計算する。 In addition, the second calculation unit 32 of the feature amount extraction unit 3 determines the second feature amount representing the similarity between the input sentence and each search target document selected by the search target document narrowing unit 2. Calculate based on 2 techniques.

図７に、第２特徴量の例を示す。図７では、入力された文書と、各検索対象文書IDの検索対象文書との間の第２特徴量の例が記載されている。 FIG. 7 shows an example of the second feature amount. FIG. 7 shows an example of the second feature amount between the input document and the search target document of each search target document ID.

そして、特徴量抽出部３は、第１特徴量と第２特徴量とを結合して類似度特徴量群とする。 Then, the feature quantity extraction unit 3 combines the first feature quantity and the second feature quantity into a similarity feature quantity group.

なお、結合の際に、特徴量抽出部３は、第１特徴量及び第２特徴量のそれぞれを正規化し、正規化された第１特徴量及び第２特徴量を類似度特徴量群としてもよい。例えば、正規化は、第１特徴量及び第２特徴量のそれぞれの要素ごとに行われる。 At the time of combination, the feature quantity extraction unit 3 normalizes each of the first feature quantity and the second feature quantity, and uses the normalized first feature quantity and second feature quantity as a similarity feature quantity group. Good. For example, normalization is performed for each element of the first feature value and the second feature value.

図８に、図６の第１特徴量を正規化した特徴量と、図７の第２特徴量を正規化した特徴量とを結合することにより得られた類似度特徴量群の例を示す。 FIG. 8 shows an example of the similarity feature quantity group obtained by combining the feature quantity normalized from the first feature quantity in FIG. 6 and the feature quantity normalized from the second feature quantity in FIG. .

以下、正規化処理の例について説明する。正規化は、例えば以下に例示する正規化関数f(x)の何れかを用いて行われる。特徴量ごとにどの正規化処理を行うのか（もしくは正規化処理を行わないのか）については、予め定めておく。また、特徴量が１以上の要素から構成さる場合、正規化関数を特徴量の各要素に対し適用して正規化を行うものとする。以下の式において、xは正規化前の特徴量の要素の値を表す。a,σ₁,σ₂,σ₃は定数とする。
f(x)＝tanh(x)
f(x)=1/(1+e^-ax)
f(x)=σ₁/(σ₂+σ₃|x|) Hereinafter, an example of normalization processing will be described. Normalization is performed using, for example, one of the normalization functions f (x) exemplified below. Which normalization processing is performed for each feature amount (or whether normalization processing is not performed) is determined in advance. Further, when the feature amount is composed of one or more elements, normalization is performed by applying a normalization function to each element of the feature amount. In the following formula, x represents the element value of the feature quantity before normalization. a, σ ₁ , σ ₂ , and σ ₃ are constants.
f (x) = tanh (x)
f (x) = 1 / (1 + e ^-ax )
f (x) = σ ₁ / (σ ₂ + σ ₃ | x |)

＜類似度スコアモデル記憶部４＞
類似度スコアモデル記憶部４には、類似度特徴量群とその類似度特徴量群に対応する類似度スコアとの関係を表す類似度スコアモデルが記憶されている。 <Similarity score model storage unit 4>
The similarity score model storage unit 4 stores a similarity score model representing the relationship between the similarity feature quantity group and the similarity score corresponding to the similarity feature quantity group.

類似度スコアモデルは、例えば線形回帰モデルである。もちろん、類似度スコアモデルは、線形回帰モデル以外の回帰モデルであってもよい。 The similarity score model is, for example, a linear regression model. Of course, the similarity score model may be a regression model other than the linear regression model.

類似度スコアモデルは、図３及び図４を参照して後述するモデル作成装置及び方法により作成される。 The similarity score model is created by a model creation device and method described later with reference to FIGS. 3 and 4.

＜類似度スコア計算部５＞
類似度スコア計算部５には、特徴量抽出部３で抽出された各検索対象文書に対応する類似度特徴量群と、類似度スコアモデル記憶部４から読み込んだ類似度スコアモデルが入力される。検索対象文書は、検索対象文書IDにより特定される。 <Similarity score calculation unit 5>
The similarity score calculation unit 5 receives the similarity feature group corresponding to each search target document extracted by the feature amount extraction unit 3 and the similarity score model read from the similarity score model storage unit 4. . The search target document is specified by the search target document ID.

類似度スコア計算部５は、類似度スコアモデル記憶部４に記憶された類似度スコアモデルと特徴量抽出部３で抽出された類似度特徴量群とを用いて、入力された文章と各検索対象文書についての類似度スコアを計算する（ステップＳ５）。類似度スコアは、類似度の算出の対象となっている文書間（文章と文書との間の場合も含む。）の類似度の高さを表す指標である。また、類似度スコアは、性質の異なる複数の技術で算出された類似度特徴量群を統合したものであるとも言える。 The similarity score calculation unit 5 uses the similarity score model stored in the similarity score model storage unit 4 and the similarity feature amount group extracted by the feature amount extraction unit 3 to input text and each search A similarity score for the target document is calculated (step S5). The similarity score is an index representing the height of similarity between documents (including the case between sentences and documents) for which similarity is calculated. Further, it can be said that the similarity score is a combination of similarity feature groups calculated by a plurality of technologies having different properties.

類似度スコアモデルは、例えば回帰分析により得られた回帰係数により構成される。この場合、類似度スコア計算部５は、回帰分析により得られた回帰係数により特定される式に、類似度特徴量群を入力した場合の出力値を計算して、その計算結果を類似度スコアとする。類似度スコアモデルが線形回帰モデルである場合には、類似度スコア計算部５は、ベクトルである類似度特徴量群とベクトルである回帰係数との内積を計算して、その計算結果を類似度スコアとする。 The similarity score model is composed of regression coefficients obtained by regression analysis, for example. In this case, the similarity score calculation unit 5 calculates an output value when the similarity feature amount group is input to the expression specified by the regression coefficient obtained by the regression analysis, and the calculation result is used as the similarity score. And When the similarity score model is a linear regression model, the similarity score calculation unit 5 calculates the inner product of the similarity feature quantity group that is a vector and the regression coefficient that is a vector, and the calculation result is used as the similarity. Score.

類似度スコア計算部５は、類似度スコアとその類似度スコアに対応する検索対象文書IDとのペアを出力部７に出力する。その際、類似度スコア計算部５は、類似度スコアとその類似度スコアに対応する検索対象文書IDとのペアを類似度スコアについての降順又は昇順に並び替えて、その並び替えられた順番で出力してもよい。 The similarity score calculation unit 5 outputs a pair of a similarity score and a search target document ID corresponding to the similarity score to the output unit 7. At that time, the similarity score calculation unit 5 sorts the pairs of the similarity score and the search target document ID corresponding to the similarity score in descending or ascending order with respect to the similarity score, and in the sorted order. It may be output.

なお、類似度スコア計算部５は、既存技術であるランキング学習器を事前に学習しておき、類似度特徴量群を学習器に入力し、類似度を出力し、それを類似度スコアとしてもよい。この場合、ランキング学習器が類似度スコアモデルに対応する。 The similarity score calculation unit 5 learns a ranking learner that is an existing technology in advance, inputs a similarity feature amount group to the learner, outputs the similarity, and uses it as a similarity score. Good. In this case, the ranking learner corresponds to the similarity score model.

また、類似度スコア計算部５は、人手により類似度特徴量群を構成する特徴量の重みを決定し、その重みベクトルと類似度特徴量群との内積を求め、類似度スコアとしてもよい。この場合、人手により構成された特徴量の重みが類似度スコアモデルに対応する。 Further, the similarity score calculation unit 5 may manually determine the weights of the feature amounts constituting the similarity feature amount group, obtain the inner product of the weight vector and the similarity feature amount group, and use the similarity score as the similarity score. In this case, the weight of the feature amount configured manually corresponds to the similarity score model.

＜出力条件記憶部６＞
出力条件記憶部６には、出力部７で用いる出力条件が記憶されている。 <Output condition storage unit 6>
The output condition storage unit 6 stores output conditions used by the output unit 7.

＜出力部７＞
出力部７には、類似度スコア計算部５が出力した、類似度スコアとその類似度スコアに対応する検索対象文書IDとのペアが入力される。 <Output unit 7>
The output unit 7 receives a pair of the similarity score output by the similarity score calculation unit 5 and the search target document ID corresponding to the similarity score.

出力部７は、類似度スコアが高い検索対象文書についての情報を出力する（ステップＳ７）。検索対象文書についての情報とは、例えば検索対象文書IDのことである。検索対象文書についての情報は、検索対象文書自体であってもよい。 The output unit 7 outputs information about the search target document having a high similarity score (step S7). The information about the search target document is, for example, a search target document ID. The information about the search target document may be the search target document itself.

出力部７は、出力条件記憶部６から出力条件を読み込み、その読み込んだ出力条件を満たす検索対象文書についての情報を出力してもよい。 The output unit 7 may read the output condition from the output condition storage unit 6 and output information about the search target document that satisfies the read output condition.

例えば、出力部７は、類似度スコアが上位α個の検索対象文書についての情報を出力してもよい。この場合、「類似度スコアが上位α個」という情報が出力条件（以下、出力条件（１）とする。）となる。αは、１以上の整数である。 For example, the output unit 7 may output information on search target documents having the highest α similarity score. In this case, the information that the “similarity score has the top α” is the output condition (hereinafter referred to as output condition (1)). α is an integer of 1 or more.

また、出力部７は、類似度スコア計算部５で計算された類似度スコアが所定の閾値以上である検索対象文書についての情報を出力してもよい。この場合、「所定の閾値以上」という情報が出力条件（以下、出力条件（２）とする。）となる。類似度スコアが小さいほど類似度が高くなるように設定されている場合には、出力部７は、類似度スコア計算部５で計算された類似度スコアが所定の閾値以下である検索対象文書についての情報を出力してもよい。 Further, the output unit 7 may output information on a search target document whose similarity score calculated by the similarity score calculation unit 5 is equal to or greater than a predetermined threshold. In this case, the information “above a predetermined threshold” is the output condition (hereinafter referred to as output condition (2)). When the similarity score is set to be higher as the similarity score is smaller, the output unit 7 determines a search target document whose similarity score calculated by the similarity score calculation unit 5 is equal to or less than a predetermined threshold. May be output.

さらに、出力条件記憶部６に、いわゆるNGワードとして、単語のペアが複数記憶されているとする。出力条件記憶部６に記憶される単語のペアの例を、図９に示す。 Furthermore, it is assumed that a plurality of word pairs are stored in the output condition storage unit 6 as so-called NG words. An example of word pairs stored in the output condition storage unit 6 is shown in FIG.

この場合、出力部７は、入力された文書に単語のペアを構成する一方の単語が含まれており、かつ、検索対象文書にその単語のペアを構成する他方の単語が含まれているような単語のペアが、出力条件記憶部６に記憶された複数の単語のペアの中にある場合には、その検索対象文書は出力しないという処理を行ってもよい。言い換えれば、入力された文章に含まれる単語と検索対象文書に含まれる単語のペアが出力条件記憶部６に記憶されている場合には、その検索対象文書は出力しないという処理を行ってもよい。 In this case, the output unit 7 seems to include one word constituting the word pair in the input document and the other word constituting the word pair in the search target document. If there is a pair of words in a plurality of pairs of words stored in the output condition storage unit 6, a process of not outputting the search target document may be performed. In other words, when a pair of a word included in the input sentence and a word included in the search target document is stored in the output condition storage unit 6, a process of not outputting the search target document may be performed. .

この場合、「入力された文書に単語のペアを構成する一方の単語が含まれており、かつ、検索対象文書にその単語のペアを構成する他方の単語が含まれているような単語のペアが、出力条件記憶部６に記憶された複数の単語のペアの中にある場合には、その検索対象文書は出力しない」という条件が出力条件（以下、出力条件（３）とする。）となる。 In this case, “a pair of words in which the input document includes one word constituting a word pair and the search target document includes the other word constituting the word pair. However, if it is in a plurality of word pairs stored in the output condition storage unit 6, the condition that the search target document is not output is referred to as an output condition (hereinafter referred to as an output condition (3)). Become.

出力条件（３）は、出力条件（１）又は（２）と両立することができる。すなわち、出力条件（１）又は（２）を満たす検索対象文書の中で、出力条件（３）を満たさない検索対象文書のみを出力し、出力条件（３）を満たす検索対象文書については出力しないという処理が行われてもよい。 The output condition (3) can be compatible with the output condition (1) or (2). That is, only search target documents that do not satisfy the output condition (3) are output among search target documents that satisfy the output condition (1) or (2), and search target documents that satisfy the output condition (3) are not output. Processing may be performed.

このように、性質の異なる複数の技術により抽出した２種類以上の特徴量を考慮して最終的な類似度スコアを求めることにより、従来よりも精度の高い検索が可能となる。 In this way, by obtaining the final similarity score in consideration of two or more types of feature amounts extracted by a plurality of techniques having different properties, a search with higher accuracy than before can be performed.

［モデル作成装置及び方法］
以下、図面を参照して、モデル作成装置及び方法の一実施形態について説明する。モデル作成装置は、図３に示すように、学習用文書記憶部１０と、特徴量抽出部８と、作成部９とを例えば備えている。モデル作成装置の各部が、図４の各ステップの処理を行うことにより、モデル作成方法が実現される。 [Model creation apparatus and method]
Hereinafter, an embodiment of a model creation apparatus and method will be described with reference to the drawings. As shown in FIG. 3, the model creation apparatus includes a learning document storage unit 10, a feature amount extraction unit 8, and a creation unit 9, for example. Each part of the model creation device performs the process of each step in FIG. 4 to realize a model creation method.

＜学習用文書記憶部１０＞
学習用文書記憶部１０には、複数の文書が記憶されている。複数の文書には、異なる２個の文書毎に類似度スコアが対応付けられている。この類似度スコアは例えば人手で予め定められたものである。複数の文書は、検索対象文書記憶部１に記憶されている検索対象文書と同じであっても異なっていてもよい。複数の文書として、互いに類似度が高い文書を記憶していてもよい。 <Learning document storage unit 10>
The learning document storage unit 10 stores a plurality of documents. A plurality of documents are associated with a similarity score for every two different documents. This similarity score is predetermined by hand, for example. The plurality of documents may be the same as or different from the search target documents stored in the search target document storage unit 1. Documents having a high degree of similarity may be stored as a plurality of documents.

図１０に、学習用文書記憶部１０に記憶されている複数の文書及び類似度スコアの例を示す。図１０の例では、複数の文書のそれぞれに識別子である文書IDが付されている。図１０の上段は、qid00001の文書IDの文書とpid00001の文書IDの文書についての類似度スコアが３であり、qid00001の文書IDの文書とpid00003の文書IDの文書についての類似度スコアが１であることを表している。この例では、類似度スコアが大きいほど、類似度が高くなるように設定されている。 FIG. 10 shows an example of a plurality of documents and similarity scores stored in the learning document storage unit 10. In the example of FIG. 10, a document ID that is an identifier is assigned to each of a plurality of documents. In the upper part of FIG. 10, the similarity score for the document with the document ID of qid00001 and the document with the document ID of pid00001 is 3, and the similarity score of the document with the document ID of qid00001 and the document with the document ID of pid00003 is 1. It represents something. In this example, the similarity is set to be higher as the similarity score is larger.

＜特徴量抽出部８＞
特徴量抽出部８は、異なる２個の文書についての類似度特徴量群を抽出する（ステップ８）。抽出された類似度特徴量群は、作成部９に出力される。 <Feature Extraction Unit 8>
The feature quantity extraction unit 8 extracts a similarity feature quantity group for two different documents (step 8). The extracted similarity feature quantity group is output to the creation unit 9.

特徴量抽出部８の処理は、上記説明した特徴量抽出部３の処理と同様である。すなわち、特徴量抽出部３が、入力された文章と各検索対象文書とについての類似度特徴量群を抽出した処理と同様の処理により、異なる２個の文書のそれぞれについての類似度特徴量群を抽出する。言い換えれば、特徴量抽出部８は、特徴量抽出部３が抽出した類似度特徴量群と同じ類似度特徴量群を抽出する。特徴量抽出部８の第一計算部８１の処理は特徴量抽出部３の第一計算部３１の処理と同様であり、特徴量抽出部８の第二計算部８２の処理は特徴量抽出部３の第二計算部３２の処理と同様である。 The process of the feature quantity extraction unit 8 is the same as the process of the feature quantity extraction unit 3 described above. That is, the feature amount extraction unit 3 performs a similarity feature amount group for each of two different documents by a process similar to the process of extracting the similarity feature amount group for the input sentence and each search target document. To extract. In other words, the feature quantity extraction unit 8 extracts the same similarity feature quantity group as the similarity feature quantity group extracted by the feature quantity extraction unit 3. The process of the first calculation unit 81 of the feature quantity extraction unit 8 is the same as the process of the first calculation unit 31 of the feature quantity extraction unit 3, and the process of the second calculation unit 82 of the feature quantity extraction unit 8 is the feature quantity extraction unit. 3 is the same as the processing of the second calculation unit 32.

図１１に、特徴量抽出部８により抽出された類似度特徴量群の例を示す。 FIG. 11 shows an example of the similarity feature quantity group extracted by the feature quantity extraction unit 8.

＜作成部９＞
作成部９は、抽出された類似度特徴量群を説明変数とし、特徴量抽出部８で抽出された類似度特徴量群に対応する類似度スコアを目的変数として回帰分析をすることにより、類似度特徴量群とその類似度特徴量群に対応する類似度スコアとの関係を表す類似度スコアモデルを作成する（ステップＳ９）。 <Creation unit 9>
The creation unit 9 uses the extracted similarity feature quantity group as an explanatory variable, and performs a regression analysis using the similarity score corresponding to the similarity feature quantity group extracted by the feature quantity extraction unit 8 as an objective variable. A similarity score model representing the relationship between the degree feature quantity group and the similarity score corresponding to the similarity feature quantity group is created (step S9).

作成部９は、例えば線形回帰分析を行う。この場合、作成部９は、線形回帰モデルのパラメタを、例えばSVMを用いて学習し求める。 The creation unit 9 performs, for example, linear regression analysis. In this case, the creation unit 9 learns and obtains parameters of the linear regression model using, for example, SVM.

作成された類似度スコアモデルは、図１の類似度スコアモデル記憶部４に記憶される。 The created similarity score model is stored in the similarity score model storage unit 4 of FIG.

[プログラム及び記録媒体]
関連文書検索装置及び方法並びにモデル作成装置及び方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program and recording medium]
The processing described in the related document search device and method and the model creation device and method is not only executed in time series according to the order of description, but also in parallel or individually according to the processing capability of the device that executes the processing or as necessary. May be executed.

また、関連文書検索装置又はモデル作成装置おける各処理をコンピュータによって実現する場合、関連文書検索装置又はモデル作成装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 When each process in the related document search device or model creation device is realized by a computer, the processing contents of the functions that the related document search device or model creation device should have are described by a program. Then, by executing this program on a computer, each process is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

［変形例］
検索対象文書絞込部２の処理は行われなくてもよい。この場合、特徴量抽出部３は、入力された文章と、検索対象文書記憶部１に記憶されている各検索対象文書とについての類似度特徴量群を抽出する。 [Modification]
The processing of the search target document narrowing unit 2 may not be performed. In this case, the feature amount extraction unit 3 extracts a similarity feature amount group for the input sentence and each search target document stored in the search target document storage unit 1.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

１検索対象文書記憶部
２検索対象文書絞込部
３特徴量抽出部
３１第一計算部
３２第二計算部
４類似度スコアモデル記憶部
５類似度スコア計算部
６出力条件記憶部
７出力部
８特徴量抽出部
８１第一計算部
８２第二計算部
９作成部
１０学習用文書記憶部 DESCRIPTION OF SYMBOLS 1 Search object document memory | storage part 2 Search object document narrowing-down part 3 Feature-value extraction part 31 1st calculation part 32 2nd calculation part 4 Similarity score model memory | storage part 5 Similarity score calculation part 6 Output condition memory | storage part 7 Output part 8 Feature amount extraction unit 81 First calculation unit 82 Second calculation unit 9 Creation unit 10 Document storage unit for learning

Claims

A group of similarities between a certain sentence and a certain search target document is used as a plurality of different feature quantities representing the similarity between the certain sentence and the certain search target document. A feature amount extraction unit that extracts a similarity feature amount group for
A similarity score model storage unit storing a similarity score model representing a relationship between the similarity feature quantity group and the similarity score corresponding to the similarity feature quantity group;
Using the similarity score model stored in the similarity score model storage unit and the extracted similarity feature quantity group, a similarity score for calculating a similarity score for the input sentence and each search target document A calculation unit;
Related document search device including

The related document search device according to claim 1,
The similarity score model is a linear regression model.
Related document retrieval device.

The related document search device according to claim 1 or 2,
An output unit that outputs information about a search target document in which the calculated similarity score is greater than or less than a predetermined threshold;
Related document retrieval device.

The related document search device according to claim 1 or 2,
An output unit for outputting a search target document having a high similarity score;
An output condition storage unit that stores a plurality of word pairs;
The output unit includes one word constituting a word pair included in the input document and the other word constituting the word pair included in the search target document. When the word pair is in the plurality of word pairs stored in the output condition storage unit, the search target document is not output.
Related document retrieval device.

The related document search device according to any one of claims 1 to 4,
A search target document storage unit storing a plurality of search target documents; and
A search target document filtering unit that determines a category of the input document and selects a search target document of the determined category from a plurality of search target documents stored in the search target document storage unit; In addition,
The feature amount extraction unit extracts a similarity feature amount group for the input sentence and each of the selected search target documents.
Related document retrieval device.

The sentence inputted by the feature quantity extraction unit as a plurality of different feature quantities representing the similarity degree between the certain sentence and the certain search target document, the similarity degree group of the certain sentence and the certain retrieval target document And feature amount extraction step for extracting a similarity feature amount group for each search target document,
The similarity score calculation unit uses the similarity score model representing the relationship between the similarity feature quantity group and the similarity score corresponding to the similarity feature quantity group and the extracted similarity feature quantity group to input A similarity score calculating step for calculating a similarity score for each of the retrieved sentences and each search target document;
Related document search method including

The program for functioning a computer as each part of the related document search apparatus in any one of Claim 1 to 5.

Similarity feature quantities for an input sentence and each document are defined as a plurality of different feature quantities representing similarity between the sentence and the document. A feature extraction unit for extracting a group;
Assuming that the similarity score between the input sentence and each document is predetermined, the extracted similarity feature amount group is used as an explanatory variable, and the similarity corresponding to the extracted similarity feature amount group A creation unit that creates a similarity score model that represents a relationship between a similarity feature quantity group and a similarity score corresponding to the similarity feature quantity group by performing regression analysis with the degree score as an objective variable;
Model creation device including

Similarity between two different documents in which the feature quantity extraction unit uses a similarity feature quantity group for a given sentence and a given document as a plurality of different feature quantities representing the similarity between the given sentence and the given document. A feature amount extraction step for extracting a degree feature amount group;
Assuming that the similarity score between the two different documents is determined in advance, the creation unit uses the extracted similarity feature amount group as an explanatory variable, and the similarity corresponding to the extracted similarity feature amount group A creation step for creating a similarity score model representing a relationship between a similarity feature group and a similarity score corresponding to the similarity feature group by performing regression analysis with the score as an objective variable;
Model creation method including

The program for functioning a computer as each part of the model production apparatus of Claim 8.