JP6433937B2

JP6433937B2 - Keyword evaluation device, similarity evaluation device, search device, evaluation method, search method, and program

Info

Publication number: JP6433937B2
Application number: JP2016093227A
Authority: JP
Inventors: 淳史大塚; 克人別所; 平野　徹; 徹平野; 久子浅野; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-05-06
Filing date: 2016-05-06
Publication date: 2018-12-05
Anticipated expiration: 2036-05-06
Also published as: JP2017201478A

Description

本発明は、音声又はテキストを入力とするキーワード評価装置、類似度評価装置、検索装置、評価方法、検索方法、及びプログラムに関する。 The present invention relates to a keyword evaluation device, a similarity evaluation device, a search device, an evaluation method, a search method, and a program that use speech or text as input.

情報検索システムとして、ユーザが入力したクエリに対して、キーワードマッチ等の検索処理を行い、クエリに適合する文書を検索するシステムが知られている。しかし、キーワードマッチを用いた検索処理の場合には、クエリのキーワードと文書内のキーワードが完全一致していなくてはならず、検索の再現率（Recall）が低下してしまうという問題が発生することがある。そこで、クエリ中に含まれるキーワードを拡張して増加させ、ユーザが入力したクエリを、より幅広い文書にマッチさせるクエリ拡張が行われる場合がある（特許文献１）。 As an information search system, there is known a system that performs a search process such as keyword matching on a query input by a user and searches for a document that matches the query. However, in the case of search processing using keyword matching, the keyword of the query must match the keyword in the document completely, and there is a problem that the recall rate (Recall) of the search is reduced. Sometimes. Thus, there is a case where query expansion is performed to expand and increase keywords included in a query and match a query input by a user with a wider range of documents (Patent Document 1).

また、キーワードマッチ以外の検索手法として、概念検索が知られている。概念検索は、キーワードを連続値のN次元ベクトルで表現し、N次元ベクトルの重心をクエリベクトルと見なす。同様に、文書ベクトルも文書内のキーワードベクトルの重心で表現し、クエリベクトルと文書ベクトルの類似度を計算し、類似度が高い順に検索結果を出力することで、クエリに適合する文書の検索を実行する。概念検索ではキーワードマッチと異なり、キーワードが完全一致しなくてもクエリに近い話題に関する文書が検索可能になるという利点がある（特許文献２）。 Moreover, concept search is known as a search method other than keyword matching. In concept search, a keyword is expressed by an N-dimensional vector of continuous values, and the center of gravity of the N-dimensional vector is regarded as a query vector. Similarly, the document vector is also expressed by the centroid of the keyword vector in the document, the similarity between the query vector and the document vector is calculated, and the search result is output in descending order of the similarity, thereby searching for a document that matches the query. Run. Unlike keyword matching, conceptual search has the advantage that documents related to topics close to the query can be searched even if the keywords do not match completely (Patent Document 2).

特開２０１０−１２３０３６号公報JP 2010-123036 A 特開２０１０−１８２０４１号公報JP 2010-182041 A

しかしながら、キーワードマッチ及び概念検索では、クエリをキーワードの集合（bag-of-words）とみなすことを前提としている。つまり、自然言語の形式でクエリが入力された場合、形態素解析やキーワード抽出により、自然言語からキーワードを抽出することで、クエリに近い話題を含む文書等の検索を行っている。 However, keyword matching and concept searching assume that a query is considered as a set of keywords (bag-of-words). That is, when a query is input in a natural language format, a keyword or the like including a topic close to the query is searched by extracting the keyword from the natural language by morphological analysis or keyword extraction.

したがって、自然言語からキーワードを抽出してキーワード集合を生成する過程で、自然言語が本来持っていた語順や構文情報が欠落してしまう状況が発生する場合がある。 Therefore, in the process of generating a keyword set by extracting keywords from the natural language, a situation may occur in which the word order or syntax information originally possessed by the natural language is lost.

例えば、FAQ検索において、「メールが送信できない」と「送信できないメールがある」という、自然言語で表現された２つのクエリが入力されたとする。この２つのクエリが表す意味は異なるが、２つのクエリをキーワード集合に変換すると、共に「メール、送信、できない」という同じ要素を含むキーワード集合になってしまい、２つのクエリの違いが区別できなくなってしまう。このように、自然言語による情報検索を行う際には、文に含まれるキーワードのみでなく、キーワード周辺の文脈を考慮しなければならないことがある。 For example, it is assumed that two queries expressed in a natural language, such as “mail cannot be sent” and “some mail cannot be sent” are input in the FAQ search. Although the meanings represented by these two queries are different, when two queries are converted to keyword sets, both result in a keyword set that includes the same elements of “email, send, cannot” and the difference between the two queries cannot be distinguished. End up. Thus, when performing information retrieval in a natural language, it is sometimes necessary to consider not only the keywords included in the sentence but also the context around the keywords.

本発明は、上記の事情を鑑みて成されたものであり、自然言語で表される入力文に含まれるキーワードの重要度を精度よく評価することができるキーワード評価装置、キーワードの評価方法、及びプログラムを提供することを目的とする。また、自然言語で表される入力文と、比較対象となる文と、の類似度を評価し、入力文に類似する文を精度よく検索することができる類似度評価装置、検索装置、類似度の評価方法、検索方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, a keyword evaluation device capable of accurately evaluating the importance of a keyword included in an input sentence expressed in a natural language, a keyword evaluation method, and The purpose is to provide a program. Also, a similarity evaluation device, a search device, and a similarity that can evaluate the similarity between an input sentence expressed in a natural language and a sentence to be compared, and accurately search for a sentence similar to the input sentence. It is an object to provide an evaluation method, a search method, and a program.

上記の目的を達成するために本発明に係るキーワード評価装置は、入力された第１文から抽出された第１のキーワードと、入力された第２文の中で前記第１のキーワードと類似するキーワードである第２のキーワードと、に基づき、キーワード同士の類似度、キーワードを含む文節同士の類似度、及びキーワードを含む文節の係り受け同士の類似度のうち少なくとも２つの類似度を計算する計算部と、前記計算部で計算された前記少なくとも２つの類似度の変化値の絶対値が小さいほど、前記第１のキーワードの重要度を高く評価する評価部と、を含んで構成されている。 In order to achieve the above object, the keyword evaluation apparatus according to the present invention is similar to the first keyword extracted from the input first sentence and the first keyword in the input second sentence. A calculation for calculating at least two similarities among the similarity between keywords, the similarity between clauses including the keyword, and the similarity between clauses including the keyword based on the second keyword as a keyword And an evaluation unit that evaluates the importance of the first keyword higher as the absolute value of the change value of the at least two similarities calculated by the calculation unit is smaller.

本発明に係る類似度評価装置は、入力された第１文に含まれる単語と入力された第２文に含まれる単語との組み合わせの中で類似度が最も高い組み合わせにおける前記第１文に含まれる単語を第１のキーワード、前記第２文に含まれる単語を第２のキーワードとし、キーワード同士の類似度、キーワードを含む文節同士の類似度、及びキーワードを含む文節の係り受け同士の類似度のうち少なくとも２つの類似度を計算する計算部と、前記計算部で計算された前記少なくとも２つの類似度の変化値の絶対値が小さいほど、前記第１文と前記第２文とが類似していると評価する評価部と、を含んで構成されている。 The similarity evaluation apparatus according to the present invention is included in the first sentence in the combination having the highest similarity among the combinations of the word included in the input first sentence and the word included in the input second sentence. The first keyword as the first word, the second keyword as the word contained in the second sentence, the similarity between the keywords, the similarity between the clauses including the keyword, and the similarity between the clauses including the keyword And the first sentence and the second sentence are more similar as the absolute value of the change value of the at least two similarity degrees calculated by the calculation part is smaller. And an evaluation unit that evaluates that

本発明に係る検索装置は、予め用意された複数の検索対象文毎に、前記検索対象文に含まれる各キーワードを表すキーワードベクトル、前記各キーワードについてのキーワードを含む文節を表す文節ベクトル、及び前記各キーワードについてのキーワードを含む前記文節の係り先を含む係り受け関係を表す係り受けベクトルを記憶する記憶部と、前記複数の検索対象文毎に、入力されたクエリ文に含まれるキーワードと前記検索対象文に含まれるキーワードとの組み合わせの中で類似度が最も高い組み合わせにおける前記クエリ文に含まれるキーワードを第１のキーワード、前記検索対象文に含まれるキーワードを第２のキーワードとし、キーワードベクトルに基づくキーワード同士の類似度、文節ベクトルに基づくキーワードを含む文節同士の類似度、及び係り受けベクトルに基づくキーワードを含む文節の係り受け同士の類似度のうち少なくとも２つの類似度を前記複数の検索対象文毎に計算する計算部と、前記複数の検索対象文毎に、前記計算部で計算された前記少なくとも２つの類似度の変化値の絶対値が小さいほど、前記クエリ文と前記検索対象文とが類似していると評価する評価部と、前記評価部による評価結果に基づいて、前記クエリ文に類似する検索対象文を検索する検索部と、を含んで構成されている。 The search device according to the present invention includes, for each of a plurality of search target sentences prepared in advance, a keyword vector that represents each keyword included in the search target sentence, a phrase vector that represents a phrase including a keyword for each keyword, and A storage unit for storing a dependency vector representing a dependency relationship including a dependency destination of the clause including the keyword for each keyword, and a keyword included in the input query statement and the search for each of the plurality of search target statements The keyword included in the query sentence in the combination with the highest similarity among the keywords included in the target sentence is the first keyword, the keyword included in the search target sentence is the second keyword, and the keyword vector Similarity between keywords based on the same phrase, including phrases based on phrase vectors A calculation unit that calculates at least two similarities for each of the plurality of search target sentences among the similarities of the phrases and the similarities between the dependencies of the clauses including keywords based on the dependency vectors, and for each of the plurality of search target sentences The evaluation unit that evaluates that the query sentence and the search target sentence are similar to each other as the absolute value of the change value of the at least two similarities calculated by the calculation unit is smaller, and the evaluation unit And a search unit that searches for a search target sentence similar to the query sentence based on the evaluation result.

本発明に係るキーワードの評価方法は、コンピュータが、入力部を介して入力された第１文から抽出された第１のキーワードと、前記第１文と共に前記入力部を介して入力された文であって、前記第１文と対応付けて記憶装置に記憶した第２文の中で前記第１のキーワードと類似するキーワードである第２のキーワードに基づき、キーワード同士の類似度、キーワードを含む文節同士の類似度、及びキーワードを含む文節の係り受け同士の類似度のうち少なくとも２つの類似度を計算するステップと、前記少なくとも２つの類似度の変化値の絶対値が小さいほど、前記第１のキーワードの重要度が高くなるように評価した評価結果を表示装置に表示するステップ、を実行する。 In the keyword evaluation method according to the present invention, a computer uses a first keyword extracted from a first sentence input via an input unit, and a sentence input via the input unit together with the first sentence. A phrase including a similarity between keywords based on a second keyword that is similar to the first keyword in the second sentence stored in the storage device in association with the first sentence, and a phrase including the keyword The step of calculating at least two similarities among the similarities between each other and the similarities between clauses including keywords, and the smaller the absolute value of the change values of the at least two similarities, the smaller the first step of displaying evaluation results were evaluated as severity of keywords increases the display device, for execution.

本発明に係る文の類似度の評価方法は、コンピュータが、入力部を介して入力された第１文に含まれる単語と、前記第１文と共に前記入力部を介して入力された文であって、前記第１文と対応付けて記憶装置に記憶した第２文に含まれる単語との組み合わせの中で類似度が最も高い組み合わせにおける前記第１文に含まれる単語を第１のキーワード、前記第２文に含まれる単語を第２のキーワードとし、キーワード同士の類似度、キーワードを含む文節同士の類似度、及びキーワードを含む文節の係り受け同士の類似度のうち少なくとも２つの類似度を計算するステップと、前記少なくとも２つの類似度の変化値の絶対値が小さいほど、前記第１文と前記第２文とが類似していると評価し、評価結果を表示装置に表示するステップ、を実行する。 Evaluation of the similarity of the sentence according to the present invention, the computer, there in the words included in the first sentence inputted through the input unit, which is input through the input unit together with the first sentence sentence The word included in the first sentence in the combination having the highest similarity among the combinations included in the second sentence stored in the storage device in association with the first sentence is the first keyword, The word contained in the second sentence is set as the second keyword, and at least two similarities are calculated among the similarity between the keywords, the similarity between the clauses including the keyword, and the similarity between the clauses including the keyword. And the step of evaluating that the first sentence and the second sentence are more similar as the absolute value of the change value of the at least two similarities is smaller, and displaying the evaluation result on a display device. Execute .

本発明に係る文の検索方法は、コンピュータが、予め用意された複数の検索対象文毎に、前記検索対象文に含まれる各キーワードを表すキーワードベクトル、前記各キーワードについてのキーワードを含む文節を表す文節ベクトル、及び前記各キーワードについてのキーワードを含む前記文節の係り先を含む係り受け関係を表す係り受けベクトルを生成し、前記複数の検索対象文毎に生成したキーワードベクトル、文節ベクトル、及び係り受けベクトルを対応付け、記憶装置に記憶するステップと、前記複数の検索対象文毎に、入力部を介して入力されたクエリ文に含まれるキーワードと前記検索対象文に含まれるキーワードとの組み合わせの中で類似度が最も高い組み合わせにおける前記クエリ文に含まれるキーワードを第１のキーワード、前記検索対象文に含まれるキーワードを第２のキーワードとし、キーワードベクトルに基づくキーワード同士の類似度、文節ベクトルに基づくキーワードを含む文節同士の類似度、及び係り受けベクトルに基づくキーワードを含む文節の係り受け同士の類似度のうち少なくとも２つの類似度を前記複数の検索対象文毎に計算し、計算した前記少なくとも２つの類似度を前記複数の検索対象文毎に対応づけて前記記憶装置に記憶するステップと、前記複数の検索対象文毎に、計算した前記少なくとも２つの類似度の変化値の絶対値が小さいほど、前記クエリ文と前記検索対象文とが類似していると評価するステップと、前記評価に基づいて、前記クエリ文に類似する検索対象文を検索し、前記クエリ文に類似する検索対象文を表示装置に表示するステップ、を実行する。 In the sentence search method according to the present invention, for each of a plurality of search target sentences prepared in advance, a computer represents a keyword vector representing each keyword included in the search target sentence and a phrase including a keyword for each keyword. A dependency vector that represents a dependency relationship including a phrase vector and a dependency destination of the clause including a keyword for each keyword is generated, and the keyword vector, the phrase vector, and the dependency generated for each of the plurality of search target sentences A step of associating vectors and storing them in a storage device, and for each of the plurality of search target sentences, a combination of a keyword included in the query sentence input via the input unit and a keyword included in the search target sentence The keyword included in the query statement in the combination having the highest similarity is the first keyword. The keyword included in the search target sentence is the second keyword, the similarity between the keywords based on the keyword vector, the similarity between the phrases including the keyword based on the phrase vector, and the phrase relationship including the keyword based on the dependency vector At least two similarities among the similarities between the receivers are calculated for each of the plurality of search target sentences, and the calculated at least two similarities are associated with each of the plurality of search target sentences and stored in the storage device. And, for each of the plurality of search target sentences, evaluating that the query sentence and the search target sentence are more similar as the absolute value of the calculated change value of the at least two similarities is smaller ; based on the evaluation, it searches the search subject sentence similar to the query statement, displayed on the display device the search subject sentence similar to the query statement That step, to run.

本発明に係るキーワード評価装置のプログラムは、キーワード評価装置の各部としてコンピュータを機能させるためのプログラムである。 The program of the keyword evaluation device according to the present invention is a program for causing a computer to function as each part of the keyword evaluation device.

本発明に係る類似度評価装置のプログラムは、類似度評価装置の各部としてコンピュータを機能させるためのプログラムである。 The program of the similarity evaluation apparatus according to the present invention is a program for causing a computer to function as each part of the similarity evaluation apparatus.

本発明に係る検索装置のプログラムは、検索装置の各部としてコンピュータを機能させるためのプログラムである。 The program of the search device according to the present invention is a program for causing a computer to function as each part of the search device.

以上説明したように、本発明のキーワード評価装置、キーワードの評価方法、及びプログラムによれば、自然言語で表される入力文に含まれるキーワードの重要度を精度よく評価することができる、という効果が得られる。
また、本発明の類似度評価装置、検索装置、類似度の評価方法、検索方法、及びプログラムによれば、自然言語で表される入力文と、比較対象となる文と、の類似度を評価し、入力文に類似する文を精度よく検索することができる、という効果が得られる。 As described above, according to the keyword evaluation device, the keyword evaluation method, and the program of the present invention, it is possible to accurately evaluate the importance of keywords included in an input sentence expressed in a natural language. Is obtained.
Further, according to the similarity evaluation device, search device, similarity evaluation method, search method, and program of the present invention, the similarity between an input sentence expressed in a natural language and a sentence to be compared is evaluated. As a result, it is possible to retrieve a sentence similar to the input sentence with high accuracy.

第１実施形態に係る類似度評価装置の構成例を示す概略図である。It is the schematic which shows the structural example of the similarity evaluation apparatus which concerns on 1st Embodiment. 係り受け解析の実行結果の一例について説明する図である。It is a figure explaining an example of the execution result of dependency analysis. 類似度評価装置における類似度評価処理ルーチンの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the similarity evaluation process routine in a similarity evaluation apparatus. 類似度評価処理ルーチンを実行した場合におけるスコアの算出過程の一例を示す図である。It is a figure which shows an example of the calculation process of the score at the time of performing a similarity evaluation process routine. 第２実施形態に係るキーワード評価装置の構成例を示す概略図である。It is the schematic which shows the structural example of the keyword evaluation apparatus which concerns on 2nd Embodiment. キーワード評価装置におけるキーワード評価処理ルーチンの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the keyword evaluation process routine in a keyword evaluation apparatus. 第３実施形態に係る検索装置の構成例を示す概略図である。It is the schematic which shows the structural example of the search device which concerns on 3rd Embodiment. 検索装置における検索処理ルーチンの処理の一例を示すフローチャートである。It is a flowchart which shows an example of a process of the search process routine in a search device.

以下、図面を参照して本発明の実施の形態を詳細に説明する。なお、以下では、同じ働きを担う構成要素又は処理には全図面を通して同じ符号を付与し、重複する説明を適宜省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the following description, the same reference numerals are given to the components or processes having the same functions throughout the drawings, and the repeated description is omitted as appropriate.

＜第１実施形態＞
第１実施形態では、自然言語で記述された第１文及び第２文の２つの文を入力とし、２つの文の類似度を数値化してスコアとして出力する類似度評価装置１００について説明する。 <First Embodiment>
In the first embodiment, a similarity evaluation apparatus 100 that inputs two sentences of a first sentence and a second sentence described in a natural language and outputs the score by quantifying the similarity between the two sentences will be described.

＜システム構成＞
図１は、類似度評価装置１００のシステム構成例を示す図である。図１に示すように、類似度評価装置１００は、ＣＰＵと、ＲＡＭと、後述する類似度評価処理ルーチンを実行するためのプログラムを記憶したＲＯＭと、を備えたコンピュータで構成され、機能的には次に示すように構成されている。 <System configuration>
FIG. 1 is a diagram illustrating a system configuration example of the similarity evaluation apparatus 100. As shown in FIG. 1, the similarity evaluation apparatus 100 is composed of a computer including a CPU, a RAM, and a ROM that stores a program for executing a similarity evaluation processing routine described later, and is functionally Is configured as follows.

類似度評価装置１００は、入力部１０、演算部２０、記憶部３０、及び出力部４０を備える。また、演算部２０は、文解析部２１、ベクトル生成部２２、計算部２３、及び評価部２４を含むと共に、評価部２４は、類似度変化率算出部２５及び類似度評価部２６を含む。 The similarity evaluation device 100 includes an input unit 10, a calculation unit 20, a storage unit 30, and an output unit 40. The computing unit 20 includes a sentence analyzing unit 21, a vector generating unit 22, a calculating unit 23, and an evaluating unit 24, and the evaluating unit 24 includes a similarity change rate calculating unit 25 and a similarity evaluating unit 26.

入力部１０に入力される第１文及び第２文は、文解析部２１に出力される。文解析部２１では、第１文及び第２文の各々に対して、キーワード抽出及び文節の係り受け解析といった言語処理を行い、第１文と第２文の言語構造を解析する。 The first sentence and the second sentence input to the input unit 10 are output to the sentence analysis unit 21. The sentence analysis unit 21 performs language processing such as keyword extraction and phrase dependency analysis on each of the first sentence and the second sentence, and analyzes the language structure of the first sentence and the second sentence.

ベクトル生成部２２は、文解析部２１で行われた第１文と第２文の言語構造の解析結果を入力として、各々の文について、キーワード、文節、及び文節の係り受けといった異なる表現単位毎に概念ベクトルを生成する。 The vector generation unit 22 receives the analysis result of the language structure of the first sentence and the second sentence performed by the sentence analysis unit 21 and inputs, for each sentence, a different expression unit such as a keyword, a phrase, and a dependency of the phrase. Generate a concept vector.

そして、計算部２３は、ベクトル生成部２２で生成された各表現単位毎の概念ベクトルを入力として、キーワード同士、文節同士、及び文節の係り受け同士の類似度を計算する。 Then, the calculation unit 23 uses the concept vector for each expression unit generated by the vector generation unit 22 as an input, and calculates the similarity between keywords, phrases, and phrases.

評価部２４は、計算部２３で計算されたキーワード同士、文節同士、及び文節の係り受け同士の類似度を入力として、類似度変化率算出部２５で類似度変化率を算出し、算出した類似度変化率に基づいて、類似度評価部２６で第１文と第２文の最終的な類似度を数値化してスコアとして評価する。 The evaluation unit 24 receives the similarity between the keywords calculated by the calculation unit 23, the phrases, and the dependency of the phrases, and calculates the similarity change rate by the similarity change rate calculation unit 25. Based on the degree change rate, the similarity evaluation unit 26 digitizes the final similarity between the first sentence and the second sentence and evaluates the score as a score.

出力部４０は、評価部２４で評価されたスコアを出力して、第１文と第２文の類似度を通知する。 The output unit 40 outputs the score evaluated by the evaluation unit 24 and notifies the similarity between the first sentence and the second sentence.

なお、以降では、キーワード、文節、及び文節の係り受けといった各々の表現単位の粒度について述べる場合がある。キーワードに比べて文節の方が表現単位の粒度が粗く、文節に比べて文節の係り受けまで含めた範囲の方が表現単位の粒度が粗いため、キーワードの粒度が最も細かく、文節の係り受けの粒度が最も粗くなる。 In the following, the granularity of each expression unit, such as keywords, clauses, and dependency of clauses, may be described. Compared to the keyword, the phrase has a coarser granularity of the expression unit, and the range including the clause dependency than the phrase has a coarser granularity of the expression unit, so the keyword granularity is the finest and the clause dependency The grain size is the coarsest.

次に、演算部２０の処理内容について詳細に説明する。 Next, the processing content of the calculating part 20 is demonstrated in detail.

＜文解析部＞
文解析部２１は、第１文及び第２文の各文に対して言語処理を実行して、係り受け解析及びキーワード抽出を行う。 <Sentence Analysis Department>
The sentence analysis unit 21 performs language processing on the first sentence and the second sentence, and performs dependency analysis and keyword extraction.

文解析部２１は、第１文及び第２文の各々の文に対して、例えば係り受け解析器を用いて、文中の形態素、各形態素の品詞、文節情報（文節数、文節に含まれる形態素）、文節間の係り受け関係等の言語構造に関する情報を取得する。なお、文解析部２１で用いる係り受け解析器に制限はなく、一例として上記に示した言語構造に関する情報を取得することができるものであれば、どのような係り受け解析器を用いてもよい。 For each sentence of the first sentence and the second sentence, the sentence analysis unit 21 uses, for example, a dependency analyzer, the morpheme in the sentence, the part of speech of each morpheme, the phrase information (the number of phrases, the morpheme included in the phrase) ), Information on language structures such as dependency relations between phrases is acquired. The dependency analyzer used in the sentence analysis unit 21 is not limited, and any dependency analyzer may be used as long as it can acquire information on the language structure shown above as an example. .

図２は、文解析部２１での係り受け解析の実行結果の一例を示す図であり、例えば第２文として「メールが送信できない」を受け付けた場合、「メールが」が文節１、「送信できない」が文節２であり、文節１の係り先が文節２であることが示される。また、各々の文節は形態素毎に分類されて各形態素の品詞が示される。 FIG. 2 is a diagram illustrating an example of the execution result of the dependency analysis in the sentence analysis unit 21. For example, when “mail cannot be sent” is received as the second sentence, “mail is” is clause 1, “send” “Cannot” is phrase 2, and phrase 1 is related to phrase 2. Each clause is classified by morpheme and the part of speech of each morpheme is shown.

更に、文解析部２１は、係り受け解析の結果に基づいて、キーワード抽出ルールに従って、各々の文からキーワードを抽出する。なお、文解析部２１で使用するキーワード抽出ルールに特に制限はなく、文同士の類似性の評価目的に応じて自由に規定することができる。一般的には、名詞及び動詞等の自立語となる単語をキーワードとして文から抽出することが好ましい。 Furthermore, the sentence analysis unit 21 extracts keywords from each sentence according to the keyword extraction rule based on the result of dependency analysis. In addition, there is no restriction | limiting in particular in the keyword extraction rule used in the sentence analysis part 21, It can prescribe | regulate freely according to the evaluation objective of the similarity of sentences. In general, it is preferable to extract words that are independent words such as nouns and verbs from sentences as keywords.

文解析部２１は、第１文及び第２文から取得した言語構造に関する情報、並びに抽出したキーワードを、各々の文と対応付けて管理する。 The sentence analysis unit 21 manages information related to the language structure acquired from the first sentence and the second sentence and the extracted keywords in association with each sentence.

＜ベクトル生成部＞
ベクトル生成部２２は、文解析部２１での第１文及び第２文の解析結果に基づいて、文同士の類似度を評価するための概念ベクトルを作成する。 <Vector generator>
The vector generation unit 22 creates a concept vector for evaluating the similarity between sentences based on the analysis results of the first sentence and the second sentence in the sentence analysis unit 21.

ベクトル生成部２２は、まず各々の文に含まれるキーワードの概念ベクトルであるキーワードベクトルを生成する。また、ベクトル生成部２２は、キーワードベクトルを合成することにより、文節及び係り受けのそれぞれの概念ベクトルである文節ベクトル及び係り受けベクトルを生成する。 First, the vector generation unit 22 generates a keyword vector that is a conceptual vector of a keyword included in each sentence. Further, the vector generation unit 22 generates a phrase vector and a dependency vector, which are concept vectors of the phrase and the dependency, by synthesizing the keyword vectors.

キーワードベクトルを生成するためには、予め概念ベクトルモデルを生成しておく必要がある。概念ベクトルモデルは、概念ベクトル生成用の文書集合を用意し、概念ベクトル生成手法を用いて生成することができる。 In order to generate a keyword vector, it is necessary to generate a concept vector model in advance. The concept vector model can be generated using a concept vector generation method by preparing a document set for generating a concept vector.

なお、ベクトル生成部２２で用いる概念ベクトル生成用の文書集合、及び概念ベクトル生成手法に特に制限はない。概念ベクトル生成用の文書集合には、例えば第１文として入力される可能性のある文の集合（入力文集合）と内容が重複するWikipedia（登録商標）のページ集合を用いてもよく、また、例えば入力文集合から抽出したキーワードを用いてWeb検索を行った場合の検索結果に含まれるWebのページ集合等、任意の文書集合を用いることができる。 There are no particular restrictions on the concept vector generation document set used in the vector generation unit 22 and the concept vector generation method. As the document set for generating concept vectors, for example, a page set of Wikipedia (registered trademark) whose contents overlap with a set of sentences that may be input as the first sentence (input sentence set) may be used. For example, an arbitrary document set such as a Web page set included in a search result when a Web search is performed using a keyword extracted from an input sentence set can be used.

また、概念ベクトル生成手法についても、例えば特異値分解を用いた潜在意味インデックス（Latent Semantic Indexing:LSI）、トピックモデル、及びニューラルネットワークを用いたモデル等、任意の概念ベクトル生成モデルを用いることができる。この際、ベクトル生成部２２で用いる概念ベクトル生成モデルでは、文に含まれるキーワードのみならず、助詞等全ての形態素のベクトルも生成するようにする。 As for the concept vector generation method, for example, an arbitrary concept vector generation model such as a latent semantic indexing (LSI) using singular value decomposition, a topic model, and a model using a neural network can be used. . At this time, in the concept vector generation model used in the vector generation unit 22, not only the keywords included in the sentence but also all morpheme vectors such as particles are generated.

そして、ベクトル生成部２２は、文節に含まれるキーワードのキーワードベクトル、及び当該文節に含まれるキーワード以外の形態素のベクトルを合成して、キーワードを含む文節の文節ベクトルを第１文及び第２文の各文について生成する。 Then, the vector generation unit 22 synthesizes the keyword vector of the keyword included in the phrase and the morpheme vector other than the keyword included in the phrase, and determines the phrase vector of the phrase including the keyword as the first sentence and the second sentence. Generate for each sentence.

なお、ベクトルの合成方法に制限はなく、重心ベクトルによる合成方法の他、Recursive AutoEncoder(RAE)等のニューラルネットワークを用いたベクトルの合成方法を用いてもよい。 The vector synthesis method is not limited, and a vector synthesis method using a neural network such as Recursive AutoEncoder (RAE) may be used in addition to the synthesis method based on the centroid vector.

具体的には、重心ベクトルによる合成方法を用いる場合、ベクトル生成部２２は、文節内に含まれるキーワードを含む全ての形態素のベクトルの重心を計算し、当該重心を表すベクトルを、キーワードを含む文節の文節ベクトルとする。また、RAEによる合成方法を用いる場合、ベクトル生成部２２は、まず、キーワードを含む文節の先頭から数えて１番目及び２番目に現われる２つの形態素のベクトルを合成し、当該合成した形態素のベクトルを、今度は３番目に現われる形態素のベクトルと合成する。以降、ベクトル生成部２２は、形態素の語順に従って、合成した形態素のベクトルと、次に現われる形態素のベクトルと、を順次合成する処理を、文節に含まれる全ての形態素のベクトルを合成するまで繰り返し、最終的に合成された形態素のベクトルを、キーワードを含む文節の文節ベクトルとする。 Specifically, when using the synthesis method based on the centroid vector, the vector generation unit 22 calculates the centroid of all morpheme vectors including the keyword included in the phrase, and the vector representing the centroid is used as the phrase including the keyword. The phrase vector of. Also, when using the RAE combining method, the vector generation unit 22 first combines the vectors of the two morphemes that appear first and second from the beginning of the clause including the keyword, and the combined morpheme vectors are obtained. This time, it combines with the vector of the morpheme that appears third. Thereafter, the vector generation unit 22 repeats the process of sequentially synthesizing the synthesized morpheme vector and the next morpheme vector according to the word order of the morpheme until all the morpheme vectors included in the phrase are synthesized, The finally synthesized morpheme vector is used as the phrase vector of the phrase including the keyword.

次に、ベクトル生成部２２は、文節の係り受け関係を表す係り受けベクトルを生成する。ここで文節の係り受け関係とは、２つの文節間の依存関係を表す。したがって、ベクトル生成部２２は、文節の係り元と係り先における２つの文節の文節ベクトルを合成することによって係り受けベクトルを生成する。当該文節ベクトルの合成方法については、文節ベクトルの生成時と同様に制限はなく、重心ベクトルによる合成方法の他、RAE等のニューラルネットワークを用いたベクトルの合成方法といった任意のベクトルの合成方法を用いることができる。 Next, the vector generation unit 22 generates a dependency vector representing the dependency relationship of the phrase. Here, the dependency relationship between clauses represents a dependency relationship between two clauses. Therefore, the vector generation unit 22 generates a dependency vector by synthesizing the phrase vectors of the two clauses at the source and destination of the clause. The method for synthesizing the phrase vector is not limited as in the case of the phrase vector generation, and any vector synthesis method such as a vector synthesis method using a neural network such as RAE is used in addition to a synthesis method using a centroid vector. be able to.

日本語における文節の係り受けの場合、単方向の係り受け関係となるため、述部に相当する文節以外の文節には、係り先となる文節が１つ存在する。 In the case of clause dependency in Japanese, since it is a unidirectional dependency relationship, there is one clause that is a dependency destination in clauses other than the clause corresponding to the predicate.

例えば、「横浜で赤い帽子を買った」という文では、文節「赤い」の係り先の文節は「帽子を」となるため、ベクトル生成部２２は、文節「赤い」に対応する文節ベクトルと、文節「帽子を」に対応する文節ベクトルとを合成して、「赤い帽子を」に対応する係り受けベクトルを生成する。 For example, in the sentence “I bought a red hat in Yokohama”, the phrase related to the phrase “red” becomes “hat”, so the vector generation unit 22 includes a phrase vector corresponding to the phrase “red”, A dependency vector corresponding to “red hat” is generated by synthesizing the phrase vector corresponding to the clause “cap”.

また、文節「横浜で」及び文節「帽子を」の係り先の文節は、共に文節「買った」となるため、ベクトル生成部２２は、それぞれ「横浜で買った」に対応する係り受けベクトルと、「帽子を買った」に対応する係り受けベクトルを生成する。 Further, since the clauses related to the clauses “in Yokohama” and the clause “hat” are both “buyed”, the vector generation unit 22 determines the dependency vectors corresponding to “bought in Yokohama” respectively. , A dependency vector corresponding to “I bought a hat” is generated.

なお、述部に相当する文節「買った」については、係り先となる文節が存在しないため、ベクトル生成部２２は、文節「買った」に対応する文節ベクトルをそのまま係り受けベクトルとして用いる。 For the clause “Bought” corresponding to the predicate, there is no clause as a dependency destination. Therefore, the vector generation unit 22 uses the clause vector corresponding to the clause “Bought” as a dependency vector.

以上の処理により、ベクトル生成部２２は、第１文及び第２文の各文について、キーワードベクトル群、キーワードを含む文節の文節ベクトル群、及びキーワードを含む文節の係り受け関係を表す係り受けベクトル群を生成し、各々の文と対応づける。 As a result of the above processing, the vector generation unit 22 determines, for each sentence of the first sentence and the second sentence, a keyword vector group, a phrase vector group of a phrase including the keyword, and a dependency vector representing a dependency relation of the phrase including the keyword. Create a group and associate it with each sentence.

＜計算部＞
計算部２３は、ベクトル生成部２２で生成したキーワードベクトル、文節ベクトル、及び係り受けベクトルに基づいて、キーワード同士の類似度、文節同士の類似度、及び文節の係り受け同士の類似度を計算する。この際、第１文と第２文のどちらの文を、類似度の評価を行いたい基準の文にするかによって類似度の計算結果が異なるが、前述したように、ここでは第１文を基準の文にして類似度を計算する。 <Calculation section>
The calculation unit 23 calculates the similarity between keywords, the similarity between clauses, and the similarity between clauses based on the keyword vector, phrase vector, and dependency vector generated by the vector generation unit 22. . At this time, the calculation result of the similarity differs depending on which of the first sentence and the second sentence is used as a reference sentence for which the similarity is to be evaluated. The similarity is calculated using the standard sentence.

まず、計算部２３は、基準となる第１文に含まれるキーワードを１つ選択する。なお、第１文に含まれるキーワードは、文解析部２１によって既に第１文から抽出されている。そして、計算部２３は、第１文から選択したキーワード（注目キーワード）と、第２文に含まれる全てのキーワードとの類似度を、各々のキーワードに対応するキーワードベクトルを用いて計算し、第２文に含まれるキーワードのうち、注目キーワードと最も類似度が高くなるキーワード（対応キーワード）を選択する。 First, the calculation unit 23 selects one keyword included in the first sentence serving as a reference. Note that the keywords included in the first sentence have already been extracted from the first sentence by the sentence analysis unit 21. Then, the calculation unit 23 calculates the similarity between the keyword selected from the first sentence (the keyword of interest) and all the keywords included in the second sentence using the keyword vector corresponding to each keyword, Of the keywords included in the two sentences, a keyword (corresponding keyword) having the highest similarity to the keyword of interest is selected.

なお、計算部２３で用いるキーワード同士の類似度の計算方法は、計算した類似度を示す値が０以上１以下の範囲の値を取るように正規化されるものであれば、どのような計算方法を用いてもよく、例えばコサイン距離等を用いることができる。類似度を示す値が“０”の場合は、キーワード同士が類似していないことを示し、類似度を示す値が大きくなるほど、キーワード同士の類似度が高いことを示す。そして、類似度を示す値が“１”の場合は、キーワード同士の類似度が最大であることを示す。 Note that the calculation method of the similarity between keywords used in the calculation unit 23 is any calculation as long as the value indicating the calculated similarity is normalized so as to take a value in the range of 0 to 1. For example, a cosine distance or the like can be used. When the value indicating the similarity is “0”, it indicates that the keywords are not similar to each other. The larger the value indicating the similarity, the higher the similarity between the keywords. When the value indicating the similarity is “1”, the similarity between the keywords is the maximum.

次に、計算部２３は、第１文から選択した注目キーワードを含む文節と、注目キーワードと最も類似度が高い対応キーワードを含む第２文の文節と、の類似度を、各々の文節に対応する文節ベクトルを用いて、例えばキーワード同士の類似度を計算する際に用いた計算方法と同じ計算方法で計算する。 Next, the calculation unit 23 associates each phrase with the similarity between the phrase including the keyword of interest selected from the first sentence and the phrase of the second sentence including the corresponding keyword having the highest similarity with the keyword of interest. For example, the same calculation method as that used when calculating the similarity between keywords is used.

なお、第２文において、対応キーワードを含む文節が複数存在する場合には、計算部２３は、注目キーワードを含む文節と、対応キーワードを含む全ての文節と、の類似度を各々計算し、類似度を示す値が最も大きくなる文節同士の組み合わせを選択する。 When there are a plurality of clauses including the corresponding keyword in the second sentence, the calculation unit 23 calculates the similarity between each of the clauses including the keyword of interest and all the clauses including the corresponding keyword. Select a combination of clauses with the largest degree value.

そして、計算部２３は、注目キーワードを含む文節の係り受け関係と、対応キーワードを含む文節の係り受け関係と、の類似度を、各々の係り受け関係に対応する係り受けベクトルを用いて、例えばキーワード同士の類似度を計算する際に用いた計算方法と同じ計算方法で計算する。 Then, the calculation unit 23 uses the dependency vectors corresponding to each dependency relationship, for example, to determine the similarity between the dependency relationship of the clause including the keyword of interest and the dependency relationship of the clause including the corresponding keyword. The calculation method is the same as that used when calculating the similarity between keywords.

すなわち、計算部２３は、第１文から選択した注目キーワードに対して、第２文に含まれる対応キーワードとの間のキーワード同士の類似度、注目キーワードを含む文節と対応キーワードを含む文節との間の文節同士の類似度、及び注目キーワードを含む文節の係り受け関係と対応キーワードを含む文節の係り受け関係との間の文節の係り受け同士の類似度の３種類の類似度を計算する。 That is, for the attention keyword selected from the first sentence, the calculation unit 23 calculates the similarity between the keywords with the corresponding keyword included in the second sentence, the phrase including the attention keyword and the phrase including the corresponding keyword. Three types of similarity are calculated: the similarity between clauses, and the dependency between clauses including the keyword of interest and the dependency between clauses including the corresponding keyword.

そして、計算部２３は、第１文に含まれる全てのキーワードの各々を注目キーワードとして順次選択し、第１文に含まれる各々のキーワードに対して、上記に示した３種類の類似度を計算する。 Then, the calculation unit 23 sequentially selects each keyword included in the first sentence as a keyword of interest, and calculates the above three types of similarity for each keyword included in the first sentence. To do.

＜評価部＞
評価部２４は、計算部２３で第１文に含まれる各々のキーワードに対して計算した３種類の類似度に基づいて、第１文と第２文との類似度を評価する。 <Evaluation Department>
The evaluation unit 24 evaluates the similarity between the first sentence and the second sentence based on the three types of similarities calculated by the calculation unit 23 for each keyword included in the first sentence.

具体的には、まず、類似度変化率算出部２５において、３種類の類似度における類似度変化率を算出する。 Specifically, first, the similarity change rate calculation unit 25 calculates the similarity change rates for the three types of similarity.

ここで類似度変化率とは、キーワードから文節、文節から文節の係り受け関係へと、第１文と第２文との類似判定単位の粒度を粗くした場合における、類似度の変化を示す尺度である。 Here, the similarity change rate is a scale indicating a change in similarity when the granularity of the similarity determination unit between the first sentence and the second sentence is coarsened from the keyword to the phrase and from the clause to the dependency relation of the phrase. It is.

例えば、キーワード同士の類似度に対して文節同士の類似度が低下するほど、第１文と第２文とはキーワードレベルでは類似しているが、キーワードの周辺を含めた文節の単位では類似していないことを示すことになる。すなわち、キーワードレベルより粗い単位で見た場合、キーワード同士の類似度で表されるほど、お互いの文節は類似していないことを示している。 For example, as the similarity between phrases decreases with respect to the similarity between keywords, the first sentence and the second sentence are similar at the keyword level, but are similar in phrase units including the periphery of the keyword. Will show that not. That is, when viewed in a coarser unit than the keyword level, it indicates that the phrases are not similar to each other as represented by the similarity between the keywords.

換言すれば、キーワード同士の類似度に対する、キーワードを含む文節まで拡張した文節同士の類似度の低下の度合いが少ないほど、第１文と第２文とは、キーワードレベルでの類似性をキーワードの周辺を含めた文節の単位でも維持していることを示すことになる。 In other words, with respect to the similarity between keywords, the first sentence and the second sentence show the similarity at the keyword level as the degree of decrease in similarity between phrases expanded to the phrase including the keyword decreases. This indicates that the phrase unit including the surroundings is maintained.

同様に、文節同士の類似度に対して、文節の係り受け同士の類似度が低下するほど、第１文と第２文とは文節レベルでは類似しているが、文節の係り受け関係まで含めた単位では類似していないことを示すことになる。すなわち、文節より粗い単位で見た場合、文節同士の類似度で表されるほど、お互いの文節の係り受け関係は類似していないことを示している。 Similarly, as the similarity between clauses decreases with respect to the similarity between clauses, the first and second sentences are more similar at the clause level, but include the dependency relationship between clauses. This means that the unit is not similar. That is, when viewed in a coarser unit than a phrase, it indicates that the dependency relationship between the phrases is not similar as the degree of similarity between the phrases is expressed.

換言すれば、文節同士の類似度に対する、文節の係り受け同士の類似度の低下の度合いが少ないほど、第１文と第２文とは、文節レベルでの類似性を文節の係り受け関係まで含めた単位でも維持していることを示すことになる。 In other words, the lower the degree of decrease in similarity between clauses relative to the similarity between clauses, the more the first sentence and the second sentence have the similarity at the phrase level to the dependency relation between phrases. It shows that the unit is included.

以上をまとめれば、より粗い粒度で類似度を比較しても類似度の低下の度合いが少ない文ほど、各々の文の類似性が高いということができる。 Summarizing the above, it can be said that a sentence having a lower degree of decrease in similarity even if the degrees of similarity are compared with a coarser granularity has a higher similarity between the sentences.

したがって、類似度変化率は、上記に示した類似度の変化の状況と適合するように、例えば類似判定単位を粗くすることに伴って第１文と第２文との類似度が低下するほど、類似変化率の絶対値が大きくなるように設定される。具体的には、類似度変化率算出部２５は、第１文に含まれる全てのキーワードに対する類似度変化率を、（１）式及び（２）式を用いて計算する。 Therefore, the similarity change rate is such that the similarity between the first sentence and the second sentence decreases as the similarity determination unit becomes rough, for example, so as to be compatible with the above-described similarity change state. The absolute value of the similarity change rate is set to be large. Specifically, the similarity change rate calculation unit 25 calculates the similarity change rate for all keywords included in the first sentence using the equations (1) and (2).

ここで、d_ws(word₁)は第１文に含まれる任意のキーワードword₁について、キーワードから文節へ類似判定単位を粗くした場合の類似度変化率を示す。word₂はキーワードword₁に対応する第２文の対応キーワードであり、sim(word₁,word₂)は、キーワード同士の類似度を示す。また、seg₁はキーワードword₁を含む文節を示し seg₂はキーワードword₂を含む文節を示す。したがって、sim(seg₁,seg₂)は、対応する文節同士の類似度を示す。 Here, d _ws (word ₁ ) represents the degree of similarity change when the similarity determination unit is coarsened from a keyword to a phrase for an arbitrary keyword word ₁ included in the first sentence. The word ₂ is a corresponding keyword of the second sentence corresponding to the keyword word ₁ , and sim (word ₁ , word ₂ ) indicates the similarity between the keywords. Further, seg ₁ indicates a clause including the keyword word ₁ and seg ₂ indicates a clause including the keyword word ₂ . Therefore, sim (seg ₁ , seg ₂ ) indicates the similarity between corresponding phrases.

また、d_sd(word₁)は第１文に含まれる任意のキーワードword₁について、キーワードword₁を含む文節から、当該文節の係り受け関係へ類似判定単位を粗くした場合の類似度変化率を示す。dep₁はキーワードword₁を含む文節の係り受け関係を示し、dep₂はキーワードword₂を含む文節の係り受け関係を示す。したがって、sim(dep₁,dep₂)は、対応する文節の係り受け同士の類似度を示す。 D _sd (word ₁ ) is the rate of change of similarity for the case of arbitrary keyword word ₁ included in the first sentence, when the similarity determination unit is coarsened from the clause containing keyword word ₁ to the dependency relationship of the clause. Show. dep ₁ indicates the dependency relationship of the phrase including the keyword word ₁ , and dep ₂ indicates the dependency relationship of the phrase including the keyword word ₂ . Therefore, sim (dep ₁ , dep ₂ ) indicates the similarity between the corresponding clauses.

なお、（１）式及び（２）式では、より類似判定単位が粗い場合の類似度から類似判定単位が細かい場合の類似度を減算しているが、より類似判定単位が細かい場合の類似度から類似判定単位が粗い場合の類似度を減算して、d_ws(word₁)及びd_sd(word₁)を算出するようにしてもよい。 In equations (1) and (2), the similarity when the similarity determination unit is fine is subtracted from the similarity when the similarity determination unit is coarser, but the similarity when the similarity determination unit is finer is subtracted. D _ws (word ₁ ) and d _sd (word ₁ ) may be calculated by subtracting the similarity when the similarity determination unit is coarse.

類似度評価部２６は、計算部２３で計算したキーワード同士の類似度と、類似度変化率算出部２５で算出した類似度変化率とを用いて、第１文と第２文との類似度の程度をスコアとして表す。 The similarity evaluation unit 26 uses the similarity between the keywords calculated by the calculation unit 23 and the similarity change rate calculated by the similarity change rate calculation unit 25 to determine the similarity between the first sentence and the second sentence. Is expressed as a score.

具体的には、類似度評価部２６は、第１文と第２文との類似度の程度を示すスコアSIM(S₁,S₂)を（３）式を用いて計算する。 Specifically, the similarity evaluation unit 26 calculates a score SIM (S ₁ , S ₂ ) indicating the degree of similarity between the first sentence and the second sentence using the expression (3).

ここで、S₁は第１文、S₂は第２文を表す。wは第１文S₁に含まれるキーワードを表し、Nは第１文S₁に含まれるキーワードの個数を表す。また、w'_s2はキーワードwと最も類似度が高くなる第２文S₂の対応キーワードを表している。（３）式からわかるように、スコアSIM(S₁,S₂)は、０以上１以下の範囲の値を取るように正規化され、スコアSIM(S₁,S₂)が“１”に近づくほど、第１文S₁と第２文S₂が類似していることを示す。 Here, S ₁ represents the first sentence and S ₂ represents the second sentence. w represents a keyword included in the first sentence S ₁ , and N represents the number of keywords included in the first sentence S ₁ . Further, w ′ _s2 represents the corresponding keyword of the second sentence S ₂ having the highest similarity with the keyword w. As can be seen from the equation (3), the score SIM (S ₁ , S ₂ ) is normalized to take a value in the range of 0 to 1, and the score SIM (S ₁ , S ₂ ) is set to “1”. The closer it is, the more similar the first sentence S ₁ and the second sentence S ₂ are.

以上により、演算部２０で第１文と第２文との類似度が算出される。 As described above, the calculation unit 20 calculates the similarity between the first sentence and the second sentence.

＜類似度評価装置の作用＞
次に、第１実施形態に係る類似度評価装置１００の作用について説明する。 <Operation of similarity evaluation device>
Next, the operation of the similarity evaluation device 100 according to the first embodiment will be described.

類似度評価装置１００は、自然言語で記述された第１文及び第２文をそれぞれ入力部１０で受け付けると、受け付けた第１文及び第２文を例えば記憶部３０に格納する。そして、類似度評価装置１００は、ＣＰＵで図３に示す類似度評価処理ルーチンを実行する。 When the similarity evaluation apparatus 100 receives the first sentence and the second sentence described in a natural language by the input unit 10, the similarity evaluation apparatus 100 stores the received first sentence and second sentence in the storage unit 30, for example. And the similarity evaluation apparatus 100 performs the similarity evaluation processing routine shown in FIG. 3 by CPU.

まず、ステップＳ１００において、係り受け解析器を用いて、第１文及び第２文に対して係り受け解析を行い、各々の文から取得した言語構造に関する情報を取得する。そして、当該言語構造に関する情報に基づいて、第１文及び第２文の各々の文からキーワードを抽出し、取得した言語構造に関する情報及びキーワードを、各々の文と対応付けて記憶部３０に格納する。 First, in step S100, dependency analysis is performed on the first sentence and the second sentence using a dependency analyzer, and information on the language structure acquired from each sentence is acquired. Then, keywords are extracted from each sentence of the first sentence and the second sentence based on the information on the language structure, and the acquired information on the language structure and the keyword are stored in the storage unit 30 in association with each sentence. To do.

ステップＳ１０２において、ステップＳ１００で取得した、第１文及び第２文の各々の文に対応する言語構造に関する情報及びキーワードを参照して、予め定めた概念ベクトルモデルに基づいて、各々の文に含まれるキーワードの各々についてキーワードベクトルを生成する。 In step S102, with reference to the language structure information and keywords corresponding to each sentence of the first sentence and the second sentence acquired in step S100, each sentence is included based on a predetermined concept vector model. A keyword vector is generated for each of the keywords to be displayed.

そして、言語構造に関する情報に含まれる文節情報を参照し、文節に含まれるキーワードのキーワードベクトル、及び当該文節に含まれるキーワード以外の形態素のベクトルを合成することによって、第１文及び第２文の各々の文に対してキーワードを含む文節の各々の文節ベクトルを生成する。 Then, by referring to the phrase information included in the information related to the language structure, by synthesizing the keyword vector of the keyword included in the phrase and the morpheme vector other than the keyword included in the phrase, the first sentence and the second sentence For each sentence, generate each phrase vector of the phrase containing the keyword.

更に、言語構造に関する情報に含まれる文節の係り受け関係を参照し、係り受け関係を有する文節ベクトルを合成することによって、第１文及び第２文の各々の文に対して、文節の係り受け関係の各々の係り受けベクトルを生成する。 Furthermore, by referring to the dependency relations of the clauses included in the information related to the language structure and synthesizing the phrase vectors having the dependency relationships, the dependency of the clauses for each sentence of the first sentence and the second sentence is obtained. Generate a dependency vector for each of the relationships.

なお、生成したキーワードベクトル、文節ベクトル、及び係り受けベクトルは、第１文及び第２文の各々の文と対応付けて記憶部３０に格納する The generated keyword vector, phrase vector, and dependency vector are stored in the storage unit 30 in association with each of the first sentence and the second sentence.

ステップＳ１０４において、ステップＳ１００で第１文に対応付けられた未選択のキーワードを注目キーワードとして、記憶部３０から１つ選択する。 In step S104, one keyword selected from the storage unit 30 is selected as the keyword of interest that has not been selected in step S100 and associated with the first sentence.

ステップＳ１０６において、ステップＳ１０４で選択した注目キーワードに対応するキーワードベクトルを記憶部３０から取得する。そして、注目キーワードに対応するキーワードベクトルと、第２文に対応付けられた全てのキーワードベクトルと、のコサイン距離を各々計算して、注目キーワードに対応するキーワードベクトルと最もコサイン距離が短い第２文に対応付けられたキーワードベクトルで表されるキーワードを、対応キーワードとして取得する。この際、注目ベクトルと対応ベクトルとのコサイン距離を、キーワード同士の類似度として記憶部３０に格納する。 In step S106, a keyword vector corresponding to the attention keyword selected in step S104 is acquired from the storage unit 30. Then, the cosine distance between the keyword vector corresponding to the attention keyword and all the keyword vectors associated with the second sentence is calculated, and the second sentence having the shortest cosine distance with the keyword vector corresponding to the attention keyword. The keyword represented by the keyword vector associated with is acquired as a corresponding keyword. At this time, the cosine distance between the attention vector and the corresponding vector is stored in the storage unit 30 as the similarity between the keywords.

ステップＳ１０８において、ステップＳ１００で取得した言語構造に関する情報を参照して、ステップＳ１０４で選択した注目キーワードを含む第１文の文節と、ステップＳ１０６で取得した対応キーワードを含む第２文の文節と、を取得する。 In step S108, referring to the information on the language structure acquired in step S100, the first sentence phrase including the attention keyword selected in step S104, and the second sentence phrase including the corresponding keyword acquired in step S106; To get.

そして、ステップＳ１０２で生成した、注目キーワードを含む第１文の文節に対応した文節ベクトルと、対応キーワードを含む第２文の文節に対応した文節ベクトルと、のコサイン距離を文節同士の類似度として計算し、計算した文節同士の類似度を記憶部３０に格納する。 Then, the cosine distance between the phrase vector corresponding to the phrase of the first sentence including the keyword of interest and the phrase vector corresponding to the phrase of the second sentence including the corresponding keyword, generated in step S102, is used as the similarity between the phrases. The calculated degree of similarity between clauses is stored in the storage unit 30.

なお、文節同士の類似度は、コサイン距離以外の指標で表してもよいことは言うまでもない。 Needless to say, the similarity between phrases may be expressed by an index other than the cosine distance.

ステップＳ１１０において、ステップＳ１００で取得した言語構造に関する情報を参照して、ステップＳ１０４で選択した注目キーワードを含む第１文の文節の係り受け関係と、ステップＳ１０６で取得した対応キーワードを含む第２文の文節の係り受け関係と、を取得する。 In step S110, referring to the information on the language structure acquired in step S100, the dependency relationship of the first sentence including the attention keyword selected in step S104 and the second sentence including the corresponding keyword acquired in step S106. Get the dependency relationship of the phrase.

そして、ステップＳ１０２で生成した、注目キーワードを含む第１文の文節の係り受け関係に対応した係り受けベクトルと、対応キーワードを含む第２文の文節の係り受け関係に対応した係り受けベクトルと、のコサイン距離を文節の係り受け同士の類似度として計算し、計算した文節の係り受け同士の類似度を記憶部３０に格納する。 A dependency vector corresponding to the dependency relationship of the first sentence including the attention keyword, and a dependency vector corresponding to the dependency relationship of the second sentence including the corresponding keyword, generated in step S102; The cosine distance is calculated as the similarity between clause dependencies, and the calculated similarity between clause dependencies is stored in the storage unit 30.

なお、文節の係り受け同士の類似度は、コサイン距離以外の指標で表してもよいことは言うまでもない。 Needless to say, the similarity between clauses may be expressed by an index other than the cosine distance.

ステップＳ１０４〜Ｓ１１０の処理によって、注目キーワードと最も類似する第２文の対応キーワードとの間のキーワード同士の類似度、注目キーワードを含む文節と対応キーワードを含む文節との間の文節同士の類似度、及び、注目キーワードを含む文節の係り受け関係と対応キーワードを含む文節の係り受け関係との間の文節の係り受け同士の類似度が各々算出される。 Through the processing in steps S104 to S110, the similarity between keywords between the keyword of interest and the corresponding keyword of the second sentence most similar to the keyword of interest, the similarity of clauses between the phrase including the keyword of interest and the phrase including the corresponding keyword , And the degree of similarity between the dependency of the clauses between the dependency relationship of the clause including the keyword of interest and the dependency relationship of the clause including the corresponding keyword.

ステップＳ１１２において、ステップＳ１０４で第１文に含まれる全てのキーワードを選択したか否か判定し、まだステップＳ１０４で選択されていないキーワードが存在する場合には、ステップＳ１０４に移行する。そして、第１文に含まれるキーワードの中から未選択のキーワードがなくなるまで、第１文に含まれる未選択のキーワードを注目キーワードとして選択することを繰り返すことで、第１文に含まれる各キーワードに対して、キーワード同士の類似度、文節同士の類似度、及び文節の係り受け同士の類似度が算出される。 In step S112, it is determined whether or not all the keywords included in the first sentence have been selected in step S104. If there are keywords that have not been selected in step S104, the process proceeds to step S104. Each keyword included in the first sentence is repeatedly selected by selecting an unselected keyword included in the first sentence as a keyword of interest until there is no unselected keyword among the keywords included in the first sentence. On the other hand, the similarity between keywords, the similarity between clauses, and the similarity between clause dependencies are calculated.

一方、ステップＳ１１２の判定処理が肯定判定の場合、すなわち、ステップＳ１０４で第１文に含まれる全てのキーワードを選択した場合には、ステップＳ１１４に移行する。 On the other hand, if the determination process in step S112 is affirmative, that is, if all keywords included in the first sentence are selected in step S104, the process proceeds to step S114.

ステップＳ１１４において、ステップＳ１０６で算出した、第１文に含まれるキーワードword₁と、当該キーワードに対応する第２文の対応キーワードword₂とのキーワード同士の類似度sim(word₁, word₂)と、ステップＳ１０８で算出した、当該各々のキーワードに対応する文節同士の類似度sim(seg₁,seg₂)と、に基づいて、上記（１）式に従って、類似度変化率d_ws(word₁)を第１文に含まれるキーワードword₁毎に算出する。 In step S114, the similarity sim (word ₁ , word ₂ ) between keywords calculated in step S106 is the keyword word ₁ included in the first sentence and the corresponding keyword word ₂ of the second sentence corresponding to the keyword. Based on the similarity sim (seg ₁ , seg ₂ ) between clauses corresponding to each keyword calculated in step S108, the similarity change rate d _ws (word ₁ ) according to the above equation ( ₁ ). Is calculated for each keyword word ₁ included in the first sentence.

また、ステップＳ１０８で算出した文節同士の類似度sim(seg₁,seg₂)と、ステップＳ１１０で算出した、当該各々の文節seg₁及び文節seg₂に対応する文節の係り受け同士の類似度sim(dep₁,dep₂)と、に基づいて、上記（２）式に従って、類似度変化率d_sd(word₁)を第１文に含まれるキーワードword₁毎に算出する。 Also, the similarity sim (seg ₁ , seg ₂ ) between clauses calculated in step S108 and the similarity sim between the dependency of the clauses corresponding to each of the clauses seg ₁ and seg ₂ calculated in step S110. Based on (dep ₁ , dep ₂ ), the similarity change rate d _sd (word ₁ ) is calculated for each keyword word ₁ included in the first sentence, according to the above equation (2).

そして、ステップＳ１１６において、ステップＳ１０６で算出したキーワード同士の類似度sim(word₁,word₂)と、ステップＳ１１４で算出した類似度変化率d_ws(word₁)及びd_sd(word₁)と、に基づいて、上記（３）式に従って、第１文S₁と第２文S₂との類似度の程度を示すスコアSIM(S₁,S₂)を算出する。算出したスコアSIM(S₁,S₂)は、記憶部３０に格納され、出力部４０によって、例えばディスプレイ等の表示装置に、第１文S₁と第２文S₂とのスコアSIM(S₁,S₂)が出力される。 In step S116, the similarity sim (word ₁ , word ₂ ) between the keywords calculated in step S 106, the similarity change rates d _ws (word ₁ ) and d _sd (word ₁ ) calculated in step S 114, Based on the above, the score SIM (S ₁ , S ₂ ) indicating the degree of similarity between the first sentence S ₁ and the second sentence S ₂ is calculated according to the above equation (3). The calculated score SIM (S ₁ , S ₂ ) is stored in the storage unit 30, and the output unit 40 gives a score SIM (S of the first sentence S ₁ and the second sentence S ₂ to a display device such as a display, for example. ₁ , S ₂ ) is output.

なお、（３）式では、スコアSIM(S₁,S₂)の算出に類似度変化率d_ws(word₁)及びd_sd(word₁)を用いているが、例えば類似度変化率d_ws(word₁)及びd_sd(word₁)の少なくとも一方を用いてスコアSIM(S₁,S₂)を算出するようにしてもよい。 In equation (3), the similarity change rates d _ws (word ₁ ) and d _sd (word ₁ ) are used to calculate the score SIM (S ₁ , S ₂ ). For example, the similarity change rate d _ws The score SIM (S ₁ , S ₂ ) may be calculated using at least one of (word ₁ ) and d _sd (word ₁ ).

＜類似度評価装置の実行結果＞
図４は、「ＰＷの変更をしたい」を第１文、「パスワードを変えたらログインできない」を第２文とした場合の、第１実施形態に係る類似度評価装置１００でのスコアSIM(S₁,S₂)の算出過程の一例を示した図である。 <Execution result of similarity evaluation device>
FIG. 4 shows the score SIM (S in the similarity evaluation apparatus 100 according to the first embodiment when “I want to change PW” is the first sentence and “I can't log in after changing my password” is the second sentence. FIG. 2 is a diagram illustrating an example of a calculation process of ₁ , S ₂ ).

この場合、第１文のキーワードとして、例えば「ＰＷ」及び「変更」が抽出され、第２文のキーワードとして、例えば「パスワード」、「変える」、「ログイン」が抽出される。なお、「変える」は「変えたら」の標準表記である。 In this case, for example, “PW” and “change” are extracted as keywords of the first sentence, and “password”, “change”, and “login” are extracted as keywords of the second sentence, for example. “Change” is a standard notation of “if changed”.

キーワード同士の類似度を算出した場合、「ＰＷ」と最も類似度が高い第２文のキーワードは「パスワード」であり、類似度は0.90であった。また、「変更」と最も類似度が高い第２文のキーワードは「変える」であり、類似度は0.95であった。 When the similarity between the keywords was calculated, the keyword of the second sentence having the highest similarity with “PW” was “password”, and the similarity was 0.90. In addition, the keyword of the second sentence having the highest similarity with “change” is “change”, and the similarity is 0.95.

文節同士の類似度を算出した場合、第１文のキーワード「ＰＷ」を含む文節「ＰＷの」と、第２文のキーワード「パスワード」を含む文節「パスワードを」との文節同士の類似度は0.75であった。また、第１文のキーワード「変更」を含む文節「変更を」と、第２文のキーワード「変える」を含む文節「変えたら」との文節同士の類似度は0.32であった。 When the similarity between clauses is calculated, the similarity between clauses of the phrase “PW” including the keyword “PW” of the first sentence and the phrase “password” including the keyword “password” of the second sentence is It was 0.75. In addition, the similarity between phrases of the phrase “change” including the keyword “change” in the first sentence and the phrase “if changed” including the keyword “change” in the second sentence was 0.32.

更に、文節の係り受け同士の類似度を算出した場合、第１文の文節「ＰＷの」の係り先を含めた係り受け関係「ＰＷの変更を」と、第２文の文節「パスワードを」の係り先を含めた係り受け関係「パスワードを変えたら」との類似度は0.15であった。また、第１文の文節「変更を」の係り先を含めた係り受け関係「変更をしたい」と、第２文の文節「変えたら」の係り先を含めた係り受け関係「変えたらログインできない」との類似度は0.04であった。 Further, when the similarity between clause dependencies is calculated, the dependency relationship “change PW” including the dependency destination of the phrase “PW” of the first sentence and the clause “password” of the second sentence are included. The degree of similarity with the dependency relationship “if the password is changed” including the dependency point is 0.15. Also, the dependency relationship “I want to change” including the dependency of the phrase “change” in the first sentence and the dependency relationship “including the dependency of the phrase“ if changed ”in the second sentence“ I cannot log in if I change. The degree of similarity was 0.04.

更に、キーワード「ＰＷ」に対する類似度変化率d_ws(ＰＷ)は、（１）式から“-0.15”となり、類似度変化率d_sd(ＰＷ)は、（２）式から“-0.60”となる。また、キーワード「変更」に対する類似度変化率d_ws(変更)は、（１）式から“-0.63”となり、類似度変化率d_sd(変更)は、（２）式から“-0.28”となる。 Furthermore, the similarity change rate d _ws (PW) for the keyword “PW” is “−0.15” from the equation (1), and the similarity change rate d _sd (PW) is “−0.60” from the equation (2). Become. The similarity change rate d _ws (change) for the keyword “change” is “−0.63” from the equation (1), and the similarity change rate d _sd (change) is “−0.28” from the equation (2). Become.

したがって、第１文「ＰＷの変更をしたい」と第２文「パスワードを変えたらログインできない」との類似度の程度を示すスコアSIM(ＰＷの変更をしたい, パスワードを変えたらログインできない)は、（３）式から“0.59”となる。 Therefore, the score SIM indicating the degree of similarity between the first sentence "I want to change PW" and the second sentence "I can't log in if I change my password" (I want to change my PW, I can't log in if I change my password) From equation (3), it is “0.59”.

このように第１実施形態に係る類似度評価装置１００は、自然言語で記述された２つの文の類似度を評価する場合、各々の文に含まれるキーワード同士の類似性の比較だけではなく、キーワードが含まれる文節及び文節の係り受け表現の意味といった、語順や構文の類似性まで考慮して、２つの文の類似性を評価する。 As described above, when the similarity evaluation apparatus 100 according to the first embodiment evaluates the similarity between two sentences described in a natural language, not only the comparison of similarities between keywords included in each sentence, The similarity between two sentences is evaluated in consideration of word order and syntactic similarity such as the phrase including the keyword and the meaning of the dependency expression of the phrase.

したがって、類似度評価装置１００は、文に含まれるキーワード同士の類似性のみによって各々の文の類似性を評価する従来の類似度評価装置と比較して、比較対象となる文に対する類似度を精度よく評価することができる。 Therefore, the similarity evaluation apparatus 100 is more accurate in comparing the similarity to the sentence to be compared with the conventional similarity evaluation apparatus that evaluates the similarity of each sentence only by the similarity between the keywords included in the sentence. Can be evaluated well.

例えば、第１文「メールが送信できなくなった」と、第２文Ａ「メールが送信できない」及び第２文Ｂ「送信できないメールがある」と、を各々比較した場合、キーワード「メール」だけに着目すると、第２文Ａ及び第２文Ｂ共にキーワード「メール」を含むため、第１文と第２文Ａの類似度、及び第１文と第２文Ｂの類似度の間に違いはない。 For example, when comparing the first sentence “I can no longer send mail” with the second sentence A “I cannot send mail” and the second sentence B “I cannot send mail”, only the keyword “mail” Since both the second sentence A and the second sentence B include the keyword “mail”, there is a difference between the similarity between the first sentence and the second sentence A, and the similarity between the first sentence and the second sentence B. There is no.

また、類似性の判定単位を文節まで広げた場合も、第１文の文節「メールが」は、第２文Ａにも第２文Ｂにも含まれるため、第１文と第２文Ａの類似度、及び第１文と第２文Ｂの類似度の間に違いはない。 Further, even when the similarity determination unit is extended to the phrase, the phrase “e-mail” of the first sentence is included in both the second sentence A and the second sentence B, so the first sentence and the second sentence A And the similarity between the first sentence and the second sentence B is not different.

しかしながら、類似性の判定単位を文節の係り受け関係まで広げた場合、第１文の「メールが送信できなくなった」に対して、第２文Ａは「メールが送信できない」、第２文Ｂは「メールがある」となる。したがって、第１文は第２文Ｂよりも第２文Ａに類似していることがわかり、その評価値がスコアとして出力される。 However, when the similarity determination unit is expanded to the dependency relation of clauses, the second sentence A is “unable to send mail” and the second sentence B is different from the first sentence “no longer able to send mail”. Is "I have an email". Therefore, it turns out that the 1st sentence is more similar to the 2nd sentence A than the 2nd sentence B, The evaluation value is output as a score.

なお、第１実施形態に係る類似度評価装置１００では、一例として、入力部１０で第１文及び第２文をテキストとして受け付けるように説明した。しかし、例えば、入力部１０で第１文及び第２文に対応する音声を受け付け、受け付けた音声に対して、音声をテキストに変換する公知の音声認識を行うことで、テキスト化された第１文及び第２文を取得するようにしてもよい。 In the similarity evaluation apparatus 100 according to the first embodiment, as an example, the input unit 10 has been described to receive the first sentence and the second sentence as text. However, for example, the input unit 10 receives voices corresponding to the first sentence and the second sentence, and performs the known voice recognition for converting the voices into texts on the received voices, thereby converting the first texts into texts. The sentence and the second sentence may be acquired.

この場合、類似度評価装置１００の入力インターフェースとして音声を用いることができるため、類似度を評価する内容を予めテキストにする必要がない。したがって、入力としてテキストを受け付ける場合に比べて、類似度評価装置１００の操作性を向上することができる。 In this case, since speech can be used as the input interface of the similarity evaluation apparatus 100, it is not necessary to preliminarily make the content for evaluating the similarity text. Therefore, the operability of the similarity evaluation device 100 can be improved as compared with the case where text is received as input.

＜第２実施形態＞
第１実施形態では、キーワード同士の類似度、キーワードを含む文節同士の類似度、及びキーワードを含む文節の係り受け同士の類似度に基づいて算出した類似度変化率の低下の度合いが少ない文同士ほど、各々の文の類似性が高いことを説明したが、換言すれば、これは、文に含まれるキーワードの中で、より粗い類似判定単位の粒度で類似度を比較しても類似度の低下の度合いが少ないキーワードほど、文の類似性の判定に与える影響が大きい重要なキーワードであることを示している。 Second Embodiment
In the first embodiment, sentences having a low degree of decrease in the degree of similarity change calculated based on the similarity between keywords, the similarity between clauses including a keyword, and the similarity between clauses including a keyword It was explained that the similarity of each sentence was high, in other words, this is similar even if the similarity is compared with the coarser similarity determination unit granularity among the keywords included in the sentence. It is indicated that a keyword having a smaller degree of decrease is an important keyword having a larger influence on the determination of sentence similarity.

したがって、第２実施形態では、類似する第１文及び第２文の２つの文を入力して、第１文に含まれる各々のキーワードの重要度を評価するキーワード評価装置２００について説明する。 Accordingly, in the second embodiment, a keyword evaluation apparatus 200 that inputs two similar sentences, a first sentence and a second sentence, and evaluates the importance of each keyword included in the first sentence will be described.

＜システム構成例＞
図５は、キーワード評価装置２００のシステム構成例を示す図である。図５のキーワード評価装置２００のシステム構成が第１実施形態に係る図１の類似度評価装置１００のシステム構成例と異なる点は、類似度評価部２６がキーワード重要度評価部２６Ａに置き換えられ、それに伴い評価部２４が評価部２４Ａに置き換えられた点である。 <System configuration example>
FIG. 5 is a diagram illustrating a system configuration example of the keyword evaluation device 200. The difference between the system configuration of the keyword evaluation device 200 of FIG. 5 and the system configuration example of the similarity evaluation device 100 of FIG. 1 according to the first embodiment is that the similarity evaluation unit 26 is replaced with the keyword importance evaluation unit 26A. Accordingly, the evaluation unit 24 is replaced with an evaluation unit 24A.

なお、その他のキーワード評価装置２００の構成は、類似度評価装置１００のシステム構成例と同様である。 The other keyword evaluation apparatus 200 has the same configuration as the system configuration example of the similarity evaluation apparatus 100.

キーワード重要度評価部２６Ａは、計算部２３で計算したキーワード同士の類似度と、類似度変化率算出部２５で算出した類似度変化率と、に基づいて、第１文に含まれるキーワード毎にキーワードの重要度を計算する。 The keyword importance degree evaluation unit 26A determines, for each keyword included in the first sentence, the similarity between the keywords calculated by the calculation unit 23 and the similarity change rate calculated by the similarity change rate calculation unit 25. Calculate keyword importance.

第２文S₂との類似性を判定する場合において、第１文S₁に含まれるキーワードwの重要度SIM(S_1,w,S₂)は、例えば（４）式で計算される。 When determining the similarity with the second sentence S ₂ , the importance SIM (S _{1, w} , S ₂ ) of the keyword w included in the first sentence S ₁ is calculated by, for example, Expression (4).

ここで、w'_s2は、（３）式で説明したように、第１文S₁に含まれるキーワードwと最も類似度が高い第２文S₂の対応キーワードを表している。また、（４）式からわかるように、重要度SIM(S_1,w,S₂)は、０以上１以下の範囲の値を取るように正規化され、重要度SIM(S_1,w,S₂)が“１”に近づくほど、キーワードwの重要度が高いことを示す。 Here, w ′ _s2 represents the corresponding keyword of the second sentence S ₂ having the highest similarity with the keyword w included in the first sentence S ₁ as described in the expression (3). Also, as can be seen from the equation (4), the importance SIM (S _{1, w} , S ₂ ) is normalized to take a value in the range of 0 to 1, and the importance SIM (S _{1, w} , The closer S ₂ ) is to “1”, the more important the keyword w is.

＜キーワード評価装置の作用＞
キーワード評価装置２００は、類似する第１文及び第２文をそれぞれ入力部１０で受け付けると、受け付けた第１文及び第２文を例えば記憶部３０に格納する。そして、キーワード評価装置２００は、ＣＰＵで図６に示すキーワード評価処理ルーチンを実行する。 <Operation of keyword evaluation device>
When the keyword evaluation device 200 receives a similar first sentence and second sentence by the input unit 10, the keyword evaluation apparatus 200 stores the received first sentence and second sentence in the storage unit 30, for example. Then, the keyword evaluation device 200 executes a keyword evaluation processing routine shown in FIG.

図６に示すキーワード評価処理ルーチンが、図３に示した第１実施形態に係る類似度評価装置１００の類似度評価処理ルーチンと異なる点は、ステップＳ１１６がステップＳ１１８に置き換えられた点であり、その他の処理は、類似度評価装置１００の類似度評価処理ルーチンと同じである。したがって、以下ではステップＳ１１８の処理について説明する。 The keyword evaluation processing routine shown in FIG. 6 is different from the similarity evaluation processing routine of the similarity evaluation apparatus 100 according to the first embodiment shown in FIG. 3 in that step S116 is replaced with step S118. Other processes are the same as the similarity evaluation processing routine of the similarity evaluation apparatus 100. Therefore, the process of step S118 will be described below.

ステップＳ１１８において、ステップＳ１０６で算出したキーワード同士の類似度sim(word₁,word₂)と、ステップＳ１１４で算出した類似度変化率d_ws(word₁)及びd_sd(word₁)と、に基づいて、上記（４）式に従って、第２文に対する第１文の類似度の判定において、判定結果に影響を与える度合いを示すキーワードwの重要度SIM(S_1,w,S₂)をキーワードw毎に算出する。 In step S118, based on the similarity sim (word ₁ , word ₂ ) between the keywords calculated in step S106 and the similarity change rates d _ws (word ₁ ) and d _sd (word ₁ ) calculated in step S114. Thus, according to the above equation (4), in determining the similarity of the first sentence to the second sentence, the importance w SIM (S _{1, w} , S ₂ ) of the keyword w indicating the degree of influence on the determination result is determined as the keyword w Calculate every time.

算出した重要度SIM(S_1,w,S₂)は記憶部３０に格納され、出力部４０によって、例えばディスプレイ等の表示装置にキーワードw毎の重要度SIM(S_1,w,S₂)が出力される。 The calculated importance SIM (S _{1, w} , S ₂ ) is stored in the storage unit 30, and the importance SIM (S _{1, w} , S ₂ ) for each keyword w is displayed on the display device such as a display by the output unit 40. Is output.

なお、キーワードwの重要度SIM(S_1,w,S₂)を算出する（４）式は一例であり、キーワード同士の類似度と、文節同士の類似度と、の変化値の絶対値が小さいほど、重要度SIM(S_1,w,S₂)を高く評価し、又は、文節同士の類似度と、文節の係り受け同士の類似度と、の変化値の絶対値が小さいほど、重要度SIM(S_1,w,S₂)を高く評価することができれば、（４）式の代わりに他の評価式を用いて重要度SIM(S_1,w,S₂)を算出してもよい。 The formula (4) for calculating the importance SIM (S _{1, w} , S ₂ ) of the keyword w is an example, and the absolute value of the change value between the similarity between keywords and the similarity between clauses is The smaller the value, the higher the importance SIM (S _{1, w} , S ₂ ), or the smaller the absolute value of the change between the similarity between clauses and the similarity between clauses, the more important If the degree SIM (S _{1, w} , S ₂ ) can be highly evaluated, the importance SIM (S _{1, w} , S ₂ ) can be calculated using another evaluation formula instead of the formula (4). Good.

このように第２実施形態に係るキーワード評価装置２００によれば、少なくとも２つの類似度変化率d_ws(w)及びd_sd(w)の絶対値が小さいほど、キーワードwの重要度SIM(S_1,w,S₂)を高く評価する。 Thus, according to the keyword evaluation device 200 according to the second embodiment, the importance SIM (S of the keyword w becomes smaller as the absolute values of the at least two similarity change rates d _ws (w) and d _sd (w) are smaller. _{1, w} , S ₂ ) are highly appreciated.

したがって、例えばキーワードを入力して文を検索する検索システム等において、キーワード評価装置２００で得られた、より重要度の高いキーワードを優先的に入力すれば、目的とする文を精度よく検索することができる。 Therefore, for example, in a search system for searching for a sentence by inputting a keyword, if a keyword with higher importance obtained by the keyword evaluation apparatus 200 is preferentially input, the target sentence can be searched with high accuracy. Can do.

なお、キーワード評価装置２００では、第１実施形態に係る類似度評価装置１００と同様に、第１文及び第２文に対応する音声を受け付け、音声をテキストに変換する公知の音声認識を行うことで、テキスト化された第１文及び第２文を取得するようにしてもよい。 Note that the keyword evaluation device 200 receives the speech corresponding to the first sentence and the second sentence and performs known speech recognition for converting the speech into text, as in the similarity evaluation device 100 according to the first embodiment. Thus, the first sentence and the second sentence converted into text may be acquired.

＜第３実施形態＞
第１実施形態に係る類似度評価装置１００、及び第２実施形態に係るキーワード評価装置２００では、文同士の異なる粒度における類似度変化率d_ws(w)及びd_sd(w)を算出し、算出した類似度変化率d_ws(w)及びd_sd(w)に基づいて、文同士の類似度、又は、文同士の類似度の判定に用いられるキーワードの重要度を評価した。 <Third Embodiment>
In the similarity evaluation device 100 according to the first embodiment and the keyword evaluation device 200 according to the second embodiment, the similarity change rates d _ws (w) and d _sd (w) at different granularities of sentences are calculated, Based on the calculated similarity change rates d _ws (w) and d _sd (w), the importance of keywords used to determine the similarity between sentences or the similarity between sentences was evaluated.

第３実施形態では、第１実施形態に係る類似度評価装置１００、及び第２実施形態に係るキーワード評価装置２００と同様に、類似度変化率d_ws(w)及びd_sd(w)を算出し、算出した類似度変化率d_ws(w)及びd_sd(w)に基づいて、複数の文の中から、第１文に最も類似した文を検索する検索装置３００について説明する。 In the third embodiment, similarly to the similarity evaluation device 100 according to the first embodiment and the keyword evaluation device 200 according to the second embodiment, the similarity change rates d _ws (w) and d _sd (w) are calculated. The search device 300 that searches for a sentence most similar to the first sentence from a plurality of sentences based on the calculated similarity change rates d _ws (w) and d _sd (w) will be described.

＜システム構成例＞
図７は、検索装置３００のシステム構成例を示す図である。図７の検索装置３００のシステム構成例が第１実施形態に係る図１の類似度評価装置１００のシステム構成例と異なる点は、類似度評価部２６がクエリ文類似度評価部２６Ｂに置き換えられ、それに伴い評価部２４が評価部２４Ｂに置き換えられた点である。更に、検索装置３００には検索部２７が追加され、記憶部３０に検索対象文ＤＢ３０Ａが予め構築される。 <System configuration example>
FIG. 7 is a diagram illustrating a system configuration example of the search device 300. The system configuration example of the search device 300 in FIG. 7 is different from the system configuration example of the similarity evaluation device 100 in FIG. 1 according to the first embodiment, in that the similarity evaluation unit 26 is replaced with a query sentence similarity evaluation unit 26B. Accordingly, the evaluation unit 24 is replaced with the evaluation unit 24B. Further, a search unit 27 is added to the search device 300, and a search target sentence DB 30A is built in the storage unit 30 in advance.

その他の検索装置３００の構成は、類似度評価装置１００のシステム構成例と同様である。 Other configurations of the search device 300 are the same as the system configuration example of the similarity evaluation device 100.

検索対象文ＤＢ３０Ａには、複数の検索対象文が、上記第１実施形態と同様に生成されたキーワードベクトル、文節ベクトル、及び係り受けベクトルと対応付けられて予め格納されているものとする。 In the search target sentence DB 30A, it is assumed that a plurality of search target sentences are stored in advance in association with keyword vectors, phrase vectors, and dependency vectors generated in the same manner as in the first embodiment.

ここで、検索対象文の各々に対応付けられるキーワードベクトル、文節ベクトル、及び係り受けベクトルは、例えば検索装置３００に検索対象文を入力した場合におけるベクトル生成部２２の出力結果を用いることができる。 Here, as the keyword vector, phrase vector, and dependency vector associated with each search target sentence, for example, the output result of the vector generation unit 22 when the search target sentence is input to the search apparatus 300 can be used.

検索装置３００では、検索対象文ＤＢ３０Ａに予め格納される複数の検索対象文の中から、自然言語で記述されたクエリ文の内容に最も類似する検索対象文を検索する。 The search device 300 searches for a search target sentence most similar to the content of a query sentence described in a natural language from a plurality of search target sentences stored in advance in the search target sentence DB 30A.

このように、検索装置３００では、複数の検索対象文が予め検索対象文ＤＢ３０Ａに格納されているため、第１実施形態に係る類似度評価装置１００、及び第２実施形態に係るキーワード評価装置２００とは異なり、クエリ文のみが入力部１０に入力される。 Thus, in the search device 300, since a plurality of search target sentences are stored in the search target sentence DB 30A in advance, the similarity evaluation apparatus 100 according to the first embodiment and the keyword evaluation apparatus 200 according to the second embodiment. Unlike the above, only the query sentence is input to the input unit 10.

したがって、文解析部２１は、類似度評価装置１００と同様の手法によって、入力部１０から受け付けたクエリ文に対して係り受け解析を実行し、係り受け解析の結果に基づいて、クエリ文からキーワードを抽出する。 Therefore, the sentence analysis unit 21 performs dependency analysis on the query sentence received from the input unit 10 by a method similar to that of the similarity evaluation apparatus 100, and based on the result of the dependency analysis, the keyword is calculated from the query sentence. To extract.

そして、ベクトル生成部２２は、文解析部２１でクエリ文から抽出したキーワードに基づいて、類似度評価装置１００と同様の手法によってキーワードベクトル、文節ベクトル、及び係り受けベクトルを生成する。 Then, the vector generation unit 22 generates a keyword vector, a phrase vector, and a dependency vector based on the keyword extracted from the query sentence by the sentence analysis unit 21 by the same method as the similarity evaluation device 100.

また、計算部２３は、ベクトル生成部２２で生成したクエリ文に対応するキーワードベクトル、文節ベクトル、及び係り受けベクトルと、検索対象文ＤＢ３０Ａに格納される検索対象文に対応するキーワードベクトル、文節ベクトル、及び係り受けベクトルと、に基づいて、類似度評価装置１００と同様の手法によってキーワード同士の類似度、文節同士の類似度、及び文節の係り受け同士の類似度を計算する。 The calculation unit 23 also includes a keyword vector, a phrase vector, and a dependency vector corresponding to the query sentence generated by the vector generation unit 22, and a keyword vector and a phrase vector corresponding to the search target sentence stored in the search target sentence DB 30A. And the dependency vectors, the similarity between keywords, the similarity between phrases, and the similarity between phrases are calculated by the same method as the similarity evaluation apparatus 100.

クエリ文類似度評価部２６Ｂは、第１実施形態に係る類似度評価装置１００の類似度評価部２６と同様に、計算部２３で計算したキーワード同士の類似度と、類似度変化率算出部２５で算出した類似度変化率と、に基づいて、クエリ文S₁に対して、（３）式のスコアSIM(S₁,S₂)を検索対象文S₂毎に算出する。そして、クエリ文類似度評価部２６Ｂは、例えばスコアSIM(S₁,S₂)が最も“１”に近くなる検索対象文S₂を、クエリ文S₁の内容に類似する検索対象文S₂として評価する。 Similar to the similarity evaluation unit 26 of the similarity evaluation apparatus 100 according to the first embodiment, the query sentence similarity evaluation unit 26B and the similarity between keywords calculated by the calculation unit 23 and the similarity change rate calculation unit 25 On the basis of the similarity change rate calculated in (5), the score SIM (S ₁ , S ₂ ) of the expression (3) is calculated for each query target sentence S ₂ for the query sentence S ₁ . Then, the query statement similarity degree evaluation unit 26B, for example score SIM (S _1, S ₂₎ is the most "1" becomes close to the search subject sentence S _2, the search subject sentence S ₂ that are similar to the contents of the query statement S ₁ Evaluate as

しかしながら、検索装置３００のように、クエリ文の内容に対応する検索対象文を検索する装置の場合、キーワード同士の類似度、文節同士の類似度、及び文節の係り受け同士の類似度が高いからといって、必ずしもクエリ文の内容に対応した適切な検索対象文が検索されるとは限られない。 However, in the case of a device that searches for a search target sentence corresponding to the content of a query sentence, such as the search device 300, the similarity between keywords, the similarity between phrases, and the similarity between clause dependencies are high. However, an appropriate search target sentence corresponding to the content of the query sentence is not necessarily searched.

例えば、文中に頻繁に出現するキーワードは文の主題を表しやすい一方、複数の文に頻繁に出現するキーワードは重要なキーワードではないといった傾向が見られる。 For example, a keyword that frequently appears in a sentence tends to represent the subject of the sentence, whereas a keyword that frequently appears in a plurality of sentences is not an important keyword.

したがって、キーワード同士の類似度、文節同士の類似度、及び文節の係り受け同士の類似度に加えて、更に、文中におけるキーワードの重みを考慮することが好ましい。 Therefore, in addition to the similarity between keywords, the similarity between clauses, and the similarity between clauses, it is preferable to further consider the weight of the keyword in the sentence.

文中におけるキーワードの重み算出手法には、例えばTerm Frequency-Inverse Document Frequency(TF-IDF)法、又はBM25法などの公知の手法が存在するが、クエリ文類似度評価部２６Ｂには、こうした公知のキーワードの重み算出手法を適用することができる。 There are known methods such as the Term Frequency-Inverse Document Frequency (TF-IDF) method or the BM25 method as the keyword weight calculation method in the sentence. The query sentence similarity evaluation unit 26B has such a known method. A keyword weight calculation method can be applied.

したがって、クエリ文類似度評価部２６Ｂは、公知のキーワードの重み算出手法を用いて算出したキーワードの重みを考慮した、クエリ文S₁と検索対象文S₂との類似度合いを示すスコアScore(S₁,S₂)を、例えば（５）式を用いて計算する。 Therefore, the query sentence similarity evaluation unit 26B takes a score Score (S) indicating the degree of similarity between the query sentence S ₁ and the search target sentence S ₂ in consideration of the keyword weights calculated using a known keyword weight calculation method. ₁ , S ₂ ) is calculated using, for example, equation (5).

ここで、SIM(S_1,w,S₂)は、（４）式で表されるクエリ文S₁に含まれるキーワードwの重要度、すなわち、クエリ文S₁に含まれるキーワードwのみに着目した場合の、クエリ文S₁と検索対象文S₂との類似度である。また、weightは重み値を表す。したがって、weight(argmax_ws2(sim(w,w_s2)))は、クエリ文S₁に含まれるキーワードwと最も類似度が高くなる検索対象文S₂中のキーワードw_s2の重み値である。 Here, SIM (S _{1, w} , S ₂ ) focuses only on the importance of the keyword w included in the query sentence S ₁ represented by the equation (4), that is, only the keyword w included in the query sentence S _1. Is the similarity between the query sentence S ₁ and the search target sentence S ₂ . Weight represents a weight value. Therefore, weight (argmax _ws2 (sim (w, w _s2 ))) is a weight value of the keyword w _{s2 in} the search target sentence S ₂ having the highest similarity with the keyword w included in the query sentence S ₁ .

検索部２７は、クエリ文類似度評価部２６Ｂで算出されたスコアScore(S₁,S₂)を用いて、例えばスコアScore(S₁,S₂)に関して予め定めた条件を満たす検索対象文S₂を検索対象文ＤＢ３０Ａから検索して、出力部４０に出力する。 The search unit 27 uses the score Score (S ₁ , S ₂ ) calculated by the query sentence similarity evaluation unit 26B, for example, a search target sentence S satisfying a predetermined condition with respect to the score Score (S ₁ , S ₂ ). ₂ is searched from the search target sentence DB 30 </ b> A and output to the output unit 40.

なお、検索装置３００が、例えばインターネット等のネットワークに接続される場合、検索対象文ＤＢ３０Ａをネットワークに接続される記憶装置等の外部装置に格納し、検索装置３００が外部装置に格納された検索対象文ＤＢ３０Ａを参照するようにしてもよい。 When the search device 300 is connected to a network such as the Internet, for example, the search target sentence DB 30A is stored in an external device such as a storage device connected to the network, and the search device 300 is stored in the external device. You may make it refer to sentence DB30A.

＜検索装置の作用＞
検索装置３００は、自然言語で記述されたクエリ文を入力部１０で受け付けると、受け付けたクエリ文を例えば記憶部３０に格納する。そして、検索装置３００は、ＣＰＵで図８に示す検索処理ルーチンを実行する。 <Operation of search device>
When the input unit 10 receives a query sentence written in a natural language, the search device 300 stores the received query sentence in the storage unit 30, for example. Then, the search device 300 executes a search processing routine shown in FIG. 8 by the CPU.

図８に示す検索処理ルーチンが、図３に示した第１実施形態に係る類似度評価装置１００の類似度評価処理ルーチンと異なる点は、ステップＳ１０６の代わりにステップＳ１０７が追加された点である。また、検索処理ルーチンでは、ステップＳ１０３、ステップＳ１２０、及びＳ１２２が新たに追加される。なお、その他の処理は、類似度評価装置１００の類似度評価処理ルーチンと同じである。したがって、以下では類似度評価処理ルーチンと異なる処理を中心にして、検索処理ルーチンを説明する。 The difference between the search processing routine shown in FIG. 8 and the similarity evaluation processing routine of the similarity evaluation apparatus 100 according to the first embodiment shown in FIG. 3 is that step S107 is added instead of step S106. . In the search processing routine, steps S103, S120, and S122 are newly added. Other processes are the same as the similarity evaluation processing routine of the similarity evaluation apparatus 100. Therefore, the search processing routine will be described below with a focus on processing different from the similarity evaluation processing routine.

ステップＳ１０３において、検索対象文ＤＢ３０Ａに予め記憶されている複数の検索対象文のうち、未選択の検索対象文を１つ選択する。 In step S103, one unselected search target sentence is selected from a plurality of search target sentences stored in advance in the search target sentence DB 30A.

そして、ステップＳ１０７において、ステップＳ１０３で選択した検索対象文に対応付けられたキーワードの中から、ステップＳ１０４で選択した注目キーワードに最も類似する対応キーワードを抽出する。なお、キーワード同士の類似度は、図３におけるステップＳ１０６と同様に、例えば注目キーワードと対応キーワードとに対応する各々のキーワードベクトルのコサイン距離によって計算すればよく、計算したコサイン距離をキーワード同士の類似度として記憶部３０に格納する。 In step S107, a corresponding keyword that is most similar to the attention keyword selected in step S104 is extracted from the keywords associated with the search target sentence selected in step S103. The similarity between the keywords may be calculated by, for example, the cosine distance of each keyword vector corresponding to the keyword of interest and the corresponding keyword, as in step S106 in FIG. 3, and the calculated cosine distance is calculated based on the similarity between the keywords. It is stored in the storage unit 30 as the degree.

以降、ステップＳ１０８及びＳ１１０で、注目キーワードを含む文節と対応キーワードを含む文節同士の類似度、及び注目キーワードを含む文節の係り受けと対応キーワードを含む文節の係り受け同士の類似度を計算し、ステップＳ１１２でクエリ文に含まれる全てのキーワードについてステップＳ１０４〜Ｓ１１２の処理を実行したか判定する。 Thereafter, in steps S108 and S110, the similarity between the clause including the keyword of interest and the clause including the corresponding keyword, and the similarity of the dependency of the clause including the keyword of interest and the dependency of the clause including the corresponding keyword are calculated. In step S112, it is determined whether the processes in steps S104 to S112 have been executed for all the keywords included in the query sentence.

そして、ステップＳ１１４において、ステップＳ１０７で算出した、クエリ文に含まれるキーワードword₁と、当該キーワードに対応する検索対象文の対応キーワードword₂とのキーワード同士の類似度sim(word₁, word₂)と、ステップＳ１０８で算出した、当該各々のキーワードに対応する文節同士の類似度sim(seg₁,seg₂)と、に基づいて、上記（１）式に従って、類似度変化率d_ws(word₁)をクエリ文に含まれるキーワード毎に算出する。 In step S114, the similarity sim (word ₁ , word ₂ ) between the keywords word ₁ included in the query sentence calculated in step S107 and the corresponding keyword word ₂ of the search target sentence corresponding to the keyword is calculated. Based on the similarity sim (seg ₁ , seg ₂ ) between clauses corresponding to each keyword calculated in step S108, the similarity change rate d _ws (word ₁ ) Is calculated for each keyword included in the query statement.

また、ステップＳ１０８で算出した文節同士の類似度sim(seg₁,seg₂)と、ステップＳ１１０で算出した、当該各々の文節seg₁及び文節seg₂に対応する文節の係り受け同士の類似度sim(dep₁,dep₂)と、に基づいて、上記（２）式に従って、類似度変化率d_sd(word₁)をクエリ文に含まれるキーワード毎に算出する。 Also, the similarity sim (seg ₁ , seg ₂ ) between clauses calculated in step S108 and the similarity sim between the dependency of the clauses corresponding to each of the clauses seg ₁ and seg ₂ calculated in step S110. Based on (dep ₁ , dep ₂ ), the similarity change rate d _sd (word ₁ ) is calculated for each keyword included in the query sentence according to the above equation (2).

次に、ステップＳ１１６において、ステップＳ１０７で算出したキーワード同士の類似度sim(word₁,word₂)と、ステップＳ１１４で算出した類似度変化率d_ws(word₁)及びd_sd(word₁)と、に基づいて、上記（４）式に従って、クエリ文S₁に含まれるキーワードwのみに着目した場合のクエリ文S₁と検索対象文S₂との類似度SIM(S_1,w,S₂)を、クエリ文S₁のキーワード毎に算出する。 Next, in step S116, the similarity sim (word ₁ , word ₂ ) between the keywords calculated in step S107 and the similarity change rates d _ws (word ₁ ) and d _sd (word ₁ ) calculated in step S114 based on the above (4) according to formula similarity SIM (S ₁ to the query statement S ₁ in the case of focusing only on the keyword w included in the query statement S ₁ and the search subject sentence S _{_2, w,} S ₂ ), and calculates for each of the query statement S ₁ keyword.

更に、本ステップで算出したキーワードw毎の類似度SIM(S_1,w,S₂)と、クエリ文S₁に含まれるキーワードwと最も類似度が高くなる検索対象文S₂中のキーワードw_s2の重み値と、に基づいて、上記（５）式に従って、スコアScore(S₁,S₂)を算出する。算出したスコアScore(S₁,S₂)は検索対象文S₂と対応付けて、例えば検索対象文ＤＢ３０Ａに格納する。 Further, the similarity SIM (S _{1, w} , S ₂ ) for each keyword w calculated in this step and the keyword w in the search target sentence S ₂ having the highest similarity with the keyword w included in the query sentence S ₁ Based on the weight value of _s2 , the score Score (S ₁ , S ₂ ) is calculated according to the above equation (5). The calculated score Score (S ₁ , S ₂ ) is associated with the search target sentence S ₂ and stored in, for example, the search target sentence DB 30A.

なお、キーワードw_s2の重み値は、前述したようにTF-IDF法等の公知の重み算出手法を用いて算出すればよい。 Note that the weight value of the keyword w _s2 may be calculated using a known weight calculation method such as the TF-IDF method as described above.

ここでは一例として、（５）式に従ってクエリ文と検索対象文との類似度を算出したが、クエリ文と検索対象文との類似度の算出方法はこれに限られない。例えば、（３）式に従って類似度を算出してもよく、また、（４）式に従って算出した、クエリ文におけるキーワード毎の重要度SIM(S_1,w,S₂)の和を、クエリ文と検索対象文との類似度を示すスコアとしてもよい。 Here, as an example, the similarity between the query sentence and the search target sentence is calculated according to equation (5), but the method of calculating the similarity between the query sentence and the search target sentence is not limited to this. For example, the similarity may be calculated according to the expression (3), and the sum of the importance SIM (S _{1, w} , S ₂ ) for each keyword in the query sentence calculated according to the expression (4) is used as the query sentence. It is good also as a score which shows the similarity degree with a search object sentence.

ステップＳ１２０において、ステップＳ１０３で検索対象文ＤＢ３０Ａに含まれる全ての検索対象文を選択したか否か判定し、まだステップＳ１０３で選択されていない検索対象文が存在する場合には、ステップＳ１０３に移行する。 In step S120, it is determined whether or not all search target sentences included in the search target sentence DB 30A are selected in step S103. If there is a search target sentence that has not been selected in step S103, the process proceeds to step S103. To do.

そして、検索対象文ＤＢ３０Ａに含まれる検索対象文の中から未選択の検索対象文がなくなり、ステップＳ１２０の判定処理が肯定判定になるまでステップＳ１０３で検索対象文を繰り返し選択することで、ステップＳ１１６でクエリ文S₁に対する各検索対象文S₂のスコアScore(S₁,S₂)が算出される。 Then, there is no unselected search target sentence in the search target sentences included in the search target sentence DB 30A, and the search target sentence is repeatedly selected in step S103 until the determination process in step S120 becomes affirmative determination, thereby step S116. The score Score (S ₁ , S ₂ ) of each search target sentence S ₂ for the query sentence S ₁ is calculated.

一方、ステップＳ１２０の判定処理が肯定判定となる場合には、ステップＳ１２２に移行する。 On the other hand, if the determination process in step S120 is affirmative, the process proceeds to step S122.

ステップＳ１２２において、各々の検索対象文に対応付けられたスコアScore(S₁,S₂)を参照し、予め定めた閾値以上のスコアScore(S₁,S₂)が対応付けられた検索対象文を、検索対象文ＤＢ３０Ａから検索して取得する。そして、出力部４０で、ステップＳ１２２で取得した検索対象文を、例えばディスプレイ等の表示装置に出力する。 In step S122, the score Score (S ₁ , S ₂ ) associated with each retrieval target sentence is referred to, and the score Score (S ₁ , S ₂ ) equal to or higher than a predetermined threshold is associated with the retrieval target sentence. Is retrieved from the retrieval target sentence DB 30A. And the output part 40 outputs the search object sentence acquired by step S122 to display apparatuses, such as a display, for example.

なお、ステップＳ１２２で取得する検索対象文は、予め定めた閾値以上のスコアScore(S₁,S₂)が対応付けられた検索対象文に限られない。例えば、スコアScore(S₁,S₂)の大きい方から順に予め定めた数の検索対象文を検索対象文ＤＢ３０Ａから検索して取得するようにしてもよい。 Note that the search target sentence acquired in step S122 is not limited to the search target sentence associated with a score Score (S ₁ , S ₂ ) equal to or higher than a predetermined threshold. For example, a predetermined number of search target sentences may be searched from the search target sentence DB 30A in order from the highest score Score (S ₁ , S ₂ ).

なお、検索装置３００では、入力部１０でクエリ文に対応する音声を受け付け、受け付けた音声に対して、音声をテキストに変換する公知の音声認識を行うことで、テキスト化されたクエリ文を取得するようにしてもよい。 In search device 300, voice corresponding to the query sentence is received by input unit 10, and the query sentence converted into text is obtained by performing known voice recognition for converting the voice into text for the received voice. You may make it do.

この場合、検索装置３００で受け付けた音声をそのまま検索に用いることができるため、入力としてテキストを受け付ける場合に比べて、検索装置３００の操作性を向上することができる。 In this case, since the voice received by the search device 300 can be used for the search as it is, the operability of the search device 300 can be improved compared to the case of receiving text as input.

このように第３実施形態に係る検索装置３００は、クエリ文の内容に類似する検索対象文を、記憶部３０に予め記憶される検索対象文ＤＢ３０Ａから検索する。この場合、検索装置３００は、クエリ文を受け付けた場合に、クエリ文に対してのみ係り受け解析器を用いて言語構造に関する情報を取得すると共に、クエリ文に含まれるキーワードに対応するキーワードベクトル、文節ベクトル、及び係り受けベクトルを生成する。すなわち、検索対象文に関するキーワードベクトル、文節ベクトル、及び係り受けベクトルは、検索対象文ＤＢ３０Ａに予め格納されているため、クエリ文を受け付ける毎に検索対象文に関する各種ベクトルを生成する場合と比較して、高速にスコアScore(S₁,S₂)を計算することができる。 As described above, the search device 300 according to the third embodiment searches for a search target sentence similar to the content of the query sentence from the search target sentence DB 30 </ b> A stored in the storage unit 30 in advance. In this case, when the query device receives the query sentence, the search device 300 acquires information on the language structure using the dependency analyzer only for the query sentence, and also includes a keyword vector corresponding to the keyword included in the query sentence, A phrase vector and a dependency vector are generated. That is, the keyword vector, phrase vector, and dependency vector related to the search target sentence are stored in the search target sentence DB 30A in advance, so that each time a query sentence is received, various vectors related to the search target sentence are generated. The score Score (S ₁ , S ₂ ) can be calculated at high speed.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の類似度評価装置１００、キーワード評価装置２００、及び検索装置３００は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, although the similarity evaluation device 100, the keyword evaluation device 200, and the search device 300 described above have a computer system therein, the “computer system” is a case where a WWW system is used. It also includes the homepage provision environment (or display environment).

また、本願明細書中において、プログラムが予めＲＯＭにインストールされている実施形態を説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the specification of the present application, the embodiment in which the program is preinstalled in the ROM has been described. However, the program may be provided by being stored in a computer-readable recording medium.

１０・・・入力部
２０・・・演算部
２１・・・文解析部
２２・・・ベクトル生成部
２３・・・計算部
２４（２４Ａ、２４Ｂ）・・・評価部
２５・・・類似度変化率算出部
２６・・・類似度評価部
２６Ａ・・・キーワード重要度評価部
２６Ｂ・・・クエリ文類似度評価部
２７・・・検索部
３０・・・記憶部
４０・・・出力部
１００・・・類似度評価装置
２００・・・キーワード評価装置
３００・・・検索装置
d_sd、d_ws・・・類似度変化率
３０Ａ・・・検索対象文ＤＢ DESCRIPTION OF SYMBOLS 10 ... Input part 20 ... Operation part 21 ... Sentence analysis part 22 ... Vector generation part 23 ... Calculation part 24 (24A, 24B) ... Evaluation part 25 ... Similarity change Rate calculation unit 26 ... similarity evaluation unit 26A ... keyword importance evaluation unit 26B ... query sentence similarity evaluation unit 27 ... search unit 30 ... storage unit 40 ... output unit 100 ..Similarity evaluation device 200 ... Keyword evaluation device 300 ... Search device
d _sd , d _ws ... similarity change rate 30A ... search target sentence DB

Claims

Similarity between keywords based on the first keyword extracted from the input first sentence and the second keyword that is similar to the first keyword in the input second sentence A calculation unit that calculates at least two similarities among similarities between clauses including a keyword and similarities between clauses including a keyword;
An evaluation unit that evaluates the importance of the first keyword higher as the absolute value of the change value of the at least two similarities calculated by the calculation unit is smaller;
Keyword evaluation device including

The evaluation unit evaluates the importance of the first keyword higher as the absolute value of the change value between the similarity between the keywords calculated by the calculation unit and the similarity between the phrases is smaller, or The importance of the first keyword is evaluated higher as the absolute value of the change value between the similarity between the clauses calculated by the calculation unit and the similarity between the dependency of the clauses is smaller. The keyword evaluation device described.

The word included in the first sentence in the combination having the highest similarity among the combinations of the word included in the input first sentence and the word included in the input second sentence is the first keyword, the first The word included in the two sentences is set as the second keyword, and at least two similarities are calculated among the similarity between the keywords, the similarity between the clauses including the keyword, and the similarity between the dependencies of the clause including the keyword. A calculation unit;
An evaluation unit that evaluates that the first sentence and the second sentence are similar as the absolute value of the change value of the at least two similarities calculated by the calculation unit is smaller;
Similarity evaluation apparatus including

In the evaluation unit, the first sentence and the second sentence are more similar as the absolute value of the change value between the similarity between the keywords calculated by the calculation unit and the similarity between the phrases is smaller. The first sentence and the second sentence are smaller as the absolute value of the change value between the similarity between the clauses calculated by the calculation unit and the similarity between the dependency of the phrases is smaller. The degree-of-similarity evaluation apparatus according to claim 3.

For each word included in the input first sentence, the calculation unit has the highest similarity between the word included in the first sentence and the word included in the first sentence. The word included in the second sentence is the second keyword, the similarity between the keywords, the similarity between clauses including the keyword, and the similarity between clauses including the keyword are calculated,
The evaluation unit changes, for each word included in the first sentence, a similarity between the first keyword and the second keyword, a similarity between the keywords, and a similarity between the phrases. Based on the average value of the score calculated based on the absolute value of the value and the absolute value of the change value of the similarity between the clauses and the similarity between the clauses, the first sentence and the The similarity evaluation apparatus according to claim 4, wherein it is evaluated whether the second sentence is similar.

For each of a plurality of search target sentences prepared in advance, a keyword vector representing each keyword included in the search target sentence, a phrase vector representing a phrase including a keyword for each keyword, and a keyword for each keyword A storage unit for storing a dependency vector representing a dependency relationship including a clause destination;
For each of the plurality of search target sentences, a keyword included in the query sentence in a combination having the highest similarity among the combinations of the keyword included in the input query sentence and the keyword included in the search target sentence is first. The keyword included in the search target sentence is a second keyword, and the similarity between the keywords based on the keyword vector, the similarity between the phrases including the keyword based on the phrase vector, and the keyword based on the dependency vector are included. A calculation unit that calculates at least two similarities for each of the plurality of search target sentences among the similarities between the dependency of the clauses;
Evaluation that evaluates that the query sentence and the search target sentence are more similar as the absolute value of the change value of the at least two similarities calculated by the calculation unit is smaller for each of the plurality of search target sentences. And
A search unit that searches for a search target sentence similar to the query sentence, based on an evaluation result by the evaluation unit;
Search device including

A computer is a first keyword extracted from a first sentence input through an input unit, and a sentence input through the input unit together with the first sentence, and is associated with the first sentence Based on the second keyword that is similar to the first keyword in the second sentence stored in the storage device, the similarity between the keywords, the similarity between the phrases including the keyword, and the phrase including the keyword calculating at least two similarity of the dependency similarity between,
Displaying the evaluation result evaluated so that the importance of the first keyword is higher as the absolute value of the change value of the at least two similarities is smaller ;
Keyword evaluation method to execute .

A computer includes a word included in a first sentence input via an input unit, and a sentence input via the input unit together with the first sentence, and is associated with the first sentence in a storage device Among the combinations with the words included in the stored second sentence, the word included in the first sentence in the combination having the highest similarity is the first keyword, and the word included in the second sentence is the second keyword. Calculating a similarity between at least two of a similarity between keywords, a similarity between clauses including the keyword, and a similarity between clauses including the keyword ;
A step of evaluating that the first sentence and the second sentence are similar to each other as the absolute value of the change value of the at least two similarities is smaller, and displaying the evaluation result on a display device;
To evaluate the similarity of statements that execute .

For each of a plurality of search target sentences prepared in advance by a computer, a keyword vector representing each keyword included in the search target sentence, a phrase vector representing a phrase including a keyword for each keyword, and a keyword for each keyword A dependency vector representing a dependency relationship including the dependency destination of the clause including the symbol is generated, the keyword vector, the phrase vector, and the dependency vector generated for each of the plurality of search target sentences are associated and stored in the storage device . Steps,
For each of the plurality of search target sentences, included in the query sentence in the combination having the highest similarity among the combinations of the keyword included in the query sentence input via the input unit and the keyword included in the search target sentence The first keyword, the keyword included in the search target sentence as the second keyword, the similarity between keywords based on the keyword vector, the similarity between phrases including the keyword based on the phrase vector, and the dependency vector And calculating at least two similarities for each of the plurality of search target sentences among the similarities of the dependency of the clause including the keyword based on the keywords, and corresponding the calculated at least two similarities for each of the plurality of search target sentences And storing in the storage device;
For each of the plurality of search target sentences, a step of evaluating that the query sentence and the search target sentence are more similar as the absolute value of the calculated change value of the at least two similarities is smaller ;
Searching a search target sentence similar to the query sentence based on the evaluation, and displaying a search target sentence similar to the query sentence on a display device;
How to search for statements that execute

The program for functioning a computer as each part of the keyword evaluation apparatus of Claim 1 or Claim 2.

The program for functioning a computer as each part of the similarity evaluation apparatus of any one of Claims 3-5.

The program for functioning a computer as each part of the search device of Claim 6.