JP6955963B2

JP6955963B2 - Search device, similarity calculation method, and program

Info

Publication number: JP6955963B2
Application number: JP2017210819A
Authority: JP
Inventors: 中島　章; 章中島
Original assignee: Mitsubishi Heavy Industries Ltd
Current assignee: Mitsubishi Heavy Industries Ltd
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2021-10-27
Anticipated expiration: 2037-10-31
Also published as: JP2019082931A

Description

本発明は、検索装置、類似度算出方法、およびプログラムに関する。 The present invention relates to a search device, a similarity calculation method, and a program.

特許文献１には、２つの文をそれぞれ特徴ベクトルに変換して類似度を計算し、その後、構文解析を用いて係り受け関係等が一致する単語を除くことで、２つの文の差異を抽出する技術が開示されている。 In Patent Document 1, the difference between the two sentences is extracted by converting each of the two sentences into a feature vector, calculating the similarity, and then excluding the words having the same dependency relationship using syntactic analysis. The technology to be used is disclosed.

特許第５３６７０９９号公報Japanese Patent No. 5367099

２つの文を比較する際、特許文献１に記載されたように文の特徴ベクトルどうしの類似度を計算すると、２つの文に共通して互いに類似する語が出現する場合に、類似度が高くなる。しかしながら、互いに類似する語が、２つの文において必ずしも同じ役割を担っているとは限らない。例えば、第１の文“A boy has a small dictionary”と、第２の文“A small boy has a dictionary”とでは、第１の文では語“small”が“dictionary”を修飾しているのに対し、第２の文では語“small”が“boy”を修飾している。一方で、２つの文に出現する語は一致しているので、文の特徴ベクトルの類似度は高く評価されてしまう。
本発明の目的は、２つの文に出現する語の役割を加味して類似度を算出する検索装置、類似度算出方法、およびプログラムを提供することにある。 When comparing two sentences, when the similarity between the feature vectors of the sentences is calculated as described in Patent Document 1, the similarity is high when words similar to each other appear in common in the two sentences. Become. However, words that are similar to each other do not always play the same role in the two sentences. For example, in the first sentence "A boy has a small dictionary" and the second sentence "A small boy has a dictionary", the word "small" modifies "dictionary" in the first sentence. On the other hand, in the second sentence, the word "small" modifies "boy". On the other hand, since the words appearing in the two sentences match, the similarity of the feature vectors of the sentences is highly evaluated.
An object of the present invention is to provide a search device for calculating the similarity, a similarity calculation method, and a program in consideration of the roles of words appearing in two sentences.

本発明の第１の態様によれば、類似度特定装置（１００）は、第１の文および第２の文を構成する語別に特徴量を特定する特徴量特定部（１０５）と、前記第１の文および前記第２の文を構成する前記語毎の文法的な役割を特定する役割特定部（１０６）と、前記第１の文を構成する語の特徴量と、前記第２の文を構成する語のうち前記第１の文を構成する前記語と共通する役割の語に係る特徴量との類似度である語間類似度を特定する語間類似度特定部（１０８）と、前記語別の前記語間類似度に基づいて前記第１の文と前記第２の文との類似度である文間類似度を特定する文間類似度特定部（１０９）とを備える。これにより、類似度特定装置は、２つの文に出現する語の役割を加味して類似度を特定することができる。 According to the first aspect of the present invention, the similarity identification device (100) includes a feature amount specifying unit (105) that specifies a feature amount for each word constituting the first sentence and the second sentence, and the first sentence. The role specifying unit (106) for specifying the grammatical role of each of the words constituting the sentence 1 and the second sentence, the feature quantities of the words constituting the first sentence, and the second sentence. Of the words constituting the first sentence, the inter-word similarity specifying unit (108) for specifying the inter-word similarity, which is the similarity with the feature amount related to the word having a common role with the word constituting the first sentence, It is provided with an inter-sentence similarity specifying unit (109) that specifies the inter-sentence similarity, which is the similarity between the first sentence and the second sentence, based on the inter-word similarity for each word. As a result, the similarity identification device can identify the similarity by taking into account the roles of the words appearing in the two sentences.

本発明の第２の態様によれば、第１の態様に係る類似度特定装置における前記文間類似度特定部は、前記語の役割に応じた重み係数を用いた前記語間類似度の荷重和を計算することで、文間類似度を特定するものであってよい。これにより、類似度特定装置は、文を構成する語ごとの文における重要性を加味して２つの文の間の類似度を特定することができる。 According to the second aspect of the present invention, the inter-sentence similarity specifying unit in the similarity identifying device according to the first aspect is a load of the inter-word similarity using a weighting coefficient according to the role of the word. By calculating the sum, the similarity between sentences may be specified. Thereby, the similarity identification device can identify the similarity between two sentences in consideration of the importance in the sentence for each word constituting the sentence.

本発明の第３の態様によれば、第１または第２の態様に係る類似度特定装置における前記語間類似度特定部は、前記第１の文を構成する前記語と同じ役割の語が前記第２の文にない場合に、語間類似度を所定のペナルティ値に特定するものであってよい。これにより、類似度特定装置は、構造が異なる文どうしの比較において文間類似度を低く算出することができる。 According to the third aspect of the present invention, the interword similarity specifying unit in the similarity identifying device according to the first or second aspect includes words having the same role as the words constituting the first sentence. If it is not in the second sentence, the word-to-word similarity may be specified as a predetermined penalty value. As a result, the similarity identification device can calculate the inter-sentence similarity low in the comparison between sentences having different structures.

本発明の第４の態様によれば、第１から第３の何れかの態様に係る類似度特定装置における前記特徴量は、前記語数より少ない次元数のベクトルであるものであってよい。これにより類似度特定装置は、表記の異なる語どうしの類似度を算出することができる。 According to the fourth aspect of the present invention, the feature amount in the similarity specifying device according to any one of the first to third aspects may be a vector having a dimension number smaller than the number of words. As a result, the similarity identification device can calculate the similarity between words having different notations.

本発明の第５の態様によれば、第１から第４の何れかの態様に係る類似度特定装置における前記役割特定部は、句構造解析処理により、前記語別に役割を特定するものであってよい。 According to the fifth aspect of the present invention, the role specifying unit in the similarity specifying device according to any one of the first to fourth aspects identifies the role for each word by the phrase structure analysis process. You can.

本発明の第６の態様によれば、類似度特定方法は、コンピュータが、第１の文および第２の文を構成する語別に特徴量を特定するステップと、前記コンピュータが、前記第１の文および前記第２の文を構成する前記語毎の文法的な役割を特定するステップと、前記コンピュータが、前記第１の文を構成する語の特徴量と、前記第２の文を構成する語のうち前記第１の文を構成する前記語と同じ役割の語に係る特徴量との類似度である語間類似度を特定するステップと、前記コンピュータが、前記語別の前記語間類似度に基づいて前記第１の文と前記第２の文との類似度である文間類似度を特定するステップとを含む。
According to the sixth aspect of the present invention, the similarity identification method includes a step in which the computer specifies the feature amount for each word constituting the first sentence and the second sentence, and the computer uses the first sentence. The step of identifying the sentence and the grammatical role of each word constituting the second sentence, the computer constructs the feature quantity of the words constituting the first sentence, and the second sentence. The step of specifying the inter-word similarity, which is the similarity with the feature amount of the word having the same role as the word constituting the first sentence among the words, and the inter-word similarity for each word by the computer. It includes a step of specifying the inter-sentence similarity, which is the similarity between the first sentence and the second sentence based on the degree.

本発明の第７の態様によれば、プログラムは、コンピュータに、第１の文および第２の文を構成する語別に特徴量を特定するステップと、前記第１の文および前記第２の文を構成する前記語別に前記語毎の文法的な役割を特定するステップと、前記第１の文を構成する語の特徴量と、前記第２の文を構成する語のうち前記第１の文を構成する前記語と同じ役割の語に係る特徴量との類似度である語間類似度を特定するステップと、前記語別の前記語間類似度に基づいて前記第１の文と前記第２の文との類似度である文間類似度を特定するステップとを実行させる。 According to a seventh aspect of the present invention, the program tells a computer a step of specifying a feature amount for each word constituting the first sentence and the second sentence, and the first sentence and the second sentence. The step of specifying the grammatical role of each word for each of the words constituting the first sentence, the feature amount of the words constituting the first sentence, and the first sentence of the words constituting the second sentence. The first sentence and the first sentence based on the step of specifying the inter-word similarity, which is the similarity with the feature quantity of the word having the same role as the word constituting the word, and the inter-word similarity for each word. The step of specifying the inter-sentence similarity, which is the similarity with the second sentence, is executed.

上記態様のうち少なくとも１つの態様によれば、類似度特定装置は、２つの文に出現する語の役割を加味して類似度を算出することができる。 According to at least one of the above aspects, the similarity identification device can calculate the similarity by taking into account the roles of the words appearing in the two sentences.

第１の実施形態に係る検索装置の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the search apparatus which concerns on 1st Embodiment. 句構造解析による文法機能の特定方法の例を示す図である。It is a figure which shows the example of the method of specifying the grammatical function by phrase structure analysis. 係り受け解析による文法機能の特定方法の例を示す図である。It is a figure which shows the example of the method of specifying a grammatical function by a dependency analysis. ２つの文の対応箇所の特定方法の例を示す図である。It is a figure which shows the example of the method of specifying the corresponding part of two sentences. 第１の実施形態に係る検索装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the search apparatus which concerns on 1st Embodiment. 第２の実施形態に係る検索装置の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the search apparatus which concerns on 2nd Embodiment. 少なくとも１つの実施形態に係るコンピュータの構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the computer which concerns on at least one Embodiment.

〈第１の実施形態〉
《検索装置の構成》
図１は、第１の実施形態に係る検索装置の構成を示す概略ブロック図である。
第１の実施形態に係る検索装置１００は、文の入力を受け付け、複数の文の中から入力された文と類似するものを検索する。検索装置１００は、２つの文の類似度を特定する類似度特定装置の一例である。
検索装置１００は、変換モデル生成部１０１、変換モデル記憶部１０２、文記憶部１０３、文取得部１０４、特徴量特定部１０５、文法機能特定部１０６、対応箇所特定部１０７、語間類似度特定部１０８、文間類似度特定部１０９、検索結果出力部１１０を備える。 <First Embodiment>
<< Configuration of search device >>
FIG. 1 is a schematic block diagram showing a configuration of a search device according to the first embodiment.
The search device 100 according to the first embodiment receives the input of a sentence and searches a plurality of sentences similar to the input sentence. The search device 100 is an example of a similarity specifying device that specifies the similarity between two sentences.
The search device 100 includes a conversion model generation unit 101, a conversion model storage unit 102, a sentence storage unit 103, a sentence acquisition unit 104, a feature amount specifying unit 105, a grammar function specifying unit 106, a corresponding part specifying unit 107, and a word-to-word similarity identification unit. A unit 108, a sentence-to-sentence similarity specifying unit 109, and a search result output unit 110 are provided.

変換モデル生成部１０１は、複数の学習用文それぞれから、語彙数の次元数を有する文ベクトルを生成し、当該文ベクトルに基づいて、語の特徴を表す特徴ベクトルを生成するための変換モデルを生成する。語は、単語に限られず複合語や句を含むものであってもよい。学習用文は、文記憶部１０３が記憶する検索対象文であってもよいし、他の文であってもよい。なお、学習用文に検索対象文を用いる場合、変換モデル生成部１０１は、検索対象文で用いられる表現に特化した変換モデルを作成することができる。また学習用文に検索対象文以外の文を含める場合、変換モデル生成部１０１は、検索対象文で用いられない表現にも対応した変換モデルを作成することができる。 The conversion model generation unit 101 generates a sentence vector having the number of dimensions of the vocabulary number from each of the plurality of learning sentences, and based on the sentence vector, generates a conversion model for generating a feature vector representing the feature of the word. Generate. The word is not limited to a word and may include a compound word or a phrase. The learning sentence may be a search target sentence stored in the sentence storage unit 103, or may be another sentence. When the search target sentence is used as the learning sentence, the conversion model generation unit 101 can create a conversion model specialized for the expression used in the search target sentence. Further, when a sentence other than the search target sentence is included in the learning sentence, the conversion model generation unit 101 can create a conversion model corresponding to an expression not used in the search target sentence.

例えば、変換モデル生成部１０１は、オートエンコーダを用いて変換モデルを生成することができる。オートエンコーダは、入力層と出力層のノード数が等しく、中間層のノード数が入力層および出力層より少ないニューラルネットワークである。変換モデル生成部１０１は、各文ベクトルをオートエンコーダに入力し、入力と出力とが等しくなるようにオートエンコーダを学習させる。そして、変換モデル生成部１０１は、オートエンコーダの入力層と中間層を取り出すことで、中間層の出力を特徴ベクトルとする変換モデルを生成する。オートエンコーダを用いた変換モデルの作成には、例えばWord2Vec（https://code.google.com/archive/p/word2vec/）を用いることができる。
また、変換モデル生成部１０１は、ｔｆ（Term Frequency）−ｉｄｆ（Inverse Document Frequency）、潜在的意味解析、主成分分析に基づいて変換モデルを作成してもよい。 For example, the transformation model generation unit 101 can generate a transformation model using an autoencoder. An autoencoder is a neural network in which the number of nodes in the input layer and the output layer is equal, and the number of nodes in the intermediate layer is smaller than that in the input layer and the output layer. The transformation model generation unit 101 inputs each sentence vector to the autoencoder, and trains the autoencoder so that the input and the output become equal. Then, the transformation model generation unit 101 generates a transformation model using the output of the intermediate layer as a feature vector by taking out the input layer and the intermediate layer of the autoencoder. For example, Word2Vec (https://code.google.com/archive/p/word2vec/) can be used to create a conversion model using an autoencoder.
Further, the transformation model generation unit 101 may create a transformation model based on tf (Term Frequency) -idf (Inverse Document Frequency), latent semantic analysis, and principal component analysis.

変換モデル記憶部１０２は、変換モデル生成部１０１が生成した変換モデルを記憶する。 The transformation model storage unit 102 stores the transformation model generated by the transformation model generation unit 101.

文記憶部１０３は、複数の検索対象文を記憶する。
文取得部１０４は、利用者によって入力されたクエリ文、および文記憶部１０３が記憶する検索対象文を取得する。 The sentence storage unit 103 stores a plurality of search target sentences.
The sentence acquisition unit 104 acquires the query sentence input by the user and the search target sentence stored in the sentence storage unit 103.

特徴量特定部１０５は、文取得部１０４が取得した文を構成する複数の語それぞれについて、変換モデル記憶部１０２が記憶する変換モデルを用いて特徴ベクトルを生成する。例えば、特徴量特定部１０５は、以下の処理により特徴ベクトルを生成することができる。特徴量特定部１０５は、文取得部１０４が取得した文を複数の語に分割する。文の分割には形態素解析を用いることができる。特徴量特定部１０５は、分割した語それぞれから、語彙数に等しい次元数を有する語ベクトルを生成する。特徴量特定部１０５は、語ベクトルを変換モデルに入力することで、特徴ベクトルを得る。 The feature amount specifying unit 105 generates a feature vector for each of a plurality of words constituting the sentence acquired by the sentence acquisition unit 104, using the conversion model stored in the conversion model storage unit 102. For example, the feature amount specifying unit 105 can generate a feature vector by the following processing. The feature amount specifying unit 105 divides the sentence acquired by the sentence acquisition unit 104 into a plurality of words. Morphological analysis can be used to divide sentences. The feature amount specifying unit 105 generates a word vector having a number of dimensions equal to the number of vocabularies from each of the divided words. The feature amount specifying unit 105 obtains a feature vector by inputting a word vector into the transformation model.

文法機能特定部１０６は、文取得部１０４が取得した文を構文解析し、文を構成する複数の語それぞれの役割である文法機能を特定する。文法機能とは、文を構成する要素（例えば、語、句、節）が他の要素に対して持つ関係による分類をいう。文法機能の例としては、名詞句（ＮＰ：noun phrase）、動詞句（ＶＰ：verb phrase）および形容詞句の区分（句構造）、主語、述語、目的語、補語および修飾語の区分（係り受け関係）、品詞の区分などが挙げられる。構文解析の例としては、句構造解析および係り受け解析が挙げられる。なお、文法機能特定部１０６は、役割特定部の一例である。なお、他の実施形態においては、文を構成する語毎の文法的な役割は、文法機能以外の情報であってもよい。 The grammar function specifying unit 106 parses the sentence acquired by the sentence acquisition unit 104 and specifies the grammar function which is the role of each of the plurality of words constituting the sentence. The grammatical function refers to the classification of the elements that make up a sentence (for example, words, phrases, clauses) according to the relationship with other elements. Examples of grammatical functions include noun phrase (NP: noun phrase), verb phrase (VP: verb phrase) and adjective phrase division (phrase structure), subject, predicate, object, complement and modifier division (dependency). Relationship), classification of part words, etc. Examples of parsing include phrase structure analysis and dependency analysis. The grammar function specifying unit 106 is an example of a role specifying unit. In other embodiments, the grammatical role of each word constituting the sentence may be information other than the grammatical function.

図２は、句構造解析による文法機能の特定方法の例を示す図である。
例えば、文法機能特定部１０６は、文取得部１０４が取得した文を句構造解析し、文を構成する複数の語を要素とする句構造木を生成する。句構造木は、文を根ノードとし、節または句を内部ノードとし、複数の語それぞれを葉ノードとする木構造データである。句構造解析により、各ノードには当該ノードの句構造に係る文法機能を示すタグが付される。文法機能特定部１０６は、句構造木の各葉ノードについて、当該葉ノードと根ノードとを結ぶ経路に付されたすべてのタグの順列を、当該葉ノードに係る語の文法機能として特定する。 FIG. 2 is a diagram showing an example of a method of specifying a grammatical function by phrase structure analysis.
For example, the grammar function specifying unit 106 analyzes the sentence acquired by the sentence acquisition unit 104 in a phrase structure, and generates a phrase structure tree having a plurality of words constituting the sentence as elements. A phrase structure tree is tree structure data in which a sentence is a root node, a clause or a phrase is an internal node, and each of a plurality of words is a leaf node. By the phrase structure analysis, each node is tagged to indicate the grammatical function related to the phrase structure of the node. The grammar function specifying unit 106 specifies, for each leaf node of the phrase structure tree, the permutation of all the tags attached to the path connecting the leaf node and the root node as the grammar function of the word related to the leaf node.

図３は、係り受け解析による文法機能の特定方法の例を示す図である。
また例えば、文法機能特定部１０６は、文取得部１０４が取得した文を係り受け解析し、文を構成する複数の語を要素とする依存構造木を生成する。依存構造木は、各語をノードとし、係り元の語（depender）のノードを係り先の語（dependee）のノードの子ノードとする木構造データである。依存構造木は、ノード間を結ぶ枝に係り受け関係を示すタグが付されていてもよい。文法機能特定部１０６は、依存構造木に基づいて各語の係り先（例えば、句構造解析によって特定された係り先の文法機能）、または係り受け関係を文法機能として特定する。係り受け関係の例としては、決定詞の付与を示すdet、述語に係る主語を示すnsubj、述語に係る目的語を示すdobj、名詞を修飾する修飾語を示すamodなどが挙げられる。 FIG. 3 is a diagram showing an example of a method of specifying a grammatical function by dependency analysis.
Further, for example, the grammar function specifying unit 106 receives and analyzes the sentence acquired by the sentence acquisition unit 104, and generates a dependent structure tree having a plurality of words constituting the sentence as elements. The dependency structure tree is tree structure data in which each word is a node and the node of the source word (depender) is a child node of the node of the dependency word (dependee). The dependent structure tree may be tagged with a dependency relationship on the branches connecting the nodes. The grammar function specifying unit 106 specifies the dependency (for example, the grammar function of the dependency specified by the phrase structure analysis) or the dependency relationship of each word as the grammar function based on the dependency structure tree. Examples of dependency relationships include det, which indicates the assignment of determiners, nsubj, which indicates the subject of a predicate, dobj, which indicates an object of a predicate, and amod, which indicates a modifier that modifies a noun.

図４は、２つの文の対応箇所の特定方法の例を示す図である。
対応箇所特定部１０７は、クエリ文の語と検索対象文の語のペアであって、文法機能が共通するペアを特定する。すなわち、対応箇所特定部１０７は、クエリ文の各語について、当該語と文法機能が共通する語を、検索対象文から特定することで、語のペアを特定する。例えば、図４に示すように、クエリ文が“A boy has a small dictionary”であり、検索対象文が“A small boy has a dictionary”である場合、対応箇所特定部１０７は、以下のように語のペアを特定する。クエリ文の語である“A”は、句構造が「Ｓ（sentence）−ＮＰ−ＤＴ（determiner：決定詞）」であり、係り受け関係が「det」の語であるという文法機能を有する。対応箇所特定部１０７は、検索対象文から、句構造が「Ｓ−ＮＰ−ＤＴ」であり、係り受け関係が「det」の語であるという文法機能を有する語を特定する。図４に示すように、クエリ文の“A”は検索対象文の“A”と共通する文法機能を有する。対応箇所特定部１０７は、これを各語について実行し、図４に示すように、クエリ文の“A”と検索対象文の“A”のペア、クエリ文の“boy”と検索対象文の“boy”のペア、クエリ文の“has”と検索対象文の“has”のペア、クエリ文の“a”と検索対象文の“a”のペア、クエリ文の“dictionary”と検索対象文の“dictionary”のペア、をそれぞれ語のペアとして抽出する。また、対応箇所特定部１０７は、図４に示すように、クエリ文の“small”と文法機能が共通する語が検索対象文に存在せず、検索対象文の“small”と文法機能が共通する語がクエリ文に存在しないことを特定する。 FIG. 4 is a diagram showing an example of a method of specifying a corresponding portion between two sentences.
Corresponding part specifying unit 107 identifies a pair of a word of a query sentence and a word of a search target sentence and having a common grammatical function. That is, the corresponding portion specifying unit 107 specifies a word pair for each word in the query sentence by specifying a word having a common grammatical function with the word from the search target sentence. For example, as shown in FIG. 4, when the query statement is "A boy has a small dictionary" and the search target sentence is "A small boy has a dictionary", the corresponding part identification unit 107 is as follows. Identify word pairs. The word "A" in the query sentence has a grammatical function that the phrase structure is "S (sentence) -NP-DT (determiner)" and the dependency relationship is the word "det". Corresponding part specifying unit 107 identifies a word having a grammatical function that the phrase structure is "S-NP-DT" and the dependency relationship is "det" from the search target sentence. As shown in FIG. 4, the query sentence “A” has a grammatical function common to that of the search target sentence “A”. Corresponding part identification unit 107 executes this for each word, and as shown in FIG. 4, the pair of the query sentence “A” and the search target sentence “A”, the query sentence “boy” and the search target sentence. “Boy” pair, query statement “has” and search target statement “has” pair, query statement “a” and search target statement “a” pair, query statement “dictionary” and search target statement "Dictionary" pairs of, are extracted as word pairs. Further, as shown in FIG. 4, the corresponding portion specifying unit 107 does not have a word having the same grammatical function as the query sentence “small” in the search target sentence, and has the same grammatical function as the search target sentence “small”. Identify that the word you want does not exist in the query statement.

語間類似度特定部１０８は、対応箇所特定部１０７が特定したペア別に、特徴ベクトルどうしの類似度である語間類似度を特定する。語間類似度の例としては、特徴ベクトルのコサイン類似度、ユークリッド距離、レーベンシュタイン距離などが挙げられる。語間類似度特定部１０８は、対応箇所特定部１０７がペアを特定しなかった語について、所定のペナルティ値を語間類似度とする。ペナルティ値は、例えば０以下の値に設定される。 The inter-word similarity specifying unit 108 specifies the inter-word similarity, which is the similarity between feature vectors, for each pair specified by the corresponding location specifying unit 107. Examples of inter-word similarity include cosine similarity of feature vectors, Euclidean distance, Levenshtein distance, and the like. The inter-word similarity specifying unit 108 sets a predetermined penalty value as the inter-word similarity for words for which the corresponding location specifying unit 107 does not specify a pair. The penalty value is set to, for example, a value of 0 or less.

文間類似度特定部１０９は、各語のペアの語間類似度に基づいて、クエリ文と検索対象文との類似度である文間類似度を特定する。例えば文間類似度特定部１０９は、語間類似度の平均値または総和を文間類似度とする。 The inter-sentence similarity specifying unit 109 specifies the inter-sentence similarity, which is the similarity between the query sentence and the search target sentence, based on the inter-word similarity of each word pair. For example, the inter-sentence similarity specifying unit 109 uses the average value or the sum of the inter-word similarity as the inter-sentence similarity.

検索結果出力部１１０は、クエリ文との文間類似度が高い検索対象文を、検索結果として出力する。例えば、検索結果出力部１１０は、クエリ文との文間類似度が最も高い検索対象文を、検索結果として出力する。なお、他の実施形態においては、検索結果出力部１１０は、例えば文間類似度が所定の閾値以上の複数の検索対象文を出力してもよいし、文間類似度の降順に並べた複数の検索対象文を出力してもよい。検索結果の出力は、例えばディスプレイへの表示、記憶媒体への記録、外部装置への送信などによってなされる。 The search result output unit 110 outputs a search target sentence having a high degree of inter-sentence similarity with the query sentence as a search result. For example, the search result output unit 110 outputs a search target sentence having the highest inter-sentence similarity with the query sentence as a search result. In another embodiment, the search result output unit 110 may output, for example, a plurality of search target sentences whose inter-sentence similarity is equal to or greater than a predetermined threshold value, or a plurality of sentences arranged in descending order of inter-sentence similarity. The search target sentence of may be output. The search result is output, for example, by displaying it on a display, recording it on a storage medium, transmitting it to an external device, or the like.

《検索装置の動作》
検索装置１００の変換モデル生成部１０１は、文の検索処理を実行する前に、予め複数の学習用文から変換モデルを生成し、変換モデル記憶部１０２に記録しておく。 << Operation of search device >>
The conversion model generation unit 101 of the search device 100 generates a conversion model from a plurality of learning sentences in advance and records it in the conversion model storage unit 102 before executing the sentence search process.

図５は、第１の実施形態に係る検索装置の動作を示すフローチャートである。
検索装置１００は、利用者からクエリ文の入力を受け付ける。文取得部１０４は、入力されたクエリ文を取得する（ステップＳ１）。次に、特徴量特定部１０５は、クエリ文を構成する複数の語それぞれについて、変換モデル記憶部１０２が記憶する変換モデルを用いて特徴ベクトルを生成する（ステップＳ２）。次に、文法機能特定部１０６は、クエリ文を構文解析し、クエリ文を構成する複数の語それぞれの文法機能を特定する（ステップＳ３）。 FIG. 5 is a flowchart showing the operation of the search device according to the first embodiment.
The search device 100 accepts input of a query sentence from the user. The statement acquisition unit 104 acquires the input query statement (step S1). Next, the feature amount specifying unit 105 generates a feature vector for each of the plurality of words constituting the query sentence by using the conversion model stored in the conversion model storage unit 102 (step S2). Next, the grammar function specifying unit 106 parses the query sentence and specifies the grammar function of each of the plurality of words constituting the query sentence (step S3).

次に、検索装置１００は、文記憶部１０３が記憶する検索対象文を１つずつ選択し、以下に示すステップＳ５からステップＳ１１の処理を実行する（ステップＳ４）。
文取得部１０４は、文記憶部１０３から選択された検索対象文を取得する（ステップＳ５）。次に、特徴量特定部１０５は、検索対象文を構成する複数の語それぞれについて、変換モデル記憶部１０２が記憶する変換モデルを用いて特徴ベクトルを生成する（ステップＳ６）。次に、文法機能特定部１０６は、検索対象文を構文解析し、検索対象文を構成する複数の語それぞれの文法機能を特定する（ステップＳ７）。 Next, the search device 100 selects the search target sentences to be stored by the sentence storage unit 103 one by one, and executes the processes of steps S5 to S11 shown below (step S4).
The sentence acquisition unit 104 acquires the search target sentence selected from the sentence storage unit 103 (step S5). Next, the feature quantity specifying unit 105 generates a feature vector for each of the plurality of words constituting the search target sentence by using the conversion model stored in the conversion model storage unit 102 (step S6). Next, the grammar function specifying unit 106 parses the search target sentence and specifies the grammatical function of each of the plurality of words constituting the search target sentence (step S7).

次に、対応箇所特定部１０７は、ステップＳ３で特定したクエリ文の語の文法機能と、ステップＳ７で特定した検索対象文の語の文法機能とに基づいて、文法機能が共通する語のペアを特定する（ステップＳ８）。語間類似度特定部１０８は、対応箇所特定部１０７が特定したペア別に語間類似度を特定する（ステップＳ９）。また語間類似度特定部１０８は、対応箇所特定部１０７がペアを特定しなかった語について、所定のペナルティ値を語間類似度とする（ステップＳ１０）。文間類似度特定部１０９は、ステップＳ９およびステップＳ１０で特定した語間類似度の平均値を計算することで、クエリ文と検索対象文との文間類似度を特定する（ステップＳ１１）。 Next, the corresponding part specifying unit 107 is a pair of words having a common grammatical function based on the grammatical function of the word of the query sentence specified in step S3 and the grammatical function of the word of the search target sentence specified in step S7. Is specified (step S8). The word-to-word similarity specifying unit 108 specifies the word-to-word similarity for each pair specified by the corresponding location specifying unit 107 (step S9). Further, the inter-word similarity specifying unit 108 sets a predetermined penalty value as the inter-word similarity for words for which the corresponding location specifying unit 107 has not specified a pair (step S10). The inter-sentence similarity specifying unit 109 specifies the inter-sentence similarity between the query sentence and the search target sentence by calculating the average value of the inter-word similarity specified in steps S9 and S10 (step S11).

文記憶部１０３が記憶するすべての検索対象文について文間類似度が算出されると、検索結果出力部１１０は、クエリ文との文間類似度が最も高い検索対象文を、検索結果として出力する（ステップＳ１２）。 When the inter-sentence similarity is calculated for all the search target sentences stored in the sentence storage unit 103, the search result output unit 110 outputs the search target sentence having the highest inter-sentence similarity with the query sentence as a search result. (Step S12).

《作用・効果》
このように、第１の実施形態に係る検索装置１００は、クエリ文を構成する語の特徴ベクトルと、検索対象文を構成する語のうちクエリ文を構成する語と共通する文法機能の語に係る特徴ベクトルとを比較することで語間類似度を特定し、各語間類似度から文間類似度を特定する。これにより、検索装置１００は、２つの文に出現する語の役割を加味して類似する文を検索することができる。《Action / Effect》
As described above, the search device 100 according to the first embodiment uses the feature vectors of the words that make up the query sentence and the words that have the same grammatical function as the words that make up the query sentence among the words that make up the search target sentence. The inter-word similarity is specified by comparing with the feature vector, and the inter-sentence similarity is specified from each inter-word similarity. As a result, the search device 100 can search for similar sentences by taking into account the roles of the words appearing in the two sentences.

ここで、クエリ文“A boy has a small dictionary”と、検索対象文“A small boy has a dictionary”との比較を例に説明する。単純なベクトル空間モデルに係る比較では、クエリ文の語“small”と検索対象文の語“small”とが区別されずに類似度が算出されるため、類似度が高く評価される。これに対し、第１の実施形態に係る検索装置１００は、図４に示すように、クエリ文の語“small”と検索対象文の語“small”とが文法機能が異なるために区別して評価される。これにより、第１の実施形態に係る検索装置１００は、文間類似度を単純なベクトル空間モデルの例と比較して低く評価することができる。このように、第１の実施形態に係る検索装置１００は、同じ表記の語であってもその文における役割が異なる場合に、これを区別して類似度を算出することができる。 Here, a comparison between the query statement “A boy has a small dictionary” and the search target statement “A small boy has a dictionary” will be described as an example. In the comparison relating to the simple vector space model, the similarity is calculated without distinguishing between the word "small" in the query sentence and the word "small" in the search target sentence, so that the similarity is highly evaluated. On the other hand, in the search device 100 according to the first embodiment, as shown in FIG. 4, the word “small” in the query sentence and the word “small” in the search target sentence are evaluated separately because they have different grammatical functions. Will be done. As a result, the search device 100 according to the first embodiment can evaluate the inter-sentence similarity lower than that of the example of the simple vector space model. As described above, the search device 100 according to the first embodiment can calculate the similarity by distinguishing the words having the same notation but having different roles in the sentence.

また、第１の実施形態に係る検索装置１００は、一方の文を構成する語と文法機能が共通する語が他方の文にない場合に、所定のペナルティ値を語間類似度に設定する。これにより、検索装置１００は、構造が異なる文どうしの比較において文間類似度を低く算出することができる。 Further, the search device 100 according to the first embodiment sets a predetermined penalty value for the inter-word similarity when there is no word in the other sentence that has the same grammatical function as the word constituting one sentence. As a result, the search device 100 can calculate the inter-sentence similarity to be low when comparing sentences having different structures.

また、第１の実施形態に係る特徴ベクトルは、語彙数より少ない次元数のベクトルである。つまり、特徴ベクトルは、語彙数と等しい次元数のベクトルである語ベクトルの次元を削減したベクトルである。これにより検索装置１００は、表記の異なる語どうしの類似度を算出することができる。 Further, the feature vector according to the first embodiment is a vector having a number of dimensions smaller than the number of vocabularies. That is, the feature vector is a vector obtained by reducing the dimension of the word vector, which is a vector having the same number of dimensions as the number of vocabularies. As a result, the search device 100 can calculate the degree of similarity between words having different notations.

〈第２の実施形態〉
第１の実施形態に係る検索装置１００の文間類似度特定部１０９は、各語のペアの語間類似度の平均値または総和に基づいて文間類似度を算出する。
一方で、文を構成する語ごとに、文の意味に対する重要性が異なる。例えば、語が単数形であるか複数形であるかによって冠詞“a”の有無が変わることがあるが、文において当該冠詞の有無は多くの場合重要な意味を持たない。他方、副詞“not”の有無は文において重要な意味を持つことが多い。そこで、第２の実施形態に係る検索装置１００は、語間類似度の加重平均または荷重和に基づいて文間類似度を算出する。なお、加重平均は、荷重和を要素数で除算したものであるため、語間類似度の荷重和の計算によって文間類似度を特定することは、検索装置１００が語間類似度の加重平均を文間類似度とすることを含む。 <Second embodiment>
The inter-sentence similarity specifying unit 109 of the search device 100 according to the first embodiment calculates the inter-sentence similarity based on the average value or the sum of the inter-sentence similarity of each word pair.
On the other hand, the importance of the meaning of a sentence differs depending on the words that make up the sentence. For example, the presence or absence of the article "a" may vary depending on whether the word is singular or plural, but the presence or absence of the article in a sentence often has no significant meaning. On the other hand, the presence or absence of the adverb "not" often has an important meaning in a sentence. Therefore, the search device 100 according to the second embodiment calculates the inter-sentence similarity based on the weighted average or the sum of loads of the inter-word similarity. Since the weighted average is obtained by dividing the weighted sum by the number of elements, the search device 100 can use the weighted average of the word-to-word similarity to specify the sentence-to-sentence similarity by calculating the weighted sum of the word-to-word similarity. Includes the inter-sentence similarity.

《検索装置の構成》
図６は、第２の実施形態に係る検索装置の構成を示す概略ブロック図である。
第２の実施形態に係る検索装置１００は、第１の実施形態の構成に加え、さらに係数特定部１１１を備える。
係数特定部１１１は、対応箇所特定部１０７が特定した語のペア別に、当該語の文法機能に応じた重み係数を特定する。重み係数は、例えば語の品詞別に決定されてもよいし、名詞句および動詞句の区分別に決定されてもよいし、係り先の語の品詞別に決定されてもよい。 << Configuration of search device >>
FIG. 6 is a schematic block diagram showing the configuration of the search device according to the second embodiment.
The search device 100 according to the second embodiment further includes a coefficient specifying unit 111 in addition to the configuration of the first embodiment.
The coefficient specifying unit 111 specifies a weighting coefficient according to the grammatical function of the word for each word pair specified by the corresponding location specifying unit 107. The weighting coefficient may be determined, for example, by the part of speech of the word, by the classification of the noun phrase and the verb phrase, or by the part of speech of the word involved.

文間類似度特定部１０９は、語間類似度特定部１０８が特定した各語のペアの語間類似度のそれぞれに、係数特定部１１１が特定した当該語のペアに係る重み係数を乗算し、平均値を求めることで、文間類似度を算出する。 The inter-sentence similarity specifying unit 109 multiplies each of the inter-word similarity of each word pair specified by the inter-word similarity specifying unit 108 by a weighting coefficient related to the word pair specified by the coefficient specifying unit 111. , Calculate the inter-sentence similarity by finding the average value.

このように、第２の実施形態によれば、検索装置１００は、文を構成する語ごとの文における重要性を加味して２つの文の間の類似度を特定することができる。 As described above, according to the second embodiment, the search device 100 can specify the degree of similarity between the two sentences in consideration of the importance in the sentence for each word constituting the sentence.

〈他の実施形態〉
以上、図面を参照して一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、様々な設計変更等をすることが可能である。
例えば、上述した実施形態に係る検索装置１００は、文法機能が一致する語のペアを特定するが、これに限られない。例えば、他の実施形態に係る検索装置１００は、語の品詞や活用形の相違を無視してもよいし、文法機能が一定の割合で部分一致する語のペアを特定してもよいし、文法機能の類似度が所定の閾値以上のペアを特定してもよい。なお、文法機能が一致する語のペア、文法機能が部分一致する語のペア、および文法機能の類似度が所定の閾値以上の語のペアは、いずれも文法機能が共通する語のペアの一例である。 <Other Embodiments>
Although one embodiment has been described in detail with reference to the drawings, the specific configuration is not limited to the above, and various design changes and the like can be made.
For example, the search device 100 according to the above-described embodiment identifies a pair of words having the same grammatical function, but the present invention is not limited to this. For example, the search device 100 according to another embodiment may ignore differences in word parts of speech and conjugations, or may specify word pairs whose grammatical functions partially match at a certain rate. Pairs whose grammatical function similarity is equal to or greater than a predetermined threshold may be specified. A pair of words with matching grammatical functions, a pair of words with partial matching grammatical functions, and a pair of words with similarities of grammatical functions above a predetermined threshold are examples of word pairs with common grammatical functions. Is.

また、上述した実施形態に係る検索装置１００は、各検索対象文について特徴ベクトルの生成および構文解析を行うが、これに限られない。例えば、他の実施形態に係る検索装置１００は、予め文記憶部１０３において検索対象文に関連付けて、当該文に含まれる各語の特徴ベクトルと文法機能とが記憶されていてもよい。すなわち、他の実施形態においては、クエリ文について特徴ベクトルの生成および構文解析を行い、検索対象文の特徴ベクトルおよび文法機能は、検索時に文記憶部１０３から読み出されるものであってよい。 Further, the search device 100 according to the above-described embodiment generates a feature vector and parses each search target sentence, but the present invention is not limited to this. For example, in the search device 100 according to another embodiment, the feature vector and the grammatical function of each word included in the sentence may be stored in advance in the sentence storage unit 103 in association with the search target sentence. That is, in another embodiment, the feature vector is generated and the syntax is analyzed for the query sentence, and the feature vector and the grammatical function of the search target sentence may be read from the sentence storage unit 103 at the time of the search.

また、上述した実施形態に係る検索装置１００は、句構造解析結果および係り受け解析結果の両方を用いて語の文法機能を特定するが、これに限られない。例えば、他の実施形態に係る検索装置１００は、句構造解析結果のみを用いて文法機能を特定してもよいし、係り受け解析結果のみを用いて文法機能を特定してもよい。 Further, the search device 100 according to the above-described embodiment specifies the grammatical function of a word by using both the phrase structure analysis result and the dependency analysis result, but the present invention is not limited to this. For example, the search device 100 according to another embodiment may specify the grammatical function using only the phrase structure analysis result, or may specify the grammatical function using only the dependency analysis result.

また、上述した実施形態に係る検索装置１００は、文を構成する単語別に語間類似度を算出するが、他の実施形態においてはこれに限られない。例えば、他の実施形態に係る検索装置１００は、文を構成する句別に語間類似度を算出してもよい。 Further, the search device 100 according to the above-described embodiment calculates the inter-word similarity for each word constituting the sentence, but is not limited to this in other embodiments. For example, the search device 100 according to another embodiment may calculate the inter-word similarity for each phrase constituting the sentence.

また、上述した実施形態においては、類似度特定装置が複数の文の中からクエリ文に類似する文を検索する検索装置１００に適用されるが、他の実施形態においてはこれに限られない。例えば、他の実施形態に係る類似度特定装置は、入力された２つの文どうしの類似度を算出するものであってもよい。 Further, in the above-described embodiment, the similarity specifying device is applied to the search device 100 that searches for a sentence similar to the query sentence from a plurality of sentences, but the other embodiments are not limited to this. For example, the similarity identification device according to another embodiment may calculate the similarity between two input sentences.

〈コンピュータ構成〉
図７は、少なくとも１つの実施形態に係るコンピュータの構成を示す概略ブロック図である。
コンピュータ９０は、ＣＰＵ９１、主記憶装置９２、補助記憶装置９３、インタフェース９４を備える。
上述の検索装置１００は、コンピュータ９０に実装される。そして、上述した各処理部の動作は、プログラムの形式で補助記憶装置９３に記憶されている。ＣＰＵ９１は、プログラムを補助記憶装置９３から読み出して主記憶装置９２に展開し、当該プログラムに従って上記処理を実行する。 <Computer configuration>
FIG. 7 is a schematic block diagram showing the configuration of a computer according to at least one embodiment.
The computer 90 includes a CPU 91, a main storage device 92, an auxiliary storage device 93, and an interface 94.
The above-mentioned search device 100 is mounted on the computer 90. The operation of each processing unit described above is stored in the auxiliary storage device 93 in the form of a program. The CPU 91 reads a program from the auxiliary storage device 93, expands it to the main storage device 92, and executes the above processing according to the program.

補助記憶装置９３の例としては、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＤＶＤ−ＲＯＭ（Digital Versatile Disc Read Only Memory）、半導体メモリ等が挙げられる。補助記憶装置９３は、コンピュータ９０のバスに直接接続された内部メディアであってもよいし、インタフェース９４または通信回線を介してコンピュータ９０に接続される外部メディアであってもよい。また、このプログラムが通信回線によってコンピュータ９０に配信される場合、配信を受けたコンピュータ９０が当該プログラムを主記憶装置９２に展開し、上記処理を実行してもよい。少なくとも１つの実施形態において、補助記憶装置９３は、一時的でない有形の記憶媒体である。 Examples of the auxiliary storage device 93 include HDD (Hard Disk Drive), SSD (Solid State Drive), magnetic disk, optical magnetic disk, CD-ROM (Compact Disc Read Only Memory), and DVD-ROM (Digital Versatile Disc Read Only). Memory), semiconductor memory and the like. The auxiliary storage device 93 may be internal media directly connected to the bus of computer 90, or external media connected to computer 90 via an interface 94 or a communication line. When this program is distributed to the computer 90 via a communication line, the distributed computer 90 may expand the program to the main storage device 92 and execute the above processing. In at least one embodiment, the auxiliary storage device 93 is a non-temporary tangible storage medium.

また、当該プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、当該プログラムは、前述した機能を補助記憶装置９３に既に記憶されている他のプログラムとの組み合わせで実現するもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be for realizing a part of the above-mentioned functions. Further, the program may be a so-called difference file (difference program) that realizes the above-mentioned function in combination with another program already stored in the auxiliary storage device 93.

１００検索装置
１０１変換モデル生成部
１０２変換モデル記憶部
１０３文記憶部
１０４文取得部
１０５特徴量特定部
１０６文法機能特定部
１０７対応箇所特定部
１０８語間類似度特定部
１０９文間類似度特定部
１１０検索結果出力部
１１１係数特定部 100 Search device 101 Conversion model generation unit 102 Conversion model storage unit 103 Sentence storage unit 104 Sentence acquisition unit 105 Feature quantity specification unit 106 Grammar function specification unit 107 Corresponding part specification unit 108 Word-to-word similarity specification unit 109 Sentence-to-sentence similarity specification unit 110 Search result output unit 111 Coefficient identification unit

Claims

A feature quantity specifying unit that specifies a feature quantity for each word that constitutes the first sentence and the second sentence,
A role specifying unit that specifies the grammatical role of each word that constitutes the first sentence and the second sentence, and
The degree of similarity between the feature quantity of the words constituting the first sentence and the feature quantity of the words constituting the second sentence and having a role common to the words constituting the first sentence. A word-to-word similarity specification part that specifies a certain word-to-word similarity,
It is provided with an inter-sentence similarity specifying unit that specifies the inter-sentence similarity, which is the similarity between the first sentence and the second sentence based on the inter-word similarity for each word.
When the second sentence does not have a word having the same role as the word constituting the first sentence, the inter-word similarity specifying unit has inter-word similarity related to the word constituting the first sentence. A similarity identification device that sets the degree to a predetermined penalty value.

The similarity degree according to claim 1, wherein the sentence-to-sentence similarity specifying unit specifies the sentence-to-sentence similarity by calculating the weight sum of the word-to-word similarity using a weighting coefficient according to the role of the word. Specific device.

The similarity identification device according to claim 1 or 2, wherein the feature quantity is a vector having a number of dimensions smaller than the number of vocabularies.

The similarity specifying device according to any one of claims 1 to 3, wherein the role specifying unit specifies a role for each word based on the result of parsing processing.

A step in which the computer identifies the features for each word that composes the first sentence and the second sentence.
A step in which the computer identifies the grammatical role of each word constituting the first sentence and the second sentence.
The computer has a feature amount of a word that constitutes the first sentence and a feature amount of a word that has the same role as the word that constitutes the first sentence among the words that constitute the second sentence. Steps to identify word-to-word similarity, which is the degree of similarity,
When the computer does not have a word having the same role as the word constituting the first sentence in the second sentence, a predetermined penalty is given to the inter-word similarity related to the word constituting the first sentence. Steps to value and
A method for specifying similarity, which comprises a step in which the computer specifies inter-sentence similarity, which is the degree of similarity between the first sentence and the second sentence, based on the inter-word similarity for each word.

On the computer
Steps to identify features for each word that makes up the first and second sentences,
A step of identifying the grammatical role of each word constituting the first sentence and the second sentence, and
It is the degree of similarity between the feature amount of the word constituting the first sentence and the feature amount related to the word having the same role as the word constituting the first sentence among the words constituting the second sentence. Steps to identify word-to-word similarity and
When the second sentence does not have a word having the same role as the word constituting the first sentence, the step of setting the inter-word similarity related to the word constituting the first sentence to a predetermined penalty value. When,
A program for executing a step of specifying the inter-sentence similarity, which is the similarity between the first sentence and the second sentence, based on the inter-word similarity for each word.