JP5441760B2

JP5441760B2 - Inter-document distance calculator and sentence searcher

Info

Publication number: JP5441760B2
Application number: JP2010040578A
Authority: JP
Inventors: 崇志三上; 敬平野
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-02-25
Filing date: 2010-02-25
Publication date: 2014-03-12
Anticipated expiration: 2030-02-25
Also published as: JP2011175568A

Description

この発明は、文書間における類似度の算出に関し、特に、文の係り受け構造を木構造に変換し、木構造同士の編集距離を類似度判定に利用する文書間距離算出器およびその文書間距離算出器を用いた文章検索器に関する。 The present invention relates to calculation of similarity between documents, and in particular, converts an inter-document dependency structure into a tree structure, and uses an edit distance between the tree structures for similarity determination and the inter-document distance. The present invention relates to a text searcher using a calculator.

従来の文書間類似度算出器や検索キーによる文書検索装置では、文章に含まれる単語の共起に基づいて文書間の類似度や文書と検索キー間の類似度が算出される。単語の共起を調べるためには一般に文章を形態素解析し、単語や文節単位に分割する。それらの出現頻度など統計情報を利用して単語の重要度を求め、重要な単語を共通して含んでいる文書同士は類似していると判断する。しかし、複数の主題を持つ文章などでは関連のない単語同士が共起することがあり、正しい類似度が求められない場合がある。従来の技術、例えば、特許文献１では単語同士の関連性に基づいてグラフ構造を構築することによりこれを解決している。 In a conventional document retrieval apparatus using a document similarity calculator or a search key, the similarity between documents and the similarity between a document and a search key are calculated based on the co-occurrence of words included in the sentence. In order to examine the co-occurrence of words, generally, a sentence is morphologically analyzed and divided into words or clauses. The degree of importance of words is obtained using statistical information such as their appearance frequency, and it is determined that documents containing important words in common are similar. However, in a sentence having a plurality of subjects, unrelated words may co-occur, and a correct similarity may not be required. Conventional technology, for example, Patent Document 1, solves this problem by constructing a graph structure based on the relationship between words.

特許第3577972号公報Japanese Patent No. 3577972

従来の方式では、基本的には単語の共起に依存して類似度を算出しているため、同義語や異なる言い回しを含む文章の類似度を精度よく算出できなかった。同義語に関しては同義語辞書などを利用して解決できる部分もあるが、辞書の構築が必要である。
また、単語が共通していなくとも、文章構造が類似している場合にも文章の意味内容が類似していることがある。例えば、以下の３つの文を考える。
文１：「台風１１号は伊豆諸島近海を進み、夕方から夜にかけて関東地方に最も近づくとみられる。」
文２：「台風１２号は東京湾沖を北上し、２５日から２６日にかけて日本に最接近とみられる。」
文３：「東京湾でみられる日本近海の魚を紹介します。」 In the conventional method, the similarity is basically calculated depending on the co-occurrence of words, and thus the similarity of sentences including synonyms and different phrases cannot be calculated with high accuracy. Although there are parts that can be solved for synonyms using a synonym dictionary or the like, it is necessary to construct a dictionary.
Even if the words are not common, the semantic content of the sentences may be similar even if the sentence structures are similar. For example, consider the following three sentences.
Sentence 1: “Typhoon No. 11 is expected to travel the sea near the Izu Islands and approach the Kanto region from evening to night.”
Sentence 2: “Typhoon No. 12 goes north off Tokyo Bay and appears to be closest to Japan from 25th to 26th.”
Sentence 3: “Introducing the fish near Japan seen in Tokyo Bay.”

文１と文２では「台風」、「みられる」が共通しており、文２と文３では「東京湾」、「日本」、「みられる」が共通しているが、意味内容を考えると文１と文２の方が文２と文３よりも類似している。単語そのものを利用する従来の方式では、このような文章構造の類似性に従って似通った文章を分類するということはできない。
この発明は上記のような問題を解決するためになされたもので、文の係り受け構造を木構造に変換し、木構造同士の編集距離を類似度判定に利用することで出現単語の共通性と文の構造を考慮した類似度判定を可能とする。 Sentence 1 and Sentence 2 have the same “typhoon” and “seen”, and Sentence 2 and Sentence 3 have “Tokyo Bay”, “Japan”, and “seen” in common. , Sentence 1 and sentence 2 are more similar than sentences 2 and 3. In the conventional method using the word itself, it is not possible to classify similar sentences according to the similarity of the sentence structure.
The present invention has been made to solve the above-described problems, and by converting the dependency structure of a sentence to a tree structure, and using the edit distance between the tree structures for similarity determination, the commonality of appearing words And similarity determination in consideration of the sentence structure.

この発明に係る文書間距離算出器は、
文書間の距離を算出する文書を入力する文書入力手段と、
文書入力手段で入力された文書の文字列を形態素解析および係り受け解析する構文解析手段と、
構文解析手段による構文解析結果から統語情報付き木構造を作成する統語情報付き木構造作成手段と、
統語情報付き木構造を構成するノードのうち並列関係にあるものを探索し、並列関係にあるノードを子ノードとした並列ノードを統語情報付き木構造に追加する並列ノード追加手段と
並列ノード追加手段によって追加された並列ノード下のノードを順序付けする並列ノード順序付けする手段と、
並列ノード下のノードが順序付けされた並列ノードが追加された統語情報付き木構造を他の文書の統語情報付き木構造に編集し、その編集距離を予め設定された定義により算出する距離計算手段とを備える。 The inter-document distance calculator according to the present invention is:
A document input means for inputting a document for calculating a distance between documents ;
A syntax analysis means for performing morphological analysis and dependency analysis on a character string of a document input by the document input means;
A tree structure creating means with syntactic information for creating a tree structure with syntactic information from the result of parsing by the syntax analyzing means;
Parallel node adding means and parallel node adding means for searching for nodes in a tree structure with syntactic information in parallel relation and adding parallel nodes having the parallel nodes as child nodes to the tree structure with syntactic information Means for ordering parallel nodes to order nodes under parallel nodes added by
A distance calculation means for editing a tree structure with syntactic information to which a parallel node in which nodes under the parallel node are ordered is added to a tree structure with syntactic information of another document , and calculating the edit distance according to a predetermined definition; Is provided.

また、この発明に係る文章検索器は、
文書の形態素解析および係り受け解析結果から統語情報付き木構造を作成し、統語情報付き木構造を構成するノードのうち並列関係にあるノードを子ノードとした並列ノードを統語情報付き木構造に追加し、追加された並列ノード下のノードが順序付けされた統語情報付き木構造が予め複数作成され、蓄積された木構造集合と、
検索文を入力する検索文入力手段と、
検索文入力手段で入力された文書の文字列を形態素解析および係り受け解析する構文解析手段と、
構文解析手段による構文解析結果から統語情報付き木構造を作成する統語情報付き木構造作成手段と、
統語情報付き木構造を構成するノードのうち並列関係にあるものを探索し、並列関係にあるノードを子ノードとした並列ノードを統語情報付き木構造に追加する並列ノード追加手段と
並列ノード追加手段によって追加された並列ノード下のノードを順序付けする並列ノード順序付けする手段と、
並列ノード下のノードが順序付けされた並列ノードが追加された統語情報付き木構造を木構造集合に蓄積された統語情報付き木構造との編集距離を求める距離計算手段と、
距離計算手段で得られた編集距離の集合をソートし、小さい順に所定の数だけ出力する検索結果出力手段を備える。 In addition, the text searcher according to the present invention is
Create a tree structure with syntactic information from the results of morphological analysis and dependency analysis of the document , and add parallel nodes to the tree structure with syntactic information from the nodes that constitute the tree structure with syntactic information as the child nodes. A plurality of tree structures with syntactic information in which the nodes under the added parallel nodes are ordered in advance and stored,
Search sentence input means for inputting a search sentence;
A syntax analysis means for morphological analysis and dependency analysis of a character string of a document input by a search text input means;
A tree structure creating means with syntactic information for creating a tree structure with syntactic information from the result of parsing by the syntax analyzing means;
Parallel node adding means and parallel node adding means for searching for nodes in a tree structure with syntactic information in parallel relation and adding parallel nodes having the parallel nodes as child nodes to the tree structure with syntactic information Means for ordering parallel nodes to order nodes under parallel nodes added by
A distance calculation means for obtaining an edit distance between the tree structure with syntactic information and the tree structure with syntactic information stored in the tree structure set, wherein the parallel nodes in which the nodes under the parallel nodes are ordered are added;
There is provided search result output means for sorting a set of edit distances obtained by the distance calculation means and outputting a predetermined number in ascending order.

この発明に係る文書間距離算出器によれば、
形態素解析および係り受け解析された文書の文字列から統語情報付き木構造を作成し、統語情報付き木構造を構成するノードのうち並列関係にあるノードを子ノードとした並列ノードを統語情報付き木構造に追加し、並列ノード下のノードを順序付けした木構造同士の編集距離を予め設定された定義により算出し、その編集距離を類似度判定に利用することで出現単語の共通性と文の構造を考慮した類似度判定を可能とする。 According to the inter-document distance calculator according to the present invention,
Create a tree structure with syntactic information from the character strings of documents that have been subjected to morphological analysis and dependency analysis, and construct a tree with syntactic information from parallel nodes that have nodes in parallel relation among the nodes that constitute the tree structure with syntactic information. Add to the structure, calculate the edit distance between the tree structures in which the nodes under the parallel nodes are ordered, and use the edit distance for similarity determination. It is possible to make a similarity determination in consideration of.

また、この発明に係る文章検索器によれば、
文章または文書の形態素解析および係り受け解析結果から統語情報付き木構造を作成し、統語情報付き木構造を構成するノードのうち並列関係にあるノードを子ノードとした並列ノードを統語情報付き木構造に追加し、追加された並列ノード下のノードが順序付けされた統語情報付き木構造を予め複数作成して、蓄積された木構造集合を備え、
形態素解析および係り受け解析された検索文の文字列から統語情報付き木構造を作成し、統語情報付き木構造を構成するノードのうち並列関係にあるノードを子ノードとした並列ノードを統語情報付き木構造に追加し、並列ノードが追加された統語情報付き木構造を木構造集合に蓄積された統語情報付き木構造との編集距離を距離計算手段で求め、
距離計算手段で得られた編集距離の集合をソートし、小さい順に所定の数だけ出力する検索結果出力手段を備えることで、検索文に類似する文書を検索することが可能となる。 Moreover, according to the text searcher according to the present invention,
Create a tree structure with syntactic information from the results of morphological analysis and dependency analysis of sentences or documents, and construct a tree structure with syntactic information from parallel nodes that have parallel nodes among the nodes constituting the tree structure with syntactic information A plurality of tree structures with syntactic information in which nodes under the added parallel node are ordered in advance, and an accumulated tree structure set is provided,
Create a tree structure with syntactic information from the character strings of search sentences that have been subjected to morphological analysis and dependency analysis, and add parallel information with syntactic information to nodes that are parallel nodes among the nodes that make up the tree structure with syntactic information Add the tree structure with syntactic information to the tree structure and add the parallel node to the tree structure with syntactic information accumulated in the tree structure set by the distance calculation means,
By providing a search result output unit that sorts a set of edit distances obtained by the distance calculation unit and outputs a predetermined number in ascending order, it is possible to search for a document similar to a search sentence.

この発明の実施の形態１による基本構成図である。It is a basic composition figure by Embodiment 1 of this invention. この発明の実施の形態１の処理手順を示すフロー図である。It is a flowchart which shows the process sequence of Embodiment 1 of this invention. 構文解析手段による係り受け解析結果の例を示す図である。It is a figure which shows the example of the dependency analysis result by a syntax analysis means. 統語情報付き木構造作成手段による統語情報付き木構造の例を示す図である。It is a figure which shows the example of the tree structure with syntactic information by the tree structure preparation means with syntactic information. 並列・同格構造を持つ統語情報付き木構造の例を示す図である。It is a figure which shows the example of the tree structure with syntactic information which has a parallel and equivalence structure. 並列ノードの例を示す統語情報付き木構造図である。It is a tree structure diagram with syntactic information which shows the example of a parallel node. この発明の実施形態１における統語情報付き木構造に並列ノードを追加する手順を示すフロー図である。It is a flowchart which shows the procedure which adds a parallel node to the tree structure with syntactic information in Embodiment 1 of this invention. 実施の形態１における統語情報付き木構造の編集距離算出処理ステップ210の説明図である。6 is an explanatory diagram of an edit distance calculation processing step 210 for a tree structure with syntactic information in Embodiment 1. FIG. 実施の形態１における統語情報付き木構造の編集距離算出処理ステップ220の説明図である。6 is an explanatory diagram of an edit distance calculation processing step 220 for a tree structure with syntactic information in Embodiment 1. FIG. 実施の形態１における統語情報付き木構造の編集距離算出処理ステップ230の説明図である。6 is an explanatory diagram of an edit distance calculation processing step 230 for a tree structure with syntactic information in Embodiment 1. FIG. 実施の形態１における統語情報付き木構造の編集距離算出処理ステップ240の説明図である。6 is an explanatory diagram of an edit distance calculation processing step 240 for a tree structure with syntactic information in Embodiment 1. FIG. 実施の形態１における統語情報付き木構造の編集距離算出処理ステップ250の説明図である。6 is an explanatory diagram of an edit distance calculation processing step 250 for a tree structure with syntactic information in Embodiment 1. FIG. 実施の形態１における並列ノードの効果例の説明図である。FIG. 10 is an explanatory diagram of an effect example of a parallel node in the first embodiment. 実施の形態１における並列ノードの効果例の説明図である。FIG. 10 is an explanatory diagram of an effect example of a parallel node in the first embodiment. 実施の形態１におけるノードの移動例を示す説明図である。6 is an explanatory diagram illustrating an example of movement of a node in Embodiment 1. FIG. この発明の実施の形態２の構成図である。It is a block diagram of Embodiment 2 of this invention. この発明の実施の形態３の構成図である。It is a block diagram of Embodiment 3 of this invention. この発明の実施の形態４の構成図である。It is a block diagram of Embodiment 4 of this invention.

実施の形態１．
図１はこの発明の実施の形態１による基本構成を示すブロック図である。図１において、文書入力手段101は解析する文章または文書を入力するものである。構文解析手段102は文書入力手段101で入力された文字列を形態素解析および係り受け解析するものである。統語情報付き木構造作成手段103は構文解析手段102における解析結果から統語情報付き木構造を作成するものである。並列ノード追加手段104は統語情報付き木構造を構成するノードのうち並列関係にあるものを探索し、並列関係にあるノードを子ノードとした並列ノードを木構造に追加するものである。並列ノード順序付け手段105は、並列ノード追加手段104によって追加された並列ノード下のノードを順序付けするものである。統語情報付き木構造間の距離計算手段106は統語情報付き木構造同士の距離を算出するものである。 Embodiment 1 FIG.
1 is a block diagram showing a basic configuration according to Embodiment 1 of the present invention. In FIG. 1, a document input means 101 inputs a sentence or document to be analyzed. The syntax analysis unit 102 performs morphological analysis and dependency analysis on the character string input by the document input unit 101. The syntactic information-added tree structure creating unit 103 creates a tree structure with syntactic information from the analysis result of the syntax analyzing unit 102. The parallel node adding means 104 searches for nodes having a parallel relationship among the nodes constituting the tree structure with syntactic information, and adds parallel nodes having the nodes having the parallel relationship as child nodes to the tree structure. The parallel node ordering means 105 orders the nodes under the parallel nodes added by the parallel node adding means 104. The distance calculation means 106 between the tree structures with syntactic information calculates the distance between the tree structures with syntactic information.

次に動作について説明する。図２は本実施の形態のフローチャートである。まず、ステップ10において、文書入力手段101より統語情報付き木構造同士の距離を求める対象となる文章または文書の集合を取得する。
次にステップ20において、構文解析手段102により取得した文章の形態素解析および係り受け解析を行い構文解析結果とする。ここでは形態素解析・係り受け解析は既存技術を用いるものとする。図３に「15時に変圧器の漏電のため障害が発生した」という文章を係り受け解析した結果の例を示す。長方形が文節のまとまりを示し、矢印は文節が係っている先を示す。 Next, the operation will be described. FIG. 2 is a flowchart of this embodiment. First, in step 10, the document input means 101 acquires a sentence or a set of documents for which the distance between syntactic information-added tree structures is to be obtained.
Next, in step 20, a morphological analysis and dependency analysis of the sentence acquired by the syntax analysis means 102 is performed to obtain a syntax analysis result. Here, morphological analysis and dependency analysis use existing technology. FIG. 3 shows an example of the result of the dependency analysis of the sentence “A failure has occurred due to the leakage of the transformer at 15:00”. A rectangle indicates a group of phrases, and an arrow indicates a point where the phrase is related.

次にステップ30において、統語情報付き木構造作成手段103により構文解析結果を木構造形式に変換する。統語情報も付与した統語情報付き木構造の例を図４に示す。長方形は木構造の1ノードを示し、文節のまとまりと一致する。文節内の「/」は形態素の区切りを図示するために付与している。各形態素の品詞は[]内に示した。ここでは名詞や動詞など大きな単位で示したが、「サ変名詞」や「自立動詞」などさらに細分化することも可能である。また、各ノードの下に主辞となる形態素と機能語となる形態素を示した。主辞は文節の中心となる形態素である。 Next, in step 30, the syntactic analysis-added tree structure creation means 103 converts the parsing result into a tree structure format. An example of a tree structure with syntactic information to which syntactic information is also given is shown in FIG. The rectangle indicates one node of the tree structure and matches the group of clauses. The “/” in the phrase is given to illustrate the morpheme break. The part of speech of each morpheme is shown in []. Here, nouns and verbs are shown in large units, but it is possible to subdivide them further, such as “sa variable nouns” and “independent verbs”. Moreover, the morpheme used as a main word and the morpheme used as a function word were shown under each node. The main word is the morpheme that is the center of the phrase.

次にステップ40において、並列ノード追加手段104により、ステップ30で作成した統語情報付き木構造のうち、並列関係や同格関係の係り受け構造を持つものに並列ノードを追加する。同格関係の係り受け構造を持つ統語情報付き木構造の例を図５に示す。ただし、形態素の区切りや品詞などの情報は簡単のため省略した。「一昨日と」と「昨日、」は同格関係であるが、係り受け構造としては「一昨日と」が「昨日、」に係る形になっている。これに並列ノードを追加することによりこれらを平等に扱えるようにした木構造の例を図６に示す。
図７はこのような並列ノードを追加する手順を示したものである。リーフノードは木構造の末端に位置するノードのこととし、図５の木構造の場合は「一昨日と」と「変圧器の」と「2回の」がリーフノードに該当する。また、ルートノードはトップに位置するノードのこととし、図５の木構造の場合は「発生した」がルートノードに該当する。 Next, in step 40, the parallel node adding means 104 adds the parallel node to the tree structure with syntactic information created in step 30 and having the dependency structure of the parallel relationship or the equality relationship. FIG. 5 shows an example of a tree structure with syntactic information having a dependency structure with the same relationship. However, information such as morpheme breaks and parts of speech was omitted for simplicity. “Yesterday” and “Yesterday” are in the same relationship, but the dependency structure is “Yesterday” and “Yesterday”. FIG. 6 shows an example of a tree structure in which parallel nodes are added to handle these equally.
FIG. 7 shows a procedure for adding such parallel nodes. The leaf node is a node located at the end of the tree structure, and in the case of the tree structure in FIG. 5, “the day before yesterday”, “transformer”, and “twice” correspond to the leaf nodes. Further, the root node is a node located at the top, and in the case of the tree structure in FIG. 5, “occurs” corresponds to the root node.

次に図２のステップ50において、ステップ40で追加された並列ノード下の並列関係にあるノードを所定の基準により並列ノード順序付け手段105で順序付けをする。基準は一意に順序が決定できるものであれば良く、ここでは対象ノードの主辞となる形態素の五十音順とする。五十音順の若い方から左に配置するように決めると、図６の木構造はそのままで良い。 Next, in step 50 of FIG. 2, the parallel nodes under the parallel nodes added in step 40 are ordered by the parallel node ordering means 105 based on a predetermined criterion. The standard is not particularly limited as long as the order can be uniquely determined, and here, the morpheme that is the main word of the target node is in alphabetical order. If it is decided to arrange them from the youngest of the Japanese syllabary order to the left, the tree structure in FIG.

次にステップ60において、統語情報付き木構造間の距離計算手段106により統語情報付き木構造の編集距離を算出する。編集距離はノードの追加・削除・置換の操作により、一方の木構造をもう一方の木構造に変形するのに必要なコストのこととする。本実施の形態では編集距離の計算に必要なノードの挿入コスト、削除コスト、置換コストは統語情報を用いて定義する。ここでの統語情報とは、単語の文字数、品詞、活用、格などである。それぞれのコストを以下のように定義する。 Next, in step 60, the distance calculation means 106 between the tree structures with syntactic information calculates the edit distance of the tree structure with syntactic information. The edit distance is the cost required to transform one tree structure into another tree structure by adding, deleting and replacing nodes. In the present embodiment, node insertion costs, deletion costs, and replacement costs necessary for calculating the edit distance are defined using syntactic information. The syntactic information here includes the number of characters of a word, part of speech, utilization, case, and the like. Each cost is defined as follows.

例えば、次の二つの文A、Bの対応する統語情報付き木構造間の編集距離を求めることを考える。
A：「15時に変圧器の漏電のため障害が発生した」
B：「20時に変圧器の漏電で2回の障害が発生した」 For example, suppose that the edit distance between the tree structures with syntactic information corresponding to the following two sentences A and B is obtained.
A: “There was a failure at 15:00 due to the leakage of the transformer”
B: “Two faults occurred due to transformer leakage at 20:00”

図８〜図１２は A に対応する統語情報付き木構造を B に対応する統語情報付き木構造に変形させる手順を示した模式図である。図８のステップ210では A、B のそれぞれに対応する統語情報付き木構造を示している。A の統語情報付き木構造を B の統語情報付き木構造に変形させるため、まず図９のステップ220では A の統語情報付き木構造の右端のノード「障害が」に「2回の」を挿入する。このときのコストは上記定義１により2.0である。
次に図１０のステップ230では「ため」を削除する。このときのコストは上記定義１により2.0である。
次に図１１のステップ240では「漏電の」を「漏電で」に置換する。このときのコストは、主辞「漏電」の品詞が等しく、主辞までの見出し（表記文字列）「漏電」が等しいため0である。
次に図１２のステップ250では「15時に」を「20時に」に置換する。このときのコストは、主辞「時」の品詞が等しく、主辞までの見出し「15時」、「20時」が異なるので0.5である。 8 to 12 are schematic diagrams showing a procedure for transforming a tree structure with syntactic information corresponding to A into a tree structure with syntactic information corresponding to B. FIG. Step 210 in FIG. 8 shows a tree structure with syntactic information corresponding to A and B respectively. In order to transform the tree structure with syntactic information of A into the tree structure with syntactic information of B, first, in step 220 of FIG. 9, “two times” is inserted into the node “failure” at the right end of the tree structure with syntactic information of A To do. The cost at this time is 2.0 according to the above definition 1.
Next, in step 230 of FIG. The cost at this time is 2.0 according to the above definition 1.
Next, in step 240 of FIG. 11, “leakage” is replaced with “leakage”. The cost at this time is 0 because the part of speech of the main word “leakage” is equal and the heading (notation character string) “leakage” up to the main word is equal.
Next, in step 250 of FIG. 12, “15:00” is replaced with “20:00”. The cost at this time is 0.5 because the part of speech of the main word “time” is equal and the headings “15:00” and “20:00” up to the main word are different.

上記の各ステップでのコストの総和＝4.5が統語情報付き木構造 A、B 間の編集距離となる。従って、統語情報付き木構造の編集距離が小さいほど類似する文章となる。一方の木構造から他方の木構造へ変形する仕方は幾通りもあり、従って総コストも幾通りもあるが、編集距離としてはそれらのうち最小のものとして定義する。このような編集距離は動的計画法などにより求めることが出来る。例えば以下のように定義する。 The total cost of each step above = 4.5 is the editing distance between the tree structures A and B with syntactic information. Therefore, the sentence becomes more similar as the editing distance of the tree structure with syntactic information is smaller. There are various ways of transforming from one tree structure to the other, and thus there are various total costs, but the edit distance is defined as the smallest of them. Such an edit distance can be obtained by dynamic programming or the like. For example, the definition is as follows.

［定義２］
F_iを順序付けされた木の集合（森、Forest）、d(F₁,F₂) をF₁、F₂ 間の距離、φは空集合、vをF₁に属するノードのうち最も右側に位置するルート、w をF₂ に属するノードのうち最も右側に位置するルート、del(v) を v を削除するコスト、ins(w)をw を挿入するコスト、rep(v,w)をv を w に置換するコスト、F₁(v)をF_iのノードまたは木のうちv の子、T_i(v)をv をルートとする木、F_i−vをv を削除したF₁のノードまたは木、F_i−T_i(v)をT_i(v)のノード全てを削除したF₁のノードまたは木とし、各々の関係を下記の式(1)〜(4)のように定める。 [Definition 2]
F _i is an ordered set of trees (forest, Forest), d (F ₁ , F ₂ ) is the distance between F ₁ and F ₂ , φ is an empty set, and v is the rightmost node belonging to F ₁ position routes, route rightmost among the nodes belonging to w to F _2, the cost to remove the v a del (v), the cost of inserting the ins and (w) w, rep (v, w) and v Is the cost of substituting w with F ₁ (v) is the child of v in the node or tree of F _i , T _i (v) is the tree rooted in v, and F _i −v is the F ₁ with v removed A node or tree, F _i −T _i (v) is a node or tree of F ₁ from which all the nodes of T _i (v) are deleted, and the respective relationships are defined as in the following equations (1) to (4) .

F₁、F₂として編集距離を求めたい二つの統語情報付き木構造を与え、式(1)〜(4)を再帰的に適用することで d(F₁,F₂) として編集距離を求めることが出来る。ただし、del(v)、ins(w)、rep(v,w)を定義１に従うようにすることが重要である。特に置換コストrep(v,w)はノードの形態素の品詞による比較と見出しの文字列比較によりコストを変動させる必要がある。 Given two syntactic information tree structures for which the edit distance is to be obtained as F ₁ and F ₂ and recursively applying equations (1) to (4) to obtain the edit distance as d (F ₁ , F ₂ ) I can do it. However, it is important that del (v), ins (w), and rep (v, w) conform to definition 1. In particular, the replacement cost rep (v, w) needs to be varied by comparing the morphemes of the nodes with parts of speech and the character strings of the headings.

以上のような実施の形態により、並列ノードは次のような効果がある。以下の二つの文C、Dの対応する統語情報付き木構造間の編集距離を求めることを考える。
C：「一昨日と昨日、変圧器の漏電で2回の障害が発生した」
D：「昨日と一昨日、変圧器の漏電で2回の障害が発生した」
統語情報付き木構造は図１３のようになる。並列ノードを追加せずに定義１、定義２に従って編集距離を求めると、Cの「昨日、」を「一昨日、」に置換し、「一昨日と」を「昨日と」に置換する必要があり、それぞれの置換コストは0.5となるため編集距離は1.0となる。 According to the embodiment as described above, the parallel node has the following effects. Suppose that the edit distance between the tree structures with syntactic information corresponding to the following two sentences C and D is obtained.
C: “Two failures occurred due to electrical leakage in the transformer the day before yesterday and yesterday.”
D: “Two failures occurred due to a transformer leakage yesterday and yesterday”
The tree structure with syntactic information is as shown in FIG. If the edit distance is calculated according to Definition 1 and Definition 2 without adding parallel nodes, it is necessary to replace “Yesterday” in C with “Yesterday” and “Yesterday” with “Yesterday” Each replacement cost is 0.5, so the edit distance is 1.0.

一方、並列ノードを導入するとC、Dは図１４のC'、D'のようになる。C'、D'の編集距離を定義１、定義２に従って求めると、C'の「昨日、」を「昨日と」に置換し、「一昨日と」を「一昨日、」に置換する必要があるが、それぞれのコストは0であるため、編集距離も0なる。実際、これらの二つの文に意味内容的な差異はないため、より正しく距離を求められたことになる。
以上のようにすれば、出現単語の共通性と文の構造を考慮した文章間の類似度判定が可能である。 On the other hand, when parallel nodes are introduced, C and D become C ′ and D ′ in FIG. If the edit distances of C 'and D' are determined according to definitions 1 and 2, C's "Yesterday" must be replaced with "Yesterday" and "Yesterday and" must be replaced with "Yesterday." Since each cost is 0, the editing distance is also 0. In fact, there is no difference in semantic content between these two sentences, so the distance was calculated more correctly.
As described above, it is possible to determine the similarity between sentences in consideration of commonality of appearance words and sentence structure.

本実施の形態では追加コスト・削除コストは固定としたが、同一ノードの移動に対応する追加・削除の場合はコストが小さくなるようにしても良い。図１５は編集距離を求める途中でノードの移動が発生する例である。図１５の E を F に変形する場合、「2回の」を一旦削除して「障害が」の下に追加すればよいが、これはノードの移動に他ならず、新規に「2回の」を追加するよりはコストが小さくなるように定義を設定することで、ノードの移動の場合は、新規に追加するよりはコストを小さくすることが可能である。このようにすることで同じ単語が使われている文章はより類似度が高いと判断される効果がある。
また、本実施の形態では統語情報付き木構造を作成する対象として1文を例に挙げたが、複数の文章であってもよい。その場合は定義２にあるように木構造ではなく木構造の集合＝森構造として扱えばよい。 In this embodiment, the addition cost and the deletion cost are fixed, but the cost may be reduced in the case of addition / deletion corresponding to the movement of the same node. FIG. 15 shows an example in which movement of a node occurs during the determination of the edit distance. When transforming E in Fig. 15 to F, it is only necessary to delete “twice” and add it under “failure”, but this is nothing but movement of the node. By setting the definition so that the cost is smaller than adding “”, in the case of node movement, it is possible to reduce the cost rather than adding a new one. By doing so, there is an effect that sentences in which the same word is used are judged to have higher similarity.
In the present embodiment, one sentence is taken as an example for creating a tree structure with syntactic information, but a plurality of sentences may be used. In that case, as in definition 2, it is sufficient to handle the tree structure as a set of forests instead of a tree structure.

本実施の形態における追加コスト・削除コスト・置換コストは定義１のように定義したが、これらの数値は他の数値でも良い。さらに置換コストは主辞までの見出しが等しいかどうかでコストを分岐させたが、主辞のみの見出しが等しいかどうかで分岐しても良いし、他の統語情報を用いても良い。 Although the addition cost, the deletion cost, and the replacement cost in this embodiment are defined as defined in Definition 1, these numerical values may be other numerical values. Further, the replacement cost is divided depending on whether the headings up to the subject are equal, but may be branched depending on whether the headings of only the subject are equal, or other syntactic information may be used.

実施の形態２．
図１６はこの発明の実施の形態２を示す構成図である。統計情報解析手段107は、構文解析手段102によって分割された形態素や文節の、入力された文書全体における出現頻度をカウントするものである。
実施の形態１では、統語情報付き木構造間の距離計算手段106は定義１に従って編集距離を求めたが、本実施の形態では追加・挿入・置換等の編集対象となるノードの出現頻度情報を用いてそれぞれのコストを求める。
例えば、主辞までの見出しによるTF-IDFを編集距離に重みとして与える。TF-IDFによるノードv の重み TFID(v）を以下の式で定義する。 Embodiment 2. FIG.
FIG. 16 is a block diagram showing Embodiment 2 of the present invention. The statistical information analysis unit 107 counts the appearance frequency of the morphemes and phrases divided by the syntax analysis unit 102 in the entire input document.
In the first embodiment, the distance calculation means 106 between the tree structures with syntactic information obtains the edit distance according to the definition 1, but in this embodiment, the appearance frequency information of the node to be edited such as addition / insertion / replacement is obtained. To determine the cost of each.
For example, TF-IDF with headings up to the main word is given as a weight to the editing distance. The weight TFID (v) of node v by TF-IDF is defined by the following formula.

ただし、ｎ_vはノードvの出現する頻度、Ｄは文書集合、ｄは文書集合Ｄに含まれる文書である。各ノードの重みは全文書で共通とするため、式(6)のようにtf(v)を定義した。 Here, n _v is a frequency of appearance of the node v, D is a document set, and d is a document included in the document set D. Since the weight of each node is common to all documents, tf (v) is defined as shown in Equation (6).

上記TF-IDF重みを利用して挿入コスト ins(v)、削除コスdel(v)を次のように定義する。 The insertion cost ins (v) and deletion cost del (v) are defined as follows using the above TF-IDF weights.

TFIDF_averageはTFIDF(v)の平均値であり、TFIDF(v)を0〜1に正規化するため最大値で割っている。上記式は平均値のときにコストが2.0になる。
ノードvとwの置換コスト rep(v,w) は次のように定義する。 TFIDF _average is the average value of TFIDF (v), and is divided by the maximum value to normalize TFIDF (v) to 0-1. The above formula is 2.0 when the average value.
The replacement cost rep (v, w) for nodes v and w is defined as follows:

このようにすることで、頻繁に出現し文の構成要素として重要度の低い文節（“「”、“」”、“次に”、“そして”、など）の距離計算への影響を軽減することが可能である。 By doing this, the influence on the distance calculation of the phrase that frequently appears and has low importance as a component of the sentence (““ ”,“ ””, “next”, “and”, etc.) is reduced. It is possible.

実施の形態３．
実施の形態１では、文章として木構造または文章の集合として森構造を作成したが、文書構造を解析して森構造を作成しても良い。図１７はこの発明の実施の形態３を示す構成図である。図１７において、文書構造解析手段108は文書の箇条書きや章立てなどの文書構造を解析するものである。 Embodiment 3 FIG.
In Embodiment 1, a forest structure is created as a tree structure or a collection of sentences as a sentence, but a forest structure may be created by analyzing the document structure. FIG. 17 is a block diagram showing Embodiment 3 of the present invention. In FIG. 17, a document structure analyzing means 108 analyzes a document structure such as document bullets and chapters.

本実施の形態では、まず文書を解析し箇条書きおよび章立ての項目（章節項）を並列関係として分解する。文書構造解析手段108は既存の技術を用いるものとし、人手によって分解するのでも良い。実施の形態１においては複数の文の場合は１文を１つの木構造とし、全体では森構造としたが、本実施の形態では並列関係にある文は並列ノードで連結して統語情報付き木構造を形成する。このようにすることで、箇条書きのような順番に依存しない文章の構造を平等に扱うことができ、文書同士の距離を正確に算出することが出来る。 In the present embodiment, the document is first analyzed, and itemized items and chaptering items (section items) are decomposed into parallel relations. The document structure analysis means 108 uses existing technology, and may be manually decomposed. In the first embodiment, in the case of a plurality of sentences, one sentence has a single tree structure, and the whole has a forest structure. However, in this embodiment, sentences in parallel relation are connected by parallel nodes, and a tree with syntactic information is provided. Form a structure. In this way, sentence structures that do not depend on the order, such as bullets, can be handled equally, and the distance between documents can be accurately calculated.

実施の形態４．
蓄積された文書の木構造の中から入力文章に類似するものを検索するように構成することも出来る。図１８はこの発明の実施の形態４を示す構成図である。図１８において検索文入力手段109は検索文を入力するものである。木構造集合110はあらかじめ統語情報付き木構造に変換された文書の集合を蓄積するものである。検索結果出力手段111は、検索文と編集距離が小さい統語情報付き木構造を持つ文章を出力するものである。 Embodiment 4 FIG.
It can also be configured to search for an item similar to the input sentence from the tree structure of the stored document. FIG. 18 is a block diagram showing Embodiment 4 of the present invention. In FIG. 18, the search text input means 109 is for inputting a search text. The tree structure set 110 stores a set of documents converted into a tree structure with syntactic information in advance. The search result output unit 111 outputs a sentence having a tree structure with syntactic information having a small edit distance from the search sentence.

本実施の形態では、実施の形態１または２または３と同様の方法により、あらかじめ複数の文書を木構造に変換して木構造集合110に蓄積しておく。木構造集合110はデータベースなどでよい。次に、検索文入力手段109から文章が入力された場合、実施の形態１または２または３と同様の方法により統語情報付き木構造に変換する。得られた統語情報付き木構造と木構造集合110に蓄積された統語情報付き木構造との編集距離を求める。得られた編集距離の集合をソートし、検索結果出力手段111により、小さい順に所定の数だけ出力する。このようにすることで、検索文に類似する文書を検索することができる。 In the present embodiment, a plurality of documents are converted into a tree structure in advance and stored in the tree structure set 110 by the same method as in the first, second, or third embodiment. The tree structure set 110 may be a database or the like. Next, when a text is input from the search text input means 109, it is converted into a tree structure with syntactic information by the same method as in the first, second, or third embodiment. The editing distance between the obtained tree structure with syntactic information and the tree structure with syntactic information stored in the tree structure set 110 is obtained. The set of obtained edit distances is sorted, and the search result output means 111 outputs a predetermined number in ascending order. In this way, it is possible to search for a document similar to the search sentence.

この発明に係る文書間距離算出器および文章検索器は、文書間における類似度を算出し、複数の文書をその内容に応じて似通った文章を分類する装置や、複数の文書から検索文に類似する文書を検索する装置に利用可能である。 The inter-document distance calculator and the sentence searcher according to the present invention calculate similarity between documents, classify a plurality of documents into similar sentences according to the contents, and resemble search sentences from a plurality of documents It can be used for an apparatus for searching for a document to be executed.

101；文書入力手段、102；構文解析手段、103；統語情報付き木構造作成手段、104；並列ノード追加手段、105；並列ノード順序付け手段、106；統語情報付き木構造間の距離計算手段、107；統計情報解析手段、108；文書構造解析手段、109；検索文入力手段、110；木構造集合、111；検索結果出力手段。 101; document input means; 102; syntax analysis means; 103; tree structure creation means with syntactic information; 104; parallel node addition means; 105; parallel node ordering means; 106; distance calculation means between tree structures with syntactic information; Statistical information analysis means 108; document structure analysis means 109; search sentence input means 110; tree structure set 111; search result output means;

Claims

A document input means for inputting a document for calculating a distance between documents ;
A syntax analysis means for performing morphological analysis and dependency analysis on a character string of a document input by the document input means;
A tree structure creating means with syntactic information for creating a tree structure with syntactic information from the result of parsing by the syntax analyzing means;
Parallel node adding means and parallel node adding means for searching for nodes in a tree structure with syntactic information in parallel relation and adding parallel nodes having the parallel nodes as child nodes to the tree structure with syntactic information Means for ordering parallel nodes to order nodes under parallel nodes added by
A distance calculation means for editing a tree structure with syntactic information to which a parallel node in which nodes under the parallel node are ordered is added to a tree structure with syntactic information of another document , and calculating the edit distance according to a predetermined definition; Inter-document distance calculator with

In the definition where the edit distance is set in advance, when the movement of the node within the same tree structure occurs, the edit distance is set smaller than when adding a new one,
When the movement of a node in the same tree structure occurs, the distance calculation means is set so that the edit distance is set smaller than in the case of adding a new node, and the movement of the node in the same tree structure occurs due to the definition. 2. The inter-document distance calculator according to claim 1, wherein the edit distance is calculated to be smaller than that of a new addition.

Statistical information acquisition means for acquiring the appearance frequency of the node to be edited from the syntax analysis result by the syntax analysis means,
The definition in which the edit distance is set in advance is set according to the appearance frequency of the node to be edited,
The distance calculation means is defined when calculating an edit distance when editing a tree structure with syntactic information to which a parallel node in which nodes under parallel nodes are ordered is added to a tree structure with syntactic information of another document. 2. The inter-document distance calculator according to claim 1, wherein one set according to the appearance frequency of a node to be edited is used.

Document structure analysis means for analyzing the document structure of the document input by the document input means,
4. The tree structure creating means with syntactic information is configured to form a tree structure with syntactic information by connecting sentences in parallel relation by parallel nodes as a result of analysis by the document structure analyzing means. Inter-document distance calculator.

Create a tree structure with syntactic information from the results of morphological analysis and dependency analysis of the document , and add parallel nodes to the tree structure with syntactic information from the nodes that constitute the tree structure with syntactic information as the child nodes. A plurality of tree structures with syntactic information in which nodes under the added parallel nodes are ordered in advance, and the accumulated tree structure set;
Search sentence input means for inputting a search sentence;
A syntax analysis means for morphological analysis and dependency analysis of a character string of a document input by a search text input means;
A tree structure creating means with syntactic information for creating a tree structure with syntactic information from the result of parsing by the syntax analyzing means;
Parallel node adding means and parallel node adding means for searching for nodes in a tree structure with syntactic information in parallel relation and adding parallel nodes having the parallel nodes as child nodes to the tree structure with syntactic information Means for ordering parallel nodes to order nodes under parallel nodes added by
A distance calculation means for obtaining an edit distance between the tree structure with syntactic information and the tree structure with syntactic information stored in the tree structure set, wherein the parallel nodes in which the nodes under the parallel nodes are ordered are added;
A sentence searcher comprising search result output means for sorting a set of edit distances obtained by distance calculation means and outputting a predetermined number in ascending order.