JP3682535B2

JP3682535B2 - Document difference detection apparatus and program

Info

Publication number: JP3682535B2
Application number: JP2002290946A
Authority: JP
Inventors: 真樹村田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2002-10-03
Filing date: 2002-10-03
Publication date: 2005-08-10
Anticipated expiration: 2022-10-03
Also published as: JP2004126986A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書（又は文章）の差分を検出して、文書の違いを容易に理解できるようにする文書差分検出装置及びプログラムに関する。
【０００２】
【従来の技術】
従来、diffコマンドを用いて、入力された複数の文書データの差分を検出し、複数の文書データの差分の内で、共通部分は一つを出力し、不一致部分はそれぞれを並べて出力する技術があった。
【０００３】
ここで、diff（ディフ）とは、ＵＮＩＸ（ユニックス）（登録商標）のファイル比較ツールdiffのことである。このdiffコマンドは、与えられた二つのファイルの差分を順序情報を保持したまま行を単位として出力するものである。
【０００４】
diffコマンドには、−Ｄオプションという便利なオプションがある。このオプションを付けてdiffコマンドを使うと差分部分だけでなく共通部分も出力される。つまり、ファイルのマージが実現される。また、差分部分を見やすく表示するため、差分部分の始まり、差分部分の終わり、差分を構成する二つのデータの境界を表す表示を行う。このような、ファイルのマージを行う場合のdiffを、Mdiff （エムディフ）と呼ぶ（M はmerge の Mである）（例えば、非特許文献１及び特願２００１−３１１３２９参照）。
【０００５】
この技術を用いて、一つの特許の複数の請求項の間の差分を検出する実験を行なった。これは新しい試みである。ある特許の二つの請求項を一行に１個の単語がはいるように変形してから、それらの Mdiffをとった（なお、以下の説明では請求項等のすみ付き括弧は「〔」又は「〕」に置き換えてある）。
【０００６】
例１、
〔請求項１７〕前記プリンタシステムは上位装置を有することを特徴とする請求項１６記載のプリンタシステムの制御方法。
〔請求項１８〕前記プリンタシステムはプリンタを有することを特徴とする請求項１６記載のプリンタシステムの制御方法。
【０００７】
（上記例１の Mdiff結果）
前記プリンタシステムは
;=====begin=====
上位装置
；────────
プリンタ
;=====end=====
を有することを特徴とする請求項１６記載のプリンタシステムの制御方法
。
【０００８】
上記例１の請求項１７と請求項１８の Mdiffをとった結果から、たいへん容易に請求項１７と請求項１８の違いを理解することができる。即ち、;=====begin=====は差分部分の始まり、;=====end=====は差分部分の終わり、；────────は差分を構成する二つのデータの境界を表す。ここで、違いは「上位装置」と「プリンタ」である。しかし、違いがもっとややこしい場合は、Mdiff の結果は見にくいことになる。
【０００９】
例２、
〔請求項１〕
刃部材の先端の刃部を凹凸に形成し波状刃とするとともに螺旋状に湾曲させ、前記刃部材に取っ手を取り付けたことを特徴とする草取り鎌。
〔請求項２〕
取っ手の上部及び下部に滑り止め部を設けたことを特徴とする草取り鎌。
【００１０】
（上記例２の Mdiff結果）
;=====begin=====
刃部材
；────────
取っ手
;=====end=====
の
;=====begin=====
先端の刃
；────────
上部及び下部に滑り止め
;=====end=====
部を
;=====begin=====
凹凸に形成し波状刃とするとともに螺旋状に湾曲させ、前記刃部材に取っ手を取り付け
；────────
設け
;=====end=====
たことを特徴とする草取り鎌。
【００１１】
上記例２の請求項１と請求項２の Mdiffをとった結果は、違いがややこしいので、Mdiff の結果は見にくいことになっている。即ち、Mdiff は、順序情報を保存する機構であるため、違いが複雑な場合に、違いがわかりにくく、このままでは問題があることがわかった。
【００１２】
【非特許文献１】
村田真樹，外１名, diffと言語処理「言語理解とコミュニケーショ
ン」社団法人電子情報通信学会2001年 7月17日(NLC2001-26 ) 電子
情報通信学会技術研究報告, ｐ．29〜36
【００１３】
【発明が解決しようとする課題】
上記従来の Mdiffを用いるものは、違いが複雑な場合に、Mdiff の結果が見にくいことになるものであった。
【００１４】
本発明は上記問題点の解決を図り、違いが複雑な場合にもわかりやすい表示を行うことを目的とする。
【００１５】
【課題を解決するための手段】
図１は本発明の原理説明図である。図１中、２は抽出手段、３ａは格納手段、２１は抽出・検出領域設定手段である。
【００１６】
本発明は、前記従来の課題を解決するため次のような手段を有する。
【００１７】
（１）：情報の入力を行う入力手段と、文書データの差分として出力する対象の単位である抽出単位と文書データの差分を検出するために比較する領域の単位である検出領域とが前記入力手段により設定される抽出・検出領域設定手段２１と、情報を格納する格納手段３ａと、抽出手段２とを備え、前記抽出手段２は、入力された文書データの現在の前記検出領域以外の領域から全ての前記抽出単位に相当するものを抽出して前記格納手段３ａに格納し、現在の前記検出領域において、前記格納手段３ａに格納されていない前記抽出単位に相当するものを強調表示して現在の検出領域の文書を出力することを、前記検出領域ごとに繰り返す。このため、新しい情報である文書の特徴や差分を容易に抽出表示することができる。
【００１８】
（２）：情報の入力を行う入力手段と、文書データの差分として出力する対象の単位である抽出単位と文書データの差分を検出するために比較する領域の単位である検出領域とが前記入力手段により設定される抽出・検出領域設定手段２１と、情報を格納する格納手段３ａと、抽出手段２とを備え、前記抽出手段２は、入力された文書データの現在の前記検出領域において、前記格納手段３ａに格納されていない前記抽出単位に相当するものを強調表示して現在の検出領域の文書を出力し、前記強調表示したものを前記格納手段３ａに格納することを、前記検出領域ごとに繰り返す。このため、新しく出現する抽出単位に相当するもの（例えば単語）を容易に抽出して表示することができる。
【００１９】
（３）：前記（１）又は（２）の文書差分検出装置において、前記格納手段３ａに予め前記強調表示しない前記抽出単位のデータを格納する。このため、予めそれほど重要でない表現を強調表示しないようにでき、見やすくすることができる。
【００２０】
（４）：前記（１）〜（３）の文書差分検出装置において、前記抽出単位として、単語の単位とする。このため、新しく出現する単語を抽出表示することができる。
【００２１】
（５）：前記（１）〜（４）の文書差分検出装置において、前記検出領域の単位として、箇条書きの単位とする。このため、箇条書き間の違いを容易に理解することができる。
【００２２】
（６）：前記（１）〜（４）の文書差分検出装置において、前記検出領域の単位として、特許請求の範囲の単位とする。このため、特許請求の範囲の特徴や違いを容易に理解することができる。
【００２３】
（７）：文書データの差分として出力する対象の単位である抽出単位と文書データの差分を検出するために比較する領域の単位である検出領域とが入力手段により設定される抽出・検出領域設定手段２１と、入力された文書データの現在の前記検出領域以外の領域から全ての前記抽出単位に相当するものを抽出して格納手段３ａに格納し、現在の前記検出領域において、前記格納手段３ａに格納されていない前記抽出単位に相当するものを強調表示して現在の検出領域の文書を出力することを、前記検出領域ごとに繰り返す抽出手段２として、コンピュータを機能させるためのプログラム又はプログラムを記録したコンピュータ読取可能な記録媒体とする。このため、このプログラムをコンピュータにインストールすることで文書の特徴や差分を容易に抽出表示することができる文書差分検出装置を容易に提供することができる。
【００２４】
（８）：文書データの差分として出力する対象の単位である抽出単位と文書データの差分を検出するために比較する領域の単位である検出領域とが入力手段により設定される抽出・検出領域設定手段２１と、入力された文書データの現在の前記検出領域において、格納手段３ａに格納されていない前記抽出単位に相当するものを強調表示して現在の検出領域の文書を出力し、前記強調表示したものを前記格納手段３ａに格納することを、前記検出領域ごとに繰り返す抽出手段２として、コンピュータを機能させるためのプログラム又はプログラムを記録したコンピュータ読取可能な記録媒体とする。このため、このプログラムをコンピュータにインストールすることで新しく出現する抽出単位に相当するものを抽出して表示することができる文書差分検出装置を容易に提供することができる。
【００２５】
【発明の実施の形態】
（１）：文書差分検出装置の説明
図２は文書差分検出装置の説明図である。図２において、文書差分検出装置には、入力手段１、抽出手段２、抽出物記憶装置３、出力手段４が設けてある。入力手段１は、キーボード、マウス、読み取り装置等の情報の入力を行うものである。抽出手段２は、入力された文書の差分を抽出するものである。抽出物記憶装置３は、単語、漢字、名詞句などの抽出物を格納する抽出物記憶手段である。出力手段４は、表示装置、プリンタ等の情報の出力を行うものである。
【００２６】
▲１▼：形態素解析システムの説明
日本語を単語に分割するために、抽出手段２が行う形態素解析システムが必要になる。ここではChaSenについて説明する（奈良先端大で開発されている形態素解析システム茶筌http://chasen.aist-nara.ac.jp/index.html.jp で公開されている）。
【００２７】
これは、日本語文を分割し、さらに、各単語の品詞も推定してくれる。例えば、「学校へ行く」を入力すると以下の結果を得ることができる。
【００２８】
学校ガッコウ学校名詞−一般
へヘへ助詞−格助詞−一般
行くイク行く動詞−自立五段・カ行促音便基本型
ＥＯＳ
このように各行に一個の単語が入るように分割され、各単語に読みや品詞の情報が付与される。本発明の抽出手段２では、この機能のうち単語を分割する部分（形態素解析手段）だけを利用する。
【００２９】
▲２▼：英語のstemmer （ステマー）の説明
抽出手段２で単語を抽出するには、英語では単語はわかち書きされているので、単語を基本形式に戻すstemmingをするだけでよい。このstemmingをするアルゴリズムとしては有名なPorterのものがある（ Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137 参照）。
【００３０】
▲３▼：抽出単位、検出領域の説明
文字、段落、文、箇条書の項目などは、文書の形式から機械的に認識できる。例えば文字ならば、１バイトや２バイトコードで認識できる。段落ならば、字下げ、改行により認識できる。文ならば、句点やピリオドの存在により認識できる。箇条書は、字下げ、改行、箇条書項目の先頭の記号などにより認識できる。単語の認識については先にあげた形態素解析システムや stemmerにより認識される。前記認識は、例えば、それぞれの認識手段を抽出手段内に設けて行うことができる。
【００３１】
（２）：差分検出の説明
本発明の差分を検出するやり方には二つの手法（方法）がある。これらの手法は Diff コマンドを使わない。以下、この二つの手法をフローチャートにより説明する。
【００３２】
▲１▼：手法１
図３は手法１の文書差分検出処理フローチャートである。以下、図３の処理Ｓ１〜Ｓ３−２に従って説明する。
【００３３】
Ｓ１：入力手段１等により、予め抽出の単位（抽出単位）、検出領域の単位を定める。抽出単位とは、差分として出力する対象の単位である。抽出単位には、「単語」「漢字」「名詞句」などが考えられる。検出領域の単位とは、差分を検出するために比較する領域の単位のことである。検出領域の単位には、「文字」「単語」「文」「箇条書の項目」「段落」「特許の請求項」などが考えられる。
【００３４】
Ｓ２：抽出手段２は、すべての入力データを記憶手段（抽出手段２内の）に記憶させる。
【００３５】
Ｓ３：抽出手段２は、入力されたデータを左から調べて左の検出領域から処理Ｓ１で定めた検出領域ごとに以下の処理Ｓ３−１と処理Ｓ３−２を繰り返す。
【００３６】
Ｓ３−１：抽出手段２は、現在の検出領域以外の領域すべてから、すべての抽出単位に相当するもの（例えば単語）を抽出し、それを抽出物記憶装置３に格納する。
【００３７】
Ｓ３−２：抽出手段２は、現在の検出領域において、抽出物記憶装置３に格納されていない抽出単位に相当するもの（例えば単語）を強調表示して現在の検出領域の文章を出力手段４に出力する。
【００３８】
▲２▼：手法１の例によるの説明
手法１の例を特許明細書の請求項（検出領域）を例に抽出単位を単語とした説明をする。現在分析している請求項以外の請求項すべてからすべての単語を抽出し、現在分析している請求項において他の請求項に現れない単語を特定する。その結果を以下の例３に示す。
【００３９】
例３
〔請求項１〕《刃部材》の《先端》の《刃》部を《凹凸》に《形成し波状刃》とする《とともに螺旋状》に《湾曲させ、前記刃部材》に取っ手を《取り付け》たことを特徴とする草取り鎌。
〔請求項２〕取っ手の《上部及び下部》に《滑り止め》部を《設け》たことを特徴とする草取り鎌。
【００４０】
上記例３は、他の請求項に現れなかった単語は「《」と「》」の括弧で囲われている（強調表示）。この結果は例２の Mdiffの結果よりもはるかに見やすい。この例３から大変容易に〔請求項２〕の特徴が「上部及び下部の滑り止め部」であると理解できる。もし、請求項２の特徴が「滑り止め部」であると理解できたならば、この用語「滑り止め部」を含む実施の形態、実施例中の段落を抜き出すことで、容易に請求項２に対応する実施の形態、実施例を抽出することもできる。
【００４１】
このようにこの手法は、特徴や差分を抽出するのに大変役に立つ。また、ある請求項に対応する実施の形態、実施例の抽出、即ち、請求項と実施の形態、実施例の対応づけにも役立つのである。
【００４２】
次にこの手法１を、三つの請求項を持つ他の例に使ってみた。この場合、以下の例４のような結果を得た。
【００４３】
例４
〔請求項１〕《刃部材》の《先端》の《刃》部を《凹凸》に《形成し波状刃》とする《とともに螺旋状》に《湾曲させ、前記刃部材》に取っ手を《取り付け》たことを特徴とする草取り鎌。
〔請求項２〕取っ手の上部に滑り止め部を設けたことを特徴とする草取り鎌。
〔請求項３〕取っ手の上部《及び下部》に滑り止め部を設けたことを特徴とする草取り鎌。
【００４４】
上記例４の結果では、請求項２と請求項３の特徴である「滑り止め部」を抽出することができなかった。この問題を解決するために二つ目の新しい手法（手法２）を考えた。
【００４５】
▲２▼：手法２
図４は手法２の文書差分検出処理フローチャートである。以下、図４の処理Ｓ１１〜Ｓ１２−２に従って説明する。
【００４６】
Ｓ１１：入力手段１等により、予め抽出の単位（抽出単位）、検出領域の単位を定める。抽出単位とは、差分として出力する対象の単位である。抽出単位には、「単語」「漢字」「名詞句」などが考えられる。検出領域の単位とは、差分を検出するために比較する領域の単位のことである。検出領域の単位には、「文字」「単語」「文」「箇条書の項目」「段落」「特許の請求項」などが考えられる。
【００４７】
Ｓ１２：入力手段１から処理Ｓ１１で定めた検出領域ごとに入力データが入力され、抽出手段２は、以下の処理Ｓ１２−１と処理Ｓ１２−２を繰り返す。
【００４８】
Ｓ１２−１：抽出手段２は、現在の検出領域において、抽出物記憶装置３に格納されていない抽出単位に相当するもの（例えば単語）を強調表示して現在の検出領域の文章を出力手段４に出力する。ただし、抽出物記憶装置３は最初は空である。
【００４９】
Ｓ１２−２：処理Ｓ１２−１で強調表示した表現を抽出物記憶装置３に格納する。
【００５０】
▲２▼：手法２の例によるの説明
・手法２の例を特許明細書の請求項を例に抽出単位を単語とした説明をする。二つ目の新しい手法は、今分析している請求項よりも上のすべての請求項からすべての単語を取り出し、今分析している請求項において今分析している請求項よりも上のすべての請求項にあらわれない単語を特定する。その結果を、以下の例５に示す。
【００５１】
例５
〔請求項１〕《刃部材の先端の刃部を凹凸に形成し波状刃とするとともに螺旋状に湾曲させ、前記刃部材に取っ手を取り付けたことを特徴とする草取り鎌。》
〔請求項２〕取っ手の《上部》に《滑り止め》部を《設け》たことを特徴とする草取り鎌。
〔請求項３〕取っ手の上部《及び下部》に滑り止め部を設けたことを特徴とする草取り鎌。
【００５２】
この場合、請求項２と請求項３の特徴である「滑り止め部」を抽出することができた。この方法により、新しく出現する単語を差分として抽出することができる。
【００５３】
・手法２を用いた普通の文の例を説明する。ここで、抽出の単位、検出領域の単位とも単語である。
【００５４】
例６
《本研究の目的は，日本語》の《受け身文》，《使役》文《を能動》文《に変換する際》に《変更され》《るべき格助詞》を《機械学習》を《用いて自動》変換する《ことである．》日本語の受け身文，使役文の《例》を《図1 と》図《2 》に《あげる》．図1 の文の日本語の《接尾辞「》れ《た」》は《受動態》を《示す助動詞》で《あり》，《この》文は受け身文である．図2 の文の日本語の接尾辞「《せ》た」は使役を示す助動詞であり，この文は使役文である．《これら》の文に《対》《応》する能動文を図《3 》に示す．図1 の文《が》能動文に変換さ《れるとき》は，《(i) 》格助詞「に」は格助詞「が」に《(ii)》格助詞「が」は格助詞「を」に変換される．図2 の文が能動文に変換されるときは，(i) 格助詞「が」の《部分》「《彼》が」の《文節》が《消去》され，(ii)格助詞「に」が格助詞「が」に変換され，《(iii) 》格助詞「を」は変換され《ず》に《そのまま残る》．本研究では，これらの格助詞の変換《( 》例《：》格《助》《詞》「に」の格助詞「が」《へ》の変換《) 》と，《不要》部分の消去( 例：「彼が」の消去) を，研究の《対象》とする．( 《以降》，《本稿》では《便宜上》「彼が」《など》の消去の部分《も》格助詞の変換と《呼ぶ》．)
受け身文，使役文の能動文への変換は，文《生成》，《言い換え》，文の《平易化／言語》《運用支援》，《自然》言語文《から》の《知識獲得や情報抽出》，《質問応答システム》と《多く》の研究《分野》で《役に立つもの》である．《例えば》，質問応答システムでは，質問文が《能》《動》文で《答え》が《受動》文で《書か》れて《いる場合》，質問文と答えを《含む》文で，文の《構造》が《異なるため》に，質問の答えを《取り出す》のが《困難な》場合がある．この《よう》な《問》《題》も受け身文，使役文の能動文への変換が《できる》ように《なる》と《解決》する《のであ》る．このように受け身文，使役文の能動文への変換は，自然言語《処理》で《重要》なものである．
【００５５】
この例６の表示により、第二段落は、《生成》《言い換え》《平易化／言語》などの話が新たに生じていることなどがわかる。また、第二段落では、《役に立つもの》《困難な》《できる》《重要》などの評価する際に用いる言語表現が多く用いられていることから、手法の妥当性や有用性を記述していることも容易に理解できる。
【００５６】
・手法２を用いた発明の詳細な説明文の例を説明する。ここで、抽出の単位、検出領域の単位とも単語である。
【００５７】
例７
《次に、本発明について図面を参照して説明する。図１は》本発明《である草取り鎌の正面》図、図《２》は本発明である草取り鎌の《背面》図、図《３》は、本発明である草取り鎌の《右側面》である。
《〔０００７〕》
本草取り鎌１は、図３《に示すよう》に、《刃部材》２の刃《部》２《ｂ》は《当該先端》の《一面が波状》の波状刃《５》に《形成され》て《いるとともに》背面が《平坦》に形成されている刃部材２《と》、《取っ手》３《から構成》されている。
〔《０００８》〕
刃部材２は、図１、図２《及び》図３に示すように、《延長》部２《ａ》が《あり》取っ手３の《約》２《倍程》の《長》さがある。波状刃５の刃部２ｂは《一方向》に《湾曲》している。
〔《０００９》〕
図《４》は本発明の草取り鎌の刃部の正面《拡大》図である。図に示すように、《雑草》を《刈り取る》刃部２ｂは、《凸》部５ａと《凹》部５ｂが《交互》に《存在》し波状と《なっ》ている。
〔《００１０》〕
図５は本発明である草取り鎌の刃部の拡大図である。刃部２ｂを構成する凸部５ａの先端は《やや左方向》に《傾い》ている。《これ》は、雑草を《より引っ掛け》て《刈り取り易く》する《ため》である。
〔《００１１》〕
図《６》は本発明である草取り鎌の刃部の湾曲《状態》を《示した一部》拡大図である。図に示すように、刃部２ｂの延長部２ａより刃部２ｂの先端２《ｃ》は《垂直線》６からより湾曲している。
〔《００１２》〕
図《７》は、図《中》の《Ａ−》Ａ線に《沿っ》た《断面》図である。刃部２ｂの《上面》７は《傾斜》し、凸部５ａの先端５ｃは《尖っ》ている。《そして》、刃部２ｂ《自体》が湾曲するとともに《螺旋》している。
〔《００１３》〕
図《８》は、本発明である草取り鎌の《他》の《実施例》の正面図、図《９》は本発明である草取り鎌の他の実施例の背面図、図《１０》は本発明である草取り鎌の他の実施例の右側面図、図《１１》は、本発明である草取り鎌の他の実施例の一部拡大図である。
〔《００１４》〕
本例の草取り鎌１ａは、刃部材２の延長部２ａが《短い》とともに刃部２ｂの《部分》がやや《大きく》形成してある。
〔《００１５》〕
《また》、取っ手３が《長く》、《握り》部３ｂの《上》に、握り部３ｂの《径》よりやや《大きい》径の《上滑り止め》部３ａを《設ける》とともに、《下》に《も同様》に握り部３ｂより《大》径の下《滑り》止め部３《ｃｂ》を《設け》てある。
〔《００１６》〕
図１０に示すように、本例の草取り鎌１ａの刃部２ｂも図１から図７《まで》に示した草取り鎌１と同様に螺旋《状》に湾曲している。
〔《００１７》〕
《この》ように、先端部が螺旋状に湾曲さ《せること》により、《芝生等》に《生え》ている雑草を《根こそぎ取り除く》ことが《容易》と《なる》。
〔《００１８》〕
【００５８】
この例７では、段落番号００１２で、「螺旋」がここで初出とわかる。段落番号００１５で、「滑り止め部」が重要とわかる。また、段落番号００１７で、「根こそぎ取り除く」という面白い表現がここで初出とわかる。
【００５９】
・手法２による英語のテキストでの例を説明する。ここで、抽出の単位、検出領域の単位ともに単語である。また、stemmingはせず、単語の認識はスペースで区切られているかで行なった。
【００６０】
例８
《In the PATENT task of NTCIR-3, we participated in 》 the《optional task,》《where 》 the《participants can perform any kind 》 of 《research related to 》《patents. We think that》 in 《a 》 PATENT 《attempt,》 the optional task《is very 》《interesting, because》 we 《have already heard》 that 《some》 participants in《previous contests wanted》 to 《make their studies as freely》 as 《they》《wanted. Various new ideas or》 new《topics will come up 》 in 《an》 optional 《task. These attempts would be novel and valuable. 》 In the 《other 》《contests, too,》 we 《hope》 that 《such》 attempts will be 《made. 》
In《this contest, 》 we 《made》 the《following three 》 studies《for 》 the optional task of 《PATENT. 》
We 《extracted rewriting rules using data》 of patents.
We 《aligned 》 the《claim 》 of a 《patent》 and《its embodiment. 》
We extracted 《differences among plural claims 》 in a 《patent. 》《The first two 》 topics 《were given by organizers》 of PATENT as 《examples》 of the optional task. We 《consider these》 studies to be very 《interesting.》 The《last topic》 is 《our idea. 》 We 《sometimes write 》 a《patent, 》and 《had 》 the《experience》 of 《wanting 》 to 《know》 the《difference》 of 《claims. So, 》 we 《did 》 this 《study.》 We have 《been studying natural language processing 》 using the《Unix》《command Diff. 》 We 《previously proposed ways》 to 《use Diff》 in natural language《processing. 》 The Diff command is very 《suitable》 for《doing 》 the《above 》 three《studies.》 We have already extracted rewriting rules by using Diff in some research 《topics. For example,》 we 《used》 a《pair》 of 《definition sentences having 》 the《same word entry 》 in two 《different 》《dictionaries》 and extracted the differences《between them. 》 Theseextracted differences can be used as《synonym phrases 》 because the definition sentences in the same entry have the same 《meaning.》 In 《another situation,》 we used aligned《spoken-language 》 and《written-language texts》 and extracted the differences betweenthem. These extracted differences can be used as rewriting rules《transforming》 spoken-language sentences《into》 written-language sentences or transforming written-language sentences into spoken-language 《sentences.》 Diff can 《also》 be used for《alignment.》 Diff 《has 》 a《function》 of 《merging 》 data 《like》 a《DP-matching algorithm. So 》 we can 《align 》 two relatedtexts by using Diff. In this《study,》 we used this function for the《alignment 》 of a patent claim and its《embodiment (working 》《example). Finally,》 we used Diff for 《extracting》 the differences of patent claims.《Extracting》 differences is an《original》 function of Diff.Extracting differences between claims 《enables us》 to 《understand》 the claims of a patent 《more deeply.》
【００６１】
この例８では、真ん中あたりの箇条書で、箇条書部での主要ワードがそれぞれ強調されている。即ち、（《extracted rewriting rules using data》や《aligned 》 the《claim 》of a《patent》 and《its embodiment. 》や《differences among plural claims 》）容易に各箇条書の要点が理解できる。
【００６２】
最後の段落では、Diffの話が始まったとわかる。また，《definition sentences》《synonym phrases 》《spoken-language 》《written-language texts》《DP-matching algorithm 》などの主要なキーワード（キーフレーズ）がすぐに目に入る。内容理解等に便利である。
【００６３】
（３）：ユーザー辞書を設ける文書差分検出装置の説明
予め各ユーザーは、ユーザー辞書なるものをもっておき、その辞書にあるものは強調しないようにするものである。これにより、重要でない表現を予め強調しないようにし、見やすくすることができる。
【００６４】
図５はユーザー辞書を設ける文書差分検出装置の説明図である。図５において、文書差分検出装置には、入力手段１、抽出手段２、抽出物記憶装置３、出力手段４、ユーザー辞書５が設けてある。入力手段１は、キーボード、マウス、読み取り装置等の情報の入力を行うものである。抽出手段２は、入力された文書の差分を抽出するものである。抽出物記憶装置３は、単語、漢字、名詞句などの抽出物を格納する抽出物記憶手段である。出力手段４は、表示装置、プリンタ等の情報の出力を行うものである。ユーザー辞書５は、予め各ユーザーが登録しておく辞書である。
【００６５】
▲１▼：ユーザー辞書を設ける手法１の説明
図６はユーザー辞書を設ける手法１の文書差分検出処理フローチャートである。以下、図６の処理Ｓ２１〜Ｓ２３−２に従って説明する。
【００６６】
Ｓ２１：入力手段１等により、予め抽出の単位（抽出単位）、検出領域の単位を定め、ユーザー辞書５登録を行う。抽出単位とは、差分として出力する対象の単位である。抽出単位には、「単語」「漢字」「名詞句」などが考えられる。検出領域の単位とは、差分を検出するために比較する領域の単位のことである。検出領域の単位には、「文字」「単語」「文」「箇条書の項目」「段落」などが考えられる。
【００６７】
Ｓ２２：抽出手段２は、すべての入力データを（抽出手段２内の）記憶手段に記憶させる。
【００６８】
Ｓ２３：抽出手段２は、入力されたデータを左から調べて左の検出領域からＳ２１で定めた検出領域ごとに以下の処理Ｓ２３−１と処理Ｓ２３−２を繰り返す。
【００６９】
Ｓ２３−１：抽出手段２は、現在の検出領域以外の領域すべてから、すべての抽出単位に相当するもの（例えば単語）を抽出し、それを抽出物記憶装置３に格納する。
【００７０】
Ｓ２３−２：抽出手段２は、現在の検出領域において、抽出物記憶装置３に格納されていない、かつ、ユーザー辞書５に格納されていない抽出単位に相当するもの（例えば単語）を強調表示して現在の検出領域の文章を出力手段４に出力する。
【００７１】
▲２▼：ユーザー辞書を設ける手法２の説明
図７はユーザー辞書を設ける手法２の文書差分検出処理フローチャートである。以下、図７の処理Ｓ３１〜Ｓ３２−２に従って説明する。
【００７２】
Ｓ３１：入力手段１等により、予め抽出の単位（抽出単位）、検出領域の単位を定め、ユーザー辞書５登録を行う。抽出単位とは、差分として出力する対象の単位である。抽出単位には、「単語」「漢字」「名詞句」などが考えられる。検出領域の単位とは、差分を検出するために比較する領域の単位のことである。検出領域の単位には、「文字」「単語」「文」「箇条書の項目」「段落」などが考えられる。
【００７３】
Ｓ３２：入力手段１から処理Ｓ３１で定めた検出領域ごとに入力データが入力され、抽出手段２は、以下の処理Ｓ３２−１と処理Ｓ３２−２を繰り返す。
【００７４】
Ｓ３２−１：抽出手段２は、現在の検出領域において、抽出物記憶装置３に格納されていない、かつ、ユーザー辞書に格納されていない、抽出単位に相当するもの（例えば単語）を強調表示して現在の検出領域の文章を出力手段４に出力する。ただし、抽出物記憶装置３は最初は空である。
【００７５】
Ｓ３２−２：処理Ｓ３２−１で強調表示した表現を抽出物記憶装置３に格納する。
【００７６】
▲３▼：ユーザー辞書を設ける手法２（他の実現法）の説明
図８はユーザー辞書を設ける手法２（他の実現法）の文書差分検出処理フローチャートである。以下、図８の処理Ｓ４１〜Ｓ４３−２に従って説明する。
【００７７】
Ｓ４１：入力手段１等により、予め抽出の単位（抽出単位）、検出領域の単位を定め、ユーザー辞書５登録を行う。抽出単位とは、差分として出力する対象の単位である。抽出単位には、「単語」「漢字」「名詞句」などが考えられる。検出領域の単位とは、差分を検出するために比較する領域の単位のことである。検出領域の単位には、「文字」「単語」「文」「箇条書の項目」「段落」などが考えられる。
【００７８】
Ｓ４２：抽出手段２は、ユーザー辞書５の内容をすべて抽出物記憶装置３に格納する。
【００７９】
Ｓ４３：入力手段１から処理Ｓ４１で定めた検出領域ごとに入力データが入力され、抽出手段２は、以下の処理Ｓ４３−１と処理Ｓ４３−２を繰り返す。
【００８０】
Ｓ４３−１：抽出手段２は、現在の検出領域において、抽出物記憶装置３に格納されていない抽出単位に相当するもの（例えば単語）を強調表示して現在の検出領域の文章を出力手段４に出力する。
【００８１】
Ｓ４３−２：処理Ｓ４３−１で強調表示した表現を抽出物記憶装置３に格納する。
【００８２】
・ユーザ辞書を用いない場合、以下のようになったテキスト例（例９）をとってみる。ここで、手法２を用い、抽出単位、検出領域の単位ともに単語である。
【００８３】
例９
《本研究の目的は，日本語》の《受け身文》，《使役》文《を能動》文《に変換する際》に《変更され》《るべき格助詞》を《機械学習》を《用いて自動》変換する《ことである．》日本語の受け身文，使役文の《例》を《図1 と》図《2 》に《あげる》．図1 の文の日本語の《接尾辞「》れ《た」》は《受動態》
を《示す助動詞》で《あり》，《この》文は受け身文である．図2 の文の日本語の接尾辞「《せ》た」は使役を示す助動詞であり，この文は使役文である．《これら》の文に《対》《応》する能動文を図《3 》に示す．図1 の文《が》能動文に変換さ《れるとき》は，《(i) 》格助詞「に」は格助詞「が」に《(ii)》格助詞「が」は格助詞「を」に変換される．図2 の文が能動文に変換されるときは，(i) 格助詞「が」の《部分》「《彼》が」の《文節》が《消去》され，(ii)格助詞「に」が格助詞「が」に変換され，《(iii) 》格助詞「を」は変換され《ず》に《そのまま残る》．本研究では，これらの格助詞の変換《( 》例《：》格《助》《詞》「に」の格助詞「が」《へ》の変換《) 》と，《不要》部分の消去( 例：「彼が」の消去) を，研究の《対象》とする．( 《以降》，《本稿》では《便宜上》「彼が」《など》の消去の部分《も》格助詞の変換と《呼ぶ》．)
受け身文，使役文の能動文への変換は，文《生成》，《言い換え》，文の《平易化／言語》《運用支援》，《自然》言語文《から》の《知識獲得や情報抽出》，《質問応答システム》と《多く》の研究《分野》で《役に立つもの》である．《例えば》，質問応答システムでは，質問文が《能》《動》文で《答え》が《受動》文で《書か》れて《いる場合》，質問文と答えを《含む》文で，文の《構造》が《異なるため》に，質問の答えを《取り出す》のが《困難な》場合がある．この《よう》な《問》《題》も受け身文，使役文の能動文への変換が《できる》ように《なる》と《解決》する《のであ》る．このように受け身文，使役文の能動文への変換は，自然言語《処理》で《重要》なものである．
【００８４】
・ユーザー辞書としては、発明者の他の論文で出現頻度の高かった語を登録する。
（ユーザー辞書の登録例）
の, を, ，, ．, で, は, と, に, が, て, こと, し, する, た, よう, 部分, な, データ, 差分, ある, この, 村田, いる, 」, 「, 研究, できる,diff,),対応, も, システム, 処理, 言語,(, また, ファイル, 用い, もの
といった語を登録する。なお、ユーザー辞書の登録例での単語の区切りは「, 」で表してある。
【００８５】
この場合、前記例９は、以下のような結果となる。
《本》研究の《目的》は，《日本語》の《受け身文》，《使役》文を《能動》文に《変換》する《際》に《変更され》《るべき格助詞》を《機械学習》を用いて《自動》変換することである．日本語の受け身文，使役文の《例》を《図1 》と図《2 》に《あげる》．図1 の文の日本語の《接尾辞》「れた」は《受動態》
を《示す助動詞》で《あり》，《この》文は受け身文である．図2 の文の日本語の接尾辞「《せ》た」は使役を示す助動詞であり，この文は使役文である．《これら》の文に《対》《応》する能動文を図《3 》に示す．図1 の文が能動文に変換さ《れるとき》は，《(i) 》格助詞「に」は格助詞「が」に《(ii)》格助詞「が」は格助詞「を」に変換される．図2 の文が能動文に変換されるときは，(i) 格助詞「が」の部分「《彼》が」の《文節》が《消去》され，(ii)格助詞「に」が格助詞「が」に変換され，《(iii) 》格助詞「を」は変換され《ず》に《そのまま残る》．本研究では，これらの格助詞の変換( 例《：》格《助》《詞》「に」の格助詞「が」《へ》の変換) と，《不要》部分の消去( 例：「彼が」の消去) を，研究の《対象》とする．( 《以降》，《本稿》では《便宜上》「彼が」《など》の消去の部分も格助詞の変換と《呼ぶ》．)
受け身文，使役文の能動文への変換は，文《生成》，《言い換え》，文の《平易化／》言語《運用支援》，《自然》言語文《から》の《知識獲得や情報抽出》，《質問応答》システムと《多く》の研究《分野》で《役に立つ》ものである．《例えば》，質問応答システムでは，質問文が《能》《動》文で《答え》が《受動》文で《書か》れている《場合》，質問文と答えを《含む》文で，文の《構造》が《異なるため》に，質問の答えを《取り出す》のが《困難》な場合がある．このような《問》《題》も受け身文，使役文の能動文への変換ができるように《なる》と《解決》する《のであ》る．このように受け身文，使役文の能動文への変換は，自然言語処理で《重要》なものである．
【００８６】
上記結果は、それほど大きな変化はないが、例えば、最初の「研究」や「ことである．」などのそれほど重要でない表現がとられなくなり、少々は見やすくなる。より多くの重要でない単語をユーザー辞書５に登録することでさらに見やすくすることができる。
【００８７】
なお、前記実施の形態では、強調表示として、２重山括弧で囲む説明をしたが、下線、色分け、背景の変更、字体の変更、点滅等他の強調表示を行うこともできる。
【００８８】
また、このような手法は、照応解析における新情報と旧情報の問題の考察に使うことができる。この「照応解析における新情報と旧情報の問題」に使える手法は、「手法２」の方だけで手法１は使えない。次に、「手法２」の場合、新規に出現した表現が強調表示されるが、言語学的にはこのような文章中に新たに出現した事物は「新情報」と呼ばれる。従って、新規に出現した表現を強調表示する手法２は、言語学でいうところの新情報を抽出していることになっていて、手法２の結果は、言語学でいうところの新情報の考察にも使うことができる。ただし、言語表現の場合、同じ事物を異なる言語表現で言い表す場合もある。その場合、旧情報であっても新しい言語表現であるので、手法２で強調表示する可能性がある。即ち、すべての「新情報」と「旧情報」を正しく区別するわけではない。それでも、手法２は「新情報」と「旧情報」の考察に役立つものである。
【００８９】
更に、抽出単位を漢字とすることで、学校教育等で新しい漢字の出現を容易に理解することができる。漢字の場合は、漢字コードで比較できるため単語のように形態素解析手段が不要となる。
【００９０】
（４）：プログラムインストールの説明
入力手段１、抽出手段２、抽出物記憶装置３、出力手段４、ユーザー辞書５、抽出・検出領域設定手段２１等は、プログラムで構成でき、主制御部（ＣＰＵ）が実行するものであり、主記憶に格納されているものである。このプログラムは、一般的な、コンピュータで処理されるものである。このコンピュータは、主制御部、主記憶、ファイル装置、表示装置、キーボード等の入力手段である入力装置などのハードウェアで構成されている。このコンピュータに、本発明のプログラムをインストールする。このインストールは、フロッピィ、光磁気ディスク等の可搬型の記録（記憶）媒体に、これらのプログラムを記憶させておき、コンピュータが備えている記録媒体に対して、アクセスするためのドライブ装置を介して、或いは、ＬＡＮ等のネットワークを介して、コンピュータに設けられたファイル装置にインストールされる。そして、このファイル装置から処理に必要なプログラムステップを主記憶に読み出し、主制御部が実行するものである。
【００９１】
【発明の効果】
以上説明したように、本発明によれば、次のような効果がある。
【００９２】
（１）：抽出手段で、入力された文書データの現在の検出領域以外の領域から全ての抽出単位に相当するものを抽出して格納手段に格納し、現在の検出領域において、前記格納手段に格納されていない抽出単位に相当するものを強調表示して現在の検出領域の文書を出力することを、検出領域ごとに繰り返すため、新しい情報である文書の特徴や差分を容易に抽出表示することができる。
【００９３】
（２）：抽出手段で、入力された文書データの現在の検出領域において、格納手段に格納されていない抽出単位に相当するものを強調表示して現在の検出領域の文書を出力し、前記強調表示したものを前記格納手段に格納することを、検出領域ごとに繰り返すため、新しく出現する抽出単位に相当するもの（例えば単語）を容易に抽出して表示することができる。
【００９４】
（３）：前記格納手段に予め強調表示しない抽出単位のデータを格納するため、予めそれほど重要でない表現を強調表示しないようにでき、見やすくすることができる。
【００９５】
（４）：前記抽出単位として、単語の単位とするため、新しく出現する単語を抽出表示することができる。
【００９６】
（５）：前記検出領域の単位として、箇条書きの単位とするため、箇条書き間の違いを容易に理解することができる。
【００９７】
（６）：前記検出領域の単位として、特許請求の範囲の単位とするため、特許請求の範囲の特徴や違いを容易に理解することができる。
【図面の簡単な説明】
【図１】本発明の原理説明図である。
【図２】実施の形態における文書差分検出装置の説明図である。
【図３】実施の形態における手法１の文書差分検出処理フローチャートである。
【図４】実施の形態における手法２の文書差分検出処理フローチャートである。
【図５】実施の形態におけるユーザー辞書を設ける文書差分検出装置の説明図である。
【図６】実施の形態におけるユーザー辞書を設ける手法１の文書差分検出処理フローチャートである。
【図７】実施の形態におけるユーザー辞書を設ける手法２の文書差分検出処理フローチャートである。
【図８】実施の形態におけるユーザー辞書を設ける手法２（他の実現法）の文書差分検出処理フローチャートである。
【符号の説明】
２抽出手段
３ａ格納手段
２１抽出・検出領域設定手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document difference detection apparatus and a program for detecting a difference between documents (or sentences) so that the difference between documents can be easily understood.
[0002]
[Prior art]
Conventionally, there is a technique in which a diff command is used to detect a difference between a plurality of input document data, and among the differences between a plurality of document data, one common part is output, and mismatched parts are output side by side. there were.
[0003]
Here, diff is a file comparison tool diff of UNIX (registered trademark). This diff command outputs the difference between two given files in units of lines while maintaining the order information.
[0004]
The diff command has a convenient option called the -D option. Using this option with the diff command will output not only the difference but also the common part. That is, file merging is realized. In addition, in order to display the difference portion in an easy-to-see manner, a display representing the beginning of the difference portion, the end of the difference portion, and the boundary between the two data constituting the difference is performed. Such diff when merging files is called Mdiff (M is M of merge) (for example, see Non-Patent Document 1 and Japanese Patent Application No. 2001-31329).
[0005]
Using this technique, experiments were performed to detect differences between multiple claims of a single patent. This is a new attempt. The two claims of a patent were transformed so that one word per line was inserted, and their Mdiff was taken (in the following explanation, the brackets in the claims etc. are "[" or " ] ”).
[0006]
Example 1,
17. The printer system control method according to claim 16, wherein the printer system includes a host device.
18. The printer system control method according to claim 16, wherein the printer system includes a printer.
[0007]
(Mdiff result of Example 1 above)
The printer system
; ===== begin =====
Host device
─────────
Printer
; ===== end =====
17. The printer system control method according to claim 16, further comprising:
.
[0008]
The difference between claim 17 and claim 18 can be understood very easily from the results of Mdiff of claims 17 and 18 of Example 1 above. That is,; ===== begin ===== is the beginning of the difference part,; ===== end ===== is the end of the difference part, and ──────── is the difference Represents the boundary between two data. Here, the difference is “higher-level device” and “printer”. However, if the differences are more confusing, the Mdiff results will be difficult to see.
[0009]
Example 2,
[Claim 1]
A weeding sickle characterized in that the blade portion at the tip of the blade member is formed into an irregular shape to form a wavy blade and is curved spirally, and a handle is attached to the blade member.
[Claim 2]
A weeding sickle characterized by having anti-slip parts on the upper and lower parts of the handle.
[0010]
(Mdiff result of Example 2 above)
; ===== begin =====
Blade member
─────────
Handle
; ===== end =====
of
; ===== begin =====
Tip blade
─────────
Anti-slip on top and bottom
; ===== end =====
Part
; ===== begin =====
A wavy blade is formed on an uneven surface and curved in a spiral, and a handle is attached to the blade member.
─────────
Establishment
; ===== end =====
Weeding sickle characterized by that.
[0011]
The difference between the results of taking Mdiff in claim 1 and claim 2 in Example 2 is complicated, so the result of Mdiff is difficult to see. In other words, since Mdiff is a mechanism for storing order information, when the difference is complicated, it is difficult to understand the difference, and it has been found that there is a problem with this as it is.
[0012]
[Non-Patent Document 1]
Masaki Murata, 1 outside, diff and language processing “Language understanding and communication
The Institute of Electronics, Information and Communication Engineers July 17, 2001 (NLC2001-26) Electronics
IEICE technical report, p. 29-36
[0013]
[Problems to be solved by the invention]
In the case of using the conventional Mdiff, when the difference is complicated, the result of the Mdiff is difficult to see.
[0014]
An object of the present invention is to solve the above-described problems and to provide an easy-to-understand display even when the difference is complicated.
[0015]
[Means for Solving the Problems]
FIG. 1 is a diagram illustrating the principle of the present invention. In FIG. 1, 2 is an extraction means, 3a is a storage means, and 21 is an extraction / detection area setting means.
[0016]
The present invention has the following means in order to solve the conventional problems.
[0017]
(1): An input means for inputting information; A detection area that is a unit of an area to be compared in order to detect a difference between an extraction unit that is a target unit to be output as a difference between document data and a difference between document data And the input means Setting Be done An extraction / detection area setting means 21, a storage means 3a for storing information, and an extraction means 2 are provided, and the extraction means 2 extracts all the extractions from areas other than the current detection area of the input document data. A unit corresponding to the unit is extracted and stored in the storage unit 3a, and in the current detection area, a unit corresponding to the extraction unit that is not stored in the storage unit 3a is highlighted to display the current detection area. Output of the document is repeated for each detection area. Therefore, it is possible to easily extract and display the document features and differences that are new information.
[0018]
(2): An input means for inputting information; A detection area that is a unit of an area to be compared in order to detect a difference between an extraction unit that is a target unit to be output as a difference between document data and a difference between document data And the input means Setting Be done An extraction / detection area setting means 21, a storage means 3a for storing information, and an extraction means 2 are provided. The extraction means 2 stores the input document data in the storage means 3a in the current detection area. The display of the current detection area is highlighted by highlighting the uncorresponding extraction unit, and the highlighted display is stored in the storage means 3a for each detection area. For this reason, it is possible to easily extract and display a unit corresponding to a newly appearing extraction unit (for example, a word).
[0019]
(3): In the document difference detection apparatus according to (1) or (2), the storage unit 3a stores in advance the data of the extraction unit that is not highlighted. For this reason, expressions that are not so important in advance can be prevented from being highlighted and can be easily viewed.
[0020]
(4): In the document difference detection apparatus of (1) to (3), a word unit is used as the extraction unit. For this reason, a newly appearing word can be extracted and displayed.
[0021]
(5): In the document difference detection device of (1) to (4), the unit of the detection area is an itemized unit. For this reason, the difference between the bullets can be easily understood.
[0022]
(6): In the document difference detection apparatus according to (1) to (4), the unit of the detection area is a unit of claims. For this reason, the features and differences of the claims can be easily understood.
[0023]
(7): A detection area that is a unit of an area to be compared in order to detect a difference between an extraction unit that is a unit to be output as a difference between document data and a difference between document data And by input means Setting Be done Extraction / detection area setting means 21 and all of the input unit corresponding to the extraction unit are extracted from areas other than the current detection area and stored in the storage means 3a. In order to cause the computer to function as the extraction unit 2 that repeats, for each detection area, the output of the document in the current detection area by highlighting the unit corresponding to the extraction unit that is not stored in the storage unit 3a. Or a computer-readable recording medium on which the program is recorded. For this reason, it is possible to easily provide a document difference detection apparatus that can easily extract and display document features and differences by installing this program in a computer.
[0024]
(8): A detection area that is a unit of an area to be compared in order to detect a difference between an extraction unit that is a unit to be output as a difference between document data and a difference between document data And by input means Setting Be done The extraction / detection area setting means 21 and the current detection area of the input document data that are not stored in the storage means 3a are highlighted and the document in the current detection area is output. Then, the storage means 3a is stored in the storage means 3a as the extraction means 2 for repeating each detection area, and a program for causing a computer to function or a computer-readable recording medium recording the program is used. For this reason, it is possible to easily provide a document difference detection apparatus that can extract and display a unit corresponding to a newly appearing extraction unit by installing this program in a computer.
[0025]
DETAILED DESCRIPTION OF THE INVENTION
(1): Description of document difference detection apparatus
FIG. 2 is an explanatory diagram of the document difference detection apparatus. In FIG. 2, the document difference detection apparatus is provided with an input means 1, an extraction means 2, an extract storage device 3, and an output means 4. The input unit 1 inputs information such as a keyboard, a mouse, and a reading device. The extraction unit 2 extracts the difference between the input documents. The extract storage device 3 is an extract storage means for storing extracts such as words, kanji and noun phrases. The output means 4 outputs information such as a display device and a printer.
[0026]
(1) Explanation of morphological analysis system
In order to divide Japanese into words, a morphological analysis system performed by the extraction unit 2 is required. Here, ChaSen will be explained (published at the tea ceremony http://chasen.aist-nara.ac.jp/index.html.jp developed at Nara Institute of Technology).
[0027]
This splits the Japanese sentence and also estimates the part of speech of each word. For example, if “go to school” is entered, the following results can be obtained.
[0028]
School Gakkou School Noun-General
To He To Particles-Case particles-General
Go Iku Go Verb-Independence
EOS
In this way, each line is divided so that one word is included, and reading and part-of-speech information are given to each word. In the extraction means 2 of the present invention, only the part (morpheme analysis means) for dividing the word in this function is used.
[0029]
▲ 2 ▼: Explanation of English stemmer
In order to extract a word by the extraction means 2, since the word is written in English, it is only necessary to perform stemming to return the word to the basic form. There is a famous Porter algorithm for this stemming (see Porter, MF, 1980, An algorithm for suffix stripping, Program, 14 (3): 130-137).
[0030]
(3): Explanation of extraction unit and detection area
Characters, paragraphs, sentences, bullet items, etc. can be mechanically recognized from the document format. For example, a character can be recognized by a 1-byte or 2-byte code. If it is a paragraph, it can be recognized by indentation and line feed. A sentence can be recognized by the presence of a period or a period. The bullets can be recognized by indentation, line breaks, symbols at the beginning of bullet items, and the like. Word recognition is performed by the morphological analysis system and stemmer mentioned above. The recognition can be performed, for example, by providing each recognition means in the extraction means.
[0031]
(2): Explanation of difference detection
There are two methods (methods) for detecting the difference of the present invention. These methods do not use the Diff command. Hereinafter, these two methods will be described with reference to flowcharts.
[0032]
(1): Method 1
FIG. 3 is a flowchart of document difference detection processing according to Method 1. Hereinafter, description will be made according to the processing S1 to S3-2 of FIG.
[0033]
S1: An extraction unit (extraction unit) and a detection area unit are determined in advance by the input means 1 or the like. An extraction unit is a unit to be output as a difference. The extraction unit may be “word”, “kanji”, “noun phrase”, or the like. The unit of detection area is a unit of area to be compared in order to detect a difference. As a unit of the detection area, “character”, “word”, “sentence”, “item of item”, “paragraph”, “claim of patent”, and the like can be considered.
[0034]
S2: The extraction means 2 stores all input data in the storage means (in the extraction means 2).
[0035]
S3: The extraction unit 2 examines the input data from the left and repeats the following processing S3-1 and processing S3-2 for each detection region determined in the processing S1 from the left detection region.
[0036]
S3-1: The extraction unit 2 extracts all the extraction units (for example, words) corresponding to all the extraction units from all the regions other than the current detection region, and stores them in the extract storage device 3.
[0037]
S3-2: The extraction unit 2 outputs a sentence in the current detection region by highlighting an item (for example, a word) corresponding to an extraction unit that is not stored in the extract storage device 3 in the current detection region. Output to.
[0038]
(2): Explanation according to the method 1 example
An example of method 1 will be described with the extraction unit as a word, taking the claims (detection region) of the patent specification as an example. All words are extracted from all claims other than the currently analyzed claim, and words that do not appear in other claims are identified in the currently analyzed claim. The results are shown in Example 3 below.
[0039]
Example 3
[Claim 1] The "blade" part of the "tip" of the "blade member" is "formed and corrugated" to be "formed and wavy blade" and "curved" and "curved", and a handle is attached to the blade member A weeding sickle characterized by that.
[Claim 2] A weed sickle characterized in that a "slip prevention" portion is "provided" on the "upper and lower parts" of the handle.
[0040]
In Example 3 above, words that did not appear in other claims are enclosed in parentheses “<<” and “>>” (highlighted). This result is much easier to see than the Mdiff result in Example 2. From Example 3, it can be understood very easily that the feature of [Claim 2] is "the upper and lower anti-slip portions". If it can be understood that the feature of claim 2 is “anti-slip part”, the embodiment including this term “anti-slip part” and paragraphs in the examples can be easily extracted to claim 2. Embodiments and examples corresponding to the above can also be extracted.
[0041]
Thus, this method is very useful for extracting features and differences. Further, it is useful for extracting embodiments and examples corresponding to certain claims, that is, for associating claims with the embodiments and examples.
[0042]
Next, I tried Method 1 for another example with three claims. In this case, the result as shown in Example 4 below was obtained.
[0043]
Example 4
[Claim 1] The "blade" portion of the "tip" of the "blade member" is made into "uneven" and "formed and wavy blades" and "curved" and "curved", and a handle is attached to the blade member A weeding sickle characterized by that.
[Claim 2] A weeding sickle characterized in that an anti-slip portion is provided on the upper part of the handle.
[Claim 3] A weeding sickle characterized in that an anti-slip portion is provided on the upper and lower parts of the handle.
[0044]
As a result of Example 4 above, the “slip prevention portion” that is the feature of Claims 2 and 3 could not be extracted. In order to solve this problem, a second new method (Method 2) was considered.
[0045]
(2): Method 2
FIG. 4 is a flowchart of the document difference detection process of method 2. Hereinafter, a description will be given according to the processing S11 to S12-2 of FIG.
[0046]
S11: An extraction unit (extraction unit) and a detection area unit are determined in advance by the input means 1 or the like. An extraction unit is a unit to be output as a difference. The extraction unit may be “word”, “kanji”, “noun phrase”, or the like. The unit of detection area is a unit of area to be compared in order to detect a difference. As a unit of the detection area, “character”, “word”, “sentence”, “item of item”, “paragraph”, “claim of patent”, and the like can be considered.
[0047]
S12: Input data is input from the input means 1 for each detection region determined in the process S11, and the extracting means 2 repeats the following processes S12-1 and S12-2.
[0048]
S12-1: The extracting means 2 outputs a sentence in the current detection area by highlighting an extraction unit (for example, a word) corresponding to an extraction unit that is not stored in the extract storage device 3 in the current detection area. Output to. However, the extract storage device 3 is initially empty.
[0049]
S12-2: The expression highlighted in the process S12-1 is stored in the extract storage device 3.
[0050]
(2): Explanation by example of method 2
An example of method 2 will be described with the extraction unit as a word, taking the claims of the patent specification as an example. The second new approach takes all words from all claims above the one that is being analyzed, and all above the claims that are currently being analyzed in the claim that is being analyzed. Identify words that do not appear in the claim. The results are shown in Example 5 below.
[0051]
Example 5
[Claim 1] << A weeding sickle characterized in that the blade part at the tip of the blade member is formed into an irregular shape to form a wave-like blade and is curved in a spiral shape, and a handle is attached to the blade member. >>
[Claim 2] A weeding sickle characterized in that a "non-slip" part is "provided" on the "upper part" of the handle.
[Claim 3] A weeding sickle characterized in that an anti-slip portion is provided on the upper and lower parts of the handle.
[0052]
In this case, the “non-slip portion” that is the feature of claims 2 and 3 could be extracted. By this method, a newly appearing word can be extracted as a difference.
[0053]
-An example of an ordinary sentence using Method 2 will be described. Here, both the unit of extraction and the unit of detection area are words.
[0054]
Example 6
《The purpose of this study is to use ‘changed’ and ‘machine particle’ as ‘changed’ in ‘passive sentence’ in ‘Japanese’ and ‘converting’ to ‘active’ sentence. Automatically. >> “Examples” of Japanese passive sentences and usage sentences are given in “Figure 1” and “Figure 2”. The Japanese << suffix ">>"<<"in the sentence in Figure 1 is" passive "is a" showing auxiliary verb "," is ", and" this "is a passive sentence. The Japanese suffix “<< se” ta ”in the sentence in Fig. 2 is an auxiliary verb indicating a working part, and this sentence is a working sentence. Figure << 3 >> shows an active sentence that corresponds to << the >> sentence. When the sentence << in Figure 1 is transformed into an active sentence, << (i) >> the case particle "ni" is the case particle "ga", << (ii) >> the case particle "ga" is the case particle " Is converted to. When the sentence in Fig. 2 is converted to an active sentence, (i) the << part >> of the case particle "ga", << the clause >> of << he >> is << erased >>, and (ii) the case particle "ni" Is converted to the case particle "ga", and << (iii) >> In this study, conversion of these case particles <<(>> example <<: >> case << auxiliary >><< verb >> conversion of the case particle "ga""to"<<)>> and elimination of the << unnecessary >> part ( Example: Erasing “He is” is the “subject” of the study. (From 《After》 and 《This paper》《Convenience》《He is》《Erase》《Erase》《Also》《Call》 Conversion of case particles.
The conversion of passive sentences and active sentences into active sentences is performed by using the sentence << Generation >>, << paraphrase >>, sentence << simplification / language >><< operation support >>, << natural >> language sentence << from >><< knowledge acquisition and information extraction 》, 《Question Answering System》 and 《Many》 Research 《Fields》 are 《Useful》. << For example >>, in the question answering system, the question sentence is a << noh >><< motion >> sentence and the << answer >> is << written >> with a << passive >> sentence. There are cases where it is “difficult” to “take out” an answer to a question because the “structure” of the sentence is “different”. These “like” “questions” and “titles” also become “resolve” and “solve” so that “passive” and active sentences can be converted into active sentences. In this way, the conversion of passive sentences and usage sentences into active sentences is “important” in the natural language “processing”.
[0055]
From the display of this example 6, it can be seen in the second paragraph that a story such as “Generation”, “Parallel”, “Simplification / Language”, etc. is newly generated. Also, in the second paragraph, the language expression used in the evaluation of “useful”, “difficult”, “can”, “important”, etc. is often used, so the validity and usefulness of the method is described. It is easy to understand that
[0056]
An example of a detailed description of the invention using method 2 will be described. Here, both the unit of extraction and the unit of detection area are words.
[0057]
Example 7
Next, the present invention will be described with reference to the drawings. FIG. 1 is a front view of the weeding sickle according to the present invention, FIG. 2 is a rear view of the weeding sickle according to the present invention, and FIG. 3 is a right side of the weeding sickle according to the present invention. It is.
<< [0007] >>
The weeding sickle 1 is formed as shown in FIG. 3 <<, and the blade << part >> 2 << b >> of the << blade member >> 2 is << formed to the wavy blade << 5 >> of the << front end >><< one side is wavy >>. The blade member 2 << and >> whose back is formed to be "flat" and "handle" 3 << are constructed.
[<< 0008]
As shown in FIGS. 1, 2 << and >> FIG. 3, the blade member 2 has an << extension >> portion 2 << a >><< the length >> of the handle 3 << about >> 2 << double >>. . The blade portion 2b of the wavy blade 5 is << curved >> in one direction.
[<< 0009 >>]
Fig. << 4 >> is a front << enlarge >> figure of the blade portion of the weeding sickle of the present invention. As shown in the figure, the blade portion 2b that << reapses << weed >> has a << convex >> portion 5a and a << concave >> portion 5b << existing >> alternately and is wavy.
[<< 0010 >>]
FIG. 5 is an enlarged view of the blade portion of the weeding sickle according to the present invention. The tip of the convex portion 5a that constitutes the blade portion 2b is << inclined >> slightly << to the left. “This” means “to hook” the weeds to “make it easy to mow”.
[<< 0011 >>]
Fig. << 6 >> is an enlarged view of << the part >> showing the curve << state >> of the blade portion of the weeding sickle according to the present invention. As shown in the figure, the tip 2 << c >> of the blade 2b is more curved from the << vertical line >> 6 than the extension 2a of the blade 2b.
[<< 0012 >>]
The figure << 7 >> is a << cross section >> figure taken along the <<A->> A line of the figure << middle >>. The “upper surface” 7 of the blade portion 2 b is “inclined”, and the tip 5 c of the convex portion 5 a is “pointed”. << And >>, the blade part 2b << itself >> is curved and << spirals >>.
[<< 0013 >>]
Figure << 8 >> is a front view of << Other >><< Example >> of the weeding sickle according to the present invention, Figure << 9 >> is a rear view of another Example of the weeding sickle according to the present invention, and Figure << 10 >> is A right side view of another embodiment of the weeding sickle according to the present invention, FIG. 11 is a partially enlarged view of another embodiment of the weeding sickle according to the present invention.
[<< 0014 >>]
In the weeding sickle 1a of this example, the extension part 2a of the blade member 2 is << short >> and the << part >> of the blade part 2b is slightly << large >> formed.
[<< 0015 >>]
<< Also >> The handle 3 is << long >> and << the top >> of the << grip >> part 3b is << prepared >> with an << anti-slip >> part 3a with a diameter slightly larger than the << diameter >> of the grip part 3b. >> is also provided with << slip >> stopper 3 << cb >> below << large >> diameter from grip part 3b.
[<< 0016 >>]
As shown in FIG. 10, the blade portion 2b of the weeding sickle 1a of this example is also curved in a spiral shape like the weeding sickle 1 shown in FIGS.
[<< 0017 >>]
Like this, it is easier and easier to “uproot” the weeds that are “grown” on the “lawn etc.” by making the tip curve spirally.
[<< 0018 >>]
[0058]
In Example 7, it can be seen that “spiral” is first appearing here at paragraph number 0012. In paragraph 0015, it is understood that the “slip prevention part” is important. Also, in paragraph number 0017, an interesting expression “remove roots” is found here for the first time.
[0059]
-An example of English text by Method 2 will be explained. Here, both the unit of extraction and the unit of detection area are words. Also, stemming was not performed, and words were recognized based on being separated by spaces.
[0060]
Example 8
《In the PATENT task of NTCIR-3, we participated in》 the 《optional task,》《where》 the 《participants can perform any kind》 of 《research related to》《patents. We think that》 in 《a》 PATENT 《 attempt,》 the optional task 《is very》《interesting, because》 we 《have already heard》 that 《some》 participants in 《previous contests wanted》 to 《make their studies as freely》 as 《they》《wanted. Various new ideas or》 new << topics will come up >> in << an >> optional << task. These attempts would be novel and valuable. >> In the << other >><< contests, too, >> we << hope that << such >> attempts will be << made .
In 《this contest,》 we 《made》 the 《following three》 studies 《for》 the optional task of 《PATENT.》
We 《extracted rewriting rules using data》 of patents.
We 《aligned》 the 《claim》 of a 《patent》 and 《its embodiment.》
We extracted << differences among plural claims >> in a << patent. >><< The first two >> topics << were given by organizers >> of PATENT as << examples >> of the optional task.We << consider these >> studies to be very << interesting. >> The << last topic >> is << our idea. >> We << sometimes write >> a << patent, >> and << had >> the << experience >> of << wanting >> to << know >> the << difference >> of << claims. So, >> we << did》 this 《study.》 We have 《been studying natural language processing》 using the 《Unix》《command Diff.》 We 《previously proposed ways》 to 《use Diff》 in natural language 《processing.》 The Diff command is very 《 suitable》 for 《doing》 the 《above》 three 《studies.》 We have already extracted rewriting rules by using Diff in some research 《topics. For example,》 we 《used》 a 《pair》 of 《definition sentences having》 the 《 same word entry》 in two 《different》《dictionaries》 and extracted the differe nces 《between them.》 Theseextracted differences can be used as 《synonym phrases》 because the definition sentences in the same entry have the same 《meaning.》 In 《another situation,》 we used aligned 《spoken-language》 and 《written-language These extracted differences can be used as rewriting rules 《transforming》 spoken-language sentences 《into》 written-language sentences or transforming written-language sentences into spoken-language 《sentences.》 Diff can 《also》 be used for << alignment. >> Diff << has >> a << function >> of << merging >> data << like >> a << DP-matching algorithm.So >> we can << align >> two related texts by using Diff. In this << study, >> we used this function for the 《alignment》 of a patent claim and its 《embodiment (working》《example). Finally,》 we used Diff for 《extracting》 the differences of patent claims. 《Extracting》 differences is an 《original》 function of Diff.Extracting differences between claims 《enables us》 to 《understand》 the claims of a patent 《more deeply.》
[0061]
In Example 8, the main word in the itemized section is emphasized in the itemized item in the middle. That is, (<< extracted rewriting rules using data >>, << aligned >> the << claim >> of a << patent >> and << its embodiment. >> and << differences among plural claims >>) can easily understand the main points of each item.
[0062]
In the last paragraph, you can see that the story of Diff begins. Also, key keywords (key phrases) such as << definition sentences >><< synonym phrases >><< spoken-language >><< written-language texts >><< DP-matching algorithm >> are immediately visible. Convenient for understanding contents.
[0063]
(3): Description of a document difference detection apparatus provided with a user dictionary
Each user has a user dictionary in advance and does not emphasize anything in the dictionary. As a result, it is possible to make it easy to see unimportant expressions without emphasizing them in advance.
[0064]
FIG. 5 is an explanatory diagram of a document difference detection apparatus provided with a user dictionary. In FIG. 5, the document difference detection apparatus is provided with an input means 1, an extraction means 2, an extract storage device 3, an output means 4, and a user dictionary 5. The input unit 1 inputs information such as a keyboard, a mouse, and a reading device. The extraction unit 2 extracts the difference between the input documents. The extract storage device 3 is an extract storage means for storing extracts such as words, kanji and noun phrases. The output means 4 outputs information such as a display device and a printer. The user dictionary 5 is a dictionary registered in advance by each user.
[0065]
(1): Explanation of Method 1 for providing a user dictionary
FIG. 6 is a flowchart of a document difference detection process of Method 1 for providing a user dictionary. Hereinafter, a description will be given according to the processes S21 to S23-2 of FIG.
[0066]
S21: The extraction unit (extraction unit) and the detection area unit are determined in advance by the input means 1 or the like, and the user dictionary 5 is registered. An extraction unit is a unit to be output as a difference. The extraction unit may be “word”, “kanji”, “noun phrase”, or the like. The unit of detection area is a unit of area to be compared in order to detect a difference. As a unit of the detection area, “character”, “word”, “sentence”, “item of item”, “paragraph”, and the like can be considered.
[0067]
S22: The extraction unit 2 stores all input data in the storage unit (in the extraction unit 2).
[0068]
S23: The extraction unit 2 examines the input data from the left and repeats the following processing S23-1 and processing S23-2 for each detection region determined in S21 from the left detection region.
[0069]
S23-1: The extraction unit 2 extracts all the extraction units (for example, words) corresponding to all the extraction units from all the regions other than the current detection region, and stores them in the extract storage device 3.
[0070]
S23-2: The extraction unit 2 highlights an extraction unit (for example, a word) that is not stored in the extract storage device 3 and is not stored in the user dictionary 5 in the current detection area. The text in the current detection area is output to the output means 4.
[0071]
(2): Method 2 for providing a user dictionary
FIG. 7 is a document difference detection process flowchart of Method 2 for providing a user dictionary. Hereinafter, a description will be given according to the processes S31 to S32-2 in FIG.
[0072]
S31: An extraction unit (extraction unit) and a detection area unit are determined in advance by the input means 1 and the like, and the user dictionary 5 is registered. An extraction unit is a unit to be output as a difference. The extraction unit may be “word”, “kanji”, “noun phrase”, or the like. The unit of detection area is a unit of area to be compared in order to detect a difference. As a unit of the detection area, “character”, “word”, “sentence”, “item of item”, “paragraph”, and the like can be considered.
[0073]
S32: Input data is input from the input means 1 for each detection area determined in the process S31, and the extraction means 2 repeats the following processes S32-1 and S32-2.
[0074]
S32-1: The extraction unit 2 highlights an extraction unit (for example, a word) that is not stored in the extract storage device 3 and is not stored in the user dictionary in the current detection area. The text in the current detection area is output to the output means 4. However, the extract storage device 3 is initially empty.
[0075]
S32-2: The expression highlighted in the process S32-1 is stored in the extract storage device 3.
[0076]
(3): Explanation of method 2 (other realization methods) for providing a user dictionary
FIG. 8 is a document difference detection process flowchart of Method 2 (another realization method) for providing a user dictionary. Hereinafter, a description will be given according to processing S41 to S43-2 in FIG.
[0077]
S41: An extraction unit (extraction unit) and a detection area unit are determined in advance by the input means 1 and the like, and the user dictionary 5 is registered. An extraction unit is a unit to be output as a difference. The extraction unit may be “word”, “kanji”, “noun phrase”, or the like. The unit of detection area is a unit of area to be compared in order to detect a difference. As a unit of the detection area, “character”, “word”, “sentence”, “item of item”, “paragraph”, and the like can be considered.
[0078]
S42: The extraction means 2 stores all the contents of the user dictionary 5 in the extract storage device 3.
[0079]
S43: Input data is input from the input unit 1 for each detection region determined in the process S41, and the extraction unit 2 repeats the following processes S43-1 and S43-2.
[0080]
S43-1: The extraction means 2 outputs a sentence in the current detection area by highlighting an extraction unit (for example, a word) corresponding to an extraction unit that is not stored in the extract storage device 3 in the current detection area. Output to.
[0081]
S43-2: The expression highlighted in the process S43-1 is stored in the extract storage device 3.
[0082]
-When not using a user dictionary, take a text example (example 9) as follows. Here, using method 2, both the extraction unit and the detection area unit are words.
[0083]
Example 9
《The purpose of this study is to use ‘changed’ and ‘machine particle’ as ‘changed’ in ‘passive sentence’ in ‘Japanese’ and ‘converting’ to ‘active’ sentence. Automatically. >> “Examples” of Japanese passive sentences and usage sentences are given in “Figure 1” and “Figure 2”. The Japanese 《suffix》《same》 in the sentence of Fig. 1 is 《Passive》.
《Present auxiliary verb》 is 《Yes》, and 《This》 sentence is passive. The Japanese suffix “<< se” ta ”in the sentence in Fig. 2 is an auxiliary verb indicating a working part, and this sentence is a working sentence. Figure << 3 >> shows an active sentence that corresponds to << the >> sentence. When the sentence << in Figure 1 is transformed into an active sentence, << (i) >> the case particle "ni" is the case particle "ga", << (ii) >> the case particle "ga" is the case particle " Is converted to. When the sentence in Fig. 2 is converted to an active sentence, (i) the << part >> of the case particle "ga", << the clause >> of << he >> is << erased >>, and (ii) the case particle "ni" Is converted to the case particle "ga", and << (iii) >> In this study, conversion of these case particles <<(>> example <<: >> case << auxiliary >><< verb >> conversion of the case particle "ga""to"<<)>> and elimination of the << unnecessary >> part ( Example: Erasing “He is” is the “subject” of the study. (From 《After》 and 《This paper》《Convenience》《He is》《Erase》《Erase》《Also》《Call》 Conversion of case particles.
The conversion of passive sentences and active sentences into active sentences is performed by using the sentence << Generation >>, << paraphrase >>, sentence << simplification / language >><< operation support >>, << natural >> language sentence << from >><< knowledge acquisition and information extraction 》, 《Question Answering System》 and 《Many》 Research 《Fields》 are 《Useful》. << For example >>, in the question answering system, the question sentence is a << noh >><< motion >> sentence and the << answer >> is << written >> with a << passive >> sentence. There are cases where it is “difficult” to “take out” an answer to a question because the “structure” of the sentence is “different”. These “like” “questions” and “titles” also become “resolve” and “solve” so that “passive” and active sentences can be converted into active sentences. In this way, the conversion of passive sentences and usage sentences into active sentences is “important” in the natural language “processing”.
[0084]
-As the user dictionary, words that appear frequently in other papers of the inventor are registered.
(User dictionary registration example)
,,,,. , In, is, is, is, is, is, is, is, is, like, part, is, data, difference, is, this is Murata, is, "research can, diff,) , Correspondence, well, system, processing, language, (, file, use, thing
Such as Note that word delimiters in user dictionary registration examples are represented by ",".
[0085]
In this case, the result of Example 9 is as follows.
The “Purpose” of the “Book” study is “Changed” and “Critical particles” to be “changed” when “converting” a “passive sentence” and “usefulness” sentence into an “active” sentence in “Japanese”. It is an “automatic” conversion using machine learning. An example of a passive sentence and a usage sentence in Japanese is given in Figure 1 and Figure 2. The Japanese “suffix” and “re” in the sentence in Figure 1 are “passive”.
《Present auxiliary verb》 is 《Yes》, and 《This》 sentence is passive. The Japanese suffix “<< se” ta ”in the sentence in Fig. 2 is an auxiliary verb indicating a working part, and this sentence is a working sentence. Figure << 3 >> shows an active sentence that corresponds to << the >> sentence. When the sentence in Fig. 1 is transformed into an active sentence, << (i) >> the case particle "ni" becomes the case particle "ga"<< (ii) >> the case particle "ga" becomes the case particle "" Converted. When the sentence in Fig. 2 is converted to an active sentence, (i) the part of the case particle "ga""<< he >>"<< clause >> is << erased >> and (ii) the case particle "ni" is The particle “ga” is converted into << (iii) >> and the case particle “ha” is converted into << less >><< remains as it is >>. In this study, conversion of these case particles (example <<: >> case << help >><< noun >> conversion of the case particle "ga"<< to ") and removal of the << unnecessary >> part (example:" he Is the “subject” of the study. (In 《After》 and 《This paper》, 《Convenience》 is also called `` Conversion of case particles ''.
The conversion of passive sentences and active sentences into active sentences is performed by using the sentence << Generation >>, << paraphrase >>, sentence << simplification / >> language << operation support >>, << natural >> language sentence << from >><< knowledge acquisition and information extraction 》, 《Question Answering》 system and 《Many》 research 《Fields》. << For example >> In a question answering system, a question sentence is a << noh >><< motion >> sentence, and an << answer >> is a << passive >> sentence << written >>, a question sentence and an answer << including >> In some cases, it is difficult to “take out” the answer to a question because the “structure” of the sentence is “different”. Such “questions” and “titles” are also “solved” and “resolved” so that passive sentences and usage sentences can be converted into active sentences. In this way, the conversion of passive sentences and usage sentences into active sentences is «important» in natural language processing.
[0086]
Although the above results do not change so much, for example, the less important expressions such as the first “research” and “it is” are not taken, and it becomes a little easier to see. By registering more unimportant words in the user dictionary 5, it is possible to make it easier to see.
[0087]
In the above-described embodiment, description has been given with double angle brackets as the highlight display, but other highlight displays such as underline, color coding, background change, font change, and blinking can also be performed.
[0088]
Such a method can also be used to consider new and old information issues in anaphora analysis. The only method that can be used for this “problem between new information and old information in anaphora analysis” is “method 2”, and method 1 cannot be used. Next, in the case of “method 2”, a newly appearing expression is highlighted, but linguistically, an object that newly appears in such a sentence is called “new information”. Therefore, Method 2 for highlighting newly appearing expressions is to extract new information in terms of linguistics, and the result of Method 2 is the consideration of new information in terms of linguistics. Can also be used. However, in the case of language expression, the same thing may be expressed in different language expressions. In that case, even old information may be highlighted by method 2 because it is a new language expression. That is, not all “new information” and “old information” are correctly distinguished. Nevertheless, Method 2 is useful for considering “new information” and “old information”.
[0089]
Furthermore, by setting the extraction unit as kanji, it is possible to easily understand the appearance of new kanji in school education and the like. In the case of kanji, morpheme analysis means is not required like words because they can be compared with kanji codes.
[0090]
(4): Explanation of program installation
The input means 1, the extraction means 2, the extract storage device 3, the output means 4, the user dictionary 5, the extraction / detection area setting means 21 and the like can be configured by a program and executed by a main control unit (CPU). It is stored in the main memory. This program is generally processed by a computer. This computer is composed of hardware such as an input device which is an input means such as a main control unit, a main memory, a file device, a display device, and a keyboard. The program of the present invention is installed on this computer. In this installation, these programs are stored in a portable recording (storage) medium such as a floppy disk or a magneto-optical disk, and a drive device for accessing the recording medium provided in the computer is used. Alternatively, it is installed in a file device provided in the computer via a network such as a LAN. Then, the program steps necessary for processing are read from the file device into the main memory and executed by the main control unit.
[0091]
【The invention's effect】
As described above, the present invention has the following effects.
[0092]
(1): The extraction unit extracts all the extracted unit corresponding to the extraction unit from the region other than the current detection region of the input document data and stores it in the storage unit. Easily extract and display document features and differences, which are new information, in order to repeat for each detection area the output corresponding to the extraction unit that is not stored and highlight and output the document in the current detection area Can do.
[0093]
(2): The extraction means highlights the current detection area of the input document data corresponding to the extraction unit that is not stored in the storage means, and outputs the document in the current detection area. Since storing the display in the storage means is repeated for each detection area, it is possible to easily extract and display a display unit (for example, a word) corresponding to a newly appearing extraction unit.
[0094]
(3): Since the data of the extraction unit that is not highlighted in advance is stored in the storage unit, expressions that are not so important can be prevented from being highlighted in advance and can be easily viewed.
[0095]
(4): Since the extraction unit is a word unit, a newly appearing word can be extracted and displayed.
[0096]
(5): Since the unit of the detection area is a unit of bullets, the difference between the bullets can be easily understood.
[0097]
(6): Since the unit of the detection region is the unit of the claims, the features and differences of the claims can be easily understood.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating the principle of the present invention.
FIG. 2 is an explanatory diagram of a document difference detection apparatus according to an embodiment.
FIG. 3 is a document difference detection processing flowchart of Method 1 in the embodiment;
FIG. 4 is a document difference detection processing flowchart of Method 2 in the embodiment;
FIG. 5 is an explanatory diagram of a document difference detection apparatus provided with a user dictionary in the embodiment.
FIG. 6 is a document difference detection process flowchart of Method 1 for providing a user dictionary in the embodiment;
FIG. 7 is a document difference detection processing flowchart of Method 2 for providing a user dictionary in the embodiment;
FIG. 8 is a document difference detection process flowchart of Method 2 (another realization method) for providing a user dictionary in the embodiment;
[Explanation of symbols]
2 Extraction means
3a Storage means
21 Extraction / detection area setting means

Claims

An input means for inputting information;
An extraction / detection area setting means in which an extraction unit that is a unit to be output as a difference between document data and a detection area that is a unit of an area to be compared in order to detect a difference between document data are set by the input means ;
Storage means for storing information;
Extraction means,
The extraction means extracts all of the extraction unit corresponding to the extraction unit from areas other than the current detection area of the input document data, stores the extracted data in the storage means, and the storage means in the current detection area A document difference detection apparatus that repeats, for each of the detection areas, highlighting the one corresponding to the extraction unit that is not stored in the document and outputting the document of the current detection area.

An input means for inputting information;
An extraction / detection area setting means in which an extraction unit that is a unit to be output as a difference between document data and a detection area that is a unit of an area to be compared in order to detect a difference between document data are set by the input means ;
Storage means for storing information;
Extraction means,
The extraction means highlights the current detection area of the input document data corresponding to the extraction unit that is not stored in the storage means, outputs the document in the current detection area, and An apparatus for detecting a difference in a document, characterized in that storing the displayed item in the storage means is repeated for each detection area.

3. The document difference detection apparatus according to claim 1, wherein the data of the extraction unit not highlighted is stored in advance in the storage means.

The document difference detection apparatus according to claim 1, wherein the extraction unit is a word unit.

The document difference detection apparatus according to claim 1, wherein the unit of the detection area is a unit of bullets.

The document difference detection apparatus according to claim 1, wherein the detection area unit is a unit of a claim.

An extraction / detection area setting unit in which an extraction unit that is a unit to be output as a difference between document data and a detection area that is a unit of an area to be compared in order to detect a difference between document data are set by an input unit ;
The input document data is extracted from the areas other than the current detection area corresponding to all the extraction units and stored in the storage means, and the current detection area is not stored in the storage means. As an extraction unit that repeats for each detection area, highlighting the one corresponding to the extraction unit and outputting the document of the current detection area,
A program that allows a computer to function.

An extraction / detection area setting unit in which an extraction unit that is a unit to be output as a difference between document data and a detection area that is a unit of an area to be compared in order to detect a difference between document data are set by an input unit ;
In the current detection area of the input document data, the document corresponding to the extraction unit not stored in the storage means is highlighted to output the document in the current detection area, and the highlighted one is stored in the storage area. As an extraction means that repeats storing for each detection area,
A program that allows a computer to function.