JP4046221B2

JP4046221B2 - Document processing device

Info

Publication number: JP4046221B2
Application number: JP2002258596A
Authority: JP
Inventors: 哲郎長束
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-09-04
Filing date: 2002-09-04
Publication date: 2008-02-13
Anticipated expiration: 2022-09-04
Also published as: JP2004094855A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書処理装置に関し、詳細には、ユーザが文書あるいは文書集合内に含まれる概念表現の中から、検索や絞込みを行いながら探索し、ユーザが必要とする概念表現を発見することを支援する文書処理装置に関する。
【０００２】
【従来の技術】
近時、情報の電子化が進み、従来紙文書で保管されていた文書も電子化されるようになってきている。このような文書の電子化に伴って、大量の電子化文書が流通し、収集・蓄積された電子化文書をいかに管理して簡便に再利用するかが重要な問題となってきている。そこでは、大量の電子化文書から何らかの知見を見出すための分析技術が要望される。この分析技術は、大量の文書群を文書内容毎にグルーピングしていきながら各グループの文書群の内容を把握していく技術であり、従来、ユーザ主導でグループの定義式を入力し、検索技術を駆使してグルーピングする方法や、文書内の語句の共起度に基づくクラスタリング技術を駆使して自動でグルーピングする方法が提案されている。
【０００３】
そして、従来、グルーピング技術としては、例えば、特許文献１の「データ分析システム」が提案されており、このデータ分析システムは、テキストから概念（文節内キーワード）を抽出して、目的に特化したカテゴリ辞書（シソーラス）を用いて、文書中の表現をラベル付きデータに変換している。また、概念間の係り受け関係を解析し、概念の組み合わせも概念として、概念の頻度／クロス表により特徴的な概念を抽出している。
【０００４】
また、従来、特許文献２の文書処理装置が提案されている。この文書処理装置は、入力した文章に対して形態素解析を行う形態素解析部と、形態素列の部分列を重み付きで特定表現候補とする特定表現候補取得部と、予めいくつかの特定表現を格納した特定表現辞書と、形態素列の特定表現辞書中の表現に対するマッチ度を表す実数を、特定表現辞書の検索結果として出力する特定表現辞書検索部と、特定表現候補に対して、前記候補に付与された重みと、前記候補の前記特定表現辞書に対する検索結果とを変数として判別スコアを計算し、前記判別スコアが一定の値を下回る候補を除外する判別分析実行部と、特定表現候補のうち、判別分析実行部によって除外されなかった形態素の文字列を特定表現として出力する出力部とを設け、判別スコアを計算して、特定表現候補として残すかどうか判断して、的確な判断を行うことを目的としている。
【０００５】
すなわち、この従来技術は、テキストから単語解析、係り受け解析を行い、文構造の類似度により文をグループ化し、テキストから抽出したキーワードとグループ化された文との出現回数の相関関係から相関の強い項目を抽出している。
【０００６】
【特許文献１】
特開２００１−７５９６６号公報
【特許文献２】
特開２０００−１７２６９１号公報
【０００７】
【発明が解決しようとする課題】
しかしながら、このような従来技術にあっては、文書データの分析に重要な概念の発見を、ユーザが文書または文書集合内に含まれる概念を自由に探索して、発見できるようにする上で、改良の必要があった。
【０００８】
すなわち、特許文献１記載の従来技術にあっては、テキストから概念を抽出して、カテゴリ辞書に基づいてカテゴリラベルをつけ、また、概念間の係り受け関係を解析し、概念の組み合わせも概念とし、さらに、概念の頻度／クロス表により特徴的な概念を抽出しているため、カテゴリ辞書を予め作成する必要があり、そのカテゴリ辞書の作成、維持に負担がかかるという問題があるとともに、頻度情報により特徴的な概念を抽出しているため、出現回数の多いものでないと抽出することができないという問題があった。
【０００９】
また、特許文献２記載の従来技術にあっては、テキストから単語解析、係り受け解析を行い、文構造の類似度により文をグループ化して、テキストから抽出したキーワードとグループ化された文との出現回数の相関関係から相関の強い項目を抽出しているため、出現回数の多いものでないと抽出することができないといる問題があった。
【００１０】
ところが、文書データの分析、特に、アンケートのようなデータの分析においては、特徴的ではない概念、例えば、頻度が多くない概念であっても、ユーザが分析に必要とする概念であることがあり、また、このような概念の場合、必要かどうかは、ユーザの意図や目的によって決まるものであり、自動的に抽出することが困難である。したがって、このような概念の発見を支援するためには、ユーザが文書あるいは文書集合内に含まれる概念を自由に探索し、発見することを支援する必要がある。
【００１１】
そこで、本発明は、ユーザが文書あるいは文書集合内に含まれる概念表現の中から、検索や絞込みを行いながら探索し、ユーザが必要とする概念表現を発見することを支援する文書処理装置を提供することを目的としている。
【００１２】
【課題を解決するための手段】
請求項１の発明は、入力された文書データに対して形態素解析、係り受け解析を行う言語解析手段と、前記言語解析手段における言語解析結果に基づいて、前記文書データの言語情報を保持する文書データ構造を生成するとともに、文節内に特定の自立語又は付属語が出現した場合に、前記文書データ構造内の単語や文節に対して、前記出現した特定の自立語又は付属語に応じて打消や要望や疑問や可能を示す意味タグを付与する文書データ構造生成手段と、前記文書データ構造生成手段で生成された前記意味タグの付与された文書データ構造を記憶する文書データ構造記憶手段と、ユーザにより指定される任意の数の単語と文書データ構造内の単語や文節の意味タグの組み合わせからなる概念表現を受け付ける概念表現指定手段と、前記文書データ構造記憶手段に記憶されている文書データ構造から、前記概念表現指定手段にて受け付けられた概念表現と関連のある概念表現を抽出する概念表現抽出手段とを備えたことを特徴とする文書処理装置である。
【００１３】
請求項２の発明は、前記文書データ構造に基づいて選択指定対象の概念表現を表示する概念表現表示手段をさらに備え、前記概念表現指定手段は、前記概念表現表示手段に表示される概念表示から任意の概念表現の指定を受け付け、前記概念表現表示手段は、前記概念表現抽出手段での概念表現抽出結果を表示することを特徴とする請求項１に記載の文書処理装置である。
【００１４】
請求項３の発明は、同義語辞書をさらに備え、前記文書データ構造生成手段は、前記同義語辞書に基づいて、同義異表記単語に対して代表表記情報を付加して前記文書データ構造を生成することを特徴とする請求項１または２に記載の文書処理装置である。
【００１５】
請求項４の発明は、前記概念表現指定手段で受け付けた概念表現の履歴を保持する手段をさらに備え、当該履歴に基づいて、過去に行った概念表現指定に基づく概念表現抽出結果を前記概念表現表示手段に再表示することを特徴とする請求項２又は３に記載の文書処理装置である。
【００１６】
請求項５の発明は、前記概念表現表示手段は、前記概念表現指定手段で受け付けた概念表現を含む前記文書データまたは当該文書データの一部を表示することを特徴とする請求項２乃至４のいずれか１項に記載の文書処理装置である。
【００１７】
請求項６の発明は、前記概念表現抽出手段は、前記文書データ構造記憶手段に記憶されている文書データ構造が前記概念表現指定手段にて受け付けられた概念表現（以下、指定概念表現）に適合するか調べ、適合する文書データ構造から、前記指定概念表現中の単語と係り受け関係の単語を抽出するとともに、前記指定概念表現中にはない意味タグを抽出し、前記抽出した単語と意味タグを含む概念表現を作成することを特徴とする請求項１乃至５のいずれか１項に記載の文書処理装置である。
【００１８】
以下、本発明の好適な実施の形態を添付図面に基づいて詳細に説明する。なお、以下に述べる実施の形態は、本発明の好適な実施の形態であるから、技術的に好ましい種々の限定が付されているが、本発明の範囲は、以下の説明において特に本発明を限定する旨の記載がない限り、これらの態様に限られるものではない。
【００１９】
図１〜図１４は、本発明の文書処理装置、文書処理方法及び記録媒体の一実施の形態を示す図であり、図１は、本発明の文書処理装置、文書処理方法及び記録媒体の一実施の形態を適用した文書処理装置１のブロック構成図である。
【００２０】
図１において、文書処理装置１は、文書入力部２、言語解析部３、文書データ構造生成部４、文書データ構造記憶部５、概念表現抽出部６、概念表現指定部７及び概念表現表示８等を備えており、文書処理プログラム及び必要なデータを記録するＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等の記録媒体を、例えば、コンピュータ等に読み取らせて導入することで、構築される。
【００２１】
文書入力部（文書入力手段）２は、文書処理対象の文書を入力するもので、複数の文書を入力することができ、複数の文書を入力する際には、各文書に識別子を付与して、記憶部等に格納して管理する。なお、以下の説明では、文書入力部２から文書集合が入力されるものとして、説明する。
【００２２】
言語解析部（言語解析手段）３は、文書入力部２から入力された各文書集合の形態素を解析する形態素解析処理、解析対象の文書の文節間の係り受け関係を解析する係り受け解析処理等の各ステップ処理を実行し、これらの形態素解析処理、係り受け解析処理等の解析処理によって得られる言語的属性を解析単位に文書データ構造生成部４に出力する。具体的には、言語解析部３は、形態素解析処理では、文書集合の各文書に含まれる単語を解析し、係り受け解析処理では、文書に含まれる文、文節を解析して、文節間の関係として係りと受けの関係にある文節を解析する。例えば、言語解析部３は、「ソフトウェアのインストールが正常に実行できない。」という文を解析する場合、図２に示すように、形態素解析を行った後、係り受け解析を行う。なお、図２は、上記例の解析結果例を示しており、単語の区切りを「／」で表し、また、各単語の上の「自」は自立語を、「付」は付属語を表している。すなわち、図２では、「ソフトウェア」という自立語に、「の」という付属語がついた文節１が係りとして、「インストール」という自立語に、「が」という付属語がついた文節２を受けとして係っており、「正常」という自立語に、「に」という付属語がついた文節３が係りとして、「実行」という自立語に、「でき」と「ない」の２つの付属語がついた文節４を受けとして係っており、さらに、文節２が係りとして、文節４に係っていることを示している。
【００２３】
文書データ構造生成部（文書データ構造生成手段）４は、言語解析部３の解析結果に基づいて、文書集合の各文書を図３に示すようなデータ構造に変換し、各構成要素は、図４に示すような情報を保持する。文書データ構造生成部４は、例えば、図５に示すような文書あるいは文書集合に含まれる単語に対して、ユニークな識別子を付与した単語リストを生成して、単語の管理を行い、その際、品詞情報や全体における出現頻度あるいは出現文書数を算出して付加する。
【００２４】
すなわち、文書集合の各文書の変換された図３に示すデータ構造は、図４に示すようになっており、文書は、文書に含まれる文ＩＤリストを管理し、文は、自分の文ＩＤと文に含まれる文節リストを管理する。また、文節は、自分の文節ＩＤと文節に含まれる単語ＩＤリスト、係り文節ＩＤリスト、受け文節ＩＤを管理する。この単語ＩＤは、図５に示す単語リストにおけるＩＤであり、係り文節ＩＤリストは、当該文節を受けとする係り文節のＩＤである。そして、１つの受け文節に対して複数の文節が係り文節となりうるので、係り文節ＩＤリストで管理する。また、受け文節ＩＤは、当該文節が係り文節となる受け文節のＩＤであり、係り文節は、受け文節を１つしかとることができない。
【００２５】
また、文書データ構造生成部４は、文節が管理する情報として、係り受けの関係の種類、例えば、連体修飾なのか連用修飾なのか、等を保持することもでき、また、文節を結ぶ助詞の種類により関係の種類を記述することもできる。
【００２６】
さらに、文書データ構造生成部４は、同義語辞書を有し、同義語を持つ単語に関して代表表記情報をもたせることができ、図５に示すように、単語リストの項目として同義語代表表記を持つことにより実現することができる。
【００２７】
また、文書データ構造生成部４は、文節内の付属語表現等から文書データ構造内の単語あるいは文節に対して付加的な意味を表す意味タグを付与し、概念表現指定部７、概念表現抽出部６、概念表現表示部８において、概念表現として単語だけでなく意味タグをも用いることができるようにする。この意味タグは、文節内の付属語等が特定の付加的な意味を表している場合に、その意味をタグとして文節に付加するものである。例えば、「打消」、「要望」、「可能」、「疑問」の意味タグは、文節内に以下のような単語が出現した場合に、その文節に付加し、また、１つの文節に複数の意味タグがつくこともある。なお、以下の説明で意図タグは意味タグと同義である。
【００２８】
意図タグＩＤ１「打消」：助動詞「ない」、助動詞「ず」、助動詞「まい」、補助助動詞「にくい」、形容詞「ない」
意図タグＩＤ２「要望」：助動詞「たい」、動詞「欲しい」、接続助詞「て」＋動詞「欲しい」
意図タグＩＤ３「疑問」：終助詞「か」、終助詞「か」＋終助詞「な」、記号「？」
意図タグＩＤ４「可能」：補助動詞「できる」、助動詞「れる」、助動詞「られる」
そして、概念表現では、たとえば「（＋打消＋可能）」といった表現で意味タグを表す。意味タグは、単独でも概念表現にもなるし、「実行（＋可能＋打消）」といったように単語に付加した形でも用いることができる。
【００２９】
文書データ構造記憶部（文書データ構造記憶手段）５は、文書データ構造生成部４で生成された文書データ構造を記憶し、管理する。
【００３０】
概念表現指定部（概念表現指定手段）７は、文字・記号等の入力部及びマウス等のポインティングディバイス等を備え、ユーザが概念表現を指定するものである。概念表現表示部（概念表現表示手段）８は、ＣＲＴ（陰極線管：ＣａｔｈｏｄｅＲａｙＴｕｂｅ）、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）等であり、概念表現抽出部６の抽出した概念表現、概念表現指定部７での概念表現の指定画面（概念ブラウザ）等の文書処理装置１が文書処理するにのに必要が各種データを表示する。
【００３１】
概念表現指定部７での概念表現の指定方法としては、例えば、図６に示すような入力ダイアログを概念表現表示部８に表示してユーザが直接概念表現を記入する直接記入方法、図７〜図１２に示すように、概念表現表示部８に概念表示画面（概念ブラウザ）を表示して、表示されている概念表現の中からマウス等のポインティングディバイスで指定する概念表現を選択する選択方法等を用いることができる。図７及び図８の例では、最初の状態として単語リストの情報を表示しており、ユーザが指定したい単語（図７では、「受信」が選択されている。）が選択されている。そして、図７及び図８で、概念表現の単語が選択されて、「絞り込み」ボタンが操作されると、その右側に、概念表現抽出を実行する。このように、選択方法を用いると、図６の場合の直接記入方法よりも、概念表現表示部８において、次に指定する概念表示を選択しながら繰り返し、概念表現の探索を行うことができ、効率的に概念表現を指定することができる。
【００３２】
概念表現抽出部（概念表現抽出手段）６は、概念表現指定部７で指定された概念表現と強い関係にある概念表現（単語（自立語）あるいは意図タグ）を文書集合の各文書から抽出し、その頻度を算出して、概念表現抽出結果を概念表現表示部８に送って表示させる。
【００３３】
次に、本実施の形態の作用を説明する。本実施の形態の文書処理装置１は、ユーザの指定に応じた概念表示に基づいて文書処理を行う。
【００３４】
まず、本実施の形態の基本的な考え方である概念表現について説明する。本実施の形態で取り扱う文書は、基本的には日本語の文章で表現されているものとし、概念表現を、単語（自立語）を単位として表現する。単語１つでもある概念を表し、複数の単語の関係によってもある概念を表す。例えば、以下のような概念表現を用いる。
【００３５】
１）検索
２）情報⇒検索
３）情報⇒検索⇒サービス
４）ソフトウェア⇒インストール（＋可能＋打消）
なお、「⇒」は、単語間に強い意味的関係があることを示しており、ここでいう強い意味的関係とは、同じ文節内に出現する単語（自立語）、または、係り受け関係にある文節対に出現する単語（自立語）を意味する。例えば、「情報⇒検索」は、以下に示すように、「情報」と「検索」が同じ文節内に出現するか、「情報」と「検索」が係り受け関係にある文節対に出現することを表している。
【００３６】
文節：「情報検索が」
係り受け文節対：「情報の」→「検索が」
また、「⇒」の方向は、単語の出現順序を表しており、出現順序が逆の場合、意味が異なってしまうこともあるため、語順は重要である。
【００３７】
そして、概念表現では、単語（自立語）をいくつでもつなげて表現することができる。例えば、上記例の３）では、３つの単語をつないでいるが、この場合、この３つの単語が「情報」「検索」「サービス」の語順で連続して強い関係で現れていることを意味している。したがって、以下に示すＡからＤは、上記例の３）の概念表現に適合するが、Ｅ、Ｆは、適合しない。
【００３８】
Ａ：「情報の検索サービス」
Ｂ：「情報を検索するサービス」
Ｃ：「情報検索のサービス」
Ｄ：「情報検索サービス」
Ｅ：「情報検索を自動的に行うサービス」
Ｆ：「検索情報のサービス」
上記例４）では、意味タグを付加した例を示しており、意味タグは、文節内の付属語等の表現が特定の付加的な意味を表している場合に、その意味をタグとして利用するものである。例えば、「打消」「要望」「可能」「疑問」の意味タグは、文節内に以下のような単語が出現した場合にその文節に付加する。また、意味タグは、１つの文節に複数つくこともある。
【００３９】
打消：助動詞「ない」、助動詞「ず」、助動詞「まい」、補助助動詞「にくい」、形容詞「ない」
要望：助動詞「たい」、動詞「欲しい」、接続助詞「て」＋動詞「欲しい」
疑問：終助詞「か」、終助詞「か」＋終助詞「な」、記号「？」
可能：補助動詞「できる」、助動詞「れる」、助動詞「られる」
そして、概念表現では、例えば、「（＋打消＋可能）」といった表現で意味タグを表し、意味タグは、単独でも概念表現にもなるし、「実行（＋可能＋打消）」といったように、単語に付加した形でも用いることができる。例えば、「実行できない」という文節は、「実行／できる／ない」と分かれるため、この文節には、「（＋可能＋打消）」という意味タグが付加される。また、「実行（＋可能＋打消）」という概念表現では、単語「実行」が意味タグ「打消」と「可能」が付加されている文節であることを意味している。
【００４０】
このような概念表現を用いることで、ユーザは任意の数の単語（自立語）と意味タグの組み合わせにより、目的に沿った概念表現を表現することができる。
【００４１】
そして、文書処理装置１は、例えば、いま、概念表現として、「ＦＡＸ⇒受信（＋打消）」が指定されている場合、概念抽出部６での概念抽出処理を、図１３に示すように行う。
【００４２】
文書処理装置１は、まず、文書データ構造内の文書ＩＤ：ｄ＝１の文書内の文ＩＤ：ｓ＝１内の文節ＩＤ：ｋ＝１の文節からスタートし（ステップＳ１０１）。文節ｋ以降の構造が指定された概念表現と適合するかを調べる適合判断処理を行う（ステップＳ１０２）。上記例の場合、以下のような構造をもっていれば適合する。
【００４３】
１）文節ｋが単語「ＦＡＸ」、「受信」をこの語順で含み、かつ、文節ｋに意図タグ「打消」が付加されている。
【００４４】
２）文節ｋが単語「ＦＡＸ」を含み、かつ、文節ｋが係り文節となる受け文節ｋ’が単語「受信」を含み、かつ受け文節ｋ’に意図タグ「打消」が付加されている。
【００４５】
ステップＳ１０２で、適合していないときには、文書処理装置１は、文ＩＤ：ｓ＝１内の文節ＩＤｋをインクリメント（ｋ＝ｋ＋１）して、次の文節に移り、適合判断処理に戻る（ステップＳ１０２）。
【００４６】
ステップＳ１０２で、適合していると、概念表現抽出部６は、概念表現指定部７で指定された概念表現の前方で強い関係にある単語（自立語）を探し、自立語が見つかると、概念表現抽出結果リストに登録する（ステップＳ１０３）。
【００４７】
この自立語としては、例えば、上記例の場合、以下のような単語が適合する。
【００４８】
１）文節ｋが単語「ＦＡＸ」の前に単語（自立語）Ｘを含む。
【００４９】
２）文節ｋが受け文節となっている係り文節ｋ’が存在し、文節ｋ’に単語（自立語）Ｘが存在する。
【００５０】
概念表現抽出部６は、このような単語（自立語）が見つかると、概念表現抽出結果リストに登録するが、この概念表現抽出結果リストには、指定された概念表現に新たに見つかった単語を付け加えた新しい概念表現を登録する。例えば、上記例の「ＦＡＸ」に対して、新たに単語「カラー」が見つかった場合、概念表現抽出結果リストには、以下の概念表現を登録する。
【００５１】
カラー⇒ＦＡＸ⇒受信（＋打消）
概念表現抽出結果リストには、図５に示したように、登録する概念表現の総出現頻度と出現文書数をも登録する。このとき、概念表現抽出部６は、登録する概念表現がすでに登録されている場合は、出現頻度に「１」を加え、処理中の文書からの登録が初めての場合は、出現文書数に「１」を加える。また、概念表現抽出部６は、概念表現を初めてリストに登録する場合は、出現頻度と出現文書数を「１」に設定する。
【００５２】
次に、概念表現抽出部６は、概念表現指定部７で指定された概念表現の後方で強い関係にある単語（自立語）を探し、自立語が見つかると、概念表現抽出リストに登録する（ステップＳ１０４）。
【００５３】
この自立語としては、例えば、上記例の場合、以下のような単語が適合する。
【００５４】
３）文節ｋが単語「ＦＡＸ」の後ろに単語（自立語）Ｘを含む。
【００５５】
４）文節ｋが係り文節となっている受け文節ｋ’が存在し、文節ｋ’に単語（自立語）Ｘが存在する。
【００５６】
概念表現抽出部６は、このような単語（自立語）が見つかると、概念表現抽出結果リストに登録するが、この概念表現抽出結果リストには、指定された概念表現に新たに見つかった単語を付け加えた新しい概念表現を登録する。例えば、上記例の「ＦＡＸ」に対して、新たに単語「症状」が見つかった場合、概念表現抽出結果リストには、以下の概念表現を登録する。
【００５７】
ＦＡＸ⇒受信（＋打消）⇒症状
概念表現抽出部６は、概念表現抽出結果リストに、図５に示したように、登録する概念表現の総出現頻度と出現文書数をも登録する。このとき、概念表現抽出部６は、登録する概念表現がすでに登録されている場合は、出現頻度に「１」を加え、処理中の文書からの登録が初めての場合は、出現文書数に「１」を加える。また、概念表現抽出部６は、初めてリストに登録する場合は、出現頻度と出現文書数を「１」に設定する。
【００５８】
次に、概念表現抽出部６は、指定された概念表現と適合する文書データ構造の一番後ろの文節に、指定された概念表現とは適合しない意味タグが付加されているかを調べ、このような意味タグが見つかると、概念表現抽出結果リストに登録する（ステップＳ１０５）。
【００５９】
このような意味タグとしては、例えば、上記例の概念表現「ＦＡＸ⇒受信（＋打消）」の場合、概念表現抽出部６は、指定された概念表現と適合する文書データ構造の一番後ろの文節ｋ’に、「打消」以外の意味タグが付いている場合は、その意味タグＸを抽出する。概念表現抽出部６は、このような意味タグが見つかった場合は、概念表現抽出結果リストに登録するが、概念表現抽出結果リストには、指定された概念表現に新たに見つかった意味タグを付け加えた新しい概念表現を登録する。例えば、上記例で、新たに意味タグ「可能」が見つかった場合、概念表現抽出結果リストには、以下の概念表現を登録する。
【００６０】
ＦＡＸ⇒受信（＋打消＋可能）
また、概念表現抽出部６は、意味タグが複数抽出された場合は、それぞれを付加した形の複数の概念表現を登録する。概念表現抽出結果リストでは、登録する概念表現の総出現頻度と出現文書数をも登録する。このとき、概念表現抽出部６は、登録する概念表現が既に登録されている場合は、出現頻度に「１」を加え、処理中の文書からの登録が初めての場合は、出現文書数に「１」を加える。また、概念表現抽出部６は、概念表現を初めてリストに登録する場合は、出現頻度と出現文書数を「１」に設定する。
【００６１】
次に、概念表現抽出部６は、文節ＩＤ：ｋ（文節ｋ）が文ＩＤ：ｓ（文ｓ）内の最後の文節であるかチェックし（ステップＳ１０６）、文節ｋが文ｓ内の最後の文節でないときには、文節ｋを「１」だけインクリメント（ｋ＝ｋ＋１）して、ステップＳ１０２に戻って、次の文節について、上記適合判断処理から上記同様に処理する（ステップＳ１０２〜Ｓ１０６）。
【００６２】
ステップＳ１０６で、文節ｋが文ｓ内の最後の文節であるときには、概念表現抽出部６は、文ＩＤ：ｓ（文ｓ）が文書ＩＤ：ｄ（文書ｄ）内の最後の文であるかチェックし（ステップＳ１０７）、文ｓが文書ｄ内の最後の文でないときには、文ｓを「１」だけインクリメント（ｓ＝ｓ＋１）して、ステップＳ１０２に戻って、次の文ついて、上記適合判断処理から上記同様に処理する（ステップＳ１０２〜Ｓ１０７）。
【００６３】
ステップＳ１０７で、文ｓが文書ｄ内の最後の文であるときには、概念表現抽出部６は、文書ＩＤ：ｄ（文書ｄ）が最後の文書であるかチェックし（ステップＳ１０８）、最後の文でないときには、文ｄを「１」だけインクリメント（ｄ＝ｄ＋１）して、ステップＳ１０２に戻り、次の文節について、上記適合判断処理から上記同様に処理する（ステップＳ１０２〜Ｓ１０８）。
【００６４】
ステップＳ１０８で、文書ｄが最後の文書であるときには、概念表現抽出部６は、全ての文書について概念抽出処理を完了したと判断して、処理を終了する。
【００６５】
そして、文書処理装置１は、概念表現抽出部６で抽出された結果を、上記図７〜図１２に示したように、概念表現表示部８に表示する。文書処理装置１は、まず、図７及び図８に示したように、文書データ構造生成部４が生成した単語リスト（図５参照）を最初に左側に表示し、ユーザが概念表現指定部８で指定された単語を指定概念表現として概念表現抽出を行った結果を右側に表示している。また、概念抽出結果の表示では、図７に示す単語による絞り込みと、図８に示す意図タグによる絞り込みの２種類を選択して表示する。さらに、文書処理装置１では、概念表現表示部８に概念表現抽出結果として表示されている概念表現をユーザが選択すると、当該選択された概念表現を指定概念表現として、さらに概念表現抽出を行うこともできる。このように概念表現抽出を繰り返し行うことができる。
【００６６】
また、文書処理装置１は、図９及び図１０に示したように、最初に意図タグリストをも左側に表示し、ユーザが概念表現指定部８で指定された意図タグを指定意図タグとして抽出を行った結果を右側に表示している。また、抽出結果の表示では、図９に示す単語による絞り込みと、図１０に示す意図タグによる絞り込みの２種類を選択して表示する。さらに、文書処理装置１では、概念表現表示部８に概念表現抽出結果として表示されている概念表現をユーザが選択すると、当該選択された概念表現を指定概念表現として、さらに概念表現抽出を行うこともできる。このように概念表現抽出を繰り返し行うことができる。
【００６７】
さらに、文書処理装置１は、図１１及び図１２に示したように、ユーザが指定した概念表現を記憶しておき、そのリストを表示する。この場合も、図１１に示すような単語と、図１２に示すような意図タグについて、表示する。
【００６８】
また、図７〜図１２の表示において、ユーザからの表記入力を受け、その表記を含む概念表現だけを表示させることもでき、これにより必要な情報だけをユーザは見ることができる。
【００６９】
さらに、文書処理装置１は、図１４に示すように、指定された概念表現を含む文書リストを表示する。すなわち、文書処理装置１は、概念表現抽出部６において、概念表現指定部７で指定された概念表現を含む文書を記録し、この記憶した文書をもとに概念表現表示部８に表示する。
【００７０】
このように、本実施の形態の文書処理装置１は、文書入力部２から入力された文書（文書集合）に対して、言語解析部３で形態素解析、係り受け解析を行い、文書データ構造生成部４で、文書を言語解析部３での言語解析結果に基づいて言語情報を保持する文書データ構造に変換して、当該生成された文書データ構造を文書データ構造記憶部５に記憶し、概念表現指定部７で任意の概念表現の指定入力が行われると、概念表現抽出部６で、文書データ構造記憶部５に記憶されている文書データ構造から概念表現指定部７で指定された概念表現と強い関係にある概念表現を抽出して、当該概念表現抽出結果を概念表現表示部８で表示している。
【００７１】
したがって、ユーザが文書に含まれる概念表現内を自由に探索して、特徴的ではないが、必要とする概念表現を発見することができ、利用性を向上させることができる。
【００７２】
また、本実施の形態の文書処理装置１は、文書入力部２から入力された文書に対して、言語解析部３で、形態素解析、係り受け解析を行い、文書データ構造生成部４で、文書を当該言語解析部３での言語解析結果に基づいて言語情報を保持する文書データ構造に変換して、文書データ構造生成部４で生成された文書データ構造を文書データ構造記憶部５に記憶し、当該文書データ構造に基づいて選択指定対象の概念表現を概念表現表示部８に表示して、概念表現指定部７で、当該概念表現表示部８に表示される概念表示から任意の概念表現の選択指定が行われると、概念表現抽出部６で、文書データ構造記憶部５に記憶されている文書データ構造から概念表現指定手段で指定された概念表現と強い関係にある概念表現を抽出して、概念表現表示部８に当該概念表現抽出結果を表示している。
【００７３】
したがって、表示される概念表現から適宜選択できるようにすることができるとともに、ユーザが文書に含まれる概念表現内を自由に探索して、特徴的ではないが、必要とする概念表現を簡単かつ容易に発見することができ、より一層利用性を向上させることができる。
【００７４】
さらに、本実施の形態の文書処理装置１は、文書データ構造生成部４が、同義語辞書に基づいて、同義異表記単語に対して代表表記情報を付加して文書データ構造を生成している。
【００７５】
したがって、意味は同じであるが異なる表記の単語を同じ単語として取り扱えるようにすることができ、ユーザが文書に含まれる概念表現内をより一層自由に探索して、特徴的ではないが、必要とする概念表現をより一層容易に発見できるようにすることができ、より一層利用性を向上させることができる。
【００７６】
また、本実施の形態の文書処理装置１は、文書データ構造生成部４で、文節内の付属語表現等から文書データ構造内の単語または文節に対して付加的な意味を表す意味タグを、概念表現指定部７での概念表現の指定、概念表現抽出部６での概念表現の抽出及び概念表現表示部８での概念表現の表示として、用いている。
【００７７】
したがって、より詳細な意味の指定を行えるようにすることができ、必要とする概念表現をより一層適切に発見できるようにして、より一層利用性を向上させることができる。
【００７８】
さらに、本実施の形態の文書処理装置１は、概念表現指定部７での概念表現の指定の履歴を保持し、当該履歴に基づいて、過去に行った概念表現指定に基づく概念表現抽出結果を概念表現表示部８に再表示している。
【００７９】
したがって、ユーザが過去に行った作業状態にすぐに戻ることができ、より一層利用性を向上させることができる。
【００８０】
また、本実施の形態の文書処理装置１は、概念表現指定部７で指定された概念表現を含む文書または当該文書の一部を概念表現表示部８に表示している。
【００８１】
したがって、ユーザの指定した概念表現が実際の文書内でどのように現れるかを知ることができるようにすることができ、概念表現に対する理解をより一層深めて、より一層利用性を向上させることができる。
【００８２】
以上、本発明者によってなされた発明を好適な実施の形態に基づき具体的に説明したが、本発明は上記のものに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。
【００８３】
【発明の効果】
本発明によれば、ユーザが文書に含まれる概念表現内を自由に探索して、特徴的ではないが、必要とする概念表現を発見することができ、利用性を向上させることができる。特に、文節内の自立語や付属語に基づき、文書データ構造内の単語または文節に対して付加的な意味を表す意味タグ、すなわち、「打消」「要望」「疑問」「可能」等を付与し、当該意味タグを、概念表現の指定、概念表現の抽出、概念表現の表示などとして用いるので、より詳細な意味の指定を行えるようにすることができ、必要とする概念表現をより一層適切に発見できるようにして、より一層利用性を向上させることができる。
【図面の簡単な説明】
【図１】本発明の文書処理装置、文書処理方法及び記録媒体の一実施の形態を適用した文書処理装置の要部ブロック構成図。
【図２】図１の言語解析部での言語解析の一例を示す図。
【図３】図１の文書データ構造生成部による文書集合の各文書のデータ構造への変換の一例を示す図。
【図４】図３の各データ構造の各構成要素の情報の一例を示す図。
【図５】図１の文書データ構造生成部により生成される文書あるいは文書集合に含まれる単語リストに対して付与するＩＤ、品詞、出現頻度、出現文書数及び同義語代表表記の一例を示す図。
【図６】ユーザが概念表現を直接入力して指定する場合の概念表現表示部に表示される画面の一例を示す図。
【図７】ユーザが概念表現を単語で選択指定して単語で絞り込みを行う場合の概念表現表示部に表示される画面の一例を示す図。
【図８】ユーザが概念表現を単語で選択指定して意図タグで絞り込みを行う場合の概念表現表示部に表示される画面の一例を示す図。
【図９】ユーザが意図タグで選択指定して単語で絞り込みを行う場合の概念表現表示部に表示される画面の一例を示す図。
【図１０】ユーザが意図タグで選択指定して意図タグで絞り込みを行う場合の概念表現表示部に表示される画面の一例を示す図。
【図１１】ユーザが履歴を選択指定して単語で絞り込みを行う場合の概念表現表示部に表示される画面の一例を示す図。
【図１２】ユーザが履歴を選択指定して意図タグで絞り込みを行う場合の概念表現表示部に表示される画面の一例を示す図。
【図１３】図１の文書処理装置による概念抽出処理を示すフローチャート。
【図１４】概念表現表示部への指定された概念表現を含む文書リストの表示画面の一例を示す図。
【符号の説明】
１文書処理装置
２文書入力部
３言語解析部
４文書データ構造生成部
５文書データ構造記憶部
６概念表現抽出部
７概念表現指定部
８概念表現指定部[0001]
BACKGROUND OF THE INVENTION
  The present inventionDocument processing deviceIn detail, it helps the user to find out the conceptual expression that the user needs by searching and narrowing down the conceptual expression contained in the document or document collectionDocument processing device.
[0002]
[Prior art]
  Recently, computerization of information has progressed, and documents conventionally stored in paper documents have also been digitized. Along with the digitization of such documents, a large number of digitized documents are distributed, and how to manage and easily reuse collected and accumulated digitized documents has become an important issue. In this case, an analysis technique for finding some knowledge from a large amount of electronic documents is desired. This analysis technology is a technology that grasps the contents of each group of documents while grouping a large number of documents by document content. Conventionally, user-driven input of group definition formulas and search technology There have been proposed a method of grouping by using a method of automatically grouping using a clustering technique based on the co-occurrence of words in a document.
[0003]
  Conventionally, as a grouping technique, for example, “Data analysis system” of Patent Document 1 has been proposed, and this data analysis system extracts a concept (keyword in a phrase) from a text and specializes it for the purpose. A category dictionary (thesaurus) is used to convert the expression in the document into labeled data. In addition, the dependency relationship between concepts is analyzed, and a characteristic concept is extracted from the concept frequency / cross table as a concept combination.
[0004]
  Conventionally, a document processing apparatus disclosed in Patent Document 2 has been proposed. This document processing apparatus stores a morpheme analysis unit that performs morpheme analysis on an input sentence, a specific expression candidate acquisition unit that uses a partial sequence of a morpheme sequence as a specific expression candidate with a weight, and stores some specific expressions in advance The specific expression dictionary and a specific expression dictionary search unit for outputting a real number representing a matching degree of the expression in the specific expression dictionary of the morpheme sequence as a search result of the specific expression dictionary, and the specific expression candidate given to the candidate A discriminant score is calculated by using the weight and the search result of the candidate for the specific expression dictionary as a variable, and a discriminant analysis execution unit that excludes candidates whose discriminant score is less than a certain value; An output unit that outputs a morpheme character string that is not excluded by the discriminant analysis execution unit as a specific expression, calculates a discriminant score, and determines whether to leave it as a specific expression candidate , It is intended to make an accurate judgment.
[0005]
  That is, this prior art performs word analysis and dependency analysis from text, groups sentences according to the similarity of sentence structures, and correlates from the correlation between the number of appearances of keywords extracted from text and grouped sentences. Extracting strong items.
[0006]
[Patent Document 1]
JP 2001-75966 A
[Patent Document 2]
JP 2000-172691 A
[0007]
[Problems to be solved by the invention]
  However, in such a conventional technique, in order to allow a user to freely search for and discover a concept included in a document or a document collection, the discovery of a concept important for analyzing document data can be performed. There was a need for improvement.
[0008]
  That is, in the prior art described in Patent Document 1, a concept is extracted from text, a category label is attached based on a category dictionary, a dependency relationship between concepts is analyzed, and a combination of concepts is also a concept. Furthermore, since characteristic concepts are extracted by concept frequency / cross table, it is necessary to create a category dictionary in advance, and there is a problem that it takes a burden to create and maintain the category dictionary, and frequency information Therefore, there is a problem in that it cannot be extracted unless the number of appearances is large.
[0009]
  Moreover, in the prior art described in Patent Document 2, word analysis and dependency analysis are performed from text, and sentences are grouped according to the similarity of sentence structures, and keywords extracted from the text and grouped sentences are compared. Since items having a strong correlation are extracted from the correlation of the number of appearances, there is a problem that the items cannot be extracted unless the number of appearances is large.
[0010]
  However, in the analysis of document data, especially in the analysis of data such as questionnaires, even a concept that is not characteristic, for example, a concept that is not frequent, may be a concept that a user needs for analysis. In the case of such a concept, whether or not it is necessary depends on the user's intention and purpose, and it is difficult to automatically extract it. Therefore, in order to support the discovery of such a concept, it is necessary to assist the user in freely searching for and discovering the concept included in the document or document set.
[0011]
  Therefore, the present invention supports a user to search a conceptual expression included in a document or a document set while performing search or narrowing down and find a conceptual expression required by the user.Provide a document processing deviceThe purpose is that.
[0012]
[Means for Solving the Problems]
  The invention according to claim 1 is a language analysis unit that performs morphological analysis and dependency analysis on input document data, and a document that holds language information of the document data based on a language analysis result in the language analysis unit. Generate data structures and in clausesWhen a specific independent word or ancillary word appears inFor words and phrases in the document data structure, Indicate cancellations, requests, questions or possibilities depending on the specific independent words or appendices that appearDocument data structure generation means for assigning a semantic tag, document data structure storage means for storing the document data structure with the semantic tag generated by the document data structure generation means, and an arbitrary number designated by the user Of words and phrases in the document data structureMeaning tagA concept expression specifying means for receiving a concept expression consisting of a combination of the above and a concept expression related to the concept expression received by the concept expression specifying means from the document data structure stored in the document data structure storage means The document processing apparatus is characterized by comprising a concept expression extracting means.
[0013]
  The invention of claim 2 further comprises a concept expression display means for displaying a concept expression to be selected and specified based on the document data structure, the concept expression specifying means from the concept display displayed on the concept expression display means. 2. The document processing apparatus according to claim 1, wherein designation of an arbitrary concept expression is received, and the concept expression display means displays a concept expression extraction result by the concept expression extraction means.
[0014]
  The invention of claim 3 further comprises a synonym dictionary, and the document data structure generation means generates the document data structure by adding representative notation information to synonym / notation words based on the synonym dictionary. The document processing apparatus according to claim 1, wherein the document processing apparatus is a document processing apparatus.
[0015]
  The invention of claim 4 further includes means for holding a history of concept expressions received by the concept expression designating means, and based on the history, a concept expression extraction result based on a concept expression designation made in the past is obtained as the concept expression. The document processing apparatus according to claim 2, wherein the document processing apparatus re-displays the display means.
[0016]
  The invention according to claim 5 is characterized in that the concept expression display means displays the document data including the concept expression received by the concept expression specifying means or a part of the document data. The document processing apparatus according to any one of the above items.
[0017]
  According to the invention of claim 6, the concept expression extracting means conforms to a concept expression (hereinafter referred to as designated concept expression) in which the document data structure stored in the document data structure storage means is received by the concept expression specifying means. And extracting from the matching document data structure a word in the specified conceptual expression and a dependency-related word, extracting a semantic tag not in the specified conceptual expression, and extracting the extracted word and the semantic tag The document processing apparatus according to claim 1, wherein a conceptual expression including the above is created.
[0018]
  DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the accompanying drawings. The embodiments described below are preferred embodiments of the present invention, and thus various technically preferable limitations are given. However, the scope of the present invention is particularly limited in the following description. As long as there is no description which limits, it is not restricted to these aspects.
[0019]
  1 to 14 are diagrams showing an embodiment of a document processing apparatus, a document processing method, and a recording medium according to the present invention. FIG. 1 is an example of a document processing apparatus, a document processing method, and a recording medium according to the present invention. It is a block block diagram of the document processing apparatus 1 to which embodiment is applied.
[0020]
  In FIG. 1, the document processing apparatus 1 includes a document input unit 2, a language analysis unit 3, a document data structure generation unit 4, a document data structure storage unit 5, a concept expression extraction unit 6, a concept expression designation unit 7, and a concept expression display 8. And a recording medium such as a CD-ROM (Compact Disc Read Only Memory) for recording a document processing program and necessary data is read by a computer or the like and introduced.
[0021]
  The document input unit (document input means) 2 inputs a document to be processed, and can input a plurality of documents. When inputting a plurality of documents, an identifier is assigned to each document. And stored in a storage unit or the like. In the following description, it is assumed that a document set is input from the document input unit 2.
[0022]
  The language analysis unit (language analysis means) 3 is a morpheme analysis process for analyzing the morpheme of each document set input from the document input unit 2, a dependency analysis process for analyzing the dependency relationship between phrases of the document to be analyzed, etc. The linguistic attributes obtained by the morphological analysis process, the dependency analysis process, and the like are output to the document data structure generation unit 4 as an analysis unit. Specifically, the language analysis unit 3 analyzes words included in each document of the document set in the morphological analysis process, and analyzes sentences and clauses included in the document in the dependency analysis process. Analyzes clauses that are in the relationship between dependency and reception. For example, when analyzing the sentence “Software installation cannot be executed normally”, the language analysis unit 3 performs dependency analysis after performing morphological analysis as shown in FIG. FIG. 2 shows an example of the analysis result of the above example, where word separators are represented by “/”, “self” above each word represents an independent word, and “attached” represents an attached word. ing. That is, in FIG. 2, the independent word “software” is related to the phrase 1 with the attached word “no”, and the independent word “install” is given the phrase 2 with the attached word “ga”. The phrase 3 with the normal word “normal” and the word “ni” attached is related, and the word “execution” has two additional words “done” and “no”. It shows that the phrase 4 is received and the phrase 2 is also related to the phrase 4 as a dependency.
[0023]
  The document data structure generation unit (document data structure generation means) 4 converts each document in the document set into a data structure as shown in FIG. 3 based on the analysis result of the language analysis unit 3. Information as shown in FIG. For example, the document data structure generation unit 4 generates a word list with unique identifiers for words included in a document or document set as shown in FIG. 5 and manages the words. The part-of-speech information, the appearance frequency in the whole or the number of appearance documents are calculated and added.
[0024]
  That is, the converted data structure shown in FIG. 3 for each document in the document set is as shown in FIG. 4. The document manages a sentence ID list included in the document, and the sentence is its own sentence ID. To manage the phrase list included in the sentence. The phrase manages its own phrase ID, a word ID list included in the phrase, a related phrase ID list, and a received phrase ID. This word ID is an ID in the word list shown in FIG. 5, and the related phrase ID list is an ID of a related phrase that receives the relevant phrase. Since a plurality of clauses can be related clauses for one received clause, they are managed by the related clause ID list. The received clause ID is an ID of a received clause in which the relevant clause becomes a related clause, and the related clause can take only one received clause.
[0025]
  In addition, the document data structure generation unit 4 can also hold the type of dependency relationship as information managed by the clause, for example, whether it is a combination modification or a combination modification, etc. You can also describe the type of relationship by type.
[0026]
  Further, the document data structure generation unit 4 has a synonym dictionary and can have representative notation information regarding words having synonyms, and has synonym representative notations as items in the word list as shown in FIG. Can be realized.
[0027]
  Further, the document data structure generation unit 4 assigns a semantic tag representing an additional meaning to a word or phrase in the document data structure from an attached word expression or the like in the phrase, and a concept expression specifying unit 7 or a concept expression extraction. Part 6,Conceptual expression display section 8In addition, not only words but also semantic tags can be used as concept expressions. This meaning tag is used to add a meaning to a clause as a tag when an attached word or the like in the clause represents a specific additional meaning. For example, the meaning tags “cancellation”, “request”, “possible”, and “question” are added to the phrase when the following words appear in the phrase, and a plurality of phrases are added to one phrase. Semantic tags may be attached.In the following description, an intention tag is synonymous with a semantic tag.
[0028]
  Intent tag ID 1 “cancellation”: auxiliary verb “none”, auxiliary verb “z”, auxiliary verb “mai”, auxiliary auxiliary verb “difficult”, adjective “none”
  Intent tag ID 2 “Request”: auxiliary verb “I want”, verb “I want”, connected particle “te” + verb “I want”
  Intent tag ID 3 “question”: final particle “ka”, final particle “ka” + final particle “na”, symbol “?”
  Intent tag ID 4 “possible”: auxiliary verb “can”, auxiliary verb “re”, auxiliary verb “re”
In the conceptual expression, for example, the meaning tag is represented by an expression “(+ cancellation + possible)”. The meaning tag can be used alone or as a conceptual expression, or can be used in a form added to a word such as “execution (+ possible + cancellation)”.
[0029]
  The document data structure storage unit (document data structure storage unit) 5 stores and manages the document data structure generated by the document data structure generation unit 4.
[0030]
  The concept expression specifying unit (concept expression specifying means) 7 includes an input unit for characters / symbols, a pointing device such as a mouse, and the like, and the user specifies the concept expression. The concept expression display unit (concept expression display means) 8 is a CRT (Cathode Ray Tube), an LCD (Liquid Crystal Display), or the like. The concept expression extraction unit 6 extracts the concept expression and the concept expression designation unit 7. Various data necessary for the document processing device 1 to process the document, such as a concept expression designation screen (concept browser), is displayed.
[0031]
  As a method of designating the concept expression in the concept expression designating unit 7, for example, a direct entry method in which an input dialog as shown in FIG. 6 is displayed on the concept expression display unit 8 and the user directly enters the concept expression, FIG. As shown in FIG. 12, a concept display screen (concept browser) is displayed on the concept expression display unit 8, and a selection method for selecting a concept expression designated by a pointing device such as a mouse from the displayed concept expressions. Can be used. In the example of FIGS. 7 and 8, the word list information is displayed as the initial state, and the word that the user wants to specify (in FIG. 7, “Receive” is selected) is selected. 7 and 8, when a word of concept expression is selected and the “narrow down” button is operated, concept expression extraction is executed on the right side. In this way, when using the selection method, the concept expression display unit 8 can repeatedly search the concept expression while selecting the concept display to be designated next, rather than the direct entry method in the case of FIG. It is possible to specify a conceptual expression efficiently.
[0032]
  The concept expression extraction unit (concept expression extraction means) 6 extracts a concept expression (word (independent word) or intention tag) having a strong relationship with the concept expression specified by the concept expression specifying unit 7 from each document in the document set. The frequency is calculated and the concept expression extraction result is sent to the concept expression display unit 8 for display.
[0033]
  Next, the operation of the present embodiment will be described. The document processing apparatus 1 according to the present embodiment performs document processing based on a concept display according to user designation.
[0034]
  First,This embodimentThe concept expression, which is the basic concept of, will be explained. Documents handled in the present embodiment are basically expressed in Japanese sentences, and conceptual expressions are expressed in units of words (independent words). A single word represents a concept, and a plurality of words represent a concept. For example, the following conceptual expression is used.
[0035]
  1) Search
  2) Information ⇒ Search
  3) Information ⇒ Search ⇒ Service
  4) Software ⇒ Install (+ possible + cancel)
Note that “⇒” indicates that there is a strong semantic relationship between words, and the strong semantic relationship here is a word that appears in the same phrase (an independent word) or a dependency relationship. It means a word (independent word) that appears in a phrase pair. For example, “information ⇒ search” means that “information” and “search” appear in the same phrase, or “information” and “search” appear in a phrase pair that has a dependency relationship, as shown below. Represents.
[0036]
  Sentence ： "Information search"
  Dependent clause pair: “Information” → “Search”
Also, the direction of “⇒” represents the order of appearance of words, and if the order of appearance is reversed, the meaning may be different, so the word order is important.
[0037]
  And in the concept expression, it is possible to connect and express any number of words (independent words). For example, in the above example 3), three words are connected. In this case, the three words appear in a strong relationship continuously in the word order of “information” “search” “service”. is doing. Therefore, A to D shown below conform to the conceptual expression of 3) in the above example, but E and F do not conform.
[0038]
  A: "Information search service"
  B: "Information search service"
  C: "Information search service"
  D: "Information search service"
  E: “Service that automatically searches for information”
  F: “Search Information Service”
The above example 4) shows an example in which a semantic tag is added, and the semantic tag uses the meaning as a tag when an expression such as an attached word in the phrase represents a specific additional meaning. Is. For example, the meaning tags “cancellation”, “request”, “possible”, and “question” are added to the phrase when the following words appear in the phrase. A plurality of semantic tags may be attached to one phrase.
[0039]
  Cancellation: Auxiliary verb "None", Auxiliary verb "Zu", Auxiliary verb "Mai", Auxiliary auxiliary verb "Hard", Adjective "None"
  Request: auxiliary verb "tai", verb "wanted", connecting particle "te" + verb "wanted"
  Question: Final particle "ka", final particle "ka" + final particle "na", symbol "?"
  Possible: auxiliary verb “can”, auxiliary verb “re”, auxiliary verb “re”
In the conceptual expression, for example, the meaning tag is represented by an expression such as “(+ cancellation + possible)”, and the meaning tag can be a concept expression alone or “execution (+ possible + cancellation)”. It can also be used in the form of words. For example, a phrase “unexecutable” is divided into “executable / can / cannot”, and a semantic tag “(+ possible + cancellation)” is added to this clause. The conceptual expression “execution (+ possible + cancellation)” means that the word “execution” is a phrase to which the semantic tags “cancellation” and “possibility” are added.
[0040]
  By using such a concept expression, the user can express a concept expression according to the purpose by a combination of an arbitrary number of words (independent words) and a semantic tag.
[0041]
  Then, for example, when “FAX → reception (+ cancellation)” is designated as the concept expression, the document processing apparatus 1 performs the concept extraction process in the concept extraction unit 6 as shown in FIG. .
[0042]
  First, the document processing apparatus 1 starts with a phrase with a phrase ID: k = 1 within a sentence ID: s = 1 within a document with a document ID: d = 1 in the document data structure (step S101). A suitability determination process is performed to check whether the structure after the clause k matches the designated concept expression (step S102). In the case of the above example, the following structure is applicable.
[0043]
  1) The phrase k includes the words “FAX” and “receive” in this order, and the intention tag “cancel” is added to the phrase k.
[0044]
  2) The phrase “k” includes the word “FAX”, the receiving clause k ′ in which the clause k is a related clause includes the word “reception”, and the intention tag “cancel” is added to the receiving clause k ′.
[0045]
  If the document ID does not match in step S102, the document processing apparatus 1 increments the phrase ID k in the sentence ID: s = 1 (k = k + 1), moves to the next phrase, and returns to the matching determination process (step S102). ).
[0046]
  In step S102, the concept expression extraction unit 6 searches for a word (independent word) that has a strong relationship in front of the concept expression specified by the concept expression specifying unit 7 when it is matched. It is registered in the expression extraction result list (step S103).
[0047]
  For example, in the case of the above example, the following words are suitable as the independent words.
[0048]
  1) The phrase k includes the word (independent word) X before the word “FAX”.
[0049]
  2) There is a related clause k 'in which the clause k is a receiving clause, and a word (independent word) X exists in the clause k'.
[0050]
  When such a word (independent word) is found, the concept expression extraction unit 6 registers it in the concept expression extraction result list. In this concept expression extraction result list, a word newly found in the designated concept expression is registered. Register the added new conceptual expression. For example, when the word “color” is newly found for “FAX” in the above example, the following concept expressions are registered in the concept expression extraction result list.
[0051]
  Color ⇒ FAX ⇒ Receive (+ Cancel)
In the concept expression extraction result list, as shown in FIG. 5, the total appearance frequency and the number of appearing documents of the registered concept expressions are also registered. At this time, the concept expression extraction unit 6 adds “1” to the appearance frequency when the concept expression to be registered has already been registered, and adds “1” to the number of appearing documents when registration from the document being processed is the first time. Add 1 ”. Further, when the concept expression is registered in the list for the first time, the concept expression extracting unit 6 sets the appearance frequency and the number of appearing documents to “1”.
[0052]
  Next, the concept expression extraction unit 6 searches for a word (independent word) having a strong relationship behind the concept expression specified by the concept expression specifying unit 7, and when an independent word is found, registers it in the concept expression extraction list ( Step S104).
[0053]
  For example, in the case of the above example, the following words are suitable as the independent words.
[0054]
  3) The phrase k includes the word (independent word) X after the word “FAX”.
[0055]
  4) There is a receiving clause k 'in which the clause k is a related clause, and a word (independent word) X exists in the clause k'.
[0056]
  When such a word (independent word) is found, the concept expression extraction unit 6 registers it in the concept expression extraction result list. In this concept expression extraction result list, a word newly found in the designated concept expression is registered. Register the added new conceptual expression. For example, when the word “symptom” is newly found for “FAX” in the above example, the following concept expressions are registered in the concept expression extraction result list.
[0057]
  FAX-> reception (+ cancellation)-> symptoms
As shown in FIG. 5, the concept expression extraction unit 6 also registers the total appearance frequency and the number of appearance documents of the registered concept expression in the concept expression extraction result list. At this time, the concept expression extraction unit 6 adds “1” to the appearance frequency when the concept expression to be registered has already been registered, and adds “1” to the number of appearing documents when registration from the document being processed is the first time. Add 1 ”. Further, when registering for the first time in the list, the concept expression extracting unit 6 sets the appearance frequency and the number of appearing documents to “1”.
[0058]
  Next, the concept expression extraction unit 6 checks whether a semantic tag that does not match the specified concept expression is added to the last clause of the document data structure that matches the specified concept expression. If a meaning tag is found, it is registered in the concept expression extraction result list (step S105).
[0059]
  As such a semantic tag, for example, in the case of the concept expression “FAX⇒receive (+ cancellation)” in the above example, the concept expression extraction unit 6 uses the rearmost document data structure that matches the designated concept expression. If the phrase k ′ has a meaning tag other than “cancel”, the meaning tag X is extracted. When such a semantic tag is found, the conceptual expression extraction unit 6 registers it in the conceptual expression extraction result list, but adds a newly found semantic tag to the specified conceptual expression. Register new conceptual expressions. For example, in the above example, when a new semantic tag “possible” is found, the following concept expressions are registered in the concept expression extraction result list.
[0060]
  FAX⇒Receive (+ Cancel + Possible)
Further, when a plurality of semantic tags are extracted, the concept expression extraction unit 6 registers a plurality of concept expressions in a form to which each is added. In the concept expression extraction result list, the total appearance frequency and the number of appearing documents of the registered concept expressions are also registered. At this time, the concept expression extraction unit 6 adds “1” to the appearance frequency when the concept expression to be registered has already been registered, and adds “1” to the number of appearing documents when registration from the document being processed is the first time. Add 1 ”. Further, when the concept expression is registered in the list for the first time, the concept expression extracting unit 6 sets the appearance frequency and the number of appearing documents to “1”.
[0061]
  Next, the conceptual expression extraction unit 6 checks whether the phrase ID: k (sentence k) is the last phrase in the sentence ID: s (sentence s) (step S106), and the phrase k is the last in the sentence s. If it is not, the phrase k is incremented by “1” (k = k + 1), the process returns to step S102, and the next phrase is processed in the same manner as described above from the matching determination process (steps S102 to S106).
[0062]
  When the phrase k is the last phrase in the sentence s in step S106, the conceptual expression extraction unit 6 determines whether the sentence ID: s (sentence s) is the last sentence in the document ID: d (document d). When the check is made (step S107) and the sentence s is not the last sentence in the document d, the sentence s is incremented by “1” (s = s + 1), the process returns to step S102, and the above-mentioned conformity determination is made for the next sentence. From the processing, the same processing as above is performed (steps S102 to S107).
[0063]
  When the sentence s is the last sentence in the document d in step S107, the conceptual expression extraction unit 6 checks whether the document ID: d (document d) is the last document (step S108), and the last sentence. If not, the sentence d is incremented by “1” (d = d + 1), the process returns to step S102, and the next phrase is processed in the same manner as described above from the matching determination process (steps S102 to S108).
[0064]
  If the document d is the last document in step S108, the concept expression extraction unit 6 determines that the concept extraction process has been completed for all the documents, and ends the process.
[0065]
  Then, the document processing apparatus 1 displays the result extracted by the concept expression extraction unit 6 on the concept expression display unit 8 as shown in FIGS. First, as shown in FIGS. 7 and 8, the document processing apparatus 1 first displays the word list (see FIG. 5) generated by the document data structure generating unit 4 on the left side, and the user displays the conceptual expression specifying unit 8. The result of the concept expression extraction using the word specified in the above as the specified concept expression is displayed on the right side. Further, in the display of the concept extraction result, two types of narrowing down by word shown in FIG. 7 and narrowing down by intention tag shown in FIG. 8 are selected and displayed. Further, in the document processing apparatus 1, when a user selects a concept expression displayed as a concept expression extraction result on the concept expression display unit 8, the concept expression is further extracted using the selected concept expression as a designated concept expression. You can also. In this way, conceptual expression extraction can be repeated.
[0066]
  Further, as shown in FIGS. 9 and 10, the document processing apparatus 1 first displays the intention tag list on the left side, and extracts the intention tag designated by the user in the concept expression designation unit 8 as the designated intention tag. The result of performing is displayed on the right side. Further, in the display of the extraction result, two types of narrowing down by word shown in FIG. 9 and narrowing down by intention tag shown in FIG. 10 are selected and displayed. Further, in the document processing apparatus 1, when a user selects a concept expression displayed as a concept expression extraction result on the concept expression display unit 8, the concept expression is further extracted using the selected concept expression as a designated concept expression. You can also. In this way, conceptual expression extraction can be repeated.
[0067]
  Further, as shown in FIGS. 11 and 12, the document processing apparatus 1 stores conceptual expressions designated by the user and displays a list thereof. Also in this case, the word as shown in FIG. 11 and the intention tag as shown in FIG. 12 are displayed.
[0068]
  In addition, in the display of FIGS. 7 to 12, it is possible to receive only a notation input from the user and display only a conceptual expression including the notation, whereby the user can see only necessary information.
[0069]
  Further, as shown in FIG. 14, the document processing apparatus 1 displays a document list including the designated conceptual expression. That is, the document processing apparatus 1 records a document including the concept expression specified by the concept expression specifying unit 7 in the concept expression extracting unit 6 and displays the document on the concept expression display unit 8 based on the stored document.
[0070]
  As described above, the document processing apparatus 1 according to the present embodiment performs morphological analysis and dependency analysis on the document (document set) input from the document input unit 2 and generates document data structure. The unit 4 converts the document into a document data structure holding linguistic information based on the language analysis result in the language analysis unit 3, stores the generated document data structure in the document data structure storage unit 5, and When designation input of an arbitrary concept expression is performed in the expression specifying unit 7, the concept expression specified by the concept expression specifying unit 7 from the document data structure stored in the document data structure storage unit 5 by the concept expression extracting unit 6. And the concept expression extraction result is displayed on the concept expression display unit 8.
[0071]
  Therefore, the user can freely search the concept expression included in the document to find the necessary concept expression which is not characteristic but can improve the usability.
[0072]
  Further, the document processing apparatus 1 of the present embodiment performs morphological analysis and dependency analysis on the document input from the document input unit 2 by the language analysis unit 3, and the document data structure generation unit 4 Is converted into a document data structure holding linguistic information based on the language analysis result in the language analysis unit 3, and the document data structure generated by the document data structure generation unit 4 is stored in the document data structure storage unit 5. Based on the document data structure, the concept representation to be selected and displayed is displayed on the concept representation display unit 8, and the concept representation designation unit 7 converts any concept representation from the concept display displayed on the concept representation display unit 8 When the selection is designated, the concept representation extraction unit 6 extracts a concept representation having a strong relationship with the concept representation designated by the concept representation designation unit from the document data structure stored in the document data structure storage unit 5. , Concept expression table Displaying the conceptual expression extraction result to part 8.
[0073]
  Therefore, it is possible to appropriately select from the displayed conceptual expressions, and the user can freely search through the conceptual expressions included in the document and easily and easily find the necessary conceptual expressions that are not characteristic. It can be discovered in a while, and the usability can be further improved.
[0074]
  Furthermore, in the document processing apparatus 1 according to the present embodiment, the document data structure generation unit 4 generates the document data structure by adding representative notation information to synonym / notation words based on the synonym dictionary. .
[0075]
  Therefore, it is possible to handle words having the same meaning but different notations as the same word, and the user can search the concept expression included in the document more freely and is not characteristic but necessary. It is possible to more easily find the concept expression to be performed, and the usability can be further improved.
[0076]
  Further, in the document processing apparatus 1 according to the present embodiment, the document data structure generation unit 4 adds a semantic tag representing an additional meaning to a word or phrase in the document data structure from an adjunct expression in the phrase. It is used as designation of the concept expression in the concept expression designation unit 7, extraction of the concept expression in the concept expression extraction unit 6, and display of the concept expression in the concept expression display unit 8.
[0077]
  Therefore, it is possible to specify a more detailed meaning, and to more appropriately find a necessary concept expression, thereby further improving usability.
[0078]
  Furthermore, the document processing apparatus 1 according to the present embodiment holds a history of designation of concept expressions in the concept expression designation unit 7, and based on the history, extracts a result of concept expression extraction based on the concept expression designation performed in the past. Redisplayed in the concept expression display unit 8.
[0079]
  Therefore, it is possible to immediately return to the work state performed by the user in the past, and the usability can be further improved.
[0080]
  Further, the document processing apparatus 1 according to the present embodiment displays a document including the concept expression specified by the concept expression specifying unit 7 or a part of the document on the concept expression display unit 8.
[0081]
  Therefore, it is possible to know how the concept expression specified by the user appears in the actual document, and it is possible to deepen the understanding of the concept expression and further improve the usability. it can.
[0082]
  The invention made by the present inventor has been specifically described based on the preferred embodiments. However, the present invention is not limited to the above, and various modifications can be made without departing from the scope of the invention. Needless to say.
[0083]
【The invention's effect】
  According to the present invention, a user can freely search within a conceptual expression included in a document to find a conceptual expression that is not characteristic but is necessary, and can improve usability. In particular,Based on independent words and ancillary words in the phrase, Semantic tags that represent additional meaning for words or phrases in the document data structure,That is,"Cancellation", "request", "question", "possible", etc. are given and the meaning tag is used for specifying the concept expression, extracting the concept expression, displaying the concept expression, etc., so that more detailed meaning can be specified It is possible to improve the usability by making it possible to find the necessary concept expression more appropriately.
[Brief description of the drawings]
FIG. 1 is a block diagram of a main part of a document processing apparatus to which an embodiment of a document processing apparatus, a document processing method, and a recording medium of the present invention is applied.
FIG. 2 is a diagram illustrating an example of language analysis in a language analysis unit in FIG. 1;
FIG. 3 is a diagram showing an example of conversion of a document set into a data structure of each document by the document data structure generation unit in FIG. 1;
4 is a diagram showing an example of information on each component of each data structure in FIG. 3;
5 is a diagram showing an example of ID, part of speech, appearance frequency, number of appearance documents, and synonym representative notation given to a word list included in a document or document set generated by the document data structure generation unit in FIG. 1; .
FIG. 6 is a diagram illustrating an example of a screen displayed on a concept expression display unit when a user directly inputs and specifies a concept expression.
FIG. 7 is a diagram illustrating an example of a screen displayed on a concept expression display unit when a user selects and designates a concept expression by word and narrows down by word.
FIG. 8 is a diagram showing an example of a screen displayed on a concept expression display unit when a user selects and designates a concept expression with a word and narrows down with an intention tag.
FIG. 9 is a diagram showing an example of a screen displayed on the concept expression display unit when the user selects and specifies with an intention tag and narrows down with words.
FIG. 10 is a diagram illustrating an example of a screen displayed on a concept expression display unit when a user selects and designates with an intention tag and narrows down with the intention tag.
FIG. 11 is a diagram showing an example of a screen displayed on the conceptual expression display unit when a user selects and specifies a history and narrows down by word.
FIG. 12 is a diagram showing an example of a screen displayed on the concept expression display unit when a user selects and specifies a history and narrows down with an intention tag.
13 is a flowchart showing concept extraction processing by the document processing apparatus of FIG. 1;
FIG. 14 is a diagram showing an example of a display screen of a document list including a designated concept expression on a concept expression display unit.
[Explanation of symbols]
    1 Document processing device
    2 Document input part
    3 Language Analysis Department
    4 Document data structure generator
    5 Document data structure storage
    6 Conceptual expression extraction unit
    7 Conceptual expression designation part
    8 Conceptual expression designation part

Claims

Language analysis means for performing morphological analysis and dependency analysis on input document data;
Based on the result of language analysis in the language analysis means, a document data structure that holds the language information of the document data is generated, and when a specific independent word or ancillary word appears in a phrase, the document data structure Document data structure generating means for giving a meaning tag indicating cancellation, request, question or possibility according to the specific independent word or appendage that has appeared ,
Document data structure storage means for storing the document data structure to which the semantic tag is generated generated by the document data structure generation means;
A concept expression specifying means for accepting a concept expression consisting of a combination of an arbitrary number of words specified by the user and a word or phrase semantic tag in the document data structure;
A concept expression extracting means for extracting a concept expression related to the concept expression received by the concept expression specifying means from the document data structure stored in the document data structure storage means;
A document processing apparatus comprising:

Further comprising a concept expression display means for displaying a concept expression to be selected and specified based on the document data structure,
The concept expression designation means accepts designation of an arbitrary concept expression from the concept display displayed on the concept expression display means,
The document processing apparatus according to claim 1, wherein the concept expression display unit displays a concept expression extraction result obtained by the concept expression extraction unit.

A synonym dictionary
3. The document data structure generating unit generates the document data structure by adding representative notation information to synonym / notation words based on the synonym dictionary. Document processing device.

Means for holding a history of concept expressions received by the concept expression specifying means, and redisplaying the result of concept expression extraction based on the concept expression specification performed in the past on the concept expression display means based on the history; The document processing apparatus according to claim 2 or 3, wherein

The said concept expression display means displays the said document data containing the concept expression received by the said concept expression designation | designated means, or a part of the said document data, The one of Claim 2 thru | or 4 characterized by the above-mentioned. Document processing device.

The concept representation extraction unit checks whether the document data structure stored in the document data structure storage unit matches a concept representation (hereinafter, designated concept representation) received by the concept representation designation unit. From the data structure, a word having a dependency relationship with a word in the designated concept expression is extracted, a meaning tag not in the specified concept expression is extracted, and a concept expression including the extracted word and the meaning tag is created. The document processing apparatus according to claim 1, wherein the document processing apparatus includes: