JP2004206468A

JP2004206468A - Document management system and document management program

Info

Publication number: JP2004206468A
Application number: JP2002375373A
Authority: JP
Inventors: Mayumi Nishimura; 真由美西村; Yuko Ogasawara; 優子小笠原
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-12-25
Filing date: 2002-12-25
Publication date: 2004-07-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document management system capable of saving labor of input work of document type information, and of flexibly sorting in accordance with a trend of stored document information. <P>SOLUTION: Keywords are extracted from new document information converted into text data by an OCR 2 by a keyword extracting part 20, extracted keywords of existing document information are read from a document information DB in a storage 7, both are compared, and document type information of existing document information with the most number of matching keywords is set as a document type of the new document information. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、電子化された文書情報を整理保存するための文書管理システム及び文書管理プログラムに関する。
【０００２】
【従来の技術】
近年の社会の情報化の進展に伴い、様々な文書が多量にやり取りされるようになってきており、こうした様々な文書を効率的に整理保存するための文書管理システムが求められている。文書管理システムでは、ネットワークシステムを通して電子化された文書情報を蓄積したり、紙出力された文書をＯＣＲ処理して電子化された文書情報として蓄積しており、電子情報として文書情報を蓄積することで多量の文書を保存することが可能となっている。しかしながら、蓄積された多量の文書情報からいかに効率的に必要な文書情報を抽出するかが問題となる。
【０００３】
蓄積された文書情報には、文書名、登録日、作成者等の書誌情報が付加されるのが一般的であるが、こうした書誌情報だけでは情報のばらつきが大きく、効率よく検索できない。そこで、予め文書情報を分類する文書タイプを決めておき、それに従って文書情報を分類しておき、検索する際にはその分類を基に文書情報を検索すると、より効率的に絞り込めることができる。しかしながら、文書情報を蓄積する際に文書タイプに関する情報をいちいち入力することは手間がかかり、また、入力ミスの可能性もある。
【０００４】
そのため、蓄積された文書情報を自動的に分類する方法が提案されている。例えば、特許文献１には、電子文書が持つ文字属性及び画像属性を分析して文書プロフィールを構築し、その文書プロフィールに基づいて既存のディレクトリの中に分類されて格納される点が記載されている。また、特許文献２には、文書をメディア変換により書式付き文書に変換して、書式付き文書より内容的特徴及び体裁的特徴の抽出を行い、これらの特徴から各カテゴリの内容的特徴及び体裁的特徴を決定し、分類対象の文書から抽出した内容的特徴及び体裁的特徴と比較することで分類する点が記載されている。また、特許文献３には、あらかじめ１以上のキーワードを登録しておき、対象となる文書を単位記事に分割して単位記事中に登録されたキーワードが含まれるか照合し、キーワードが含まれる記事をキーワード単位に設けられた所定の記事格納領域に整理保存する文書処理装置が記載されている。
【０００５】
【特許文献１】
特開２０００−１１２９９３号公報
【特許文献２】
特開２０００−２６８０４０号公報
【特許文献３】
特開２００１−１０９７７２号公報
【０００６】
【発明が解決しようとする課題】
上述したような、自動的に文書情報を分類する方法は、人手による入力が不要になる点でメリットはあるものの、分類を特定するための情報について予め決めておき、その特定情報により文書情報を分類しているため、最初に特定情報がどのように決められるかによって以後の文書情報の検索が効率よく行われるかが決まってくる。しかしながら、文書情報の検索ニーズは、時間の経過と共に変化していくものであり、それを最初から分類に反映させることは困難であるため、分類を見直していくことが必要となるが、上述した従来技術ではこうした問題点への対応が十分なされているとはいえない。
【０００７】
そこで、本発明は、文書タイプ情報の入力作業の省力化を図ると共に、蓄積される文書情報の傾向に合せて柔軟に分類することが可能な文書管理システムを提供することを目的とするものである。
【０００８】
【課題を解決するための手段】
本発明に係る文書管理システムは、入力された文書情報より文書特定情報を抽出する抽出手段と、抽出された文書特定情報を記憶する特定情報記憶手段と、記憶された既存文書情報から生成された文書タイプを判別するための判別情報を記憶する判別情報記憶手段と、新たに入力された新文書情報の文書特定情報と判別情報記憶手段から読み出した判別情報とに基づき新文書情報の文書タイプ情報を設定する文書タイプ判別手段とを備えていることを特徴とする。さらに、前記抽出手段は、文書特定情報として文書情報よりキーワードを抽出し、前記判別情報記憶手段は、判別情報として既存文書情報毎の抽出キーワードを記憶しており、前記文書タイプ判別手段は、新文書情報の抽出キーワードと既存文書情報毎の抽出キーワードとを比較して文書タイプ情報を設定することを特徴とする。さらに、前記抽出手段は、文書特定情報として文書情報より重要文データを抽出し、前記判別情報記憶手段は、判別情報として既存文書情報から生成された文書タイプ毎の文書タイプ特定情報を記憶しており、前記文書タイプ判別手段は、新文書情報の重要文データと文書タイプ毎の文書タイプ特定情報とを比較して文書タイプ情報を設定することを特徴とする。さらに、前記抽出手段は、文書特定情報として文書情報よりキーワードを抽出し、前記判別情報記憶手段は、判別情報として既存文書情報から生成された文書タイプ毎の単語帳データとして記憶しており、前記文書タイプ判別手段は、新文書情報の抽出キーワードと文書タイプ毎の単語帳データとを比較して文書タイプ情報を設定することを特徴とする。さらに、前記判別情報記憶手段は、複数種類の判別情報を記憶しており、前記文書タイプ判別手段が文書タイプを判別する際に用いる判別情報の優先順位を設定する手段を備えていることを特徴とする。さらに、新文書情報に対して、前記文書タイプ判別手段により設定された文書タイプ情報に基づき同一タイプの既存文書情報に関する付加情報と同一の付加情報を設定する手段を備えていることを特徴とする。
【０００９】
また、本発明に係る文書管理プログラムは、文書情報を管理するためにコンピュータを、新た入力された新文書情報からキーワードを抽出する手段、記憶された既存文書情報からキーワードを読み出す手段、新文書情報の抽出キーワードと既存文書情報のキーワードとを比較する手段、及び比較結果に基づいて既存文書の文書タイプ情報を新文書情報の文書タイプ情報として設定する手段、として機能させることを特徴とする。
【００１０】
また、別の文書管理プログラムは、文書情報を管理するためにコンピュータを、新た入力された新文書情報からキーワード及び重要文データを抽出する手段、記憶された既存文書情報からキーワードを読み出す手段、新文書情報の抽出キーワードと既存文書情報のキーワードとを比較する手段、記憶された文書タイプ毎の文書タイプ特定情報を読み出す手段、前記重要文データと前記文書タイプ特定情報とを比較する手段、及び比較結果に基づいて既存文書の文書タイプ情報を新文書情報の文書タイプ情報として設定する手段、として機能させることを特徴とする。
【００１１】
また、別の文書管理プログラムは、文書情報を管理するためにコンピュータを、新た入力された新文書情報からキーワード及び重要文データを抽出する手段、記憶された既存文書情報からキーワードを読み出す手段、新文書情報の抽出キーワードと既存文書情報のキーワードとを比較する手段、記憶された文書タイプ毎の文書タイプ特定情報を読み出す手段、前記重要文データと前記文書タイプ特定情報とを比較する手段、記憶された文書タイプ毎の単語帳データを読み出す手段、新文書情報の抽出キーワードと文書タイプ毎の単語帳データとを比較する手段、及び比較結果に基づいて既存文書の文書タイプ情報を新文書情報の文書タイプ情報として設定する手段、として機能させることを特徴とする。さらに、前記文書管理プログラムは、既存文書の付加情報を新文書情報の付加情報として文書タイプ情報とともに設定する。
【００１２】
また、本発明に係る記録媒体は、前記文書管理プログラムを記録したコンピュータ読取可能な記録媒体からなる。
【００１３】
上記のような構成を有することで、既存文書情報がどのように文書タイプ情報を設定されているかによって新文書情報の文書タイプ情報が自動的に決められていくため、入力作業の省力化が図られるとともに、次々と記憶される既存文書情報から生成される判別情報を基に文書タイプ情報が設定されるため、常に最新の傾向に合せて文書タイプ情報が設定されることになる。そして、既存の文書タイプに入らない文書情報が入力されるときには、人手により文書タイプを設定しておけば、以後の文書情報に対しては、既存文書情報の文書タイプ情報として記憶され、以後の文書情報に対しては自動的にその文書タイプ情報が設定されるようになるから、文書タイプ情報の設定をより柔軟性を持たせて行うことができる。
【００１４】
そして、新文書情報からキーワードを抽出し、判別情報として既存文書情報の抽出キーワードを用いることで、新文書情報と同一のキーワードを含む既存文書情報の文書タイプ情報を文書タイプ判別の際に用いることができる。
【００１５】
また、新文書情報から重要文データを抽出し、判別情報として、判別情報として既存文書情報から生成された文書タイプ毎の文書タイプ特定情報を用いることで、文書タイプの判別をより効率的に正確に行うことができる。
【００１６】
また、新文書情報からキーワードを抽出し、判別情報として既存文書情報から生成された文書タイプ毎の単語帳データを用いることで、より効率的に正確に行うことができる。また、以上のような判別情報により文書判別を行う際にどの判別情報から順番に用いていくか優先順位を設定するようにすれば、文書タイプ情報の設定を利用者のニーズに合せて行うことができる。
【００１７】
また、新文書情報の同一タイプの既存文書情報に関する付加情報は、新文書情報の付加情報として採用できる場合が多いことから、そのまま新文書情報の付加情報として設定すれば、いちいち人手による入力作業を省力化することができる。
【００１８】
【発明の実施の形態】
以下、本発明に係る実施形態について詳しく説明する。図１は、本発明に係る文書管理システムについてその構成の概要を示したものである。システム本体１は、インターフェース１１を介してネットワークに接続されて、文書を作成する端末と文書情報等の情報の送受信を行う。また、インターフェース１２を介してＯＣＲ２及びプリンタ３等の周辺機器と接続されている。ＯＣＲ２からは文書からテキストデータが生成されてシステム本体１に送信される。
【００１９】
システム本体１は、必要なデータを入力するための入力装置５、情報を表示するディスプレイ６、文書情報等の情報を記憶する記憶装置７、情報処理を行う情報処理装置８、情報処理装置８で処理される情報を一時記憶するＲＡＭ９及び本発明に係る文書管理システムとして機能させるためのプログラム等を記憶するＲＯＭ１０を備えている。ＲＯＭ１０に記憶されたプログラムは、ＣＤ−ＲＯＭ、ＦＤ、ＭＯ等のコンピュータで読取可能な周知の記録媒体に記録されて必要に応じてコンピュータにインストールされる。
【００２０】
例えば、ＯＣＲ２において文書からテキストデータを生成して新しい文書情報がシステム本体１に送信されてくると、情報処理装置８は、いったんＲＡＭ９にその新文書情報を記憶し、キーワード抽出部２０において公知の方法により新文書情報からキーワード抽出が行われる。記憶装置７内には、既存文書に関する文書情報が記憶されており、これらの既存文書情報からキーワード抽出部２０においてキーワードを抽出して、キーワード比較部２２において新文書情報のキーワードと既存文書情報のキーワードが比較される。両者のキーワードの比較は、図２に示すように、記憶された既存文書情報１、２、・・・から抽出された「ネットワーク」、「技術」、・・・といったキーワードと新文書情報から抽出された「ネットワーク」、「方法」、・・・といったキーワードを比較する。そして、両者のキーワードの一致する数を文書タイプ判別部２５でチェックしていき、一致する数の多い既存文書情報の文書タイプ情報を読み出してその文書タイプ情報を新文書情報の文書タイプ情報として設定する。この場合一致する数の基準をどの程度のレベルにするかは、記憶される文書情報の類似性により適宜設定すればよい。例えば、類似性が高い場合は一致する数を多くして一致する既存文書の数を小さくすればよく、類似性が低い場合には、一致する数を小さくして一致する既存文書の数がある程度ヒットするようにする。
【００２１】
こうして新文書情報とキーワードが一致する既存文書情報を検索して文書タイプ判別部２５でこれらの文書タイプを読み出したときにすべて１つの文書タイプである場合にはそのまま新文書情報の文書タイプとして設定できるが、２以上の文書タイプが読み出された場合には１つの文書タイプに絞り込むための重み付けを行う必要がある。重み付けを行う手段としては、以下のようなものが挙げられる。
【００２２】
まず、重要文データに基づいて行うやり方がある。重要文データとは、文書構造からみて、文書タイプの特徴が顕著に表れる部分である、「見出し」や「固有単語・フレーズ」、主張文や疑問文等に相当する文を意味し、重要文データを解析することで文書タイプの絞込みを行う。具体的には、ＯＣＲ２で生成されたテキストデータから重要文抽出部２１で新文書情報に関する重要文データを抽出する。図３にその抽出例を示す。ここでは、「ネットワーク試験のご案内」に関する新文書情報が例として挙げられているが、「見出し」に関連する「ネットワーク試験のご案内」、「日時・・・」、「場所・・・」が重要文として抽出される。
【００２３】
そして、この例の場合には、文書タイプとして「試験案内」、「新聞記事」、「特許」が既に設定されており、それぞれの文書タイプに関して、「特定の見出し」、「固有単語・フレーズ」、「準固有単語・フレーズ」等の項目についてその特有の表現を示すデータが設定されており、主張文、疑問文、推定文の使用頻度からその文書タイプに一致する可能性に関するデータ（図３では３段階で示されている）が設定されている。こうした文書タイプを特定する情報は、記憶部７内の「文書構造ＤＢ」、「特定見出しＤＢ」及び「固有単語・フレーズＤＢ」に予め記憶しておく。こうした文書特定情報を基に、例えば、「試験」という固有単語があれば、「試験案内」の文書タイプの可能性が高いと判断し、「請求項」という見出しがあれば、「特許」という文書タイプの可能性が高いと判断できる。これらのＤＢに記憶される文書タイプ特定情報は、既存文書情報に基づいて予め設定してもよいし、既存文書情報が蓄積されるに従い適宜見直してもよい。
【００２４】
新文書情報から抽出された重要文データは、文書構造ＤＢに記憶された情報によりその文書構造が解析され、固有単語・フレーズとして「試験」が、準固有単語・フレーズとして「日時」「場所」が一致しており、主張文等がないことから、「試験案内」の文書タイプの可能性が高くなる。
【００２５】
次に、単語帳データに基づいて行うやり方がある。単語帳は、同一文書タイプの既存文書情報から抽出したキーワードを基にその文書タイプの特徴を表すキーワードを予め収集して生成する。生成された単語帳は、単語帳ＤＢ３４に記憶される。なお、必要であれば、人手により単語帳にキーワードを入力装置５から入力することもできる。図４に単語帳データの一例を示す。この例では、「特許」の文書タイプに関する単語帳として会社名が記憶されており、「新聞記事」の文書タイプに関する単語帳として新聞社名が記憶されている。
【００２６】
そして、新文書情報からの抽出キーワードと一致するキーワードが存在する単語帳を単語帳検索部２４において検索処理する。図４では、抽出キーワードである「理工」について単語帳を検索して同一のキーワードが存在する「特許」の文書タイプの単語帳がヒットする。したがって、「特許」の文書タイプの可能性が高くなる。
【００２７】
上述した「重要文」、「単語帳データ」以外にも新文書情報の「文書名」やその文書に関して作成された「メモ」を用いることもできる。「文書名」や「メモ」には、その文書の特徴が顕出すると考えられるからである。新文書情報の「文書名」やその文書に関して作成された「メモ」から抽出されたキーワードについて既存文書情報との一致をキーワード比較部２２で比較して文書タイプを判別する。
【００２８】
そして、「重要文」、「単語帳データ」、「文書名」及び「メモ」による重み付けをどうゆう順番で行うかは、どのような種類の文書情報を蓄積するかで適宜設定すればよい。
【００２９】
図５に、以上の処理フローの一例を示す。新文書情報として、紙原稿がＯＣＲにより読み込まれて（Ｓ１００）ＯＣＲ処理され（Ｓ１０１）、テキストデータが生成される。そして、公知の方法によりキーワードが抽出され（Ｓ１０２）、既存文書情報の抽出キーワードと比較されて同一キーワードの有無がチェックされる（Ｓ１０３）。比較した結果、同一キーワードが存在する既存文書情報がある場合には、その文書タイプ情報をチェックする（Ｓ１０４）。そして、２種類以上の文書タイプがある場合には、予め設定された重み付けの順番に従い、次の重み付け処理を行って（Ｓ１０５）、再度文書タイプ情報をチェックする。
【００３０】
こうして重み付け処理を行って、文書タイプ情報が１つに絞り込まれたら、その文書タイプ情報を新文書情報の文書タイプとして付加する（Ｓ１０６）。また、既存文書情報に付加されている文書名、単語帳及びメモ等の付加情報のうち必要なものを新文書情報として付加する（Ｓ１０７）。また、既存文書情報に対するアクセス権が設定されているか否かチェックし（Ｓ１０８）、設定されている場合には同一のアクセス権を付加する（Ｓ１０９）。さらに、既存文書情報に関連する関連文書が記憶されているか否かチェックし（Ｓ１１０）、記憶されている関連文書と同一の関連文書を付加する（Ｓ１１１）。
【００３１】
Ｓ１０３において同一のキーワードが存在しない場合には、その旨の表示処理を行い（Ｓ１１２）、人手により文書タイプ情報を入力する（Ｓ１１３）。
【００３２】
したがって、既存文書情報が少ない初期の段階には、人手による入力が必要となるが、既存文書情報が蓄積されるに従いシステムで文書タイプ情報が自動的に設定されるようになる。また、蓄積された既存文書情報に従い設定されるので、蓄積される文書情報の傾向が常に反映されて文書タイプ情報が設定される。
【００３３】
【発明の効果】
以上に説明したとおり、本発明は、既存文書情報がどのように文書タイプ情報を設定されているかによって新文書情報の文書タイプ情報が自動的に決められていくため、入力作業の省力化が図られるとともに、次々と記憶される既存文書情報から生成される判別情報を基に文書タイプ情報が設定されるため、常に最新の傾向に合せて文書タイプ情報が設定されることになる。そして、既存の文書タイプに入らない文書情報が入力されるときには、人手により文書タイプを設定しておけば、以後の文書情報に対しては、既存文書情報の文書タイプ情報として記憶され、以後の文書情報に対しては自動的にその文書タイプ情報が設定されるようになるから、文書タイプ情報の設定をより柔軟性を持たせて行うことができる。
【００３４】
そして、新文書情報からキーワードを抽出し、判別情報として既存文書情報の抽出キーワードを用いることで、新文書情報と同一のキーワードを含む既存文書情報の文書タイプ情報を文書タイプ判別の際に用いることができる。
【００３５】
また、新文書情報から重要文データを抽出し、判別情報として、判別情報として既存文書情報から生成された文書タイプ毎の文書タイプ特定情報を用いることで、文書タイプの判別をより効率的に正確に行うことができる。
【００３６】
また、新文書情報からキーワードを抽出し、判別情報として既存文書情報から生成された文書タイプ毎の単語帳データを用いることで、より効率的に正確に行うことができる。また、以上のような判別情報により文書判別を行う際にどの判別情報から順番に用いていくか優先順位を設定するようにすれば、文書タイプ情報の設定を利用者のニーズに合せて行うことができる。
【００３７】
また、新文書情報の同一タイプの既存文書情報に関する付加情報は、新文書情報の付加情報として採用できる場合が多いことから、そのまま新文書情報の付加情報として設定すれば、いちいち人手による入力作業を省力化することができる。
【図面の簡単な説明】
【図１】本発明に係る実施形態の装置構成のブロック図である。
【図２】本発明に係る実施形態におけるキーワード処理に関する説明図である。
【図３】本発明に係る実施形態における重要文処理に関する説明図である。
【図４】本発明に係る実施形態における単語帳処理に関する説明図である。
【図５】本発明に係る実施形態の処理フロー図である。
【符号の説明】
１・・・システム本体、２・・・ＯＣＲ、３・・・プリンタ、４・・・端末、５・・・入力装置、６・・・ディスプレイ、７・・・記憶装置、８・・・情報処理装置、９・・・ＲＡＭ、10・・・ＲＯＭ、11・・・インターフェース、12・・・インターフェース。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document management system and a document management program for organizing and storing digitized document information.
[0002]
[Prior art]
2. Description of the Related Art In recent years, with the progress of computerization in society, various documents have been exchanged in large quantities, and a document management system for efficiently organizing and storing these various documents has been demanded. In a document management system, digitized document information is stored through a network system, or a document output on paper is subjected to OCR processing and stored as digitized document information. The document information is stored as electronic information. Makes it possible to store a large number of documents. However, there is a problem how to efficiently extract necessary document information from a large amount of stored document information.
[0003]
Generally, bibliographic information such as a document name, a registration date, and a creator is added to the stored document information. However, such bibliographic information alone has large variations in information and cannot be searched efficiently. Therefore, if the document type for classifying the document information is determined in advance, and the document information is classified in accordance with the document type, and the document information is searched based on the classification at the time of the search, it is possible to narrow down more efficiently. . However, it is troublesome to input information on the document type one by one when storing the document information, and there is a possibility of an input error.
[0004]
Therefore, a method of automatically classifying the stored document information has been proposed. For example, Patent Document 1 describes that a character profile and an image attribute of an electronic document are analyzed to construct a document profile, and the document profile is classified and stored in an existing directory based on the document profile. I have. Patent Document 2 discloses that a document is converted into a formatted document by media conversion, and a content feature and a format feature are extracted from the formatted document. From these features, a content feature and a format feature of each category are extracted. It describes that features are determined and compared with content features and formal features extracted from a document to be classified to perform classification. Further, in Patent Document 3, one or more keywords are registered in advance, the target document is divided into unit articles, and whether or not the registered keywords are included in the unit articles is checked. Describes a document processing apparatus that organizes and saves in a predetermined article storage area provided for each keyword.
[0005]
[Patent Document 1]
JP 2000-112993 A [Patent Document 2]
Japanese Patent Application Laid-Open No. 2000-268040 [Patent Document 3]
JP 2001-109772 A
[Problems to be solved by the invention]
Although the method of automatically classifying document information as described above has an advantage in that manual input is not required, information for specifying the classification is determined in advance, and the document information is determined based on the specific information. Since the information is classified, how the specific information is determined first determines whether the subsequent search for the document information is performed efficiently. However, the search needs for document information change over time, and it is difficult to reflect that in the classification from the beginning. Therefore, it is necessary to review the classification. The prior art does not sufficiently address these problems.
[0007]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a document management system capable of saving labor of inputting document type information and flexibly classifying according to the tendency of stored document information. is there.
[0008]
[Means for Solving the Problems]
A document management system according to the present invention includes an extracting unit that extracts document specifying information from input document information, a specifying information storage unit that stores the extracted document specifying information, and a document information generating unit that is generated from the stored existing document information. Discrimination information storage means for storing discrimination information for discriminating the document type, and document type information of the new document information based on the document identification information of the newly input new document information and the discrimination information read from the discrimination information storage means And document type determining means for setting Further, the extracting means extracts a keyword from the document information as the document specifying information, the discriminating information storing means stores an extracted keyword for each existing document information as the discriminating information, Document type information is set by comparing the extracted keyword of the document information with the extracted keyword for each existing document information. Further, the extracting means extracts important sentence data from the document information as document specifying information, and the discriminating information storage means stores document type specifying information for each document type generated from existing document information as discriminating information. Preferably, the document type determining means sets the document type information by comparing the important sentence data of the new document information with the document type identification information for each document type. Further, the extraction means extracts a keyword from the document information as document identification information, and the discrimination information storage means stores the discrimination information as word book data for each document type generated from existing document information, The document type determining means sets the document type information by comparing the extracted keyword of the new document information with the word book data for each document type. Further, the discrimination information storage means stores a plurality of types of discrimination information, and comprises means for setting a priority of discrimination information used when the document type discrimination means discriminates a document type. And Further, the apparatus further comprises means for setting the same additional information as the additional information relating to the existing document information of the same type based on the document type information set by the document type discriminating means for the new document information. .
[0009]
In addition, the document management program according to the present invention includes: a computer for managing document information; a unit for extracting a keyword from newly input new document information; a unit for reading a keyword from stored existing document information; And a means for setting the document type information of the existing document as the document type information of the new document information based on the comparison result.
[0010]
Further, another document management program includes a computer for managing document information, a means for extracting keywords and important sentence data from newly input new document information, a means for reading keywords from stored existing document information, Means for comparing the extracted keyword of the document information with the keyword of the existing document information, means for reading the stored document type identification information for each document type, means for comparing the important sentence data with the document type identification information, and comparison It is characterized by functioning as a means for setting the document type information of the existing document as the document type information of the new document information based on the result.
[0011]
Further, another document management program includes a computer for managing document information, a means for extracting keywords and important sentence data from newly input new document information, a means for reading keywords from stored existing document information, Means for comparing the extracted keyword of the document information with the keyword of the existing document information, means for reading the stored document type identification information for each document type, means for comparing the important sentence data with the document type identification information, Means for reading out the word book data for each document type, means for comparing the extracted keyword of new document information with the word book data for each document type, and the document type information of the existing document based on the comparison result as a document of the new document information. It is characterized by functioning as means for setting as type information. Further, the document management program sets the additional information of the existing document as the additional information of the new document information together with the document type information.
[0012]
Further, a recording medium according to the present invention comprises a computer-readable recording medium recording the document management program.
[0013]
With the above configuration, the document type information of the new document information is automatically determined according to how the document type information is set in the existing document information. Since the document type information is set based on the discrimination information generated from the existing document information stored one after another, the document type information is always set according to the latest tendency. When document information that does not fit in the existing document type is input, if the document type is manually set, the subsequent document information is stored as the document type information of the existing document information. Since the document type information is automatically set for the document information, the setting of the document type information can be performed with more flexibility.
[0014]
Then, by extracting a keyword from the new document information and using the extracted keyword of the existing document information as the discrimination information, the document type information of the existing document information including the same keyword as the new document information is used for the document type discrimination. Can be.
[0015]
Also, by extracting important sentence data from new document information and using document type identification information for each document type generated from existing document information as discrimination information as discrimination information, the document type discrimination can be performed more efficiently and accurately. Can be done.
[0016]
Further, by extracting a keyword from the new document information and using the word book data for each document type generated from the existing document information as the discrimination information, it can be performed more efficiently and accurately. In addition, when performing the document discrimination based on the above-described discrimination information, by setting the priority order of which discrimination information to use in order, it is possible to set the document type information according to the needs of the user. Can be.
[0017]
In addition, since additional information about existing document information of the same type of new document information can often be adopted as additional information of new document information, if it is set as it is as additional information of new document information, manual input work can be performed. Labor can be saved.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments according to the present invention will be described in detail. FIG. 1 shows an outline of the configuration of a document management system according to the present invention. The system main body 1 is connected to a network via an interface 11, and transmits and receives information such as document information to and from a terminal that creates a document. Further, it is connected to peripheral devices such as the OCR 2 and the printer 3 via the interface 12. Text data is generated from the document from the OCR 2 and transmitted to the system main body 1.
[0019]
The system body 1 includes an input device 5 for inputting necessary data, a display 6 for displaying information, a storage device 7 for storing information such as document information, an information processing device 8 for performing information processing, and an information processing device 8. The system includes a RAM 9 for temporarily storing information to be processed and a ROM 10 for storing a program and the like for functioning as a document management system according to the present invention. The program stored in the ROM 10 is recorded on a computer-readable recording medium such as a CD-ROM, an FD, and an MO, and is installed in the computer as needed.
[0020]
For example, when text data is generated from a document in the OCR 2 and new document information is transmitted to the system main body 1, the information processing device 8 temporarily stores the new document information in the RAM 9, A keyword is extracted from the new document information by the method. Document information on existing documents is stored in the storage device 7, keywords are extracted from the existing document information by the keyword extraction unit 20, and the keywords of the new document information and the existing document information are extracted by the keyword comparison unit 22. Keywords are compared. As shown in FIG. 2, the comparison between the two keywords is performed by extracting keywords such as “network”, “technology”,... Extracted from the stored existing document information 1, 2,. The keywords such as “network”, “method”,... Are compared. Then, the number of matching of both keywords is checked by the document type discriminating unit 25, the document type information of the existing document information having a large number of matching is read out, and the document type information is set as the document type information of the new document information. I do. In this case, the level of the reference of the number of coincidences may be appropriately set according to the similarity of the stored document information. For example, if the similarity is high, the number of matching existing documents may be increased to reduce the number of matching existing documents. If the similarity is low, the number of matching existing documents may be reduced to reduce the number of matching existing documents to some extent. Make it hit.
[0021]
In this way, the existing document information whose keyword matches the new document information is searched, and when these document types are read out by the document type discrimination unit 25, if they are all one document type, they are set as the document type of the new document information as they are. Although it is possible, when two or more document types are read, it is necessary to perform weighting to narrow down to one document type. Means for performing weighting include the following.
[0022]
First of all, there is a method based on important sentence data. Important sentence data is a sentence corresponding to “headline”, “proper word / phrase”, assertion, question, etc., which is a part where the characteristics of the document type are noticeable in the document structure. Analyzing the data narrows down the document type. Specifically, the important sentence data relating to the new document information is extracted by the important sentence extracting unit 21 from the text data generated by the OCR2. FIG. 3 shows an example of the extraction. Here, new document information related to “information on network testing” is given as an example, but “information on network testing”, “date and time ...”, “location ...” Is extracted as an important sentence.
[0023]
In the case of this example, “test guide”, “newspaper article”, and “patent” have already been set as the document types, and for each document type, “specific heading”, “specific word / phrase” , "Semi-specific words / phrases", etc., data indicating the specific expressions are set, and data on the possibility of matching the document type from the frequency of use of assertions, questions, and estimated sentences (FIG. 3) Are shown in three stages). Information for specifying such a document type is stored in advance in the “document structure DB”, the “specific heading DB”, and the “unique word / phrase DB” in the storage unit 7. Based on such document identification information, for example, if there is a unique word “test”, it is judged that the possibility of the document type of “test guide” is high, and if there is a heading of “claims”, it is called “patent”. It can be determined that the possibility of the document type is high. The document type specifying information stored in these DBs may be set in advance based on the existing document information, or may be appropriately reviewed as the existing document information is accumulated.
[0024]
The important sentence data extracted from the new document information has its document structure analyzed by the information stored in the document structure DB, and “test” is used as a unique word / phrase, and “date / time” / “place” is used as a quasi-specific word / phrase. And there is no assertion, etc., so that the possibility of the document type of “test guide” increases.
[0025]
Next, there is a method based on wordbook data. The word book is generated by previously collecting keywords representing characteristics of the document type based on keywords extracted from existing document information of the same document type. The generated word book is stored in the word book DB 34. If necessary, a keyword can be manually input to the word book from the input device 5. FIG. 4 shows an example of the word book data. In this example, the company name is stored as a word book relating to the document type of “patent”, and the name of a newspaper company is stored as a word book relating to the document type of “newspaper article”.
[0026]
Then, the word book search unit 24 performs a search process for a word book in which a keyword that matches the keyword extracted from the new document information exists. In FIG. 4, the word book is searched for “Riko” as an extracted keyword, and a word book of the “patent” document type in which the same keyword exists is hit. Therefore, the possibility of the "patent" document type increases.
[0027]
In addition to the above-mentioned "important sentence" and "word book data", "document name" of new document information and "memo" created for the document can also be used. This is because the feature of the document is considered to appear in the “document name” and the “memo”. The keyword comparison unit 22 compares the keyword extracted from the “document name” of the new document information and the “memo” created for the document with the existing document information to determine the document type.
[0028]
The order in which weighting is performed by “important sentence”, “word book data”, “document name”, and “memo” may be appropriately set depending on what kind of document information is stored.
[0029]
FIG. 5 shows an example of the above processing flow. As a new document information, a paper document is read by the OCR (S100) and OCR-processed (S101) to generate text data. Then, a keyword is extracted by a known method (S102), and is compared with the extracted keyword of the existing document information to check for the presence of the same keyword (S103). As a result of the comparison, if there is existing document information including the same keyword, the document type information is checked (S104). If there are two or more document types, the next weighting process is performed in accordance with a preset weighting order (S105), and the document type information is checked again.
[0030]
When the weighting process is performed and the document type information is narrowed down to one, the document type information is added as the document type of the new document information (S106). In addition, necessary information such as a document name, a word book, and a memo added to the existing document information is added as new document information (S107). Further, it is checked whether or not the access right to the existing document information is set (S108). If the access right is set, the same access right is added (S109). Further, it is checked whether or not a related document related to the existing document information is stored (S110), and the same related document as the stored related document is added (S111).
[0031]
If the same keyword does not exist in S103, display processing to that effect is performed (S112), and the document type information is manually input (S113).
[0032]
Therefore, manual input is required in the initial stage when the amount of existing document information is small, but as the existing document information is accumulated, the system automatically sets the document type information as the existing document information is accumulated. Further, since the setting is made in accordance with the stored existing document information, the document type information is set by always reflecting the tendency of the stored document information.
[0033]
【The invention's effect】
As described above, according to the present invention, the document type information of the new document information is automatically determined depending on how the document type information is set in the existing document information. Since the document type information is set based on the discrimination information generated from the existing document information stored one after another, the document type information is always set according to the latest tendency. When document information that does not fit in the existing document type is input, if the document type is manually set, the subsequent document information is stored as the document type information of the existing document information, and the subsequent document information is stored. Since the document type information is automatically set for the document information, the setting of the document type information can be performed with more flexibility.
[0034]
Then, by extracting a keyword from the new document information and using the extracted keyword of the existing document information as the discrimination information, the document type information of the existing document information including the same keyword as the new document information is used for the document type discrimination. Can be.
[0035]
Also, by extracting important sentence data from new document information and using document type identification information for each document type generated from existing document information as discrimination information as discrimination information, the document type discrimination can be performed more efficiently and accurately. Can be done.
[0036]
Further, by extracting a keyword from the new document information and using the word book data for each document type generated from the existing document information as the discrimination information, it can be performed more efficiently and accurately. In addition, when performing the document discrimination based on the above-described discrimination information, by setting the priority order of which discrimination information to use in order, it is possible to set the document type information according to the needs of the user. Can be.
[0037]
In addition, since additional information about existing document information of the same type of new document information can often be adopted as additional information of new document information, if it is set as it is as additional information of new document information, manual input work can be performed. Labor can be saved.
[Brief description of the drawings]
FIG. 1 is a block diagram of a device configuration according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram relating to keyword processing in the embodiment according to the present invention.
FIG. 3 is an explanatory diagram relating to important sentence processing in the embodiment according to the present invention.
FIG. 4 is an explanatory diagram related to a word book process in the embodiment according to the present invention.
FIG. 5 is a processing flowchart of the embodiment according to the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... System main body, 2 ... OCR, 3 ... Printer, 4 ... Terminal, 5 ... Input device, 6 ... Display, 7 ... Storage device, 8 ... Information Processing device, 9 RAM, 10 ROM, 11 interface, 12 interface.

Claims

Extraction means for extracting document identification information from input document information, identification information storage means for storing the extracted document identification information, and identification information for identifying a document type generated from the stored existing document information Discriminating information storing means for storing document type information of new document information and document type discriminating means for setting document type information of new document information based on discriminating information read from the discriminating information storing means. A document management system, comprising:

The extracting means extracts a keyword from document information as the document specifying information, the discrimination information storage means stores an extracted keyword for each existing document information as the discrimination information, and the document type discriminating means 2. The document management system according to claim 1, wherein the document type information is set by comparing the extracted keyword of the document information with the extracted keyword for each existing document information.

The extraction unit extracts important sentence data from document information as the document identification information, and the discrimination information storage unit stores document type specification information for each document type generated from existing document information as the discrimination information. The document according to claim 1, wherein the document type determining unit sets the document type information by comparing the important sentence data of the new document information with the document type specifying information for each document type. Management system.

The extraction unit extracts a keyword from document information as the document identification information, and the discrimination information storage unit stores the discrimination information as word book data for each document type generated from existing document information, 4. The document management system according to claim 1, wherein the document type determination unit sets the document type information by comparing the extracted keyword of the new document information with the word book data for each document type. .

The discrimination information storage means stores a plurality of types of discrimination information, and includes means for setting a priority of discrimination information used when the document type discrimination means discriminates a document type. The document management system according to claim 1.

2. A method according to claim 1, further comprising the step of setting, based on the document type information set by said document type discriminating means, the same additional information as the additional information relating to the existing document information of the same type. 6. The document management system according to any one of 1 to 5.

Computer to manage document information,
Means for extracting a keyword from newly input new document information,
Means for reading a keyword from the stored existing document information,
Means for comparing the extracted keyword of the new document information with the keyword of the existing document information, and means for setting the document type information of the existing document as the document type information of the new document information based on the comparison result;
Document management program to function as.

Computer to manage document information,
Means for extracting keywords and important sentence data from newly inputted new document information,
Means for reading a keyword from the stored existing document information,
Means for comparing the extracted keyword of the new document information with the keyword of the existing document information,
Means for reading the stored document type identification information for each document type,
Means for comparing the important sentence data with the document type identification information, and means for setting document type information of an existing document as document type information of new document information based on the comparison result;
Document management program to function as.

Computer to manage document information,
Means for extracting keywords and important sentence data from newly inputted new document information,
Means for reading a keyword from the stored existing document information,
Means for comparing the extracted keyword of the new document information with the keyword of the existing document information,
Means for reading the stored document type identification information for each document type,
Means for comparing the important sentence data with the document type identification information,
Means for reading the stored word book data for each document type,
Means for comparing the extracted keyword of the new document information with the word book data for each document type, and means for setting the document type information of the existing document as the document type information of the new document information based on the comparison result;
Document management program to function as.

10. The document management program according to claim 7, wherein the additional information of the existing document is set together with the document type information as the additional information of the new document information.

A computer-readable recording medium recording the document management program according to claim 7.