JP3771047B2

JP3771047B2 - Document classification apparatus, document classification method, and computer-readable recording medium storing a program for causing a computer to execute the method

Info

Publication number: JP3771047B2
Application number: JP11441498A
Authority: JP
Inventors: 哲郎長束; 達生宮地; 敦夫嶋田; 一寿武谷; 栄治剣持; 明子中島; 真湖人山崎; 克彦藤田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-04-10
Filing date: 1998-04-10
Publication date: 2006-04-26
Anticipated expiration: 2018-04-10
Also published as: JPH11296550A

Description

【０００１】
【発明の属する技術分野】
この発明は、文書の内容に基づいて文書を分類する文書分類装置、文書分類方法およびその方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体に関する。
【０００２】
【従来の技術】
従来、文書分類装置として、たとえば、特開平７−３６８９７号公報記載の文書分類装置には、文書を単語を特徴とする文書ベクトルとみなし、クラスタリング手法を用いてこれらの文書ベクトルを群分けし、文書の自動分類をおこなうものが記載されている。
【０００３】
また、通常、文書データは一般的にデータベース化されており、文書内容だけでなく作成日や作成者などの書誌的項目が付加されていたり、また文書内容自体が複数の項目を含んでいる場合が多い。たとえば、特許公報は、「特許請求の範囲」「発明の詳細な説明」といった複数の項目から構成されている。
【０００４】
【発明が解決しようとする課題】
しかしながら、上記従来技術の文書分類装置は、複数の項目を持つ文書データに対して、操作者が分類対象とする項目を任意に指定することができないことから、分類に悪影響を与えるデータが付加されていたり、また、複数の項目を組み合わせることが出来ないことから、分類に有効なデータが不足したりして、精度の高い分類結果を得ることができないという問題があった。
【０００５】
この発明は、上述した従来例による問題点を解消するため、操作者の意図が反映された精度の高い分類をおこなうことができる文書分類装置、文書分類方法およびその方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することを目的とする。
【０００６】
【課題を解決するための手段】
上述した課題を解決し、目的を達成するため、この発明に係る文書分類装置は、文書の内容に基づいて文書の分類をおこなう文書分類装置において、一つまたは複数の項目から構成された文書データを入力する入力手段と、前記入力手段により入力された文書データを構成する前記項目を指定する指定手段と、前記指定手段により指定された項目に対応するデータのみの内容となるように前記文書データを変換する変換手段と、前記変換手段により変換された変換データをもちいて文書を分類する分類手段と、を備えたことを特徴とする。
【０００７】
この発明によれば、文書を分類する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類を効率よくおこなうことが可能である。
【０００８】
また、この発明に係る文書分類装置は、文書の内容に基づいて文書の分類をおこなう文書分類装置において、一つまたは複数の項目から構成された文書データを入力する入力手段と、前記入力手段により入力された文書データを構成する前記項目を指定する指定手段と、前記指定手段により指定された項目に対応するデータのみの内容となるように前記文書データを変換する変換手段と、前記変換手段により変換された変換データをもちいて各文書の特徴ベクトルを生成する文書ベクトル生成手段と、前記文書ベクトル生成手段により生成された各文書の特徴ベクトルをもちいて文書を分類する分類手段と、を備えたことを特徴とする。
【０００９】
この発明によれば、文書を分類するための各文書の特徴ベクトルを生成する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類をおこなうことが可能である。
【００１０】
また、この発明に係る文書分類装置は、上記発明において、前記変換手段が、前記文書データを変換する際、前記各項目のデータが分離可能となるように前記項目のデータ間に所定の記号を挿入することを特徴とする。
【００１１】
この発明によれば、各変換データの間に区切りとなる記号を挿入するので、形態素解析等の解析処理をおこなう際に、各項目に対応するデータをそのまま結合させることにより変換データ全体として全く別の意味が構成されることを回避することが可能である。
【００１２】
また、この発明に係る文書分類方法は、文書の内容に基づいて文書の分類をおこなう文書分類方法において、一つまたは複数の項目から構成された文書データを入力する入力工程と、前記入力工程により入力された文書データを構成する前記項目を指定する指定工程と、前記指定工程により指定された項目に対応するデータのみの内容となるように前記文書データを変換する変換工程と、前記変換工程により変換された変換データをもちいて文書を分類する分類工程とを含んだことを特徴とする。
【００１３】
この発明によれば、文書を分類する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が自分が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類をおこなうことが可能である。
【００１４】
また、この発明に係る文書分類方法は、文書の内容に基づいて文書の分類をおこなう文書分類方法において、一つまたは複数の項目から構成された文書データを入力する入力工程と、前記入力工程により入力された文書データを構成する前記項目を指定する指定工程と、前記指定工程により指定された項目に対応するデータのみの内容となるように前記文書データを変換する変換工程と、前記変換工程により変換された変換データをもちいて各文書の特徴ベクトルを生成する文書ベクトル生成工程と、前記文書ベクトル生成工程により生成された各文書の特徴ベクトルをもちいて文書を分類する分類工程と、を含んだことを特徴とする。
【００１５】
この発明によれば、文書を分類するための各文書の特徴ベクトルを生成する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が自分が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類をおこなうことが可能である。
【００１６】
また、この発明に係る文書分類方法は、上記発明において、前記変換工程が、前記文書データを変換する際、前記各項目のデータが分離可能となるように前記項目のデータ間に所定の記号を挿入することを特徴とする。
【００１７】
この発明によれば、各変換データの間に区切りとなる記号を挿入するので、形態素解析等の解析処理の際に、複数の項目のデータを一つのデータとして混同して扱われることを回避できるとともに、各項目ごとの内容データが瞬時に識別することが可能である。
【００１８】
また、この発明に係る記憶媒体は、上記に記載された方法をコンピュータに実行させるプログラムを記録したことで、そのプログラムを機械読み取り可能となり、これによって、上記の動作をコンピュータによって実現することが可能である。
【００１９】
【発明の実施の形態】
以下に添付図面を参照して、この発明に係る文書分類装置、文書分類方法およびその方法をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体の好適な実施の形態を詳細に説明する。
【００２０】
（実施の形態１）
まず、この発明の実施の形態１による文書分類装置を構成する情報処理システム全体のハードウエア構成を説明する。図１は、実施の形態１による文書分類装置を構成する情報処理システム全体のハードウエア構成を示す説明図である。図１において、実施の形態１による文書分類装置を構成する情報処理システムは、サーバー／クライアント方式で構成されている。すなわち、サーバー１０１と複数のクライアント１０２がネットワーク１０３によって接続されている。
【００２１】
クライアント１０２は、分類データの生成、サーバー１０１への指示、分類結果の表示などをおこなう。一方、クライアント１０２からの指示に従い、サーバー１０１は文書（テキスト）分類に関する処理を膨大な数値演算によりおこない、その処理の結果をクライアント１０２へ送る。より具体的には、サーバー１０１においては、テキスト分類処理がおこなわれ、クライアント１０２においては、分類データ生成、処理実行指示、テキスト分類結果表示等がおこなわれる。
【００２２】
また、サーバー１０１とクライアント１０２との間のデータのやりとりはファイル共有という方法をもちいる。すなわち、分類処理にもちいるファイルをサーバー１０１上の共有フォルダに作成することにより両者はデータのやりとりをおこなう。したがって、クライアント１０２からはサーバー１０１の共有フォルダをネットワーク共有して利用することが可能である。
【００２３】
つぎに、サーバー１０１およびクライアント１０２のハードウエア構成について説明する。図２は、実施の形態１による文書分類装置を構成する情報処理システムにおけるサーバー１０１をハードウエア的に示す説明図である。サーバー１０１は、たとえばワークステーション（ＷＳ）等がもちいられる。
【００２４】
図２において、２０１はサーバー１０１全体を制御するＣＰＵを、２０２はブートプログラム等を記憶したＲＯＭを、２０３はＣＰＵ２０１のワークエリアとして使用されるＲＡＭ２０３を、２０４は通信回線２０５を介してネットワーク１０３に接続され、そのネットワーク１０３と内部のインターフェイスを司るインターフェイス（Ｉ／Ｆ）を、２０６はデータを記憶するディスク装置を示している。２００は上記各部を結合させるためのバスを示している。
【００２５】
そのほか、文書情報、画像情報、機能情報等を表示するディスプレイ２０８や、データを入力するためのキーボード２０９およびマウス２１０等が同様に接続されていてもよい。さらに、ディスク装置２０６には、クライアント１０２との間のデータのやりとりをするための共有フォルダ２０７が設けられている。
【００２６】
また、図３は、実施の形態１による文書分類装置を構成する情報処理システムにおけるクライアント１０２をハードウエア的に示す説明図である。クライアント１０２は、たとえばパーソナルコンピュータ（ＰＣ）等がもちいられる。
【００２７】
図３において、３０１はシステム全体を制御するＣＰＵを、３０２はブートプログラム等を記憶したＲＯＭを、３０３はＣＰＵ３０１のワークエリアとして使用されるＲＡＭを、３０４はＣＰＵ３０１の制御にしたがってＨＤ（ハードディスク）３０５に対するデータのリード／ライトを制御するＨＤＤ（ハードディスクドライブ）を、３０５はＨＤＤ３０４の制御で書き込まれたデータを記憶するＨＤを、３０６はＣＰＵ３０１の制御にしたがってＦＤ（フロッピーディスク）３０７に対するデータのリード／ライトを制御するＦＤＤ（フロッピーディスクドライブ）を、３０７はＦＤＤ３０６の制御で書き込まれたデータを記憶する着脱自在のＦＤを、３０８はドキュメント、画像、機能情報等を表示するディスプレイをそれぞれ示している。
【００２８】
また、３０９は通信回線３１０を介してネットワーク１０３に接続され、そのネットワーク１０３と内部のインターフェイスを司るインターフェイス（Ｉ／Ｆ）を、３１１は文字、数値、各種指示等の入力のためのキーを備えたキーボードを、３１２はカーソルの移動や範囲選択、あるいは表示画面に表示されたアイコンやボタンの押下やウインドウの移動やサイズの変更等をおこなうマウスを、３１３はＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）機能を備えた画像を光学的に読み取るスキャナを、３１４は分類結果を含むデータの内容等を印刷するプリンタを、３１５は上記各部を結合するためのバスをそれぞれ示している。
【００２９】
つぎに、実施の形態１による文書分類装置の機能的構成について説明する。図４は実施の形態１による文書分類装置の構成を機能的に示すブロック図である。図４において、文書分類装置は、入力部４０１と、指定部４０２と、変換部４０３と、変換データ記憶部４０４と、分類部４０５と、分類結果記憶部４０６を含む構成である。
【００３０】
つぎに、各構成部についてその内容を詳細に説明する。なお、入力部４０１、指定部４０２、変換部４０３、変換データ記憶部４０４、分類部４０５、分類結果記憶部４０６は、ＲＯＭ２０２または３０２、ＲＡＭ２０３または３０３、あるいはディスク装置３０６またはハードディスク３１６等の記録媒体に記録されたプログラムに記載された命令に従ってＣＰＵ２０１または３０１等が命令処理を実行することにより、各部の機能を実現するものである。
【００３１】
（入力部４０１）
入力部４０１は、文書データを入力するものであり、たとえば、キーボード２０９または３１１、ＯＣＲ機能を備えたスキャナ３１３、またはネットワーク１０３を経由して文書や文書群を得ることができるＩ／Ｆ２０４または３０９等である。また、入力部４０１は、上記以外に文書データを取得することができるものであれば、それらすべてを含む。
【００３２】
たとえば、文書データがデータベース化されている場合に、そのデータベースが記録された媒体を本実施の形態の文書分類装置に組み入れた場合も文書データの入力とする。さらに、入力した文書データを記憶する図示しない文書データ記憶部を含んでいてもよい。この文書データ記憶部は、たとえば大容量のメモリを有するサーバー１０１のディスク装置２０６等であってもよい。
【００３３】
ここで、文書とは、本実施の形態にあっては、自然言語で記述された一つ以上の文の集まりであり、それが分類対象となる場合はこれを文書という。具体的には、公開特許公報や特定の新聞記事も文書であり、また、請求項や特定の一文を取り出したものであっても、これを文書と見なすものである。
【００３４】
（指定部４０２）
指定部４０２は、文書データの項目を指定するものである。指定部は、具体的には３つの処理から構成される。
【００３５】
まず、入力部４０１により入力された文書データから項目を抽出する（第１処理）。項目を抽出する方法としては、あらかじめ所定の符号（たとえば、「［」「］」等）が付されている項目を検索し、その項目を選択する等の方法がある。
【００３６】
上記第１処理は、指定部４０２でおこなう代わりに、入力部４０１においておこなってもよい。すなわち、入力部４０１が文書データを入力する際に、あわせてその文書データの項目の抽出をおこなう。その抽出結果は文書データと対応付けされて上記文書データ記憶部に記憶される。この場合は、当該抽出結果をもちいることにより、指定部４０２においては上記第１処理は省略されることになる。また、データベースの種類によってはあらかじめ項目に関する情報を有しているものがあり、その項目に関する情報を利用することによっても、上記第１処理は省略される。
【００３７】
つぎに、第１処理による項目の抽出結果、上記文書データ記憶部に記憶された上記抽出結果、または上記項目に関する情報等に基づいて、抽出された各項目がどのような項目であり、その項目に対応する内容はどのようなものであるかの一覧を操作者に提示する（第２処理）。提示の方法としては、ディスプレイ２０８または３０８に項目のみを、あるいは項目とその項目に対応する内容の全部または一部を表示する方法等がある。
【００３８】
項目のみを表示する方法としては、たとえば、項目名を文書データ中の出現順序に基づいて横書で縦一列になるように羅列して表示するといった方法がある。この場合、表示画面上の表示行数よりも項目数が多くなる場合は、折り返して縦二列以上で表示してもよく、また、縦一列で表示して、表示画面を縦方向にスクロールできるようにしてもよい。
【００３９】
項目とその項目に対応する内容の全部または一部を表示する方法としては、たとえば、上述の項目のみを表する方法と同様に、項目名を文書データ中の出現順序に基づいて横書で縦一列になるように羅列して表示し、さらにその右側に項目名と対応して配置される位置に同じく横書でその内容を表示するといった方法がある。この場合、表示画面上の表示列数よりも内容のデータ量が多くなる場合は、表示画面を横方向にスクロールできるようにしてもよい。
【００４０】
また、項目とその項目に対応する内容の全部または一部を表示する別の方法としては、項目名のみを表示し、項目名が表示されている領域にカーソルを移動させ、所定の操作（マウス２１０または３１２のボタンあるいはキーボード２０９または３１１等の所定キーの押下）により、内容のデータの全部または一部をポップアップして表示するようにしてもよい。
【００４１】
つぎに、操作者の指示に従って、提示（表示）された項目の中から分類処理の対象となる項目を一つまたは二つ以上を同時に指定する（第３処理）。指定の方法としては、キーボード２０９または３１１やマウス２１０または３１２等のポインティングデバイスからの指定に関する指示信号に基づいて、提示されている項目の中から該当する項目を指定する。
【００４２】
この際、項目の指定は一つであってもよく、また、二つ以上を同時に指定してもよい。また、結合の形態を併せて指定することもできる。さらに、指定の順序により、データの変換後の内容データの配列順を指定するようにしてもよい。
【００４３】
（変換部４０３）
変換部４０３は、入力された文書データを前記指定部４０２により指定された項目に対応するデータのみの内容となるように文書データを変換するものである。具体的には、文書データ中の指定された項目に対応するデータだけを抽出し、抽出されたデータのみからなる変換データへ変換するものである。
【００４４】
変換データは、単にもとの文書データにおける指定された項目の順序で各項目に対応するデータを羅列することにより変換される場合のみならず、たとえば指定された項目のデータ内容を文字列として結合して指定された項目のデータ内容だけを含む変換データとすることや、項目の順序をもとの文書データ内における順序と異なる順序に入れ替えてからデータを結合するように変換してもよい。
【００４５】
また、変換部４０３は、変換データにおける各項目のデータが分離可能となるように項目のデータ間に所定の分離記号６０１を挿入する。これにより、各項目に対応するデータの切れ目を瞬時に把握することができる。
【００４６】
また、この分離記号６０１は、形態素解析等の自然言語解析をおこなう場合に特に重要である。各項目に対応するデータが文の体をなしている場合（文の終わりが句点で終わっている場合）は、この分離記号がなくても文と文の切れ目を判断することができるが、各項目に対応するデータが文の体をなしていない場合（箇条書きの文、文の途中で項目が変わる等の場合）は、そのままデータ同士を結合させると、項目によっては、全く別の意味が構成されてしまう場合がある。そのような場合を回避するためにこの分離記号６０１を挿入する。
【００４７】
分離記号６０１は、一般的には、切れ目を表す「／（スラッシュ）」がもちいられるが、変換データ中に「／」が存在する場合には、データの「／」との混同が生じるので、別の記号をもちいることができる。また、この記号を挿入するか否かについてキーボード２０９または３１１に所定のキーを割付け、そのキーを押下するごとに、あるいは表示画面上に所定のアイコンを表示させて、マウス２１０または３１２によりそのアイコンをクリックするごとに、分離記号６０１を挿入する／挿入しないを交互に設定するようにしてもよい。
【００４８】
（変換データ記憶部４０４）
変換データ記憶部４０４は、変換データを記憶する記憶部である。変換データ記憶部４０４としては、たとえば、サーバー１０１のディスク装置２０６またはクライアント側のハードディスク３０５、またはフロッピーディスク３０７等、変換データの容量の違いあるいは用途の違いにより、それぞれ設定することが可能である。
【００４９】
変換データ記憶部４０４には、項目の設定順序等を含む変換データのほか、前記分離記号６０１等も記憶される。変換データ記憶部４０４に記憶された変換データは、別の分類の際に用いる等、活用を図ることができる。
【００５０】
（分類部４０５）
分類部４０５は、変換部４０３により変換された変換データまたは変換データ記憶部４０４に記憶されている変換データの内容にしたがって自動的に分類する。分類部４０５については、たとえば特開平７−３６８９７号公報に開示された「文書分類装置」など従来の文書分類方法を用いて文書を分類することができる。
【００５１】
（分類結果記憶部４０６）
分類結果記憶部４０６は、分類部４０５により分類された結果を記憶する記憶部である。分類結果記憶部４０６としては、変換データ記憶部４０４と同様に、たとえば、サーバー１０１のディスク装置２０６またはクライアント側のハードディスク３０５、またはフロッピーディスク３０７等、変換データの容量の違いあるいは用途の違いにより、それぞれ設定することが可能である。
【００５２】
つぎに、文書データと文書データを変換した変換データの一例について説明する。図５は文書データとその変換データの一例を示す説明図である。図５において、文書群として特許公報群をもちいた場合であり、５０１は文書データの一例であり、５０２は変換データの一例である。
【００５３】
文書データ５０１は、「出願番号」、「出願日」、「発明者」、「発明の名称」、「目的」、「構成」、「請求項１」、「従来技術」、「課題を解決するための手段」、「作用」、「実施例」、「発明の効果」等の項目が含まれている。
【００５４】
従来の文書分類装置では各文書データをひとまとまりとして取り扱うので、複数の項目を含む文書データに対してはすべての項目の内容データが分類処理の対象となり、操作者が望む分類の観点に不必要、あるいは悪影響を与える項目も含まれる場合がある。
【００５５】
本実施の形態においては、分類をおこなう操作者は自分が望む分類の観点に必要と思われる項目を１つ以上指定することができる。たとえば特許公報文書群の分類をおこなう際に、操作者が「発明の課題」に注目したい場合は、「目的」、「課題を解決するための手段」、「作用」、「発明の効果」を指定する。また、解決手段に注目したい場合は、「課題を解決するための手段」および「実施例」を指定することができる。分類の対象となる項目が指定されると、指定された項目に基づいて文書データを変換する。
【００５６】
図５にあっては、操作者が「目的」、「課題を解決するための手段」、「作用」、「発明の効果」の項目を指定した場合において、指定された項目の内容データだけを含むように変換した場合の例である。
【００５７】
変更データ５０２から明らかなように、「目的」の項目に対応するデータである「履歴とともに対応する画面情報を記憶しておき・・・ことを目的とする。」と、「課題を解決するための手段」の項目に対応するデータである「上記目的を達成するために・・・表示する表示手段とを有する。」と、「作用」の項目に対応するデータである「以上の構成において、入力手続きより・・・表示するように動作する。」と、「発明の効果」の項目に対応するデータである「以上説明したよう日本発明によれば・・・再現できる効果がある。」とが結合して一つの文書を構成している。
【００５８】
また、図６は、同一の文書データをもちいて、操作者が「目的」、「課題を解決するための手段」、「作用」、「発明の効果」を指定した場合において、指定された項目の内容データだけを含み、各項目のデータ間に分離記号６０１（「／」）を挿入するように変換した場合の例である。
【００５９】
つぎに、実施の形態１による文書分類装置の一連の処理の手順について説明する。図７は実施の形態１による文書分類装置の一連の処理の手順を示すフローチャートである。
【００６０】
図７のフローチャートにおいて、まず、入力部１は文書データの入力をおこなう（ステップＳ７１０）。また、指定部４０２は項目の指定をおこなう（ステップＳ７２０）。
【００６１】
変換部４０３は、ステップＳ７１０において入力された文書データをステップＳ７２０において指定されて項目の内容になるように変換データへ変換する（ステップＳ７３０）。また、必要に応じて分離記号６０１を項目に対応するデータ間に挿入する（ステップＳ７４０）。変換された変換データは、分離記号データとともに変換データ記憶部４０４により記憶される（ステップＳ７５０）。
【００６２】
上記ステップＳ７３０において変換された変換データあるいは上記ステップＳ７５０において変換データ記憶部４０４によって記憶された変換データに基づいて、分類部４０５は文書の分類をおこなう（ステップＳ７８０）。分類処理が終了後、分類処理の結果は分類結果記憶部４０６により記憶され（ステップＳ７９０）、すべての処理は終了する。
【００６３】
以上説明したように、実施の形態１によれば、指定された項目により文書データが変換データへ変換され、その変換データに基づいて文書の分類をおこなうので、その他の不要な項目の内容による分類結果への影響を抑制することができる。また、分離記号６０１の挿入により、変換データにおける結合された項目ごとのデータの識別ができ、かつ、項目間のデータの結合による内容の混同を回避することができる。
【００６４】
（実施の形態２）
さて、実施の形態１では、変換データをもちいて文書を分類したが、以下に説明する実施の形態２のように、変換データを用いて文書の特徴ベクトルを生成し、その特徴ベクトルを用いて文書を分類するようにしてもよい。
【００６５】
まず、実施の形態２による文書分類装置の機能的構成について説明する。図８は、実施の形態２による文書分類装置の構成を機能的に示すブロック図である。図８において、実施の形態１の図４と同一のものに関しては同じ番号を付して、その説明を省略する。
【００６６】
図８において、文書分類装置は、入力部４０１と、指定部４０２と、変換部４０３と、変換データ記憶部４０４と、分類部４０５と、分類結果記憶部４０６のほかに、文書ベクトル生成部８０１と、文書ベクトル記憶部８０２とを含む構成である。
【００６７】
なお、文書ベクトル生成部８０１と文書ベクトル記憶部８０２は、他の構成部と同様に、ＲＯＭ２０２または３０２、ＲＡＭ２０３または３０３、あるいはディスク装置３０６またはハードディスク３１６等の記録媒体に記録されたプログラムに記載された命令に従ってＣＰＵ２０１または３０１等が命令処理を実行することにより、各部の機能を実現するものである。
【００６８】
（文書ベクトル生成部８０１）
文書ベクトル生成部８０１は、各文書の特徴ベクトルを生成する。文書の特徴ベクトルを生成するためには、文書データに対して形態素解析等の自然言語解析処理をおこなう必要がある。この自然言語解析処理は、図示しない文書解析部によって、各文書データについて各項目ごとおこなわれる。形態素解析は従来の形態素解析手法を用いることができる。
【００６９】
文書ベクトル生成部８０１では各文書データに対して前記文書解析部によって得られた解析結果を用いて文書ベクトルを生成するものである。この際に指定部４０２によって指定された項目に関する解析結果のみに基づいて文書ベクトルの生成をおこなう。たとえば各文書データに対して指定部４０２で指定された項目の内容データから得られる特徴ベクトルだけを加算して文書ベクトルを生成することで、指定部４０２で指定された項目の内容データだけを反映した文書ベクトルを生成することができる。
【００７０】
（文書ベクトル記憶部８０２）
文書ベクトル記憶部８０２は、文書ベクトル生成部８０１によって生成された各文書の特徴ベクトルを記憶する記憶部である。文書ベクトル記憶部８０２においては同一文書であっても指定部４０２により指定される項目によっては、その文書の特徴ベクトルが異なってくるので、指定ごとにそれぞれ文書の特徴ベクトルを記憶する。分類部４０５による文書の分類をおこなう際には、あらかじめ文書ベクトル記憶部８０２によって記憶された上記文書の特徴ベクトルをもちいるので、効率よく文書の分類をおこなうことができる。
【００７１】
文書ベクトル記憶部８０２としては、たとえば、サーバー１０１のディスク装置２０６またはクライアント側のハードディスク３０５、またはフロッピーディスク３０７等を、変換データの容量の違いあるいは用途の違いにより、それぞれ設定することが可能である。
【００７２】
（分類部４０５）
分類部４０５は、変換部４０３により変換された各文書の特徴ベクトル間の類似度に基づいて文書を分類するものである。具体的には、生成された分類対象データに対して、カイ自乗法の手法、判別分析の手法、およびクラスタ分析の手法等の分類手法を適用することで、文書分類をおこなうことができる。ここではベクトルデータが適用できる分類手法であれば、その手法は問わない。
【００７３】
つぎに、実施の形態２による文書分類装置の一連の処理の手順について説明する。図９は実施の形態２による文書分類装置の一連の処理の手順を示すフローチャートである。図９のフローチャートにおいて、ステップＳ７１０〜Ｓ７５０までは、実施の形態１の図７のフローチャートと同一ステップなので、同一ステップ番号を付して、その説明は省略する。
【００７４】
上記ステップＳ７３０において変換された変換データあるいは上記ステップＳ７５０において記憶された変換データに基づいて、文書ベクトル生成部８０１は各文書の特徴ベクトルの生成をおこなう（ステップＳ７６０）。生成された各文書の特徴ベクトルは文書ベクトル記憶部８０２により記憶される（ステップＳ７７０）。
【００７５】
上記ステップＳ７６０において変換された変換データあるいはステップＳ７７０において記憶された変換データに基づいて、分類部４０５は文書の分類がおこなう（ステップＳ７８０）。分類処理が終了後、分類処理の結果は分類結果記憶部４０６により記憶され（ステップＳ７９０）、すべての処理は終了する。
【００７６】
以上、実施の形態２によれば、指定された項目により文書データが変換データへ変換され、変換データに基づいて、各文書の特徴ベクトルの生成をおこなうので、操作者の意図をより反映した文書の特徴ベクトルを用いて文書の分類をおこなうことでき、その他の不要な項目の内容による分類結果への影響を抑制することができる。
【００７７】
【発明の効果】
以上説明したように、この発明によれば、文書を分類する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類を効率よくおこなうことが可能な文書分類装置が得られるという効果を奏する。
【００７８】
また、この発明によれば、文書を分類するための各文書の特徴ベクトルを生成する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類をおこなうことが可能な文書分類装置が得られるという効果を奏する。
【００７９】
また、この発明によれば、各変換データの間に区切りとなる記号を挿入するので、形態素解析等の解析処理の際に、複数の項目のデータを一つのデータとして混同して扱われることを回避できるとともに、各項目ごとの内容データが瞬時に識別することが可能な文書分類装置が得られるという効果を奏する。
【００８０】
また、この発明によれば、文書を分類する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が自分が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類をおこなうことが可能な文書分類方法が得られるという効果を奏する。
【００８１】
また、この発明によれば、文書を分類するための各文書の特徴ベクトルを生成する際に、指定された項目の内容データだけが用いられるので、その他の項目の内容による分類結果への影響を防ぐことができる。そのため、操作者が自分が期待する分類の観点に必要と思われる文書データの項目を指定することにより、操作者が望む分類により近い精度の高い分類をおこなうことが可能な文書分類方法が得られるという効果を奏する。
【００８２】
また、この発明によれば、各変換データの間に区切りとなる記号を挿入するので、形態素解析等の解析処理の際に、複数の項目のデータを一つのデータとして混同して扱われることを回避できるとともに、各項目ごとの内容データが瞬時に識別することが可能な文書分類方法が得られるという効果を奏する。
【００８３】
また、この発明によれば、上記のいずれか一つに記載された方法をコンピュータに実行させるプログラムを記録したことで、そのプログラムを機械読み取り可能となり、これによって、上記の動作をコンピュータによって実現することが可能な記録媒体が得られるという効果を奏する。
【図面の簡単な説明】
【図１】この発明の実施の形態１による文書分類装置を構成する情報処理システム全体のハードウエア構成を示す説明図である。
【図２】実施の形態１による文書分類装置を構成する情報処理システムにおけるサーバーをハードウエア的に示す説明図である。
【図３】実施の形態１による文書分類装置を構成する情報処理システムにおけるクライアントをハードウエア的に示す説明図である。
【図４】実施の形態１による文書分類装置の構成を機能的に示すブロック図である。
【図５】実施の形態１による文書分類装置における文書データおよび変換データの内容の一例を示す説明図である。
【図６】実施の形態１による文書分類装置における文書データおよび変換データの内容の別の一例を示す説明図である。
【図７】実施の形態１による文書分類装置の一連の処理の手順を示すフローチャートである。
【図８】この発明の実施の形態２による文書分類装置の構成を機能的に示すブロック図である。
【図９】実施の形態２による文書分類装置の一連の処理の手順を示すフローチャートである。
【符号の説明】
１０１サーバー
１０２クライアント
１０３ネットワーク
２０１ＣＰＵ
２０４Ｉ／Ｆ
２０６ディスク装置
３０１ＣＰＵ
３０６ハードディスク
３０８ディスプレイ
３０９Ｉ／Ｆ
３１１キーボード
３１２マウス
３１３スキャナ
４０１入力部
４０２指定部
４０３変換部
４０４変換データ記憶部
４０５分類部
４０６分類結果記憶部
５０１文書データ
５０２変換データ
６０１分離記号
８０１文書ベクトル生成部
８０２文書ベクトル記憶部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document classification device that classifies documents based on the contents of the document, a document classification method, and a computer-readable recording medium that records a program that causes a computer to execute the method.
[0002]
[Prior art]
Conventionally, as a document classification device, for example, in the document classification device described in JP-A-7-36897, a document is regarded as a document vector characterized by a word, and these document vectors are grouped using a clustering method, It describes what performs automatic classification of documents.
[0003]
In addition, document data is generally stored in a database, and not only the document content but also bibliographic items such as creation date and creator are added, or the document content itself contains multiple items There are many. For example, a patent publication is composed of a plurality of items such as “claims” and “detailed description of the invention”.
[0004]
[Problems to be solved by the invention]
However, since the above-described prior art document classification apparatus cannot arbitrarily specify an item to be classified by an operator for document data having a plurality of items, data that adversely affects classification is added. In addition, since a plurality of items cannot be combined, there is a problem that data effective for classification is insufficient and a highly accurate classification result cannot be obtained.
[0005]
The present invention eliminates the problems caused by the above-described conventional example, and provides a document classification device, a document classification method, and a program for causing a computer to execute the method that can perform highly accurate classification reflecting the intentions of an operator. An object is to provide a recorded computer-readable recording medium.
[0006]
[Means for Solving the Problems]
In order to solve the above-mentioned problems and achieve the purpose, this The document classification device according to the invention is a document classification device that performs document classification based on the content of a document, an input unit that inputs document data composed of one or more items, and an input unit that inputs the document data A designation unit for designating the items constituting the document data, a conversion unit for converting the document data so that only the data corresponding to the item specified by the specification unit is included, and the conversion unit Classification means for classifying documents using the converted data.
[0007]
This Departure According to Meiji, since only the content data of the designated item is used when classifying the document, it is possible to prevent the content of other items from affecting the classification result. For this reason, it is possible to efficiently perform classification with high accuracy close to the classification desired by the operator by designating the items of document data that are considered necessary from the viewpoint of the classification expected by the operator.
[0008]
Also, This invention The document classification apparatus according to claim 1, wherein the document classification apparatus classifies the document based on the content of the document, the input means for inputting the document data composed of one or a plurality of items, and the document input by the input means Designation means for designating the items constituting the data, conversion means for converting the document data so that only the data corresponding to the item designated by the designation means is included, and the conversion converted by the conversion means Document vector generation means for generating feature vectors of each document using data, and classification means for classifying documents using the feature vectors of each document generated by the document vector generation means, To do.
[0009]
This Departure According to Meiji, only the content data of the specified item is used when generating the feature vector of each document for classifying the document, so that the effect of the content of other items on the classification result can be prevented. it can. For this reason, it is possible to perform classification with high accuracy close to the classification desired by the operator by designating the items of the document data that are considered necessary from the viewpoint of the classification expected by the operator.
[0010]
Also, This invention The document classification device according to the above The invention is characterized in that, when the conversion means converts the document data, a predetermined symbol is inserted between the data of the items so that the data of the items can be separated.
[0011]
This Departure According to Akira, since a delimiter symbol is inserted between each conversion data, when performing analysis processing such as morphological analysis, the conversion data as a whole is completely different by combining the data corresponding to each item as they are. It is possible to avoid the meaning being constructed.
[0012]
Also, This invention According to the document classification method according to the document classification method for classifying documents based on the contents of the document, an input step of inputting document data composed of one or a plurality of items, and a document input by the input step A designation step for designating the items constituting the data, a conversion step for converting the document data so that only the data corresponding to the items designated by the designation step is included, and the conversion converted by the conversion step And a classification step of classifying documents using data.
[0013]
This Departure According to Meiji, since only the content data of the designated item is used when classifying the document, it is possible to prevent the content of other items from affecting the classification result. For this reason, it is possible to perform classification with high accuracy close to the classification desired by the operator by designating items of document data that are considered necessary for the viewpoint of the classification expected by the operator.
[0014]
Also, This invention According to the document classification method according to the document classification method for classifying documents based on the contents of the document, an input step of inputting document data composed of one or a plurality of items, and a document input by the input step A designation step for designating the items constituting the data, a conversion step for converting the document data so that only the data corresponding to the items designated by the designation step is included, and the conversion converted by the conversion step A document vector generation step of generating feature vectors of each document using data, and a classification step of classifying documents using the feature vectors of each document generated by the document vector generation step, To do.
[0015]
This Departure According to Meiji, only the content data of the specified item is used when generating the feature vector of each document for classifying the document, so that the effect of the content of other items on the classification result can be prevented. it can. For this reason, it is possible to perform classification with high accuracy close to the classification desired by the operator by designating items of document data that are considered necessary for the viewpoint of the classification expected by the operator.
[0016]
Also, This invention The document classification method related to the above In the invention, the conversion step inserts a predetermined symbol between the data of the items so that the data of the items can be separated when converting the document data.
[0017]
This Departure According to Ming, since a delimiter symbol is inserted between each conversion data, it is possible to avoid confusing and handling data of multiple items as one data during analysis processing such as morphological analysis. The content data for each item can be instantly identified.
[0018]
Also, This The storage medium according to the invention is the above By recording a program that causes a computer to execute the method described in the above, the program can be machine-readable, the above Can be realized by a computer.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Exemplary embodiments of a document classification device, a document classification method, and a computer-readable recording medium recording a program that causes a computer to execute the method will be described below in detail with reference to the accompanying drawings.
[0020]
(Embodiment 1)
First, the hardware configuration of the entire information processing system constituting the document classification apparatus according to Embodiment 1 of the present invention will be described. FIG. 1 is an explanatory diagram showing the hardware configuration of the entire information processing system constituting the document classification apparatus according to the first embodiment. In FIG. 1, the information processing system constituting the document classification apparatus according to the first embodiment is configured by a server / client system. That is, the server 101 and a plurality of clients 102 are connected by the network 103.
[0021]
The client 102 generates classification data, instructs the server 101, displays the classification result, and the like. On the other hand, in accordance with an instruction from the client 102, the server 101 performs processing relating to document (text) classification by enormous numerical operations, and sends the processing result to the client 102. More specifically, the server 101 performs text classification processing, and the client 102 performs classification data generation, processing execution instruction, text classification result display, and the like.
[0022]
Data exchange between the server 101 and the client 102 uses a method called file sharing. That is, by creating a file used for classification processing in a shared folder on the server 101, the two exchange data. Therefore, the client 102 can use the shared folder of the server 101 by sharing the network.
[0023]
Next, the hardware configuration of the server 101 and the client 102 will be described. FIG. 2 is an explanatory diagram illustrating the hardware of the server 101 in the information processing system constituting the document classification device according to the first embodiment. As the server 101, for example, a workstation (WS) is used.
[0024]
In FIG. 2, 201 is a CPU that controls the entire server 101, 202 is a ROM that stores a boot program and the like, 203 is a RAM 203 that is used as a work area for the CPU 201, and 204 is a network 103 via a communication line 205. Reference numeral 206 denotes an interface (I / F) connected to the network 103 and serving as an internal interface, and a disk device 206 stores data. Reference numeral 200 denotes a bus for connecting the above-described units.
[0025]
In addition, a display 208 that displays document information, image information, function information, and the like, a keyboard 209 and a mouse 210 for inputting data, and the like may be similarly connected. Further, the disk device 206 is provided with a shared folder 207 for exchanging data with the client 102.
[0026]
FIG. 3 is an explanatory diagram showing the client 102 in hardware in the information processing system constituting the document classification device according to the first embodiment. As the client 102, for example, a personal computer (PC) or the like is used.
[0027]
In FIG. 3, 301 is a CPU that controls the entire system, 302 is a ROM that stores a boot program, 303 is a RAM that is used as a work area of the CPU 301, and 304 is an HD (hard disk) 305 according to the control of the CPU 301. HDD (Hard Disk Drive) for controlling the reading / writing of data with respect to the HDD, 305 for storing the data written under the control of the HDD 304, and 306 for reading / writing the data with respect to the FD (floppy disk) 307 according to the control of the CPU 301. FDD (floppy disk drive) for controlling writing, 307 for a removable FD for storing data written under the control of the FDD 306, and 308 for a display for displaying documents, images, function information, etc. That.
[0028]
Further, reference numeral 309 is connected to the network 103 via the communication line 310, and an interface (I / F) that controls an internal interface with the network 103, and 311 has keys for inputting characters, numerical values, various instructions, and the like. 312 has a mouse for moving the cursor, selecting a range, pressing an icon or button displayed on the display screen, moving a window, changing the size, etc., and 313 has an OCR (Optical Character Reader) function. 314 indicates a scanner for optically reading the image, 314 indicates a printer for printing the contents of data including the classification result, and 315 indicates a bus for connecting the above-described units.
[0029]
Next, a functional configuration of the document classification device according to Embodiment 1 will be described. FIG. 4 is a block diagram functionally showing the configuration of the document classification apparatus according to the first embodiment. In FIG. 4, the document classification apparatus includes an input unit 401, a specification unit 402, a conversion unit 403, a conversion data storage unit 404, a classification unit 405, and a classification result storage unit 406.
[0030]
Next, the contents of each component will be described in detail. The input unit 401, the specifying unit 402, the conversion unit 403, the conversion data storage unit 404, the classification unit 405, and the classification result storage unit 406 are a recording medium such as the ROM 202 or 302, the RAM 203 or 303, or the disk device 306 or the hard disk 316. The function of each unit is realized by the CPU 201 or 301 executing instruction processing according to the instructions described in the program recorded in the above.
[0031]
(Input unit 401)
The input unit 401 is for inputting document data. For example, the I / F 204 or 309 can obtain a document or a document group via the keyboard 209 or 311, the scanner 313 having the OCR function, or the network 103. Etc. In addition to the above, the input unit 401 includes all of them if it can acquire document data.
[0032]
For example, when the document data is stored in a database, the document data is also input when the medium in which the database is recorded is incorporated in the document classification apparatus according to the present embodiment. Furthermore, a document data storage unit (not shown) that stores the input document data may be included. The document data storage unit may be, for example, the disk device 206 of the server 101 having a large capacity memory.
[0033]
Here, in the present embodiment, a document is a collection of one or more sentences written in a natural language, and when it is a classification target, this is called a document. Specifically, an open patent gazette or a specific newspaper article is also a document, and a claim or a specific sentence is taken out as a document.
[0034]
(Designation unit 402)
The designation unit 402 designates an item of document data. Specifically, the designation unit is composed of three processes.
[0035]
First, an item is extracted from the document data input by the input unit 401 (first process). As a method of extracting an item, there is a method of searching for an item to which a predetermined code (for example, “[” “]” or the like) is added in advance and selecting the item.
[0036]
The first process may be performed by the input unit 401 instead of the designation unit 402. That is, when the input unit 401 inputs document data, the document data items are also extracted. The extraction result is stored in the document data storage unit in association with the document data. In this case, the first process is omitted in the designation unit 402 by using the extraction result. Some types of databases have information about items in advance, and the first processing is omitted by using information about the items.
[0037]
Next, based on the extraction result of the item by the first process, the extraction result stored in the document data storage unit, the information related to the item, etc., what is the extracted item, and the item A list of the contents corresponding to is presented to the operator (second process). As a method of presentation, there are a method of displaying only an item on the display 208 or 308, or displaying all or part of an item and contents corresponding to the item.
[0038]
As a method for displaying only the items, for example, there is a method in which the item names are displayed so as to be arranged in a vertical line in horizontal writing based on the appearance order in the document data. In this case, if the number of items is larger than the number of display lines on the display screen, it may be folded and displayed in two or more vertical columns, or displayed in one vertical column and the display screen can be scrolled vertically. You may do it.
[0039]
As a method of displaying all or part of the item and the content corresponding to the item, for example, the item name is displayed in horizontal writing based on the appearance order in the document data, as in the method of representing only the item. There is a method of displaying in a line so that the contents are displayed in a line, and the contents are also displayed in horizontal writing at the position arranged corresponding to the item name on the right side. In this case, when the data amount of the content is larger than the number of display columns on the display screen, the display screen may be scrollable in the horizontal direction.
[0040]
Another method for displaying all or part of an item and the contents corresponding to the item is to display only the item name, move the cursor to the area where the item name is displayed, and perform a predetermined operation (mouse All or part of the content data may be displayed in a pop-up manner by pressing a button 210 or 312 or a predetermined key such as the keyboard 209 or 311).
[0041]
Next, according to an instruction from the operator, one or more items to be classified are designated simultaneously from the presented (displayed) items (third process). As a designation method, a corresponding item is designated from the presented items based on an instruction signal related to designation from a pointing device such as the keyboard 209 or 311 or the mouse 210 or 312.
[0042]
At this time, one item may be specified, or two or more items may be specified at the same time. Also, the form of coupling can be specified together. Furthermore, the arrangement order of the content data after the data conversion may be designated according to the designation order.
[0043]
(Conversion unit 403)
The conversion unit 403 converts the document data so that the input document data includes only the data corresponding to the item specified by the specification unit 402. Specifically, only the data corresponding to the specified item in the document data is extracted and converted into converted data consisting only of the extracted data.
[0044]
The conversion data is not only converted by listing the data corresponding to each item in the order of the specified items in the original document data, but for example, the data contents of the specified items are combined as a character string Thus, conversion data including only the data content of the specified item may be used, or conversion may be performed such that the data is combined after the order of the items is changed to an order different from the order in the original document data.
[0045]
Also, the conversion unit 403 inserts a predetermined separation symbol 601 between the item data so that the data of each item in the conversion data can be separated. Thereby, the break of the data corresponding to each item can be grasped instantly.
[0046]
The separation symbol 601 is particularly important when performing natural language analysis such as morphological analysis. If the data corresponding to each item is in the form of a sentence (when the end of the sentence ends with a punctuation mark), it is possible to determine the break between the sentence and sentence without this separator. If the data corresponding to the item does not form the body of the sentence (in the case of a bulleted sentence, the item changes in the middle of the sentence, etc.), combining the data as they are has a completely different meaning depending on the item. May be configured. In order to avoid such a case, this separator 601 is inserted.
[0047]
In general, the separator 601 uses “/ (slash)” indicating a break, but if “/” is present in the converted data, confusion with “/” of the data occurs. You can use another symbol. Whether or not to insert this symbol, a predetermined key is assigned to the keyboard 209 or 311 and a predetermined icon is displayed each time the key is pressed or on the display screen. It is also possible to alternately set whether or not to insert the separation symbol 601 each time is clicked.
[0048]
(Conversion data storage unit 404)
The conversion data storage unit 404 is a storage unit that stores conversion data. The conversion data storage unit 404 can be set, for example, according to a difference in conversion data capacity or use, such as the disk device 206 of the server 101, the client-side hard disk 305, or the floppy disk 307.
[0049]
The conversion data storage unit 404 stores the separation symbol 601 and the like in addition to the conversion data including the item setting order. The conversion data stored in the conversion data storage unit 404 can be utilized, for example, when used in another classification.
[0050]
(Classification unit 405)
The classification unit 405 automatically classifies according to the content of the conversion data converted by the conversion unit 403 or the conversion data stored in the conversion data storage unit 404. The classification unit 405 can classify a document using a conventional document classification method such as “document classification device” disclosed in Japanese Patent Application Laid-Open No. 7-36897.
[0051]
(Classification result storage unit 406)
The classification result storage unit 406 is a storage unit that stores the results classified by the classification unit 405. Similar to the conversion data storage unit 404, the classification result storage unit 406 includes, for example, a disk device 206 of the server 101, a client-side hard disk 305, a floppy disk 307, etc. Each can be set.
[0052]
Next, an example of converted data obtained by converting document data and document data will be described. FIG. 5 is an explanatory diagram showing an example of document data and its conversion data. In FIG. 5, a patent publication group is used as a document group, 501 is an example of document data, and 502 is an example of converted data.
[0053]
The document data 501 includes “application number”, “application date”, “inventor”, “invention name”, “purpose”, “configuration”, “claim 1”, “prior art”, and “solve the problem”. Items such as “means for achieving”, “action”, “example”, “effect of the invention” are included.
[0054]
Since conventional document classification devices handle each piece of document data as a group, the content data of all items is subject to classification processing for document data containing multiple items, which is unnecessary for the classification viewpoint desired by the operator. Or, there may be an item that has an adverse effect.
[0055]
In the present embodiment, an operator who performs classification can designate one or more items that are considered necessary for the viewpoint of the classification desired by the operator. For example, when categorizing patent gazette documents, if the operator wants to focus on “the problem of the invention”, the “purpose”, “means for solving the problem”, “action”, “effect of the invention” are set. specify. Further, when it is desired to pay attention to solving means, “means for solving the problem” and “example” can be designated. When an item to be classified is designated, the document data is converted based on the designated item.
[0056]
In FIG. 5, when the operator designates the items “purpose”, “means for solving the problem”, “action”, and “effect of the invention”, only the content data of the designated item is displayed. It is an example when converted to include.
[0057]
As is apparent from the change data 502, the data corresponding to the item “purpose” is “to store screen information corresponding to the history and so on” and “to solve the problem”. The data corresponding to the item of “means of“ having display means for displaying to achieve the above-mentioned object ”and the data corresponding to the item of“ action ”“ in the above configuration, From the input procedure, it operates as if it is displayed, and the data corresponding to the item of “Effects of the invention” is “According to the Japanese invention as described above… there is an effect that can be reproduced”. Are combined to form one document.
[0058]
FIG. 6 shows the specified items when the operator designates “purpose”, “means for solving the problem”, “action”, and “effect of the invention” using the same document data. This is an example in which conversion is performed so that a separator 601 (“/”) is inserted between the data of each item.
[0059]
Next, a series of processing procedures of the document classification apparatus according to the first embodiment will be described. FIG. 7 is a flowchart showing a series of processing procedures of the document classification apparatus according to the first embodiment.
[0060]
In the flowchart of FIG. 7, first, the input unit 1 inputs document data (step S710). The designation unit 402 designates an item (step S720).
[0061]
The conversion unit 403 converts the document data input in step S710 into converted data so as to be the contents of the item specified in step S720 (step S730). Further, the separator 601 is inserted between the data corresponding to the items as necessary (step S740). The converted conversion data is stored in the conversion data storage unit 404 together with the separation symbol data (step S750).
[0062]
Based on the converted data converted in step S730 or the converted data stored in the converted data storage unit 404 in step S750, the classification unit 405 performs document classification (step S780). After the classification process is completed, the result of the classification process is stored in the classification result storage unit 406 (step S790), and all the processes are completed.
[0063]
As described above, according to the first embodiment, document data is converted into converted data by a specified item, and the document is classified based on the converted data. Therefore, the classification based on the contents of other unnecessary items is performed. The influence on the result can be suppressed. In addition, by inserting the separation symbol 601, it is possible to identify data for each combined item in the converted data, and to avoid content confusion due to data combination between items.
[0064]
(Embodiment 2)
In the first embodiment, the documents are classified using the conversion data. However, as in the second embodiment described below, a feature vector of the document is generated using the conversion data, and the feature vector is used. You may make it classify | categorize a document.
[0065]
First, a functional configuration of the document classification apparatus according to the second embodiment will be described. FIG. 8 is a block diagram functionally showing the configuration of the document classification apparatus according to the second embodiment. In FIG. 8, the same components as those in FIG. 4 of the first embodiment are denoted by the same reference numerals, and the description thereof is omitted.
[0066]
In FIG. 8, the document classification apparatus includes a document vector generation unit 801 in addition to an input unit 401, a specification unit 402, a conversion unit 403, a conversion data storage unit 404, a classification unit 405, and a classification result storage unit 406. And a document vector storage unit 802.
[0067]
The document vector generation unit 801 and the document vector storage unit 802 are described in a program recorded in a recording medium such as the ROM 202 or 302, the RAM 203 or 303, the disk device 306, or the hard disk 316, as with other components. The function of each unit is realized by the CPU 201 or 301 executing instruction processing in accordance with the received instructions.
[0068]
(Document vector generation unit 801)
The document vector generation unit 801 generates a feature vector for each document. In order to generate a feature vector of a document, it is necessary to perform natural language analysis processing such as morphological analysis on document data. This natural language analysis processing is performed for each item of each document data by a document analysis unit (not shown). The morpheme analysis can use a conventional morpheme analysis method.
[0069]
The document vector generation unit 801 generates a document vector for each document data using the analysis result obtained by the document analysis unit. At this time, the document vector is generated based only on the analysis result relating to the item designated by the designation unit 402. For example, only the feature data obtained from the content data of the item specified by the specification unit 402 is added to each document data to generate a document vector, so that only the content data of the item specified by the specification unit 402 is reflected. Document vectors can be generated.
[0070]
(Document vector storage unit 802)
The document vector storage unit 802 is a storage unit that stores the feature vector of each document generated by the document vector generation unit 801. In the document vector storage unit 802, even if the document is the same, the feature vector of the document differs depending on the item specified by the specification unit 402. Therefore, the feature vector of the document is stored for each specification. When classifying documents by the classification unit 405, the document feature vectors stored in advance by the document vector storage unit 802 are used, so that the documents can be classified efficiently.
[0071]
As the document vector storage unit 802, for example, the disk device 206 of the server 101, the hard disk 305 on the client side, the floppy disk 307, or the like can be set according to the difference in conversion data capacity or application. .
[0072]
(Classification unit 405)
The classification unit 405 classifies documents based on the similarity between feature vectors of each document converted by the conversion unit 403. Specifically, document classification can be performed by applying classification methods such as a chi-square method, a discriminant analysis method, and a cluster analysis method to the generated classification target data. Here, any classification method can be used as long as the vector data can be applied.
[0073]
Next, a series of processing procedures of the document classification apparatus according to the second embodiment will be described. FIG. 9 is a flowchart showing a series of processing procedures of the document classification apparatus according to the second embodiment. In the flowchart of FIG. 9, steps S710 to S750 are the same as those in the flowchart of FIG. 7 of the first embodiment, and therefore the same step numbers are given and description thereof is omitted.
[0074]
Based on the converted data converted in step S730 or the converted data stored in step S750, the document vector generation unit 801 generates a feature vector of each document (step S760). The generated feature vector of each document is stored in the document vector storage unit 802 (step S770).
[0075]
Based on the conversion data converted in step S760 or the conversion data stored in step S770, the classification unit 405 performs document classification (step S780). After the classification process is completed, the result of the classification process is stored in the classification result storage unit 406 (step S790), and all the processes are completed.
[0076]
As described above, according to the second embodiment, the document data is converted into the conversion data by the designated item, and the feature vector of each document is generated based on the conversion data. Therefore, the document more reflecting the operator's intention. The feature vector can be used to classify documents, and the influence on the classification result due to the contents of other unnecessary items can be suppressed.
[0077]
【The invention's effect】
As explained above, This According to the invention, when classifying the document, only the content data of the designated item is used, so that it is possible to prevent the content of other items from affecting the classification result. Therefore, it is possible to obtain a document classification device capable of efficiently performing classification with high accuracy close to the classification desired by the operator by designating items of document data that are considered necessary for the viewpoint of the classification expected by the operator. There is an effect.
[0078]
Also, This According to the invention, when generating the feature vector of each document for classifying the document, only the content data of the designated item is used, so that the influence of the content of the other item on the classification result is prevented. Can do. Therefore, by specifying the document data items that are considered necessary for the viewpoint of the classification expected by the operator, it is possible to obtain a document classification apparatus capable of performing classification with high accuracy close to the classification desired by the operator. Play.
[0079]
Also, This According to the invention, since the delimiter symbol is inserted between the conversion data, it is possible to avoid the confusion of the data of a plurality of items as one data in the analysis process such as morphological analysis. In addition, there is an effect that a document classification device capable of instantaneously identifying the content data for each item is obtained.
[0080]
Also, This According to the invention, when classifying the document, only the content data of the designated item is used, so that it is possible to prevent the content of other items from affecting the classification result. Therefore, it is possible to obtain a document classification method capable of performing classification with high accuracy close to the classification desired by the operator by designating the items of the document data that are considered necessary for the viewpoint of the classification expected by the operator. There is an effect.
[0081]
Also, This According to the invention, when generating the feature vector of each document for classifying the document, only the content data of the designated item is used, so that the influence of the content of the other item on the classification result is prevented. Can do. Therefore, it is possible to obtain a document classification method capable of performing classification with high accuracy close to the classification desired by the operator by designating the items of the document data that are considered necessary for the viewpoint of the classification expected by the operator. There is an effect.
[0082]
Also, This According to the invention, since the delimiter symbol is inserted between the conversion data, it is possible to avoid the confusion of the data of a plurality of items as one data in the analysis process such as morphological analysis. In addition, there is an effect that it is possible to obtain a document classification method capable of instantaneously identifying content data for each item.
[0083]
Also, This According to the invention of the above By recording a program that causes a computer to execute the method described in any one of the above, the program can be machine-readable, the above There is an effect that a recording medium capable of realizing the above operation by a computer can be obtained.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram showing a hardware configuration of an entire information processing system that constitutes a document classification device according to Embodiment 1 of the present invention;
FIG. 2 is an explanatory diagram showing a server in hardware in the information processing system constituting the document classification device according to the first embodiment.
FIG. 3 is an explanatory diagram showing a client in the information processing system constituting the document classification device according to the first embodiment in hardware.
FIG. 4 is a block diagram functionally showing the configuration of the document classification device according to the first embodiment.
FIG. 5 is an explanatory diagram showing an example of the contents of document data and conversion data in the document classification device according to the first embodiment.
6 is an explanatory diagram showing another example of the contents of document data and converted data in the document classification device according to Embodiment 1. FIG.
FIG. 7 is a flowchart showing a series of processing procedures of the document classification device according to the first embodiment.
FIG. 8 is a block diagram functionally showing the configuration of a document classification apparatus according to Embodiment 2 of the present invention.
FIG. 9 is a flowchart showing a sequence of processing steps of the document classification device according to the second embodiment.
[Explanation of symbols]
101 server
102 clients
103 network
201 CPU
204 I / F
206 Disk unit
301 CPU
306 hard disk
308 display
309 I / F
311 keyboard
312 mouse
313 Scanner
401 Input section
402 Designation part
403 conversion unit
404 Conversion data storage unit
405 Classification part
406 Classification result storage unit
501 Document data
502 Conversion data
601 Separation symbol
801 Document vector generation unit
802 Document vector storage unit

Claims

In a document classification device that classifies documents based on document contents,
The document classification device includes :
Means for obtaining the document data,
Means for retrieving a predetermined code from the acquired document data and extracting one or more items;
Means for causing specified at least one of the extracted items,
Means for converting the document data so that only the data corresponding to the specified item is contained;
Means for classifying the document using the converted data;
Have
The means for converting is
An apparatus for classifying documents, wherein a separator is inserted between the data of the items when the data corresponding to the designated items does not form a sentence.

In a document classification device that classifies documents based on document contents,
The document classification device includes :
Means for obtaining the document data,
Means for retrieving a predetermined code from the acquired document data and extracting one or more items;
Means for causing specified at least one of the extracted items,
Means for converting the document data so that only the data corresponding to the specified item is contained;
Means for generating a feature vector of each document using the converted data;
Means for classifying the document using the feature vector of each generated document;
Have
The means for converting is
An apparatus for classifying documents, wherein a separator is inserted between the data of the items when the data corresponding to the designated items does not form a sentence.

3. The document classification apparatus according to claim 1, wherein a symbol that does not exist in the conversion data is used as the separation symbol.

The document classification apparatus according to claim 1, wherein the conversion unit can be set by an operator as to whether or not to insert the separation symbol.

The document classification apparatus according to claim 1, wherein the document data has one or more predetermined codes.

The designated means is:
Display means for displaying items extracted by the extracting means;
Means for designating at least one item from among the items presented on the display means;
The document classification apparatus according to claim 1, comprising:

In a document classification method in a document classification device that classifies a document based on the contents of the document,
A step performed by the document classification device,
An acquisition unit included in the document classification apparatus acquires document data composed of one or more items;
An extraction unit included in the document classification device , searching the item from the acquired document data and extracting one or more items;
A designation means provided in the document classification device designates at least one or more of the extracted items;
A step of converting the document data so that the conversion means included in the document classification apparatus has only the data corresponding to the designated item;
A classifying unit provided in the document classification device classifies the document using the converted data;
Have
The converting step includes
A document classification method, wherein a separator is inserted between data of items when the data corresponding to the designated items does not form a sentence.

In a document classification method in a document classification device that classifies a document based on the contents of the document,
A step performed by the document classification device,
An acquisition unit included in the document classification apparatus acquires document data composed of one or more items;
An extraction unit included in the document classification device , searching the item from the acquired document data and extracting one or more items;
A designation means provided in the document classification device designates at least one or more of the extracted items;
A step of converting the document data so that the conversion means included in the document classification apparatus has only the data corresponding to the designated item;
Generation means for the document classification apparatus comprising, generating a feature vector for each document by using the converted data,
A classifying unit provided in the document classification apparatus classifies the document using the feature vector of each generated document;
Have
The converting step includes
A document classification method, wherein a separator is inserted between data of items when the data corresponding to the designated items does not form a sentence.

9. The document classification method according to claim 7 , wherein a symbol that does not exist in the conversion data is used as the separation symbol.

9. The document classification method according to claim 7 , wherein the converting step can be set by a user as to whether to insert the separator.

A computer-readable recording medium having recorded thereon a program for causing the document classification apparatus to execute each step of the method according to any one of claims 7 to 10 .