JP2004054303A

JP2004054303A - System for making electronic dictionary for document classification and system using it for classifying document

Info

Publication number: JP2004054303A
Application number: JP2002206549A
Authority: JP
Inventors: Masami Hara; 原　正巳; Otoya Shirotsuka; 城塚　音也; Yoshio Eriguchi; 江里口　善生; Toru Takagi; 高木　徹
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2002-07-16
Filing date: 2002-07-16
Publication date: 2004-02-19

Abstract

<P>PROBLEM TO BE SOLVED: To improve a certainty of application of a proper category to a document to be classified. <P>SOLUTION: A system dictionary 13 or user dictionaries 11A, 11B... listing respective character strings and one or more categories serving as mapping of the character strings are prepared. A text classifying device 1 acquires a classification objective text 19 and decides a candidate of the category to be applied to the obtained classification objective text 19 on the basis of the system dictionary 13, the user dictionaries 11A, 11B.... The text classifying device 1 displays a user interface and the like for receiving a cause why the category is nominated for application, an inquiry asking whether the category may be applied or not, and an answer to the inquiry. When an affirmative answer is received to its inquiry, the text classifying device 1 applies the category to the classification objective text 19. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、複数のサンプル文書に基づいて文書分類用の電子的な辞書を作成するための新規な技術、及び、その電子的な辞書を利用した新規な文書分類技術に関する。
【０００２】
【従来の技術】
従来、文字のみからなる文書（以下、「テキスト」と言う）を自動分類するテキスト分類システムが知られている。従来のテキスト分類システムでは、以下のようにしてテキストの自動分類が行われている。
【０００３】
すなわち、従来のテキスト分類システムでは、テキストについての複数のカテゴリの各々について、そのカテゴリが付与されている多数の学習用テキスト（サンプル的なテキスト）から成る学習用テキスト群が用意されている。テキスト分類システムは、各カテゴリについて、学習用テキスト群から各単語を抽出して統計的な処理を行うことで、単語統計データ（例えば、そのカテゴリにおける各単語の分布を示すデータ）を作成する。そして、テキスト分類システムは、分類対象のテキスト（以下、分類対象テキスト）が入力されたときは、入力された分類対象テキストから各単語を抽出し、抽出された各単語と、カテゴリ別に作成された単語統計データとに基づいて分類対象テキストにカテゴリを付与することで、分類対象テキストの自動分類が行われる。
【０００４】
【発明が解決しようとする課題】
従来のテキスト分類システムには例えば以下のような問題点がある。
【０００５】
すなわち、統計データに基づく分類手法であるため、例えば、（１）辞書構築及び分類判定に膨大な計算を要するという問題点、（２）統計データは単なる数値でしかないので、分類の特徴を的確に表すのが困難であるという問題点、（３）人間が分類結果を正しいかどうか判断するための情報をシステムから得ることができないという問題点、及び、（４）分類対象テキストの内容を正確に記憶しておく必要があり、そのため、分類対象テキストの内容を知らない人間では付与カテゴリ適否判断は全く行うことができないという問題点がある。
【０００６】
また、従来は、カテゴリ別に作成された統計データに基づいて上述の自動分類が行われるにすぎないため、分類対象テキストに最終的に人間が付与したカテゴリをフィードバックして辞書を再構築するには膨大な処理時間が必要となるため実用上は困難である。
【０００７】
以上のような問題点は、テキストに限らず、画像や線画等を含んだ他の種類の文書についても存在すると考えられる。
【０００８】
従って、本発明の目的は、分類対象の文書に対して適切なカテゴリが付与することの確実性を高められるようにすることにある。
【０００９】
本発明の別の目的は、分類対象の文書に付与されるカテゴリが適切か否かの判断を容易に行うことができるようにすることにある。
【００１０】
本発明のまた別の目的は、分類対象の文書に対して人間が判断したカテゴリの結果に基づいて、文書の自動分類を行うシステムに対して適宜にフィードバックをかけることができるようにすることにある。
【００１１】
【課題を解決するための手段】
本発明の第１の側面に従うシステムは、関連度合判定手段と辞書作成手段とを備える。関連度合判定手段は、文書分類用の電子的な辞書を作成するためのシステムであって、文書についての複数のカテゴリの各々に属する複数のサンプル文書を入力し、各カテゴリについて、上記複数のサンプル文書に含まれている多数の文字列の各々について、その文字列とそのカテゴリとの関連度合の判定を行う。辞書作成手段は、その判定の結果を基に、上記多数の文字列の中から、上記関連度合が所定度合以上である各特定文字列を選択し、上記選択された各特定文字列とそれに対応した１以上のカテゴリとが記録された電子的な辞書を作成する。
【００１２】
ここで、「文書」とは、文字を含んだ文書、例えば、文字のみの文書や、文字の他に画像或いは線画等が含まれた文書である。
【００１３】
また、「カテゴリ」は、文書を分類可能にせしめるものであれば何でも良く、例えば、人間が理解できる言葉（例えば、小説、随筆など）であっても良いし、数字や英字等から成るコード群であっても良い。
【００１４】
また、「文字列」とは、例えば、文法上の単語（すなわち言語の最小単位）や、連続する単語（例えば「文書作成システム」という文字列）や、助詞を介して連なる複数の単語（例えば「私の物」という文字列）や、複数の助詞と複数の単語から成る文字列である。
【００１５】
本発明の第１の側面に従うシステムの好適な実施形態では、上記関連度合判定手段は、上記多数の文字列の各々について、（１）同一のカテゴリに属する上記複数のサンプル文書に含まれているその文字列の総数と、（２）存在するサンプル文書の総数、又は、上記存在するサンプル文書に含まれているその文字列の総数とに基づいて、その文字列に関する上記関連度合を判定する。
【００１６】
本発明の第２の側面に従うシステムは、文書を分類するためのシステムであって、辞書取得手段と、文書取得手段と、比較照合手段と、文書分類手段とを備える。辞書取得手段は、文書についての複数のカテゴリの各々に対応した１以上の文字列（例えば、そのカテゴリに対して特徴的な文字列）が記載されている電子的な辞書を取得する。文書入力手段は、分類対象の文書を取得する（例えば、ユーザから分類対象の文書を受ける、或いは、ユーザの要求に応じた文書を作成することで、分類対象の文書を取得する）。比較照合手段は、取得された分類対象の文書に含まれている各文字列と、上記電子的な辞書に記載されている各文字列（例えば上述の特徴的な文字列）との比較照合を行う。文書分類手段は、その比較照合の結果、互いに一致した文字列が存在する場合、上記電子的な辞書を参照して上記一致した文字列に対応したカテゴリを把握し、上記取得された文書に対して上記把握されたカテゴリである付与対象カテゴリを付与することで上記取得された文書を分類する。
【００１７】
本発明の第２の側面に従うシステムの好適な実施形態では、システムは、根拠文字列抽出手段と、報知手段と、問い出力手段と、回答受け手段とを更に備える。根拠文字列抽出手段は、上記付与対象カテゴリが付与されることとなる根拠となった上記一致した文字列又はその文字列を含んだ第１の長い文字列を上記取得された文書から抽出する。報知手段と、上記抽出された上記一致した文字列又は上記第１の長い文字列と、上記付与対象カテゴリとを、上記文書を入力したユーザに報知する。問い出力手段は、上記取得された文書に対し上記報知された付与対象カテゴリを付与して良いか否かの問いを上記ユーザに出力する。回答受け手段は、上記出力された問いに対する回答を上記ユーザから受ける。
【００１８】
この場合、上記文書分類手段は、上記回答受け手段が肯定的な回答を受けたときにのみ、上記取得された文書に対し上記付与対象カテゴリを付与する。
【００１９】
これの更に好適な実施形態では、上記根拠文字列抽出手段は、上記複数のカテゴリの各々に属する複数のサンプル文書に含まれている、上記一致した文字列を含んだ特定のサンプル文書から、上記一致した文字列を含んだ第２の長い文字列を上記特定のサンプル文書から抽出し、上記報知手段は、上記第１の長い文字列と、上記付与対象カテゴリと、上記第２の長い文字列と、上記特定のサンプル文書が属するカテゴリとを上記ユーザに報知する（例えば、上記第１の長い文字列と上記付与対象カテゴリとの組合せと、上記第２の長い文字列とそれが含まれているサンプル文書が属するカテゴリとの組合せとを並べて所定の画面に表示する）。
【００２０】
また、好適な実施形態では、電子的な辞書は複数個存在し、複数個の電子的な辞書には、各ユーザによって作成された各ユーザ専用のユーザ辞書が含まれており、システムは、ユーザからの要求に応答してそのユーザに専用のユーザ辞書を編集する辞書編集手段を更に備える。
【００２１】
本発明の第３の側面に従う方法は、文書分類用の電子的な辞書を作成するための方法であって、文書についての複数のカテゴリの各々に属する複数のサンプル文書を入力し、各カテゴリについて、上記複数のサンプル文書に含まれている多数の文字列の各々について、その文字列とそのカテゴリとの関連度合の判定を行うステップと、上記判定の結果を基に、上記多数の文字列の中から、上記関連度合が所定度合以上である各特定文字列を選択し、上記選択された各特定文字列とそれに対応した１以上のカテゴリとが記録された電子的な辞書を作成するステップとを有する。
【００２２】
本発明の第４の側面に従う方法は、文書を分類するための方法であって、文書についての複数のカテゴリの各々に対応した文字列であってそのカテゴリに対して特徴的な文字列が２個以上記載されている電子的な辞書を取得するステップと、分類対象の文書を取得するステップと、上記取得された文書に含まれている各文字列と、上記電子的な辞書に記載されている各特徴的な文字列との比較照合を行うステップと、上記比較照合の結果、互いに一致した文字列が存在する場合、上記電子的な辞書を参照して上記一致した文字列に対応したカテゴリを把握し、上記取得された文書に対して上記把握されたカテゴリである付与対象カテゴリを付与することで上記取得された文書を分類するステップとを有する。
【００２３】
本発明の第５の側面に従うコンピュータプログラムは、文書分類用の電子的な辞書を作成するためのコンピュータプログラムであって、文書についての複数のカテゴリの各々に属する複数のサンプル文書を入力し、各カテゴリについて、上記複数のサンプル文書に含まれている多数の文字列の各々について、その文字列とそのカテゴリとの関連度合の判定を行うステップと、上記判定の結果を基に、上記多数の文字列の中から、上記関連度合が所定度合以上である各特定文字列を選択し、上記選択された各特定文字列とそれに対応した１以上のカテゴリとが記録された電子的な辞書を作成するステップとをコンピュータに実行させるためのものである。
【００２４】
本発明の第６の側面に従うコンピュータプログラムは、文書を分類するためのコンピュータプログラムであって、文書についての複数のカテゴリの各々に対応した文字列が含まれている（例えば、そのカテゴリに対して特徴的な文字列が２個以上記載されている）電子的な辞書を取得するステップと、分類対象の文書を取得するステップと、上記取得された文書に含まれている各文字列と、上記電子的な辞書に記載されている各特徴的な文字列との比較照合を行うステップと、上記比較照合の結果、互いに一致した文字列が存在する場合、上記電子的な辞書を参照して上記一致した文字列に対応したカテゴリを把握し、上記取得された文書に対して上記把握されたカテゴリである付与対象カテゴリを付与することで上記取得された文書を分類するステップとをコンピュータに実行させるためのものである。
【００２５】
本発明の第７の側面に従うデータは、文書分類用の電子的な辞書であって、文書についての複数のカテゴリの各々に対応した、そのカテゴリに対して特徴的であるとみなされた又は推定された１又は複数の文字列を含んでいる。各文字列は、１つのカテゴリに対して１個以上が対応付けられている。例えば、２種類の文字列にそれぞれ同一のカテゴリが対応付けられている場合もあれば、同一の文字列に、２種類のカテゴリが対応付けられている場合もある。前者の場合は、２種類の文字列のうちいずれが出現しても、その同一のカテゴリを付与することができ、後者の場合は、その同一の文字列が出現したら、必ず２種類のカテゴリの両方を付与することができる。
【００２６】
上述した本発明の各システムに備えられる複数の手段は、１台のコンピュータシステムに搭載することもできるし、分散された複数台のコンピュータシステムに分けて搭載することもできる。
【００２７】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態を説明する。
【００２８】
図１は、本発明の一実施形態に係るシステムの全体構成を示す。
【００２９】
この実施形態では、学習用テキスト群保存部３と、辞書作成装置５と、テキスト分類装置１とが備えられる。なお、辞書作成装置５とテキスト分類装置１は、それぞれが１台のコンピュータマシンであって物理的に独立したものであっても良いし、それぞれが１つのアプリケーションプログラムであって１台のコンピュータマシンに搭載されても良いし、２台のコンピュータマシンに分けて搭載されても良い。
【００３０】
学習用テキスト群保存部３には、例えば図２に示すように、複数の学習用テキスト群２１Ａ、２１Ｂ、…が保存されている。各学習用テキスト群、例えば学習用テキスト群２１Ａは、テキストについての同一のカテゴリ（例えば「カテゴリＡ」）が付与されている複数の学習用テキスト２１Ａ_１、２１Ａ_２、２１Ａ_３、…から成っている（これについては、他の学習用テキスト群２１Ｂ・・・についても同様である）。その結果、各学習用テキスト群は、複数のカテゴリのうちのいずれか１つに対応付けられている。なお、「学習用テキスト」とは、記述した通り、既にカテゴリが付与されているテキストであってサンプル的なもの（例えば、過去にテキスト分類がされたテキスト）である。学習用テキストは、２以上のカテゴリが付与されている場合は、それら２以上のカテゴリに対応した各学習用テキスト群に存在することとなる。
【００３１】
辞書作成装置５は、学習用テキスト群保存部３に保存されている各学習用テキスト群から各文字列を抽出し、抽出された各文字列のそれに対応したカテゴリにおける重要度を算出し、算出された重要度の結果に基づいて、テキスト分類用の電子的な辞書を作成する。作成された辞書は、後述のシステム辞書１３として出力される（このため、その電子的な辞書を以下「システム辞書」と称する）。
【００３２】
テキスト分類装置１は、複数人のユーザが利用することができる。テキスト分類装置１は、分類対象のテキスト（以下、分類対象テキスト）１９を特定の方法で取得し、辞書作成装置５によって作成されたシステム辞書１３と、複数のユーザによって作成された電子的な辞書（以下、ユーザ辞書）１１Ａ、１１Ｂ、…とに基づいて、取得された分類対象テキスト１９に対し、上述した複数のカテゴリのうちの少なくとも１つを付与する。テキスト分類装置１は、システム辞書記憶部７と、ユーザ辞書記憶部９と、テキスト分類部１５と、ユーザ辞書更新部１７とを有する。
【００３３】
システム辞書記憶部７は、辞書作成装置５によって作成されて出力されたシステム辞書１３を記憶する（記憶されているシステム辞書１３は、例えばハッシュ化されている）。システム辞書１３は、例えば図３に示すように、各特徴的文字列（図３には「語句」と表記）と、各特徴的文字列に対応した１又は２以上のカテゴリとが記載されている（「特徴的文字列」とは、後述の方法で、辞書作成装置５によって、そのカテゴリにおいて特徴的であると推定された文字列である）。このシステム辞書１３は、特定の方法で（例えば、フロッピー（登録商標）ディスクやＭＯディスク等の可搬記録媒体を介して、或いは、有線又は無線の通信システムを利用して）、辞書作成装置５からテキスト分類装置１のシステム辞書記憶部７に取り込むことができる。
【００３４】
ユーザ辞書記憶部９は、例えば図４に示すように、複数のユーザの各々が作成したユーザ辞書（具体的には例えば、ユーザＡのユーザ辞書１１Ａや、ユーザＢのユーザ辞書１１Ｂや、・・・）を記憶する（記憶されている各ユーザ辞書１１Ａ、１１Ｂ、…は、例えばハッシュ化されている）。各ユーザ辞書、例えばユーザ辞書１１Ａには、ユーザＡの１以上の所望の文字列（図４には「語句」と表記、以下、ユーザ所望文字列）と、各ユーザ所望文字列に対応した１又は２以上のカテゴリとが記載されている（他のユーザ辞書１１Ｂ、…の構成も同様である）。各ユーザ辞書１１Ａ、１１Ｂ、…は、テキスト分類装置１に搭載されている所定のアプリケーションプログラムを利用して作成することもできるし、図示しない外部機器（例えば、パーソナルコンピュータや、それから出力されたユーザ辞書を記憶した可搬記録媒体など）からテキスト分類装置１に入力することでユーザ辞書記憶部９に記憶させることもできる。
【００３５】
テキスト分類部１５は、分類対象テキスト１９を特定の方法で取得し、システム辞書記憶部７が記憶しているシステム辞書１３と、ユーザ辞書記憶部９が記憶しているユーザ辞書１１Ａ、１１Ｂ、…とに基づいて、取得された分類対象テキスト１９に対して付与の対象となるカテゴリ（以下、付与対象カテゴリ）を上述した複数のカテゴリの中から把握する。そして、テキスト分類部１５は、把握された付与対象カテゴリを分類対象テキスト１９に付与して良いか否かの問い（以下、付与適否問い）等を表示する。テキスト分類部１５は、付与適否問いに対して肯定的な回答を受けた場合は、分類対象テキスト１９に上記の付与対象カテゴリを付与し、一方、付与適否問いに対して否定的な回答を受けた場合は、所定の処理（例えば、付与対象となり得る別のカテゴリを探す処理）を実行する。
【００３６】
ユーザ辞書更新部１７は、複数のユーザのうちの或るユーザからの要求を受け、そのユーザの要求に応答して、ユーザ辞書記憶部９が記憶している１又は複数のユーザ辞書１１Ａ、１１Ｂ、…のうちそのユーザに対応したユーザ辞書を編集して更新する。
【００３７】
以上が、本実施形態に係るシステムの概要である。以下、このシステムにおいての処理流れ、換言すれば、辞書作成装置５、テキスト分類部１５、及びユーザ辞書更新部１７が行う処理内容について詳細に説明する。
【００３８】
図５は、辞書作成装置５がシステム辞書１３を作成する際の処理流れを示す。なお、図５以降及び以下の説明では、「ステップ」を「Ｓ」と略記する。
【００３９】
辞書作成装置５は、所定のイベントが発生したときに（例えばユーザから要求を受けたときに）起動する。辞書作成装置５は、まず、学習用テキスト群保存部３内の各学習用テキスト群２１Ａ、２１Ｂ、…に含まれている各学習用テキストから、各文字列を抽出する（Ｓ１）。ここで、「文字列」は、文法上の単語（すなわち言語の最小単位）であっても良いし、連続する単語であっても良いし（例えば図３に示したように「自動焦点調節」の如くであっても良いし）や、助詞を介して連なる複数の単語であっても良い（例えば「私の物」の如くであっても良いし）、複数の助詞と複数の単語から成る文字列であっても良い（例えば「私は特許出願をした」の如くであっても良い）。
【００４０】
次に、辞書作成装置５は、抽出された各文字列について、その文字列が含まれていた学習用テキストが属するカテゴリ（以下、「該当カテゴリ」と言う）においての重要度、換言すれば、抽出された各文字列とそれの該当カテゴリとの関連度合を算出する。具体的には、例えば、辞書作成装置５は、抽出された各文字列について、全体に対して該当カテゴリで出現するその文字列の出現比率（以下、「文字列出現比率」と言う）を算出する（Ｓ２）。辞書作成装置５は、以下の（１）式
文字列出現比率＝該当カテゴリに対応した学習用テキスト群での出現頻度／全ての学習用テキスト群での出現頻度・・・（１）
を用いて、上記抽出された各文字列についての文字列出現比率を算出する。ここで、「出現頻度」とは、出現した回数のことである（なお、上記（１）式は、一例であり、具体的な記載はしないが、他の算出方法もある）。
【００４１】
辞書作成装置５は、上記抽出された各文字列について、算出された文字列出現比率が所定の閾値を超えているか否かの判定を行う。その判定の結果、辞書作成装置５は、肯定的な結果が得られた文字列（すなわち、文字列出現比率が所定の閾値を超えている文字列）を、システム辞書に登録する候補とする（Ｓ３）（以下、その文字列を「登録候補文字列」と言う）。
【００４２】
ここで、複数の登録候補文字列の文字列出現比率の中には、文字列出現比率の算出の際に母数（つまり上記（１）式の分母の値）が所定値以下であったために、上記閾値を超えることとなった文字列出現比率がある可能性がある。そこで、辞書作成装置５は、各登録候補文字列について、それの文字列出現比率の算出の際に母数が所定数以下であった登録候補文字列に対しては、母比率検定を行うことによって、その登録候補文字列がシステム辞書１３に登録されることが妥当であるか否かの判定を行う（Ｓ４）。
【００４３】
その後、辞書作成装置５は、Ｓ４の判定において妥当であると判断された登録候補文字列、及び、Ｓ４の判定が行われなかった登録候補文字列（すなわち、Ｓ４の判定を行う必要性がなかった登録候補文字列）と、それに対応した該当カテゴリとを、システム辞書１３となる所定のファイルに記録する（Ｓ５）。
【００４４】
以上の処理流れが完了することで、各カテゴリにおいて重要度が一定値以上である文字列（換言すれば、各カテゴリにおいて特徴的であると推定された文字列）とそれに対応したカテゴリとのみが記録されたシステム辞書１３が作成される。このシステム辞書１３に記載されている複数の文字列（以下、「特徴的文字列」と言う）と、上述したユーザ辞書１１Ａ、１１Ｂ、…に記載されている複数のユーザ所望文字列とに基づいて、テキスト分類部１５によって分類対象テキスト１９の分類が行われる。
【００４５】
図６は、テキスト分類部１５が分類対象テキスト１９を分類する際の処理流れを示す。
【００４６】
テキスト分類装置１には、ユーザによって特定の方法で分類対象テキスト１９が入力される（Ｓ１０）。ここで言う「特定の方法」とは、例えば、ユーザの要求に応答してテキスト分類装置１内で新たな分類対象テキストが作成されることや、フロッピー（登録商標）ディスク等の可搬記録媒体や有線又は無線の通信システムを介すること等である。
【００４７】
分類対象テキスト１９が入力されたら、テキスト分類部１５は、その分類対象テキスト１９に記載されている各文字列を抽出する（Ｓ１１）。
【００４８】
そして、テキスト分類部１５は、ユーザ辞書記憶部９にアクセスして、分類対象テキスト１９を入力したユーザ（以下、そのユーザを「ユーザＡ」とする）のユーザ辞書１１Ａを参照し、Ｓ１１で抽出された各文字列（以下、その文字列を「抽出文字列」と言う）と、ユーザ辞書１１Ａに記載されている１以上のユーザ所望文字列との比較照合（以下、「ユーザ辞書比較照合」と言う）を実行する（Ｓ１２）。なお、複数のユーザ辞書１１Ａ、１１Ｂ、…のうちどれがユーザＡのユーザ辞書であるかは、所定の方法で判別することができる（例えば、各ユーザ辞書１１Ａ、１１Ｂ、…にユーザ識別コードを対応付けておき、ユーザＡがユーザＡのユーザ識別コードを入力すれば、入力されたユーザ識別コードと、各ユーザ辞書１１Ａ、１１Ｂ、…に対応付けられているユーザ識別コードとを比較照合することでユーザ辞書１１Ａを判別することができる）。また、Ｓ１２では、テキスト分類部１５は、例えば、複数のユーザからなるグループで辞書を共有、あるいはそのグループのユーザが利用可能なグループ辞書を作成し、共有の辞書及びグループ辞書の少なくとも１つを用いて上述のユーザ辞書比較照合を実行してもよい。
【００４９】
Ｓ１２の後、テキスト分類部１５は、システム辞書記憶部７にアクセスしてシステム辞書１３を参照し、Ｓ１１で抽出された各文字列と、システム辞書１３に記載されている１以上のユーザ所望文字列との比較照合（以下、「システム辞書比較照合」と言う）を実行する（Ｓ１３）。
【００５０】
テキスト分類部１５は、Ｓ１２のユーザ辞書比較照合の結果と、Ｓ１３のシステム辞書比較照合の結果とのうち、ユーザ辞書比較照合の結果を優先的に用いて、分類対象テキスト１９に付与するカテゴリの候補を取得する（なお、必ずしもユーザ辞書比較照合の結果を優先的に用いなければならないわけではなく、システム辞書比較照合の結果を優先的に用いても良いし、優劣つけることなく双方の結果を利用しても良い）。
【００５１】
具体的には、例えば、テキスト分類部１５は、Ｓ１２のユーザ辞書比較照合の結果、複数の抽出文字列の中に、ユーザ辞書１１Ａ上のユーザ所望文字列と一致する抽出文字列がある場合は、ユーザ辞書１１Ａを参照して、その抽出文字列と一致したユーザ所望文字列に対応付けられているカテゴリを識別し、そのカテゴリを、分類対象テキスト１９に付与する候補とする（なお、この処理は、Ｓ１２の実行後、Ｓ１３の実行前に行っても良い）。
【００５２】
また、テキスト分類部１５は、Ｓ１２のユーザ辞書比較照合の結果、複数の抽出文字列の中に、ユーザ辞書１１Ａ上のユーザ所望文字列と一致する抽出文字列がなく、且つ、Ｓ１３のシステム辞書比較照合の結果、複数の抽出文字列の中に、システム辞書１３上の特徴的文字列と一致する抽出文字列がある場合は、システム辞書１３を参照して、その抽出文字列と一致した特徴的文字列に対応付けられているカテゴリを識別し、そのカテゴリを、分類対象テキスト１９に付与する候補として決定する。
【００５３】
以上の処理によって、付与される候補となったカテゴリ（以下、「付与候補カテゴリ」と言う）が決定したら、テキスト分類部１５は、その付与候補カテゴリと、付与候補カテゴリを分類対象テキスト１９に付与しても良いか否かの付与適否問いと等が所定のディスプレイ画面（図示せず）に表示されるための処理を実行する（Ｓ１４）。
【００５４】
具体的には、例えば、テキスト分類部１５は、Ｓ１２のユーザ辞書比較照合の結果のみに基づいて１以上の付与候補カテゴリを決定した場合には、各付与候補カテゴリが付与の候補となった根拠の文字列（以下、「根拠文字列」と言う）を含んだ長い文字列（例えば、その根拠の文字列の前後５０字を含んだ文字列）を分類対象テキスト１９から抽出し、抽出された長い文字列と、各付与候補カテゴリと、付与適否問いと、付与適否問いに対する回答を受けるためのユーザインタフェースとが表示されるための処理を実行する。
【００５５】
また、例えば、テキスト分類部１５は、Ｓ１３のシステム辞書比較照合の結果に基づいて１以上の付与候補カテゴリ（例えば、「カテゴリＧ」というカテゴリ）を決定した場合には、図７に示すように、付与候補カテゴリ「カテゴリＧ」についての根拠文字列（図７の例では「静電写真用」）を含んだ長い文字列７０を、分類対象テキスト１９から抽出する。さらに、テキスト分類部１５は、学習用テキスト群保存部３にアクセスし、付与候補カテゴリ「カテゴリＧ」に対応した学習用テキスト群内の、上記根拠文字列「静電写真用」を含んだ学習用テキスト２１Ｇ_１、２１Ｇ_３を識別する。また、テキスト分類部１５は、各学習用テキスト２１Ｇ_１、２１Ｇ_３から、上記根拠文字列「静電写真用」を含んだ長い文字列５０、６０を抽出する。そして、テキスト分類部１５は、抽出された長い文字列５０、６０、７０に基づいて、付与適否回答受付画面１００を生成してそれが表示されるための処理を実行する（勿論、このような処理は、Ｓ１２のユーザ辞書比較照合の結果のみに基づいて１以上の付与候補カテゴリを決定したときに、その付与候補カテゴリに属する学習用テキストがある場合に行なっても良い）。
【００５６】
付与適否回答受付画面１００には、分類対象テキスト１９から抽出された長い文字列７０と、各学習用テキスト２１Ｇ_１、２１Ｇ_３から抽出された長い文字列５０、６０とが、左右に並べて表示される。また、付与適否回答受付画面１００には、各学習用テキスト２１Ｇ_１、２１Ｇ_３に対応したカテゴリ「カテゴリＧ」と、今回の付与候補カテゴリ「カテゴリＧ」も表示される。更に、付与適否回答受付画面１００には、付与適否問い１１０と、付与適否問い１１０に対して肯定的な回答をする場合に操作する「ＯＫ」ボタン１１１と、付与適否問い１１０に対して否定的な回答をする場合に操作する「変更する」ボタン１１３と、ユーザ辞書１１Ａを編集したい場合に操作する「ユーザ辞書更新」ボタン１１５とが表示される。
【００５７】
この画面構成により、ユーザＡは、分類対象テキスト１９に対して「カテゴリＧ」が付与候補となった原因７０と、既存の学習用テキスト２１Ｇ_１、２１Ｇ_３に対して「カテゴリＧ」が付与された原因５０、６０とを比較することができるので、「カテゴリＧ」が分類対象テキスト１９に付与されることが適切か否かを容易に判断することができる。
【００５８】
ユーザＡは、付与適否回答受付画面１００を見て、「カテゴリＧ」が分類対象テキスト１９に付与されることが適切であると判断したならば、「ＯＫ」ボタン１１１を操作する。テキスト分類部１５は、「ＯＫ」ボタン１１１が操作されたならば（図６のＳ１５でＹ）、「カテゴリＧ」が分類対象テキスト１９に付与することで（Ｓ１７）、分類対象テキスト１９の分類を終える。このとき、テキスト分類部１５は、ユーザ辞書１１Ａ又はシステム辞書１３に対し、文字列「静電写真用」を含んだ分類対象テキスト１９に対して「カテゴリＧ」が付与された旨をフィードバックしても良い。
【００５９】
ユーザＡは、付与適否回答受付画面１００を見て、「カテゴリＧ」が分類対象テキスト１９に付与されることが適切でないと判断したならば、「変更する」ボタン１１３を操作する。テキスト分類部１５は、「変更する」ボタン１１３が操作されたならば（図６のＳ１５でＮ）、所定の操作、例えば、別の付与候補カテゴリを探す処理を実行する（Ｓ１６）。また、このとき、テキスト分類部１５は、ユーザ辞書１１Ａ又はシステム辞書１３に対し、分類対象テキスト１９が文字列「静電写真用」を含んでいたにもかかわらず「カテゴリＧ」が付与されなかった旨をフィードバックしても良い。
【００６０】
ユーザＡは、付与適否回答受付画面１００を見て、ユーザ辞書１１Ａを編集したいならば、「ユーザ辞書更新」ボタン１１５を操作する。この場合は、その操作に応答してユーザ辞書更新部（図１参照）１７が起動し、以降、ユーザＡの要求に応じてユーザ辞書１１Ａのみを編集し更新する（他人のユーザ辞書１１Ｂ、…は編集できないようにする）。すなわち、分類対象テキスト１９の分類の結果に基づいて、ユーザＡが自分で自分のユーザ辞書１１Ａに対してフィードバックをかける。
【００６１】
また、図６に示した流れにおいて、Ｓ１２とＳ１３の処理の結果、分類対象テキスト１９に含まれている複数の文字列のどれも、ユーザ辞書１１Ａにもシステム辞書１３にも全く記載されていなければ、分類対象テキスト１９に対して分類候補カテゴリが取得されることはない。その場合、テキスト分類部１５は、分類対象テキスト１９を分類することができなかった旨を表示するための処理を実行する。その後、ユーザＡからテキスト分類装置１に対して、ユーザＡ自らが分類対象テキスト１９の分類を行う旨の要求があった場合には、以下のような処理が行われる。
【００６２】
すなわち、図８に示すように、何のカテゴリも付与されなかった分類対象テキスト１９´が表示され、ユーザＡの操作に応答して、分類対象テキスト１９´中のユーザ所望の文字列（図８の例では「インクジェットプリンタ」）が識別される（例えば、ユーザＡがマウスを操作して網掛けした文字列が識別される）。その後、ユーザＡからテキスト分類装置１に対して、分類対象テキスト１９´に「カテゴリＢ」を付与する旨が入力された場合には、ユーザ辞書更新部１７が、ユーザＡのユーザ辞書１１Ａに、新たなレコードとして、ユーザＡ所望の文字列「インクジェットプリンタ」と、それに対応したカテゴリとして「カテゴリＢ」とを登録する。なお、ここでユーザ辞書１１Ａに登録できる文字列とカテゴリとの組合せは１つに限られない。また、１つの文字列に対応付けられるカテゴリの数も１つに限られない。
【００６３】
以上が、本実施形態についての説明である。上述の実施形態は、テキストに限らず画像等を含んだ他の種類の文書の分類にも利用可能であるし、また、種々の目的に応じた文書分類に利用することもできる。具体的には、例えば、上記実施形態において、学習用テキストを、既に国際特許分類が付与されている特許明細書とし、カテゴリを国際特許分類とし、分類対象テキストを国際特許分類が付与されていない特許明細書とすれば、特許明細書に対して国際特許分類を自動的に付与するが行われることとなる。
【００６４】
以上、上述した実施形態によれば、システム辞書１３やユーザ辞書１１Ａ、１１Ｂ、…を用いて、分類対象テキスト１９に対してカテゴリが付与される。システム辞書１３やユーザ辞書１１Ａ、１１Ｂ、…は、単なる統計データとは異なり、各文字列と、その文字列の写像となる１以上のカテゴリとが記載されているものである。このため、分類対象テキスト１９中に、システム辞書１３或いはユーザ辞書１１Ａ、１１Ｂ、…上の文字列と同一の文字列があれば、従来よりも高い確率で、適切なカテゴリを自動的に分類対象テキスト１９に対して付与されるようになる。
【００６５】
また、上述した実施形態によれば、図７に例示した付与適否回答受付画面１００が生成されて表示されるので、分類対象テキスト１９に対して或るカテゴリが付与候補となった原因７０と、既存の或る学習用テキストに対して或るカテゴリが付与された原因５０、６０とを比較して、その或るカテゴリが分類対象テキスト１９に付与されることが適切か否かを容易に判断することができる。それにより、分類対象テキスト１９に対して付与されるカテゴリが適切であることの確実性を高めることができ、かつ、ユーザサイドに立ったシステム構築が容易となる。
【００６６】
また、上述した実施形態によれば、ユーザは、適宜に、自分のユーザ辞書を編集することができる。このことは、ユーザ辞書のカスタマイズの支援環境としても有効である。
【００６７】
以上、本発明の好適な実施形態を説明したが、これは本発明の説明のための例示であって、本発明の範囲をこの実施形態にのみ限定する趣旨ではない。本発明は、他の種々の形態でも実施することが可能である。
【００６８】
例えば、学習用テキスト群保存部１３内を定期的に又は不定期に更新し、それに応じて、辞書作成装置５が、図５に示した処理流れを実行することにより、システム辞書１３を定期的に又は不定期に更新しても良い。
【００６９】
また、例えば、辞書作成装置５は、学習用テキスト上の複数のエリアの中から所定のエリアを判別し（特許明細書を例に言うと、例えば、「発明の名称」という文字列と「特許請求の範囲」という文字列との間のエリアを判別し）、判別されたエリア内のみから文字列を抽出して、抽出された文字列を特徴的な文字列としても良い。
【００７０】
また、例えば、ユーザは、自分で１以上のサンプル的なテキストを用意し（以下、ユーザに用意されたサンプル的なテキストを「ユーザテキスト」と言う）、そのユーザテキスト（例えば、ユーザに作成されたテキスト文書や所望のサイトからダウンロードされたＷｅｂページ等の、ユーザ辞書の更新の際に参照したテキスト）を、カテゴリ別に学習用テキスト群保存部３内に格納しても良い。これにより、分類対象テキスト１９をカテゴリ別に分類する際には、予め用意されている学習用テキストのみならず、ユーザが用意したユーザテキストからも、分類候補カテゴリの決定の根拠となった言葉（以下、「根拠ワード」と言う）が含まれている文字列が抜き出されて、その文字列と、分類対象テキスト１９上の根拠ワードが含まれている文字列とが、付与適否回答受付画面１００上に並べて表示される。このため、より一層、容易且つ正確に、分類対象テキスト１９に付与されるカテゴリを決定することができる。なお、その際、学習用テキストとユーザテキストとの両方或いは片方だけが用いられても良いし、どちらのテキストを優先的に用いるかの優先度を予めユーザが設定しておいて、その優先度に基づいて学習用テキスト及びユーザテキストのいずれかが自動で選択的に用いられても良い。後者の場合は、例えば、先に用いられた一方のテキスト（例えばユーザテキスト）上に根拠ワードが存在しなければ、自動的に、他方のテキスト（例えば学習用テキスト）上から根拠ワードが検索されても良い。
【図面の簡単な説明】
【図１】本発明の一実施形態に係るシステムの全体構成を示すブロック図。
【図２】学習用テキスト群保存部３内の様子を示す図。
【図３】システム辞書１３の構成を示す図。
【図４】ユーザ辞書１１Ａ、１１Ｂ、…の構成を示す図。
【図５】辞書作成装置５がシステム辞書１３を作成する際の処理流れを示す図。
【図６】テキスト分類部１５が分類対象テキスト１９を分類する際の処理流れを示す図。
【図７】Ｓ１３のシステム辞書比較照合の結果に基づいて１以上の付与候補カテゴリを決定した場合に、付与適否問い等が表示される際に行われる処理の内容を説明するための図。
【図８】分類対象テキスト１９に対して付与候補カテゴリが選ばれなかった場合にユーザ辞書が更新されるときのことを説明するための図。
【符号の説明】
１　テキスト分類装置
３　学習用テキスト群保存部
５　辞書作成装置
７　システム辞書記憶部
９　ユーザ辞書記憶部
１１Ａ、１１Ｂ、…　ユーザ辞書
１３　システム辞書
１５　テキスト分類部
１７　ユーザ辞書更新部
１９　分類対象テキスト[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a novel technique for creating an electronic dictionary for document classification based on a plurality of sample documents, and a novel document classification technique using the electronic dictionary.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, there has been known a text classification system for automatically classifying a document including only characters (hereinafter, referred to as “text”). In a conventional text classification system, automatic text classification is performed as follows.
[0003]
That is, in the conventional text classification system, for each of a plurality of text categories, a learning text group including a large number of learning texts (sample texts) to which the categories are assigned is prepared. The text classification system creates word statistical data (for example, data indicating the distribution of each word in the category) by extracting each word from the learning text group and performing statistical processing for each category. Then, when a text to be classified (hereinafter referred to as a text to be classified) is input, the text classification system extracts each word from the input text to be classified, and generates each extracted word and each word. By assigning a category to the classification target text based on the word statistical data, the classification target text is automatically classified.
[0004]
[Problems to be solved by the invention]
The conventional text classification system has the following problems, for example.
[0005]
That is, since the classification method is based on statistical data, for example, (1) a problem that a huge amount of calculation is required for dictionary construction and classification determination, and (2) statistical data is merely a numerical value. (3) that humans cannot obtain information from the system to determine whether the classification result is correct, and (4) that the content of the text to be classified is accurate. Therefore, there is a problem that a person who does not know the contents of the text to be classified cannot judge the appropriateness of the assigned category at all.
[0006]
Conventionally, since the above-described automatic classification is merely performed based on statistical data created for each category, it is necessary to feed back a category finally assigned to a classification target text by a human to reconstruct a dictionary. Since it requires a huge amount of processing time, it is practically difficult.
[0007]
It is considered that the above problems exist not only in text but also in other types of documents including images and line drawings.
[0008]
Therefore, an object of the present invention is to improve the certainty that an appropriate category is assigned to a document to be classified.
[0009]
Another object of the present invention is to make it possible to easily determine whether a category assigned to a document to be classified is appropriate.
[0010]
Still another object of the present invention is to enable appropriate feedback to be provided to a system for automatically classifying documents based on the result of a category determined by a human for a document to be classified. is there.
[0011]
[Means for Solving the Problems]
A system according to a first aspect of the present invention includes an association degree determination unit and a dictionary creation unit. The relevance determination means is a system for creating an electronic dictionary for classifying documents, and inputs a plurality of sample documents belonging to each of a plurality of categories of the document, and, for each category, For each of a large number of character strings included in the document, the degree of association between the character string and the category is determined. The dictionary creating means selects, based on the result of the determination, each specific character string whose degree of association is equal to or more than a predetermined degree from among the many character strings, and selects each of the selected specific character strings and the corresponding specific character string. An electronic dictionary in which the obtained one or more categories are recorded is created.
[0012]
Here, the “document” is a document including characters, for example, a document including only characters, or a document including an image or a line drawing in addition to characters.
[0013]
The "category" may be anything as long as it makes the document categorizable, for example, words that can be understood by humans (for example, novels, essays, etc.), or a code group consisting of numbers, alphabetic characters, and the like. It may be.
[0014]
The “character string” is, for example, a grammatical word (that is, a minimum unit of language), a continuous word (for example, a character string “document creation system”), or a plurality of words (for example, Or a character string consisting of multiple particles and multiple words.
[0015]
In a preferred embodiment of the system according to the first aspect of the present invention, the relevance determination means is included in (1) the plurality of sample documents belonging to the same category for each of the plurality of character strings. Based on the total number of the character strings and (2) the total number of the existing sample documents or the total number of the character strings included in the existing sample documents, the degree of association with respect to the character strings is determined.
[0016]
A system according to a second aspect of the present invention is a system for classifying documents, and includes a dictionary acquisition unit, a document acquisition unit, a comparison / collation unit, and a document classification unit. The dictionary acquisition means acquires an electronic dictionary in which one or more character strings corresponding to each of a plurality of categories of the document (for example, character strings characteristic to the category) are described. The document input unit acquires a document to be classified (for example, receives a document to be classified from a user, or creates a document according to a user's request to acquire a document to be classified). The comparison and collation unit compares and compares each character string included in the acquired document to be classified with each character string (for example, the characteristic character string described above) described in the electronic dictionary. Do. As a result of the comparison and collation, if there is a character string that matches each other, the document classifying unit refers to the electronic dictionary to grasp a category corresponding to the matched character string, and The acquired document is classified by assigning the category to be assigned, which is the above-identified category.
[0017]
In a preferred embodiment of the system according to the second aspect of the present invention, the system further includes a basis character string extracting unit, a notifying unit, a question output unit, and an answer receiving unit. The basis character string extracting means extracts, from the acquired document, the matched character string or the first long character string including the matched character string, which became the basis for the provision of the category to be provided. The notification unit notifies the user who has input the document of the extracted matched character string or the first long character string and the assignment target category. The question output means outputs to the user a question as to whether the notified category to be provided may be assigned to the acquired document. The answer receiving means receives an answer to the output question from the user.
[0018]
In this case, the document classifying means assigns the assignment target category to the acquired document only when the answer receiving means receives a positive answer.
[0019]
In a further preferred embodiment of the present invention, the basis character string extracting means includes the plurality of sample documents belonging to each of the plurality of categories, the specific sample document including the matched character string, A second long character string including the matched character string is extracted from the specific sample document, and the notifying unit determines the first long character string, the assignment target category, and the second long character string. And the category to which the specific sample document belongs to the user (for example, the combination of the first long character string and the category to be given, the second long character string and the The combination with the category to which the sample document belongs is displayed side by side on a predetermined screen).
[0020]
In a preferred embodiment, there are a plurality of electronic dictionaries, and the plurality of electronic dictionaries include a user dictionary created by each user and dedicated to each user. Further comprising dictionary editing means for editing a user dictionary dedicated to the user in response to a request from the user.
[0021]
A method according to a third aspect of the present invention is a method for creating an electronic dictionary for classifying documents, comprising inputting a plurality of sample documents belonging to each of a plurality of categories of documents, and For each of a number of character strings included in the plurality of sample documents, a step of determining the degree of association between the character string and its category; and, based on the result of the determination, Selecting, from among them, specific character strings whose degree of association is equal to or greater than a predetermined degree, and creating an electronic dictionary in which the selected specific character strings and one or more categories corresponding thereto are recorded. Having.
[0022]
A method according to a fourth aspect of the present invention is a method for classifying a document, wherein a character string corresponding to each of a plurality of categories of the document and a character string characteristic to the category is 2 or more. Obtaining an electronic dictionary in which at least one document is described, obtaining a document to be classified, and each character string included in the obtained document, and being described in the electronic dictionary. Performing a comparison and collation with each characteristic character string, and if there is a character string that matches each other as a result of the comparison and collation, refer to the electronic dictionary to find a category corresponding to the matched character string. And classifying the acquired document by assigning an assignment target category that is the identified category to the acquired document.
[0023]
A computer program according to a fifth aspect of the present invention is a computer program for creating an electronic dictionary for document classification, and inputs a plurality of sample documents belonging to each of a plurality of categories of documents, and For each category, for each of a large number of character strings included in the plurality of sample documents, determining a degree of association between the character string and the category; and, based on a result of the determination, determining the number of the character strings. From the columns, select each specific character string whose degree of association is equal to or higher than a predetermined degree, and create an electronic dictionary in which the selected specific character string and one or more categories corresponding to the selected specific character string are recorded. And a step for causing a computer to execute the steps.
[0024]
A computer program according to a sixth aspect of the present invention is a computer program for classifying a document, and includes a character string corresponding to each of a plurality of categories of the document (for example, for the category, Obtaining an electronic dictionary (in which two or more characteristic character strings are described), obtaining a document to be classified, each character string included in the obtained document, Performing a comparison and comparison with each characteristic character string described in the electronic dictionary; and as a result of the comparison and collation, if there is a character string that matches each other, The category corresponding to the matched character string is grasped, and the acquired document is classified by assigning the granted category which is the grasped category to the acquired document. That it is intended to execute the steps in the computer.
[0025]
The data according to the seventh aspect of the present invention is an electronic dictionary for document classification, which corresponds to each of a plurality of categories of a document, and is considered or characteristic of the category. Contains one or more character strings. One or more character strings are associated with one category. For example, the same category may be associated with two types of character strings, or two types of categories may be associated with the same character string. In the former case, no matter which of the two types of character strings appears, the same category can be assigned. In the latter case, if the same character string appears, the two categories must be assigned. Both can be provided.
[0026]
The plurality of means provided in each system of the present invention described above can be mounted on one computer system, or can be mounted separately on a plurality of distributed computer systems.
[0027]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0028]
FIG. 1 shows an overall configuration of a system according to an embodiment of the present invention.
[0029]
In this embodiment, a learning text group storage unit 3, a dictionary creation device 5, and a text classification device 1 are provided. Note that the dictionary creation device 5 and the text classification device 1 may each be one computer machine and be physically independent, or each may be one application program and one computer machine. Or may be mounted separately on two computer machines.
[0030]
The learning text group storage unit 3 stores a plurality of learning text groups 21A, 21B,..., For example, as shown in FIG. Each learning text group, for example, the learning text group 21A is composed of a plurality of learning texts 21A to which the same category (for example, “category A”) of the text is assigned. ₁ , 21A ₂ , 21A ₃ (The same applies to the other learning text groups 21B ...). As a result, each learning text group is associated with one of a plurality of categories. As described above, the “learning text” is a text to which a category has already been assigned and is a sample text (for example, a text that has been classified in the past). When two or more categories are assigned, the learning texts are present in each learning text group corresponding to the two or more categories.
[0031]
The dictionary creation device 5 extracts each character string from each learning text group stored in the learning text group storage unit 3 and calculates the importance of each extracted character string in the category corresponding to the extracted character string. An electronic dictionary for text classification is created on the basis of the result of the assigned importance. The created dictionary is output as a system dictionary 13 described later (for this reason, the electronic dictionary is hereinafter referred to as a “system dictionary”).
[0032]
The text classification device 1 can be used by a plurality of users. The text classification device 1 acquires a text to be classified (hereinafter referred to as a text to be classified) 19 by a specific method, and a system dictionary 13 created by the dictionary creation device 5 and an electronic dictionary created by a plurality of users. (Hereinafter, user dictionary) 11A, 11B,..., And at least one of the plurality of categories described above is assigned to the acquired classification target text 19. The text classification device 1 includes a system dictionary storage unit 7, a user dictionary storage unit 9, a text classification unit 15, and a user dictionary update unit 17.
[0033]
The system dictionary storage unit 7 stores the system dictionary 13 created and output by the dictionary creating device 5 (the stored system dictionary 13 is hashed, for example). For example, as shown in FIG. 3, the system dictionary 13 describes each characteristic character string (indicated as “phrase” in FIG. 3) and one or more categories corresponding to each characteristic character string. ("Characteristic character string" is a character string estimated to be characteristic in the category by the dictionary creation device 5 in a method described later). The system dictionary 13 is stored in a specific manner (for example, via a portable recording medium such as a floppy (registered trademark) disk or MO disk, or using a wired or wireless communication system). Can be taken into the system dictionary storage unit 7 of the text classification device 1.
[0034]
As shown in FIG. 4, for example, the user dictionary storage unit 9 stores user dictionaries created by a plurality of users (specifically, for example, the user dictionary 11A of the user A, the user dictionary 11B of the user B,. .) Are stored (the stored user dictionaries 11A, 11B,... Are, for example, hashed). In each user dictionary, for example, the user dictionary 11A, one or more desired character strings of the user A (indicated as “words” in FIG. 4, hereinafter referred to as “user desired character strings”), and one corresponding to each user desired character string. Or, two or more categories are described (the configuration of the other user dictionaries 11B,... Is the same). Each of the user dictionaries 11A, 11B,... Can be created by using a predetermined application program installed in the text classification device 1, or can be an external device (not shown) (for example, a personal computer or a user output therefrom). It can also be stored in the user dictionary storage unit 9 by inputting the text classification device 1 from a portable storage medium storing a dictionary).
[0035]
The text classification unit 15 acquires the classification target text 19 by a specific method, and the system dictionary 13 stored in the system dictionary storage unit 7 and the user dictionaries 11A, 11B,. Based on the above, a category to be assigned to the acquired classification target text 19 (hereinafter, an assignment target category) is grasped from the plurality of categories described above. Then, the text classifying unit 15 displays a question as to whether or not the grasped grant target category can be assigned to the sort target text 19 (hereinafter, a question of whether the assignment is appropriate). When the text classifying unit 15 receives an affirmative answer to the assignment eligibility question, the text classification unit 15 assigns the above-described assignment target category to the classification target text 19, and receives a negative answer to the assignment eligibility question. In this case, a predetermined process (for example, a process of searching for another category that can be an application target) is executed.
[0036]
The user dictionary updating unit 17 receives a request from a certain user among a plurality of users, and responds to the request of the user by using one or more user dictionaries 11A and 11B stored in the user dictionary storage unit 9. , ..., the user dictionary corresponding to the user is edited and updated.
[0037]
The above is the outline of the system according to the present embodiment. Hereinafter, the processing flow in this system, in other words, the processing contents performed by the dictionary creation device 5, the text classification unit 15, and the user dictionary update unit 17 will be described in detail.
[0038]
FIG. 5 shows a processing flow when the dictionary creation device 5 creates the system dictionary 13. In addition, in FIG. 5 and subsequent drawings and the following description, “step” is abbreviated as “S”.
[0039]
The dictionary creation device 5 is activated when a predetermined event occurs (for example, when a request is received from a user). First, the dictionary creation device 5 extracts each character string from each learning text included in each learning text group 21A, 21B,... In the learning text group storage unit 3 (S1). Here, the “character string” may be a grammatical word (that is, a minimum unit of language) or a continuous word (for example, as shown in FIG. 3, “automatic focus adjustment”). ), Or a plurality of words connected via a particle (for example, it may be "mine"), or a plurality of particles and a plurality of words. It may be a character string (for example, "I filed a patent application").
[0040]
Next, the dictionary creation device 5 determines, for each of the extracted character strings, the importance in the category (hereinafter, referred to as “corresponding category”) to which the learning text that includes the character string belongs, in other words, The degree of association between each extracted character string and its corresponding category is calculated. Specifically, for example, for each extracted character string, the dictionary creation device 5 calculates the appearance ratio of the character string that appears in the corresponding category with respect to the entire character string (hereinafter, referred to as “character string appearance ratio”). (S2). The dictionary creation device 5 uses the following equation (1)
Character string appearance ratio = appearance frequency in learning text group corresponding to corresponding category / appearance frequency in all learning text groups (1)
Is used to calculate the character string appearance ratio for each of the extracted character strings. Here, the “appearance frequency” refers to the number of appearances (the above equation (1) is an example, and although not specifically described, there are other calculation methods).
[0041]
The dictionary creation device 5 determines whether the calculated character string appearance ratio exceeds a predetermined threshold value for each of the extracted character strings. As a result of the determination, the dictionary creation device 5 sets a character string for which a positive result is obtained (that is, a character string whose character string appearance ratio exceeds a predetermined threshold) as a candidate to be registered in the system dictionary ( S3) (hereinafter, the character string is referred to as “registration candidate character string”).
[0042]
Here, among the character string appearance ratios of a plurality of registration candidate character strings, the parameter (that is, the value of the denominator in the above equation (1)) was not more than a predetermined value when calculating the character string appearance ratio. There is a possibility that there is a character string appearance ratio that exceeds the threshold. Therefore, the dictionary creation device 5 performs a population ratio test on each registered candidate character string whose registered parameter is equal to or less than a predetermined number when calculating the character string appearance ratio thereof. Then, it is determined whether it is appropriate to register the registration candidate character string in the system dictionary 13 (S4).
[0043]
Thereafter, the dictionary creating device 5 determines whether the registration candidate character string determined to be valid in the determination in S4 and the registration candidate character string not determined in S4 (that is, there is no need to perform the determination in S4). The registered candidate character string) and the corresponding category are recorded in a predetermined file serving as the system dictionary 13 (S5).
[0044]
By completing the above processing flow, only a character string whose importance is equal to or more than a certain value in each category (in other words, a character string estimated to be characteristic in each category) and a corresponding category A recorded system dictionary 13 is created. Based on a plurality of character strings described in the system dictionary 13 (hereinafter referred to as “characteristic character strings”) and a plurality of user desired character strings described in the user dictionaries 11A, 11B,. Then, the text classification unit 15 classifies the classification target text 19.
[0045]
FIG. 6 shows a processing flow when the text classification unit 15 classifies the classification target text 19.
[0046]
The text to be classified 19 is input to the text classification device 1 by a user using a specific method (S10). Here, the “specific method” means, for example, that a new text to be classified is created in the text classification device 1 in response to a user request, or a portable recording medium such as a floppy (registered trademark) disk. Or via a wired or wireless communication system.
[0047]
When the classification target text 19 is input, the text classification unit 15 extracts each character string described in the classification target text 19 (S11).
[0048]
Then, the text classification unit 15 accesses the user dictionary storage unit 9, refers to the user dictionary 11A of the user who has input the classification target text 19 (hereinafter, the user is referred to as “user A”), and extracts the text in S11. Of the extracted character strings (hereinafter, the character strings are referred to as “extracted character strings”) and one or more user-desired character strings described in the user dictionary 11A (hereinafter, “user dictionary comparison and collation”). Is executed (S12). Note that which of the plurality of user dictionaries 11A, 11B,... Is the user dictionary of the user A can be determined by a predetermined method (for example, a user identification code is assigned to each of the user dictionaries 11A, 11B,. If the user A enters the user identification code of the user A, the input user identification code is compared with the user identification codes associated with the user dictionaries 11A, 11B,. Can be used to determine the user dictionary 11A). In S12, for example, the text classification unit 15 shares a dictionary with a group of a plurality of users, or creates a group dictionary that can be used by the users of the group, and creates at least one of the shared dictionary and the group dictionary. The above-described user dictionary comparison / collation may be executed using the above.
[0049]
After S12, the text classification unit 15 accesses the system dictionary storage unit 7 and refers to the system dictionary 13, and refers to each character string extracted in S11 and one or more user-desired characters described in the system dictionary 13. A comparison and collation with a column (hereinafter, referred to as “system dictionary comparison and collation”) is executed (S13).
[0050]
The text classifying unit 15 preferentially uses the result of the user dictionary comparison and collation of the result of the user dictionary comparison and collation of S12 and the result of the system dictionary comparison and collation of S13, and determines the category to be assigned to the classification target text 19. Obtain a candidate (Note that the user dictionary comparison / matching result does not necessarily have to be used preferentially. The system dictionary comparison / matching result may be used preferentially, or both results may be compared without any difference. May be used).
[0051]
Specifically, for example, as a result of the user dictionary comparison and collation in S12, if the text classification unit 15 determines that there is an extracted character string that matches the user desired character string on the user dictionary 11A among a plurality of extracted character strings, With reference to the user dictionary 11A, a category associated with the user-desired character string that matches the extracted character string is identified, and the category is set as a candidate to be assigned to the classification target text 19 (this processing is performed). May be performed after execution of S12 and before execution of S13).
[0052]
In addition, as a result of the user dictionary comparison and collation in S12, the text classification unit 15 determines that there is no extracted character string that matches the user desired character string on the user dictionary 11A among the plurality of extracted character strings, and the system dictionary in S13. As a result of the comparison and collation, if there is an extracted character string that matches the characteristic character string in the system dictionary 13 among the plurality of extracted character strings, the feature that matches the extracted character string is referred to by referring to the system dictionary 13. The category associated with the target character string is identified, and the category is determined as a candidate to be assigned to the classification target text 19.
[0053]
When the category that is a candidate to be assigned (hereinafter, referred to as “assignment candidate category”) is determined by the above processing, the text classifying unit 15 assigns the assignment candidate category and the assignment candidate category to the classification target text 19. A process is executed for displaying a question of whether or not to allow the display on a predetermined display screen (not shown) (S14).
[0054]
Specifically, for example, when the text classifying unit 15 determines one or more assignment candidate categories based only on the result of the user dictionary comparison / collation in S12, the reason that each assignment candidate category is an assignment candidate A long character string (for example, a character string including 50 characters before and after the character string of the ground) including the character string (hereinafter, referred to as “ground character string”) is extracted from the classification target text 19 and extracted. A process is executed for displaying a long character string, each assignment candidate category, an assignment suitability question, and a user interface for receiving an answer to the assignment suitability question.
[0055]
For example, when the text classification unit 15 determines one or more assignment candidate categories (for example, a category “category G”) based on the result of the system dictionary comparison / collation in S13, as illustrated in FIG. The long character string 70 including the base character string (“for electrostatic photography” in the example of FIG. 7) for the assignment candidate category “category G” is extracted from the classification target text 19. Further, the text classifying unit 15 accesses the learning text group storage unit 3 and performs the learning including the ground character string “for electrostatic photography” in the learning text group corresponding to the assignment candidate category “category G”. Textbook 21G ₁ , 21G ₃ Identify. In addition, the text classifying unit 15 determines each of the learning texts 21G. ₁ , 21G ₃ , The long character strings 50 and 60 including the ground character string “for electrostatic photography” are extracted. Then, the text classifying unit 15 executes a process for generating the assignment propriety response acceptance screen 100 based on the extracted long character strings 50, 60, and 70 and displaying the same (of course, such as The processing may be performed when one or more assignment candidate categories are determined based only on the result of the user dictionary comparison and collation in S12 and there is a learning text belonging to the assignment candidate category.)
[0056]
The assignment appropriateness response reception screen 100 includes a long character string 70 extracted from the classification target text 19 and each learning text 21G. ₁ , 21G ₃ Are displayed side by side on the left and right. In addition, each of the learning texts 21G ₁ , 21G ₃ Are displayed, and the category "Category G" corresponding to the present application is also displayed. Further, on the grant aptitude response reception screen 100, the grant aptitude question 110, an “OK” button 111 operated when giving an affirmative answer to the grant aptitude question 110, and a negative A "change" button 113 that is operated when making an answer and a "update user dictionary" button 115 that is operated when the user wants to edit the user dictionary 11A are displayed.
[0057]
With this screen configuration, the user A can determine the cause 70 of “category G” as a candidate for the classification target text 19 and the existing learning text 21G. ₁ , 21G ₃ Can be compared with the causes 50 and 60 to which "category G" is assigned, so that it is easy to determine whether or not it is appropriate to assign "category G" to the classification target text 19. it can.
[0058]
The user A operates the “OK” button 111 when looking at the acceptability answer reception screen 100 and determining that “category G” is appropriate to be assigned to the classification target text 19. When the “OK” button 111 is operated (Y in S15 of FIG. 6), the text classification unit 15 assigns “category G” to the classification target text 19 (S17), thereby classifying the classification target text 19. Finish. At this time, the text classification unit 15 feeds back to the user dictionary 11A or the system dictionary 13 that "category G" has been assigned to the classification target text 19 including the character string "for electrostatic photography". Is also good.
[0059]
The user A operates the “change” button 113 if he sees the assignment aptitude response reception screen 100 and determines that it is not appropriate for “category G” to be assigned to the classification target text 19. When the “change” button 113 is operated (N in S15 of FIG. 6), the text classification unit 15 executes a predetermined operation, for example, a process of searching for another assignment candidate category (S16). At this time, the text classification unit 15 does not assign “category G” to the user dictionary 11A or the system dictionary 13 even though the classification target text 19 includes the character string “for electrostatic photography”. You may give feedback to the effect.
[0060]
The user A operates the “update user dictionary” button 115 if he / she wants to edit the user dictionary 11A while viewing the acceptability answer reception screen 100. In this case, the user dictionary updating unit (see FIG. 1) 17 is started in response to the operation, and thereafter, only the user dictionary 11A is edited and updated according to the request of the user A (user dictionary 11B of another person). Is not editable). That is, based on the result of the classification of the classification target text 19, the user A gives feedback to his or her own user dictionary 11A.
[0061]
In the flow shown in FIG. 6, as a result of the processing in S12 and S13, none of the plurality of character strings included in the classification target text 19 must be described in the user dictionary 11A or the system dictionary 13 at all. For example, no classification candidate category is obtained for the classification target text 19. In this case, the text classification unit 15 executes a process for displaying that the classification target text 19 could not be classified. Thereafter, when there is a request from the user A to the text classifying device 1 that the user A itself classifies the text to be classified 19, the following processing is performed.
[0062]
That is, as shown in FIG. 8, the classification target text 19 'to which no category is assigned is displayed, and in response to the operation of the user A, the character string desired by the user in the classification target text 19' (FIG. In the example, "inkjet printer" is identified (for example, a character string shaded by the user A operating the mouse is identified). After that, when the user A inputs to the text classification device 1 that "category B" is to be assigned to the classification target text 19 ', the user dictionary update unit 17 stores the user dictionary A in the user dictionary 11A of the user A. As a new record, a character string “inkjet printer” desired by the user A and “category B” as a category corresponding thereto are registered. Note that the number of combinations of character strings and categories that can be registered in the user dictionary 11A is not limited to one. Further, the number of categories associated with one character string is not limited to one.
[0063]
The above is the description of the present embodiment. The above-described embodiment can be used for classifying other types of documents including images as well as texts, and can also be used for classifying documents for various purposes. Specifically, for example, in the above embodiment, the learning text is a patent specification to which an international patent classification has already been assigned, the category is an international patent classification, and the classification target text is not assigned an international patent classification. In the case of a patent specification, an international patent classification is automatically assigned to the patent specification.
[0064]
As described above, according to the above-described embodiment, a category is assigned to the classification target text 19 using the system dictionary 13 and the user dictionaries 11A, 11B,. The system dictionary 13 and the user dictionaries 11A, 11B,... Are different from mere statistical data, and describe each character string and one or more categories that map the character string. Therefore, if the classification target text 19 includes the same character strings as those on the system dictionary 13 or the user dictionaries 11A, 11B,..., The appropriate category is automatically classified with a higher probability than before. The text 19 is given.
[0065]
Further, according to the above-described embodiment, the assignment appropriateness response reception screen 100 illustrated in FIG. 7 is generated and displayed. It is easy to determine whether it is appropriate to assign a certain category to the classification target text 19 by comparing causes 50 and 60 to which a certain category is assigned to an existing learning text. can do. As a result, it is possible to increase the certainty that the category given to the classification target text 19 is appropriate, and it is easy to construct a system on the user side.
[0066]
Further, according to the above-described embodiment, the user can appropriately edit his / her own user dictionary. This is also effective as a support environment for customizing the user dictionary.
[0067]
The preferred embodiment of the present invention has been described above, but this is an exemplification for describing the present invention, and is not intended to limit the scope of the present invention only to this embodiment. The present invention can be implemented in other various forms.
[0068]
For example, the content of the learning text group storage unit 13 is updated periodically or irregularly, and the dictionary creation device 5 executes the processing flow shown in FIG. It may be updated on a regular basis or irregularly.
[0069]
Further, for example, the dictionary creation device 5 determines a predetermined area from a plurality of areas on the learning text (for example, in the case of a patent specification, for example, a character string “name of invention” and “ It is also possible to determine the area between the character string "claim range"), extract the character string only from the determined area, and use the extracted character string as a characteristic character string.
[0070]
In addition, for example, the user prepares one or more sample texts by himself (hereinafter, the sample text prepared by the user is referred to as “user text”), and the user text (for example, a text created by the user). The texts referred to when updating the user dictionary, such as a text document or a Web page downloaded from a desired site, may be stored in the learning text group storage unit 3 for each category. Accordingly, when the classification target text 19 is classified by category, not only a learning text prepared in advance but also a user text prepared by a user can be used as a word (hereinafter referred to as a basis) for determining a classification candidate category. , "Ground word") is extracted, and the character string and the character string containing the ground word on the classification target text 19 are displayed on the acceptance / rejection answer acceptance screen 100. Displayed side by side. Therefore, it is possible to more easily and accurately determine the category assigned to the classification target text 19. At this time, both or one of the learning text and the user text may be used, or the user sets in advance the priority of which text is to be used preferentially, and One of the learning text and the user text may be automatically and selectively used based on the. In the latter case, for example, if the ground word does not exist on one of the previously used texts (for example, the user text), the ground word is automatically searched from the other text (for example, the learning text). May be.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of a system according to an embodiment of the present invention.
FIG. 2 is a diagram showing a state in a learning text group storage unit 3;
FIG. 3 is a diagram showing a configuration of a system dictionary 13;
FIG. 4 is a diagram showing a configuration of user dictionaries 11A, 11B,.
FIG. 5 is a diagram showing a processing flow when the dictionary creation device 5 creates a system dictionary 13;
FIG. 6 is a diagram showing a processing flow when the text classification unit 15 classifies the classification target text 19;
FIG. 7 is a diagram for explaining the contents of processing performed when an assignment suitability question or the like is displayed when one or more assignment candidate categories are determined based on the result of the system dictionary comparison / collation in S13.
FIG. 8 is a view for explaining a case where a user dictionary is updated when an assignment candidate category is not selected for a classification target text 19;
[Explanation of symbols]
1 Text classifier
3 Text group storage for learning
5 Dictionary creation device
7 System dictionary storage
9 User dictionary storage
11A, 11B, ... User dictionary
13 System dictionary
15 Text Classification Unit
17 User dictionary update unit
19 Classification target text

Claims

A system for creating an electronic dictionary for document classification,
A plurality of sample documents belonging to each of a plurality of categories of the document are input, and for each of the plurality of character strings included in the plurality of sample documents, the degree of association between the character strings and the category is determined. Means for determining the degree of association,
Based on the result of the determination, select each specific character string whose degree of association is equal to or higher than a predetermined degree from among the many character strings, and select each of the selected specific character strings and one or more categories corresponding thereto. And a dictionary creating means for creating an electronic dictionary including the following.

The association degree determination means may include, for each of the plurality of character strings,
(1) the total number of the character strings included in the plurality of sample documents belonging to the same category;
(2) The dictionary creation system according to claim 1, wherein the degree of association of the character string is determined based on the total number of existing sample documents or the total number of the character strings included in the existing sample document. .

A system for classifying documents,
Dictionary acquisition means for acquiring an electronic dictionary in which character strings corresponding to each of a plurality of categories of the document are described;
A document acquisition unit for acquiring a document to be classified;
Each character string included in the acquired document, a comparison and collation unit that performs comparison and collation with each character string described in the electronic dictionary,
As a result of the comparison and collation, when there is a character string that matches each other, the category corresponding to the matched character string is grasped by referring to the electronic dictionary, and the grasp is performed on the acquired document. A document classification unit that classifies the acquired document by assigning an assignment target category that is a category.

Ground character string extracting means for extracting the matched character string or the first long character string including the character string that became the ground to be given the target category from the acquired document,
Notifying means for notifying a predetermined user of the extracted matched character string or the first long character string, and the assignment target category;
A question output unit that outputs to the user a question as to whether or not to give the notified category to be given to the document,
Further comprising: answer receiving means for receiving an answer to the output question from the user, wherein the document classifying means only receives a positive answer from the answer receiving means for the acquired document. The document classification system according to claim 3, wherein a category to be provided is provided.

The basis character string extracting means includes a second sample including the matched character string from a specific sample document including the matched character string included in a plurality of sample documents belonging to each of the plurality of categories. Extracting a long character string from the specific sample document,
The document according to claim 4, wherein the notifying unit notifies the user of the first long character string, the category to be provided, the second long character string, and a category to which the specific sample document belongs. Classification system.

A method for creating an electronic dictionary for document classification,
A plurality of sample documents belonging to each of a plurality of categories of the document are input, and for each of the plurality of character strings included in the plurality of sample documents, the degree of association between the character strings and the category is determined. Performing a determination of
Based on the result of the determination, a plurality of specific character strings whose degree of association is equal to or more than a predetermined degree are selected from among the many character strings, and each of the selected specific character strings and one or more corresponding thereto are selected. Creating an electronic dictionary in which categories are recorded.

A method for classifying documents, wherein
Obtaining an electronic dictionary that is a character string corresponding to each of a plurality of categories of the document and includes two or more character strings characteristic of the category;
Obtaining a document to be classified;
Comparing and comparing each character string included in the document with each characteristic character string described in the electronic dictionary;
As a result of the comparison and collation, when there is a character string that matches each other, the category corresponding to the matched character string is grasped by referring to the electronic dictionary, and the grasp is performed on the acquired document. Classifying the acquired document by assigning an assignment target category that is a category.

A computer program for creating an electronic dictionary for document classification,
A plurality of sample documents belonging to each of a plurality of categories of the document are input, and for each of the plurality of character strings included in the plurality of sample documents, the degree of association between the character strings and the category is determined. Performing a determination of
Based on the result of the determination, a plurality of specific character strings whose degree of association is equal to or more than a predetermined degree are selected from among the many character strings, and each of the selected specific character strings and one or more corresponding thereto are selected. Creating an electronic dictionary in which categories are recorded, and a computer program for causing the computer to execute the steps.

A computer program for classifying documents,
Obtaining an electronic dictionary which is a character string corresponding to each of a plurality of categories of the document, and in which two or more character strings characteristic of the category are described;
Receiving documents to be classified and input;
Performing a comparison and comparison between each character string included in the acquired document and each characteristic character string described in the electronic dictionary;
As a result of the comparison and collation, when there is a character string that matches each other, the category corresponding to the matched character string is grasped by referring to the electronic dictionary, and the grasp is performed on the acquired document. Categorizing the acquired document by assigning an assignment target category, which is a category, to the computer.

An electronic dictionary for document classification,
An electronic dictionary corresponding to each of a plurality of categories of a document and including one or more character strings considered or estimated to be characteristic for the category.