JP3609252B2

JP3609252B2 - Automatic character string classification apparatus and method

Info

Publication number: JP3609252B2
Application number: JP07392098A
Authority: JP
Inventors: さより下畑
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1998-03-23
Filing date: 1998-03-23
Publication date: 2005-01-12
Anticipated expiration: 2018-03-23
Also published as: JPH11272702A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストから抽出された任意の文字列を、特定の分野で使われる専門表現と分野に関係なく使われる一般表現に分類する方法および装置に関する。
【０００２】
【従来の技術】
情報処理装置で使用できるようにデータ化された文書を、自動的に翻訳する機械翻訳システムや、文書をサーチして該当するキーワードを使用した文書を検索するといったシステムでは、文書中に存在する一定の意味を持った文字列を抽出する処理が必要である。
文献１「情報処理学会研究報告Ｖｏｌ．９３，Ｎｏ．６１（９３−ＮＬ−９６−１）」
文献１に開示された手法では、テキスト文書に含まれるすべての文字を先頭とするすべての長さの文字列（テキストの長さをｌとした場合、テキスト中に出現する文字ｉ（１≦ｉ≦Ｌ）を先頭とする長さｎ（１≦ｎ≦Ｌ−ｉ）の文字列。以下、ｎ−ｇｒａｍ文字列と呼ぶ）を抽出し、その出現回数をカウントすることで、処理対象となるテキスト文書から文字列を抽出している。この手法は、形態素解析を行なったり辞書を使用したりする必要がなく、統計処理だけで文字列を抽出できるという特徴がある。しかし、テキストに出現する文字列を文字数と出現回数により網羅的に抽出するため、言語として意味をなさない文字列（以下、断片的文字列と呼ぶ）が混在するという問題がある。
【０００３】
ところで、単語や句のようにひとまとまりとして認識すべき連続文字列（以下、「表現」と呼ぶ）は、テキスト中での出現頻度が高く、またその前後には様々な語が出現するという特徴がある。
文献２情報処理学会研究報告Ｖｏｌ．９５，Ｎｏ．１１０（９５−ＮＬ−１１０−１１）
文献２に開示された技術では、上記特徴を利用して、任意の文字列の直前の文字の分散値と、直後の文字の分散値を計算することにより、妥当な連続文字列を抽出する手法について述べたものである。この手法では、文献１の手法で抽出された文字列から分散値の低い文字列を断片的文字列として除去することにより、意味のある文字列、すなわち「表現」だけを抽出するものである。
【０００４】
【発明が解決しようとする課題】
通常、「表現」には様々なレベル（例えば専門用語や固有名詞などの専門表現、一般用語や慣用句などの一般表現）のものがある。
しかしながら、従来の文字列抽出方法では、文書中から抽出した文字列に含まれる断片的文字列は排除され、意味のある文字列「表現」が抽出されるものの、様々なレベルの表現の文字列が混在した状態で抽出されてしまう。
従って、抽出結果を実際に利用する際には、さらにこれを用途に合わせて分類しなければならないという問題があった。
【０００５】
このような点から、本発明はテキスト文書から抽出された任意の文字列を、特定の分野で使われる専門表現と、分野に関係なく使われる一般表現に、適切に分類することができる文字列自動分類装置を提供することを目的とする。
【０００６】
【課題を解決するための手段】
係る課題を解決するため、本発明は、自然言語で記述された複数の文書を保存する文書格納手段と、複数の文書のうち、任意の文書から文字列を抽出する文字列抽出手段と、文字列抽出手段により抽出した文字列の、抽出した文書内での重要度を文書内重要度として算出する文書内重要度計算手段と、文字列抽出手段により抽出した文字列の、複数の文書全体での重要度を文書間重要度として算出する文書間重要度計算手段と、文書内重要度と文書間重要度に基き抽出した文字列の重要度を文字列重要度として算出する文字列重要度計算手段と、文字列重要度計算手段で得られた文字列重要度に基き抽出した文字列を分類する文字列分類手段とを備え、文字列分類手段は、抽出したすべての文字列に対し、各々の文字列に付与されている文字列重要度と予め定めた閾値を比較し、この抽出したすべての文字列を特定分野に関係なく用いる一般表現または特定分野で用いる専門表現に分類する第１の分類部と、第１の分類部で専門表現に分類された文字列を分割し、この分割した文字列の各構成要素が第１の分類結果に存在する場合は、この分割した文字列の構成要素の組合せにより分割前の文字列の分類を該専門表現から一般表現を伴って用いられる専門表現である一般専門表現または一般表現に分類し直す第２の分類部とを有する文字列自動分類装置を提供する。
【０００９】
【発明の実施の形態】
以下、本発明の実施形態に係る文字列自動分類装置について、図面を用いて詳細に説明する。
（Ａ）第１の実施形態
図１は、本発明の文字列自動分類装置の第１の実施形態を示すブロック構成図であり、ワークステーションやパソコン等の情報処理装置上で実現されるものである。
【００１０】
図１において、文字列自動分類装置は、入出力装置１と、処理装置２と、記憶装置３を有する。入出力装置１は、テキストの入力、抽出結果の表示などを行なう機能を有する。処理装置２は、連続文字列を抽出し、その結果を分類するための各種処理を実行する機能を有する。記憶装置３は、入力されたテキストや各段階の処理結果を保存する機能を有する。
【００１１】
更に、入出力装置１は、入力部１１と出力部１２を有する。入力部１１はデータとなるテキスト文書を入力する機能を有し、例えばキーボード等から構成される。出力部１２は抽出した文字列等の表示を行う機能を有し、例えばディスプレイやプリンタ等で構成される。
【００１２】
処理装置２は、文字列抽出部２１と重要度計算部２２と文字列分類部２３を有する。文字列抽出部２１は、後述する文書ファイルを読み込み、その文書ファイルに含まれる任意のｎ−ｇｒａｍ文字列を抽出する。この抽出方法は、例えば、文献１に示した文字列抽出方法を用いる。また、抽出結果に対して、文献２のような断片的文字列を除去する処理を行ってもよい。
なお、従来技術で記載した方法にかかわらず、単語や句、節の様な文字列が抽出できる方法ならよい。
【００１３】
重要度計算部２２は、文字列抽出部２１で抽出したｎ−ｇｒａｍ文字列の文書内および文書間の重要度を計算し、この２つの重要度から文字列に重み付けをした最終的な文字列の重要度（以下、文字列重要度と呼ぶ）を求めるものである。文字列分類部２３は、重要度計算部２２で抽出した文字列毎に付与された重要度に基いて、特定分野で利用される専門表現、または、分野に関係なく通常の文書中に多く利用される一般表現に分類するものである。重要度計算部２２および文字列分類部２３については、詳細を後述する。
【００１４】
記憶装置３は、文書ファイル３１と文書内重要度テーブル３２と文書間重要度テーブル３３とキーワードテーブル３４とバッファ３５を有する。
文書ファイル３１は、入力部１２から入力されたデータとなるテキスト文書を文書ファイルとして保存するものである。本実施の形態では、文書ファイル３１は複数個存在し、文書ファイル３１の記載分野（内容）は限定されず、文書ファイル毎に異なる分野のものであってもよい。
【００１５】
図２に文書内重要度テーブル３２の例を示す。文書内重要度テーブル３２は、文字列抽出部２１によって文書ファイル３１から生成されたｎ−ｇｒａｍ文字列を格納する文字列格納部と、文字列の文書内の重要度を格納する文書内重要度格納部と、その文字列の文書内での重要度と、文書内での文字列の重要度に文書間での文字列の重要度を加味した重みつき重要度を格納する文字列重要度格納部を有する。
【００１６】
なお、文書内重要度格納テーブル３２は、複数の文書ファイル３１毎に対応している。
図３に文書間重要度テーブル３３の例を示す。文書間重要度テーブル３３は、文字列抽出部２１によって文書ファイル３１から生成されたｎ−ｇｒａｍ文字列を格納する文字列格納部と、複数の文書ファイル３１の内の１つの文書ファイル３１において抽出されたｎ−ｇｒａｍ文字列が、その他の文書ファイル３１に出現する数を格納する出現文書数格納部と、複数の文書ファイル３１における文字列の重要度を格納する文書間重要度格納部を有する。本実施形態では、１つのテーブルで構成している。
【００１７】
なお、文書間重要度格納テーブル３３は、複数の文書ファイル３１毎に生成された複数の文書内重要度格納テーブル３２に格納されている文字列から作成されるものである。
【００１８】
キーワードテーブル３４は、文字列分類部２３によって文書ファイル３１内で重要と判断され抽出された文字列（キーワード）を格納する。
図４にキーワードテーブル３４の例を示す。キーワードテーブル３４は、抽出された文字列を格納する文字列格納部と、文書内重要度テーブル３２の文字列重要度格納部に格納されている文字列分類部２３によって文書ファイル３１内で重要と判断され抽出された文字列（キーワード）を格納する重要度格納部を有する。なお、キーワードテーブル３４は、複数の文書ファイル３１毎に対応している。
【００１９】
バッファ３５は、各処理の過程で得られる値や途中結果など格納する。
【００２０】
ここで、重要度計算部２２について詳細に説明する。
図５は、重要度計算部２２の機能を示す構成図である。
重要度計算部２２は、さらに、文書内重要度計算部２２１と文書間重要度計算部２２２と文字列重要度計算部２２３を有するものである。それぞれ、文書内重要度テーブル３２と文書間重要度テーブル３３と関連して処理を行う。
【００２１】
文書内重要度計算部２２１は、記憶装置２に格納された複数の文書ファイル３１から１つの文書ファイル３１を読み出し、その文書ファイル３１から抽出したｎ−ｇｒａｍ文字列のその文書ファイル３１における文書内重要度を計算する。文書内重要度には、例えば、ある文字列が文書ファイル３１に出現する出現頻度を用いる。計算した文書内重要度は文書内重要度テーブル３２の文書内重要度格納部に格納する。この文書内重要度の計算処理をすべての文書ファイル３１に対して繰り返し行う。
【００２２】
出現頻度は、文献１および２の方法でｎ−ｇｒａｍ文字列を抽出する際に同時に求めることができる。
【００２３】
文書間重要度計算部２２２は、抽出したｎ−ｇｒａｍ文字列の、複数ある文書ファイル３１全体における重要度を求めるものである。
まず、文書ファイル３１に対応する文書内重要度テーブル３２から、１レコードづつ読み込みｎ−ｇｒａｍ文字列が、幾つの文書ファイル３１に出現する文字列であるかを計数する。この計数は、複数個の文書ファイル３１に対応する文書内重要度テーブル３２それぞれに対し行い、ある文字列が出現する文書ファイル３１の累計を出現文書数とする。出現文書数が多い時はその文字列が特定の文書ファイル３１にかかわらず多く出現する文字列であることがわかり、逆に出現文書数が少ないときは、特定の文書ファイルにしか出現しない文字列であることがわかる。
【００２４】
求めた出現文書数は、文書間重要度テーブル３３の出現文書数格納部に格納する。
さらに、各文字列の文書間重要度を計算する。ここでは、ある文字列_ｋを含む文書ファイル３１が少ないほど文書間の重要度が大きな値をとるようにする。この値は、例えば、ｉｎｖｅｒｓｅｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙを用いる。ｉｎｖｅｒｓｅｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙとは、ある文字列を含む文書の数の逆数に、全体の文書数を掛けたものである。文字列_ｋのｉｎｖｅｒｓｅｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙをｉｄｆ_ｋとし、文書ファイル数をＮ、抽出した文字列を含む文書ファイル３１の数（出現文書数）をｎ_ｋとすると、次の式で求められる。
【００２５】
（式１）ｉｄｆ_ｋ＝ｌｏｇ（Ｎ／ｎ_ｋ）ここで、ｉｄｆ_ｋはｎ_ｋ＝１のとき最大値をとり、ｎ_ｋ＝Ｎのとき最小値（＝０）をとり、文字列_ｋを含む文書ファイル３１の数によって変化する。文書間重要度は、文字列ｋを含む文書ファイル３１が多いほど小さな値をとる。逆に、文字列ｋが少ない文書ファイル３１にしか含まれない場合は大きな値をとる。
【００２６】
この文書間重要度の計算処理を文書間重要度テーブル３３の文字列格納部に格納されているすべてのｎ−ｇｒａｍ文字列に対して繰り返し行う。
求めた文書間重要度は、文書間重要度テーブル３３の文書間重要度格納部に格納する。
【００２７】
文字列重要度計算部２２３は、文書内重要度テーブル３２から文書内重要度を、文書間重要度テーブル３３から文書間重要度をそれぞれ読み出し、文字列の文書内での重要度を文書間の重要度によって重み付けされた、文字列の重要度の計算を行う。ここでは、特定の文書ファイル３１での出現頻度は高いが、それ以外の文書ファイル３１ではほとんど出現しない文字列、すなわち特定分野に関連する文字列の重要度が高くなるように設定し、逆に特定の文書ファイル３１に関係なく、多くの文書ファイル３１に出現する、すなわち特定分野に関係なく出現する文字列の値が低くなるように設定する。算出方法は、例えば、文書内重要度と文書間重要度の積を用いる。
【００２８】
ここで、文書ファイル_ｉ３１（１≦_ｉ≦Ｎ）から抽出した文字列_ｋの文字列重要度をＷ_ｉｋとする。文字列_ｋの文書ファイル_ｉ３１における出現頻度を表す文書内重要度をｔｆ_ｉｋとし、文字列_ｋの文書間における重要度を表す文書間重要度をｉｄｆ_ｋとすると、次の式２で求められる。（式２）Ｗ_ｉｋ＝ｔｆ_ｉｋ×ｉｄｆ_ｋ文字列重要度は、文字列_ｋが少ない文書ファイルにしか出現しないためｉｄｆ_ｋの値が高く、かつ、文字列_ｋが抽出された文書ファイル内での出現頻度が高いため、ｔｆ_ｉｋの値が高い場合に、高い値を得る。
【００２９】
文字列重要度が高い値を得た場合に、文字列_ｋがその文書ファイル内において重要なキーワードであると判断する。求めた、文字列重要度は、文書内重要度テーブル３３の文字列重要度格納部に格納する。この文字列の重要度の計算処理を、文書内重要度テーブル３２のすべての文字列に対して繰り返して行う。
【００３０】
次に、文字列分類部２３について詳細に説明する。
文字列分類部２３は、文書内重要度テーブル３２から１レコードづつ読み込み、文字列重要度格納部に格納されている文字列の重要度に基き、文字列を専門表現と一般表現の２つに分類し、専門表現のみを抽出する。
【００３１】
分類方法は、例えば、予め閾値を定めておき、その閾値と比較することで行う。ここで、閾値をＴとし、文字列の重要度が閾値Ｔより大きい場合にその文字列は重要度が高いと判断する。閾値と比較が終了し、重要だと判断された文字列は、文字列とその文字列の重要度とともに、キーワードテーブル３４に格納する。この分類処理を、すべての文書内重要度テーブル３２に対して繰り返しておこない、文書ファイル３１内で重要である文字列として分類する。
【００３２】
図６は、本発明の文字列自動抽出装置の動作を示すフローチャートである。
ここで、入力部１１からデータとなるテキスト文書を文書ファイルとして、記憶部３の文書ファイル３１に入力し、複数の文書ファイル３１が格納されているものとする。また、文書ファイル３１の総数は予めわかっているものとする。
【００３３】
まず、文字列抽出部２１は、文書ファイル３１から複数存在する文書ファイル中から１つの文書ファイルを読み込み（ステップ１）、読み込んだ文書ファイルからｎ−ｇｒａｍ文字列を抽出し、文書内重要度テーブル３２の文字列格納部に格納する。（ステップ２）。
【００３４】
抽出したｎ−ｇｒａｍ文字列がその文書ファイル中に出現する頻度を求め、文書内重要度テーブル３２の文書内重要度格納部に格納する。（ステップ３）。
ここで、文字列抽出が未処理である文書ファイル３１が存在するかを判断する。未処理の文書ファイル３１が存在する場合はステップ１の処理に戻り、最後の文書ファイル３１であるときには次のステップ４の処理に進む（ステップ４）。
【００３５】
次に文字列毎に求めた文書内重要度を用いて、文字列の文書間重要度を求める。
ステップ３までの処理過程で生成したすべての文書内重要度テーブル３２を参照し、各文字列の文書間重要度を計算して文書間重要度テーブル３３に格納する（ステップ５）。
【００３６】
次に、文書内重要度と文書間重要度を用いて、抽出したｎ−ｇｒａｍ文字列に重み付けをした文字列の重要度を計算し、文書内重要度テーブル３３の文字列重要度格納部に格納する。（ステップ６）。
最後に、重み付けされた文字列の重要度と予め設定した閾値とを比較して文字列の分類を行う（ステップ７）。
【００３７】
ここで、文書間重要度の処理を行うステップ５について図７のフローチャートを用いて詳細に説明する。
まず、文書内重要度テーブル３２から１レコード読み込む（ステップ５１）。読み込んだ文字列が文書間重要度テーブル３３の文字列格納部にすでに格納されているかを判断し、格納されている場合はステップ５４に進む。
【００３８】
格納されていない場合は、ステップ５３に進み、文字列重要度テーブル３３の文字列格納部に文字列を格納した後、ステップ５４に処理を進める（ステップ５２、５３）。ステップ５４において、その文字列の出現文書数を１増加する（ステップ５４）。対象となっている文書内重要度テーブル３２に未処理のレコードが在るかを判断し、まだ未処理のレコードがあればステップ５１に戻り、無ければステップ５６に進む（ステップ５５）。
【００３９】
次に、ステップ５５までの処理をすべての文書内重要度テーブルに実行したかを判断する。まだ未処理の文書内重要度テーブル３２がある場合は、ステップ５７に進み、未処理の文書内重要度テーブル３２が無い場合はステップ５８に処理に進める（ステップ５６）。未処理の文書内重要度テーブル３２がある場合は次の文書内重要度テーブル３２に処理を移し、すべての文書内重要度テーブル３２に対してステップ５１からステップ５５の処理を行う（ステップ５７）。
【００４０】
次に、文書間重要度テーブル３３から１レコード読み込む（ステップ５８）。出現文書数と文書ファイル総数を用いて文書間での文字列の重要度を計算し、（ステップ５９）求めた文書間重要度を文書間重要度テーブル３３の文書間重要度格納部に格納する（ステップ５１１）。次に、文書間重要度テーブル３３に存在するすべてのレコードに処理を実行したかを判断する。まだ未処理のレコードがある場合は、ステップ５８に進み、未処理のレコードが無い場合は処理を終了する（ステップ５１２）。
【００４１】
次に、文字列重要度の処理を行うステップ６について、図８のフローチャートを用いて詳細に説明する。
【００４２】
まず、文書内重要度テーブル３２から１レコード読み込む（ステップ６１）。読み込んだ文字列に該当する文書間重要度を文書間重要度テーブル３３から参照し、重み付けした文字列の重要度を計算し（ステップ６２）、求めた文字列重要度を現在処理の対象となっている文書内重要度格納テーブル３２の文字列重要度格納部に格納する（ステップ６３）。
【００４３】
ここで、対象となっている文書内重要度テーブル３２に未処理のレコードが在るかを判断し、まだ未処理のレコードがあればステップ６１に戻り処理を続け、無ければステップ６５に進む（ステップ６４）。次にステップ６４までの処理をすべての文書内重要度テーブル３２に実行し、重み付けした文字列重要度を算出したかを判断する。
【００４４】
まだ未処理の文書内重要度テーブル３２がある場合はステップ６６に進み、未処理の文書内重要度テーブル３２が無い場合は、図６のステップ７に進む（ステップ６５）。未処理の文書内重要度テーブル３２がある場合は次の文書内重要度テーブル３２に処理を移しすべての文書内重要度テーブル３２に対してステップ６１からステップ６５の処理を行う（ステップ６６）。
【００４５】
次に、重み付けされた文字列の分類処理を行うステップ７について、図９のフローチャートを用いて詳細に説明する。
【００４６】
まず、文書内重要度テーブルから１レコード読み込む（ステップ７１）。読み込んだ文字列重要度と予め設定した閾値との比較を行い、文字列重要度が閾値より大きければステップ７３に進み、閾値より小さければステップ７４に進む（ステップ７２）。閾値より大きいと判断された文字列重要度と、その文字列重要度に対応する文字列を、対象となっている文書内重要度テーブル３２に対応する、キーワードテーブル３４の重要度格納部と文字列格納部にそれぞれ格納する。
【００４７】
ここで、対象となっている文書内重要度テーブル３２に未処理のレコードが在るかを判断し、まだ未処理のレコードがあればステップ７１に戻り処理を続け、無ければステップ７５に進む（ステップ７４）。次にステップ７４までの処理をすべての文書内重要度テーブル３２に実行し、文字列の分類を行ったかを判断する。
【００４８】
まだ未処理の文書内重要度テーブル３２がある場合はステップ７６に進み、未処理の文書内重要度テーブル３２が無い場合は処理を終了する。未処理の文書内重要度テーブル３２がある場合は次の文書内重要度テーブル３２に処理を移しすべての文書内重要度テーブル３２に対してステップ７１からステップ７５の処理を行う（ステップ７６）。
【００４９】
次に、実際の事例と図６から図９のフローチャートを用いて、本発明の処理過程を具体的に説明する。
【００５０】
記憶装置３にはＮ個の文書ファイル３１が格納されているものとする。
図１０は文書内重要度テーブル３２の例である。
【００５１】
まず、文書ファイル_ｉ３１（１≧_ｉ≧Ｎ）の内容を読み込み、ｎ−ｇｒａｍ文字列を抽出し、抽出した文字列の文書ファイル_ｉ３１中における出現頻度を求め、それぞれ対応する文書内重要度テーブル３２に格納する（図６のステップ１、２、３）。
【００５２】
図１０に抽出した文字列と出現頻度（文書内重要度）を格納した文書内重要度テーブル３２を示す。図６のステップ１、２、３の処理をＮ個の文書ファイル_ｉ３１に対して各々行う。この処理の終了後は、文書ファイル_ｉ３１（１≦_ｉ≦Ｎ）に対応して、文書内重要度テーブルｉ３２（１≦ｉ≦Ｎ）が作成される。
【００５３】
次に、各文字列の文書間の重要度を計算し、これを文書間重要度テーブル３３に格納する。図１１に文書間重要度テーブルの例を示す。ここで、図１０の文書内重要度テーブル_ｉ３２を処理対象とし、読み込んだ文字列_ｋ＝“での“が、その他の文書内重要度テーブル３２にが存在するかを判断する。存在する場合は他の文書での出現数としてカウントして合計値を文書間重要度テーブル_ｉ３３の出現文書数格納部に格納する。ここでは、他文書での出現数ｎ_ｋ＝４３であったとする。（図７ステップ５１〜５７）。
【００５４】
次に文書間重要度を求める。文書ファイル３１の数Ｎ＝５０、文字列_ｋ＝“での“、出現文書数ｎ_ｋ＝４３を式（１）に従って、文書ファイル_ｉ３１における文字列_ｋの文書間重要度ｉｄｆ_ｋ＝ｌｏｇ（５０／４３）＝０．１５を求め、文書間重要度格納部に０．１５を格納する（図７ステップ５８〜５１１）。図１１に図１０に示した文書内重要度テーブル３２に文書間重要度計算を実行した後の文書間重要度テーブル３３の内容を示す。
【００５５】
次に、ｎ−ｇｒａｍ文字列に重み付けした文字列重要度を求める。図１０の文書内重要度テーブル_ｉ３１から文字列_ｋ＝“での”の文書内重要度ｔｆ_ｉｋ＝１１を読み込む。また、図１１の文書間重要度テーブル３３から文字列_ｋ＝“での”の文書間重要度ｉｄｆ_ｋ＝０．１５を読み込む。式（２）に従って、文書ファイル_ｉ３１における文字列_ｋの重要度Ｗ_ｉｋ＝１１×０．１５＝１．６５を求め、得られた値を文書内重要度テーブルｉ３１の文字列重要度格納部に格納する（図８ステップ６１から６３）。
【００５６】
この処理を、文書内重要度テーブル_ｉ３３のすべての文字列に対して繰り返して行う（ステップ６５、６６）。図１２に、図１０の文書内重要度テーブル_ｉ３２に対し、図１１の文書間重要度テーブル３３を用いて文字列に重み付けを行った後の内容を示す。図示しないが、文書内重要度テーブル_ｉ３２以外の文書内重要度テーブル３２が存在すれば次のテーブルに移り、同様の処理を繰り返す。
【００５７】
最後に、ｎ−ｇｒａｍ文字列の分類処理を行う。
図１２の文書内重要度テーブル_ｉ３２から文字列_ｋ＝“での”と文字列_ｋ＝“での”に対応する重み付き文字列重要度Ｗ_ｉｋ＝１．６５を読み込み、予め定めた閾値Ｔ＝１０との比較を行う。文字列の重要度が閾値Ｔより大きければキーワードとして登録するが、文字列の重要度１．６５は閾値１０よりも小さいため、キーワードテーブル_ｉ３４には格納されない（図９ステップ７１から７４）。
【００５８】
この処理を文書内重要度テーブル_ｉ３２中のすべてのレコードに対して繰り返して行う（図９ステップ７５、７６）。
【００５９】
続いて同様の処理を行うと、文字列_ｋ＝“で”および“の”の文字列重要度Ｗ_ｉｋは０であり、閾値１０より小さいためキーワードとして登録されない。これに対して、文字列_ｋ＝“ネットワーク”の文字列重要度Ｗ_ｉｋは３９．３３であり、閾値１０より大きいので、文字列“ネットワーク”と文字列重要度をキーワードテーブル_ｉ３４に格納する。
【００６０】
図１３に、図１２の文書内重要度テーブル_ｉ３２に対し、閾値Ｔ＝１０として分類処理を行った後の内容を示す。
以上の処理を複数の文書ファイル（１〜Ｎ）に対して行う、この結果、すべての文書ファイル３１に対して、対応するキーワード３４が作成される。
【００６１】
＜第１の実施形態の効果＞
本発明の第１の実施形態によれば、テキストから抽出したｎ−ｇｒａｍ文字列を専門表現と一般表現に分類することができる。文書内での重要度のみで判断するのではなく、文書間の重要度を加味することにより、各文書における専門表現と一般表現を相対的に分類することができる。つまり、文書内での出現頻度が少ない文字列であっても専門性が高いと判断されれば（特定の文書にしか出現しなければ）専門表現としての値が高くなり、キーワードとして登録することができる。
【００６２】
また、用意された文書ファイルの内容に応じて、適切な分類が行うことができる。例えば、第１の実施形態において“ネットワーク”という文字列は、少ない文書ファイルにしか出現しないため文書中での専門性が高いと判断でき専門表現として分類できる。しかし、文書ファイルがすべてネットワーク関連の論文等であった場合は文字列“ネットワーク”の重要度は低くなり抽出されなくなる。この特徴は、キーワード検索装置で利用するキーワードを抽出する際等に有効である。
【００６３】
（Ｂ）第２の実施形態
図１４は、本発明の文字列自動分類装置の第２の実施形態を示すブロック構成図である。第２の実施形態において第１の実施形態と同様の機能を備えるブロックには同一の番号を付与し、第２の実施形態において第１の実施形態と異なるブロックについてのみ詳細に説明する。
【００６４】
処理装置２は第１の実施形態での文字列分類部２３に代わり、文字列複数分類部２４を備える。
【００６５】
文字列複数分類部２４は抽出したｎ−ｇｒａｍ文字列を、重要度計算部２２で文字列毎に付与された重要度に基いて、専門表現、一般表現、または一般表現と専門表現の組合わせの３種類に分類するものである。
【００６６】
まず、文書内重要度テーブル３２から１レコードづつ読み込み、文字列重要度格納部に格納されている文字列の重要度に基き文字列を分類する。
【００６７】
分類方法は、例えば、予め閾値を定めておき、その閾値と比較することで行う。ここで、閾値をＴとした場合、文字列の重要度が閾値Ｔより大きい場合にその文字列は重要度が高いと判断して、専門表現を表す分類コード（ＤＣ＝ｄｏｍａｉｎｄｅｐｅｎｄｅｎｔｃｏｌｌｏｃａｔｉｏｎ、以下、専門表現ＤＣ）を付与し、閾値Ｔより小さい場合は、一般表現を表す分類コード（ＧＣ＝ｇｅｎｅｒａｌｃｏｌｌｏｃａｔｉｏｎ、以下、一般表現ＧＣ）を付与する。
【００６８】
この処理ですべての文字列に専門表現ＤＣまたは一般表現ＧＣのいずれかの分類コードを付与した後、この分類コードに基き、更に文字の並び方を考慮して分類コードを再付与する。ここで、専門表現ＤＣと一般表現ＧＣの他に、この２つの表現を組合わせた表現である場合、組合わせを表す分類コード（ＣＧＤ＝ｃｏｍｂｉｎａｔｉｏｎｏｆｇｅｎｅｒａｌａｎｄｄｏｍａｉｎｄｅｐｅｎｄｅｎｔｃｏｌｌｏｃａｔｉｏｎ、以下、一般・専門表現ＣＧＤ）を付与する。
【００６９】
分類コードの付与処理が終了した後、文字列とその文字列の分類コードを文字列分類テーブル３６に格納する。この分類コードの付与処理を、すべての文書内重要度テーブル３２に対して繰り返しおこない、文書ファイル３１内の文字列を分類する。
【００７０】
なお、一般・専門表現ＣＧＤは、２つの表現の組合わせ順に関係なく、専門表現ＤＣ・一般表現ＧＣの順、または一般表現ＧＣ・専門表現ＤＣの順でも一般・専門表現ＣＧＤを構成するものとする。また、専門表現ＤＣと一般表現・専門表現ＣＧＤの組み合せと、一般表現ＧＣと一般表現・専門表現ＣＧＤの組み合せの場合も、一般・専門表現ＣＧＤを構成する。この場合も、２つの表現の順序は問わない。
【００７１】
記憶装置３は第１の実施形態でのキーワードテーブル３４に代わり、文字列分類テーブル３６を備える。
【００７２】
文字列分類テーブル３６は、文字列複数分類部２４によって文書ファイル３１内で重要と判断され抽出された文字列（キーワード）と、文字列ごとに付与した分類コードを格納する。
【００７３】
図１６に文字列分類テーブル３６の例を示す。文字列分類テーブル３６は、抽出された文字列を格納する文字列格納部と、文字列複数分類部２４によって文字列に付与された分類コードを格納する分類コード格納部を有する。なお、文字列分類テーブル３６は、複数の文書ファイル毎に対応している。
【００７４】
次に、第２の実施形態の特徴である文字列複数分類処理についてフローチャートを用いて説明する。本発明における文字列複数分類処理以外の処理は、第１の実施形態と同様である。
【００７５】
図１５は、文字列複数分類部の動作を示すフローチャートである。
図１５のステップ７２０等に記載されている記号“＋”は文字列の要素の組み合せを表すものである。例えば、“ＧＣ＋ＤＣ”は一般表現の要素である文字列と専門表現の要素である文字列のと組み合せであることを表す。また、分類コードの並び順は関係がない。例えば、“ＧＣ＋ＣＧＤ”と記載した場合は、“ＣＧＤ＋ＧＣ”の並び順も含むものとする。
【００７６】
まず、文書内重要度テーブル３２から１レコード読み込む（ステップ７１１）。読み込んだ文字列重要度と予め設定した閾値との比較を行い、文字列重要度が閾値より大きければステップ７１３に進み、閾値より小さければステップ７１４に進む（ステップ７１２）。閾値より大きいと判断された場合は、その文字列に専門表現ＤＣを付与する（ステップ７１３）。
【００７７】
閾値より小さいと判断された場合は、その文字列に一般表現ＧＣを付与し、ステップ７１５に進む（ステップ７１４）。分類コードが付与された文字列とその分類コードを、文字列を読み込んだ文書内重要度テーブル３２に対応する文字列分類テーブル３６の文字列格納部と分類コード格納部にそれぞれ格納する（ステップ７１５）。
【００７８】
ここで、対象となっている文書内重要度テーブル３２に未処理のレコードが在るかを判断し、まだ未処理のレコードがあればステップ７１１に戻り、ステップ７１１からステップ７１５の処理を繰り返し、文書内重要度テーブル３２のすべての文字列に対して専門表現ＤＣまたは一般表現ＧＣのいずれかの分類コードを付与する。未処理のレコードが無ければステップ７１７に進む（ステップ７１６）。ステップ７１７ではフラグを０にセットする（ステップ７１７）。
【００７９】
次に、文字列分類テーブル３６から１レコード読み込む（ステップ７１８）。読み込んだレコードの文字列の要素が、「一般表現ＧＣ」または「一般・専門ＣＧＤ」であるかを判断する。２つの表現のいずれかに該当する場合はステップ７２５に進み、該当しない場合はステップ７２０に進む（ステップ７１９）。
【００８０】
次に、文字列の要素が「一般表現ＧＣと一般表現ＧＣの組合わせ」であるかを判断する。この組合わせに該当する場合はステップ７２１へ進み、該当しない場合はステップ７２２へ進む（ステップ７２０）。この組合わせに該当する場合は、文字列に一般表現ＧＣを付与し（ステップ７２１）、文字列分類テーブル３６の分類コード格納部に格納されている分類コードを一般表現ＧＣに置き換え（ステップ７２９）、フラグを１にして（ステップ７２４）処理をステップ７２５に進める。
【００８１】
更に、文字列の要素が「一般表現ＧＣと専門表現ＤＣの組合わせ」または「一般表現ＧＣと一般・専門表現ＣＧＤの組合わせ」もしくは「専門表現ＤＣと一般・専門表現ＣＧＤの組合わせ」であるか判断する。この３つの表現のいずれかに該当する場合はステップ７２３に進み、該当しない場合はステップ７２５に進む（ステップ７２２）。
【００８２】
この３つの表現に該当す場合は、文字列に一般・専門表現ＣＧＤを付与し（ステップ７２３）、文字列分類テーブル３６の分類コード格納部に格納されている分類コードを一般・専門表現ＣＧＤに置換へ、（ステップ７２９）、フラグを１にして（ステップ７２４）処理をステップ７２５に進める。
【００８３】
ここで、対象となっている文書内重要度テーブル３２に未処理のレコードが在るかを判断し、まだ未処理のレコードがあればステップ７１８に戻り、ステップ７１８からステップ７２５の処理を繰り返し、文書内重要度テーブル３２のすべての文字列に対して専門表現ＤＣまたは一般表現ＧＣもしくは一般・専門表現ＣＧＤのいずれかの分類コードを付与する。未処理のレコードが無ければステップ７２６に進む（ステップ７２５）。ステップ７２６ではフラグを０にセットする（ステップ７２６）。
【００８４】
次にステップ７２６までの処理をすべての文書内重要度テーブル３２に実行し、文字列の分類を行ったかを判断する。まだ未処理の文書内重要度テーブル３２がある場合はステップ７２８に進み、未処理の文書内重要度テーブル３２が無い場合は、処理を終了する（ステップ７２７）。
【００８５】
未処理の文書内重要度テーブル３２がある場合は次の文書内重要度テーブル３２に処理を移しすべての文書内重要度テーブル３２に対してステップ７１１からステップ７２７の処理を行う（ステップ７６）。
【００８６】
ここで、文字列が２つの要素の組み合せであるかの判断方法について説明する。
まず、ステップ７１１からステップ７１７までの処理を実行し、文字列にＤＣまたはＧＣのいずれかの分類コードが付与され文字列分類テーブル３６に格納されているものとする。
【００８７】
文字列を２つの要素に分割するには、文字列の区切り位置を１つづつずらして、各々の文字列が文字列分類テーブル３６中に存在し、かつその文字列の分類コードが判断条件に合致したときに分類コードを付与する。図１９に、文字列“ネットワークの構築“を２つの要素に分割する方法を示す。
【００８８】
ここで、文字列“ネットワーク”には専門表現ＤＣが、文字列“ネットワークの”には一般・専門表現ＣＧＤが、文字列“構築”には専門表現ＤＣが付与されているものとし、２つの要素の組み合せとする判断条件は、「一般表現ＧＣと専門表現ＤＣの組み合せ」または「一般表現ＧＣと一般・専門表現ＣＧＤの組み合せ」であることとする。
【００８９】
図１９において、番号部は文字列を分割した回数を示し、文字列Ａ部および文字列Ｂ部は２つに分割した文字列の各要素を示し、照合結果部は文字列Ａおよび文字列Ｂの両方の要素が文字列分類テーブル３６に存在するかを照合した結果を示す。
【００９０】
まず、“ネットワークの構築“の区切り位置を１文字づつずらしていくと、番号１では、文字列Ａが“ネ”で文字列Ｂが“ットワークの構築”となる。この２つの要素は文字列分類テーブル３６に存在しないので、２つの要素の組み合せでは無いと判断される。
【００９１】
次に、番号６では、文字列Ａが“ネットワーク”で文字列Ｂが“の構築”である。文字列Ａの“ネットワーク”は、文字列分類テーブル３６に存在するが、文字列Ｂの“の構築”は文字列分類テーブル３６に存在しないため、２つの要素の組み合せとはならない。
【００９２】
番号７では、文字列Ａが“ネットワークの”で文字列Ｂが“構築”である。文字列Ａと文字列Ｂの両方が文字列分類テーブル３６に存在し、かつ、各々の分類コードが一般・専門表現ＣＧＤと一般表現ＧＣである。したがって、文字列“ネットワークの構築“は、「一般表現ＧＣと一般・専門表現ＣＧＤの組み合せ」であると判断される。
【００９３】
次に、実際の事例と図１５のフローチャートを用いて、第２の実施形態における文字列複数分類処理の過程を具体的に説明する。
図６の文字列自動抽出装置の動作を示すフローチャートのステップ１からステップ５までの処理を実行し、図１２の文書内重要度テーブル３２の内容が得られているものとする。
【００９４】
また、記憶装置３にはＮ個の文書ファイル３１が格納されているものとする。図１２の文書内重要度テーブル_ｉ３１から文字列_ｋ＝“での”と文字列_ｋ＝“での”に対応する文字列重要度Ｗ_ｉｋ＝１．６５を読み込み、予め定めた閾値Ｔ＝１０との比較を行う。文字列の重要度１．６５は閾値１０よりも小さいため、一般表現ＧＣを付与し、文字列分類テーブル_ｉ３６に文字列と分類コードの一般表現ＧＣを格納する。（図１５ステップ７１１から７１５）。
【００９５】
この処理を文書内重要度テーブル_ｉ３２中のすべてのレコードに対して繰り返して行う（図１５ステップ７１６）。続いて同様の処理を行うと、文字列_ｋ＝“で”および“の”の文字列重要度Ｗ_ｉｋは０であり、閾値１０より小さいため、一般表現ＧＣを付与し、文字列分類テーブル_ｉ３６に文字列と分類コードの一般表現ＧＣを格納する（図１５ステップ７１４、７１５）。
【００９６】
これに対して、文字列_ｋ＝“ネットワーク”の文字列重要度Ｗ_ｉｋは３９．３３であり、閾値１０より大きいので、専門表現ＤＣを付与し、文字列分類テーブル_ｉ３６に文字列と分類コードの専門表現ＤＣを格納する（ステップ７１３、７１５）。図１７に、図１２の文書内重要度テーブル_ｉ３２に対し、閾値Ｔ＝１０として文字列複数分類処理を行った後の内容を示す。
【００９７】
次に、フラグを０にセットし、文字列分類テーブル_ｉ３６から１レコード読み込む。１レコードめの文字列“での”の分類コードは、一般表現ＧＣであり、最後のレコードではないので処理を次のレコードに移す（図１５ステップ７１８、７１９、７２５）。続いて読み込んだ文字列_ｋ＝“で”および“の”についても分類コードは、一般表現ＧＣとなる。
【００９８】
次に、文字列_ｋ＝“ネットワーク”を読み込む。ここで、文字列_ｋ＝“ネットワーク”は、図１５のステップ７１９、７２０、７２２に示す条件のいずれにも該当しない。また、最後のレコードでは無いので処理を次のレコードに移す（ステップ７１８、７１９、７２２、７２５）。
【００９９】
次に、文字列_ｋ＝“ネットワークの”を読み込む。文字列_ｋ＝“ネットワークの”は、専門表現ＤＣの“ネットワーク”と一般表現ＧＣ“の”の２つの要素で構成された文字列であるためステップ７２２の条件に該当し、分類コードとして一般・専門表現ＣＧＤを付与し、文字列分類テーブルｉ３６の分類コード格納部にすでに格納されている分類コードＤＣを一般・専門表現ＣＧＤに置き換える（ステップ７１８、７２２、７２３、７２９）。フラグが１にセットされ、最後のレコードではないので、次のレコードに処理を移す（ステップ７２４、７２５）。
【０１００】
次に、文字列_ｋ＝“ネットワークの構築”を読み込む。文字列_ｋ＝“ネットワークの構築”は、“ネットワークの”と“構築”の２つの要素に分割される。ここで、“ネットワークの”は先の処理で一般・専門表現ＣＧＤに分類コードが置き換えられているので、ステップ７２２の条件「一般・専門表現ＣＧＤと専門表現ＤＣの組み合せ」に該当し、分類コードとして一般・専門表現ＣＧＤを付与し、文字列分類テーブルｉ３６の分類コード格納部にすでに格納されている分類コードＤＣを一般・専門表現ＣＧＤに置き換える。
【０１０１】
文字列分類テーブル_ｉ３６の最後のレコードまで以上の処理を繰り返し、最後のレコードまで処理を行った後に、フラグが０であるかを判断する。この時、フラグは１となっているので処理をステップ７１７に進め、フラグを０にセットする。ステップ７１８から７２６の処理をステップ７２６の判断でフラグが０になるまで繰り返す。
【０１０２】
フラグが０の場合、文字列分類テーブル３６に格納されているすべての文字列に対して文字列複数分類処理が終了したことになる。ステップ７２７で、他に文字列分類テーブル３６があるかを判断し、すべての文字列分類テーブル_ｉ３６に対して処理を実行する。図１８に、図１７の文字列分類テーブル_ｉ３６に対し、文字列複数分類処理を行った後の内容を示す。
【０１０３】
以上の処理を複数の文書ファイル３１（１〜Ｎ）に対して行う、この結果、すべての文書ファイル３１に対して、対応する文字列分類テーブル３６が作成される。
【０１０４】
＜第２の実施形態の効果＞
本発明の第２の実施形態によれば、第１の実施形態で得られる効果の他に、テキストから抽出したｎ−ｇｒａｍ文字列を専門表現、一般表現、専門表現と一般表現の組合わせの３つの表現に分類することができる。
【０１０５】
専門表現と一般表現を組合わせた表現に分類することで、専門用語辞書を作成する際に不要な語句を除去することが可能である。
【０１０６】
例えば、図１８の文字列分類テーブル３６から専門用語辞書を作成する場合に、一般・専門表現ＣＧＤが付与されている“ネットワークの”のような辞書に登録する必要の無い文字列を除き、専門表現ＤＣが付与されている文字列のみで専門用語辞書を作成できる。
【０１０７】
また、一般・専門表現ＣＧＤが付与される文字列は、専門用語に伴って用いられる単語を含んでおり、専門的な言い回しを表すものである。従って、一般・専門表現ＣＧＤが付与される文字列を抽出した専門表現辞書（専門的な言い回しを格納した辞書）を作成することができる。
【０１０８】
抽出した文字列が専門表現と一般表現を組合わせた表現かどうかを判断する際の文字列の分割は、１度文字列を分割した結果を利用して文字列の再分割を行う方法である。これにより、複数の表現で構成される文字列を１度に分割するよりも効率よく表現の組み合せであるかの判断が行える。
【０１０９】
この効果は特に、文字列を２つに分割するのを繰り返し行うときに生じる。例えば、図１８の文字列分類テーブル３６の文字列“ネットワークの構築”を分割するとき１度に“ネットワーク／の／構築”（／は単語の分割区切りを表す）の３つに分割して一般・専門表現であると判断するよりも効率がよい。
【０１１０】
（Ｃ）他の実施の形態
（ｃ−１）第１および第２の実施形態においては、テキストから抽出した文字列を様々なレベルの表現に分類することを特徴とするものであり、文書中からキーワードを検索する際に必要となる、キーワードの抽出や、機械翻訳などのシステムで用いる専門用語辞書の自動抽出などに適用することができる。
【０１１１】
（ｃ―２）第１および第２の実施の形態において、分類後の文字列をキーワードテーブルおよび文字列分類テーブルに格納したが、抽出した文字列の出力を行う際は、各テーブルの形式を変更して出力してもよいし、文字列の重要度の大きさや分類コードおよび文字列の類似度に基いて並び替えるなど各種変形が可能である。また、出力はキーワードテーブルや文字列分類テーブルに格納されている最終的な分類結果のみに限定せず、キーワードの分類における各処理過程の結果も任意に出力してもよい。
【０１１２】
（ｃ−３）本発明の処理にかかわらず、特定の文字列を専門表現または一般表現等に分類を固定したい場合は、その文字列の固定する分類を記憶装置に登録しておき、分類処理の前に抽出した文字列が分類を固定する文字列に該当するかを判断し、該当する場合は、登録された分類を付与する構成にしてもよい。
【０１１３】
【発明の効果】
以上のように、本発明によれば、抽出した文字列を自動的に分類する文字列自動分類装置に関し、文字列の文書内での重要度と複数の文書ファイル全体での重要度を考慮して、その文字列の重要度を決定する構成にしたことで、抽出した文字列を文書内での重要度のみで判断することなく、専門表現や一般表現に分類することができる。
【０１１４】
また、専門表現と一般表現のほかに、この２つ要素の組み合わせで構成されている文字列であることを判断することで、３つに分類することができる。
文字列が組み合わせで構成されているとの情報を得ることで、専門用語辞書を作成する際に不要な語句を除去することが可能となり、また、専門表現辞書（専門的な言い回しを格納した辞書）を作成することが可能となる。
【図面の簡単な説明】
【図１】本発明の文字列自動抽出装置の第１の実施の形態を示すブロック図である。
【図２】文書内重要度テーブルを示す図である。
【図３】文書間重要度テーブルを示す図である。
【図４】キーワードテーブルを示す図である。
【図５】重要度計算部の機能を示すブロック図である。
【図６】本発明の文字列自動抽出装置の動作を示すフローチャートである。
【図７】文書間重要度処理の動作を示すフローチャートである。
【図８】文字列重要度処理の動作を示すフローチャートである。
【図９】第１の実施の形態の文字列分類処理の動作を示すフローチャートである。
【図１０】文書内重要度テーブルに格納された途中結果例を示す図である。
【図１１】文書間重要度テーブルに格納された例を示す図である。
【図１２】文書内重要度テーブルに格納された例を示す図である。
【図１３】キーワードテーブルに格納された例を示す図である。
【図１４】本発明の文字列自動抽出装置の第２の実施の形態を示すブロック図である。
【図１５】第２の実施の形態の文字列分類処理の動作を示すフローチャートである。
【図１６】文字列分類テーブルを示す図である。
【図１７】文字列分類テーブルの途中結果例を示す図である。
【図１８】文字列分類テーブルの例を示す図である。
【図１９】文字列が２つの要素で構成されているかの判断方法を示す図である。
【符号の説明】
１・・入力装置
１１・・入力部
１２・・出力部
２・・処理装置
２１・・文字列抽出部
２２・・重要度計算部
２３・・文字列分類部
２４・・文字列複数分類部
３・・記憶装置
３１・・文書ファイル
３２・・文書内重要度テーブル
３３・・文書間重要度テーブル
３４・・キーワードテーブル
３５・・バッファ
３６・・文字列分類テーブル。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method and apparatus for classifying an arbitrary character string extracted from a text into a specialized expression used in a specific field and a general expression used regardless of the field.
[0002]
[Prior art]
In machine translation systems that automatically translate documents that have been converted into data that can be used by information processing devices, and systems that search for documents that use relevant keywords by searching for documents, there is a fixed amount that exists in the document. It is necessary to extract a character string having the meaning of.
Reference 1 “Information Processing Society of Japan Vol.93, No.61 (93-NL-96-1)”
In the technique disclosed in Document 1, a character string of all lengths starting with all characters included in a text document (if the text length is 1 character i (1 ≦ i .Ltoreq.L) leading character string of length n (1.ltoreq.n.ltoreq.Li) (hereinafter referred to as "n-gram character string") is extracted, and the number of appearances is counted to be processed. A character string is extracted from a text document. This method is characterized in that it is not necessary to perform morphological analysis or use a dictionary, and a character string can be extracted only by statistical processing. However, since character strings appearing in text are exhaustively extracted based on the number of characters and the number of appearances, there is a problem that character strings that do not make sense as a language (hereinafter referred to as fragment character strings) are mixed.
[0003]
By the way, a continuous character string (hereinafter referred to as “expression”) that should be recognized as a group like a word or phrase has a high appearance frequency in the text, and various words appear before and after that. There is.
Reference 2 Information Processing Society of Japan Research Report Vol. 95, no. 110 (95-NL-110-11)
The technique disclosed in Document 2 uses the above feature to extract an appropriate continuous character string by calculating a variance value of a character immediately before an arbitrary character string and a variance value of a character immediately after the arbitrary character string. It is about. In this method, only a meaningful character string, that is, “expression” is extracted by removing a character string having a low variance value as a fragmented character string from the character string extracted by the technique of Document 1.
[0004]
[Problems to be solved by the invention]
In general, there are various levels of “expression” (for example, technical expressions such as technical terms and proper nouns, general expressions such as general terms and idiomatic phrases).
However, in the conventional character string extraction method, fragmentary character strings included in the character string extracted from the document are excluded and a meaningful character string “expression” is extracted, but the character strings of various levels of expression are extracted. Will be extracted in a mixed state.
Therefore, when actually using the extraction result, there is a problem that it must be further classified according to the application.
[0005]
From this point, the present invention can arbitrarily classify an arbitrary character string extracted from a text document into a specialized expression used in a specific field and a general expression used regardless of the field. The purpose is to provide an automatic classification device.
[0006]
[Means for Solving the Problems]
In order to solve the problem, the present invention provides a document storage unit that stores a plurality of documents described in a natural language, a character string extraction unit that extracts a character string from an arbitrary document among a plurality of documents, and a character The document importance calculation means for calculating the importance of the character string extracted by the string extraction means in the extracted document as the importance in the document, and the entire character string extracted by the character string extraction means. Inter-document importance calculation means for calculating the importance of documents as inter-document importance, and string importance calculation for calculating the importance of character strings extracted based on intra-document importance and inter-document importance Means, and character string classification means for classifying the extracted character strings based on the character string importance obtained by the character string importance calculation means, the character string classification means for each of the extracted character strings, Characters added to the string A first classification unit that compares the importance level with a predetermined threshold and classifies all the extracted character strings into a general expression used regardless of a specific field or a specialized expression used in a specific field, and a first classification unit When the character string classified into the specialized expression is divided and each component of the divided character string exists in the first classification result, the character string before the division is divided by the combination of the components of the divided character string. There is provided an automatic character string classification device having a second classification unit for reclassifying a classification into a general technical expression or a general expression that is a technical expression used with the general expression from the technical expression.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an automatic character string classification device according to an embodiment of the present invention will be described in detail with reference to the drawings.
(A) First embodiment
FIG. 1 is a block diagram showing a first embodiment of the automatic character string classification apparatus of the present invention, which is realized on an information processing apparatus such as a workstation or a personal computer.
[0010]
In FIG. 1, the automatic character string classification device includes an input / output device 1, a processing device 2, and a storage device 3. The input / output device 1 has functions for inputting text, displaying extraction results, and the like. The processing device 2 has a function of extracting a continuous character string and executing various processes for classifying the result. The storage device 3 has a function of storing the input text and the processing result of each stage.
[0011]
Further, the input / output device 1 includes an input unit 11 and an output unit 12. The input unit 11 has a function of inputting a text document serving as data, and is composed of, for example, a keyboard. The output unit 12 has a function of displaying the extracted character string or the like, and is configured by, for example, a display or a printer.
[0012]
The processing device 2 includes a character string extraction unit 21, an importance calculation unit 22, and a character string classification unit 23. The character string extraction unit 21 reads a document file, which will be described later, and extracts an arbitrary n-gram character string included in the document file. As this extraction method, for example, the character string extraction method shown in Document 1 is used. Moreover, you may perform the process which removes a fragmentary character string like literature 2 with respect to an extraction result.
It should be noted that any method that can extract a character string such as a word, a phrase, or a clause may be used regardless of the method described in the prior art.
[0013]
The importance calculation unit 22 calculates the importance of the n-gram character string extracted by the character string extraction unit 21 in the document and between the documents, and the final character string obtained by weighting the character string from the two importances. Is obtained (hereinafter referred to as character string importance). The character string classification unit 23 is used in specialized expressions used in a specific field or in a normal document regardless of the field based on the importance assigned to each character string extracted by the importance calculation unit 22 Are classified into general expressions. Details of the importance calculation unit 22 and the character string classification unit 23 will be described later.
[0014]
The storage device 3 includes a document file 31, an intra-document importance table 32, an inter-document importance table 33, a keyword table 34, and a buffer 35.
The document file 31 stores a text document serving as data input from the input unit 12 as a document file. In the present embodiment, there are a plurality of document files 31, and the field (content) of the document file 31 is not limited, and the document file 31 may be of a different field for each document file.
[0015]
FIG. 2 shows an example of the document importance level table 32. The in-document importance level table 32 includes a character string storage unit that stores the n-gram character string generated from the document file 31 by the character string extraction unit 21, and an in-document importance level that stores the importance level of the character string in the document. Character string importance storage that stores the weight of importance in the storage unit and the importance of the character string in the document and the importance of the character string in the document in addition to the importance of the character string between documents Part.
[0016]
The in-document importance storage table 32 corresponds to each of the plurality of document files 31.
FIG. 3 shows an example of the inter-document importance table 33. The inter-document importance table 33 is extracted in a character string storage unit that stores an n-gram character string generated from the document file 31 by the character string extraction unit 21 and one document file 31 among the plurality of document files 31. The generated n-gram character string has an appearance document number storage unit that stores the number of appearances in other document files 31 and an inter-document importance storage unit that stores the importance of character strings in the plurality of document files 31. . In this embodiment, it is composed of one table.
[0017]
The inter-document importance storage table 33 is created from the character strings stored in the plurality of intra-document importance storage tables 32 generated for each of the plurality of document files 31.
[0018]
The keyword table 34 stores character strings (keywords) that are determined to be important in the document file 31 by the character string classification unit 23 and extracted.
FIG. 4 shows an example of the keyword table 34. The keyword table 34 is important in the document file 31 by the character string storage unit that stores the extracted character string and the character string classification unit 23 that is stored in the character string importance storage unit of the in-document importance degree table 32. An importance level storage unit for storing the character string (keyword) determined and extracted is provided. The keyword table 34 corresponds to each of the plurality of document files 31.
[0019]
The buffer 35 stores values obtained in the course of each process, intermediate results, and the like.
[0020]
Here, the importance calculator 22 will be described in detail.
FIG. 5 is a configuration diagram illustrating the function of the importance degree calculation unit 22.
The importance calculation unit 22 further includes an in-document importance calculation unit 221, an inter-document importance calculation unit 222, and a character string importance calculation unit 223. Processing is performed in association with the intra-document importance table 32 and the inter-document importance table 33, respectively.
[0021]
The in-document importance calculation unit 221 reads one document file 31 from a plurality of document files 31 stored in the storage device 2, and stores the n-gram character string extracted from the document file 31 in the document file 31. Calculate importance. As the importance in the document, for example, an appearance frequency at which a certain character string appears in the document file 31 is used. The calculated in-document importance is stored in the in-document importance storage section of the in-document importance table 32. This in-document importance calculation process is repeated for all document files 31.
[0022]
The appearance frequency can be obtained at the same time when the n-gram character string is extracted by the methods of Documents 1 and 2.
[0023]
The inter-document importance calculation unit 222 calculates the importance of the extracted n-gram character string in the plurality of document files 31 as a whole.
First, from the in-document importance level table 32 corresponding to the document file 31, one record is read and the number of document strings 31 in which the n-gram character string appears is counted. This count is performed for each document importance table 32 corresponding to a plurality of document files 31, and the total number of document files 31 in which a certain character string appears is defined as the number of appearing documents. When the number of appearing documents is large, it can be seen that the character string is a character string that appears frequently regardless of the specific document file 31, and conversely, when the number of appearing documents is small, the character string that appears only in the specific document file. It can be seen that it is.
[0024]
The obtained number of appearing documents is stored in the appearing document number storage section of the inter-document importance table 33.
Furthermore, the importance between documents of each character string is calculated. Here is a string _k The smaller the number of document files 31 including, the greater the importance between documents. As this value, for example, an inverse document frequency is used. Inverse document frequency is obtained by multiplying the reciprocal of the number of documents including a certain character string by the total number of documents. String _k Inverse document frequency of idf _k N is the number of document files, and n is the number of document files 31 including the extracted character strings (the number of appearing documents). _k Then, it is obtained by the following formula.
[0025]
(Formula 1) idf _k = Log (N / n _k ) Where idf _k Is n _k = 1 when the maximum value is taken, n _k When N = N, take the minimum value (= 0) _k The number varies depending on the number of document files 31 including. The degree of importance between documents takes a smaller value as the number of document files 31 including the character string k increases. On the other hand, when the document file 31 includes only a small number of character strings k, the value is large.
[0026]
This inter-document importance calculation process is repeated for all n-gram character strings stored in the character string storage section of the inter-document importance table 33.
The obtained inter-document importance is stored in the inter-document importance storage section of the inter-document importance table 33.
[0027]
The character string importance calculation unit 223 reads the document importance from the document importance table 32 and the document importance from the document importance table 33, and calculates the importance of the character string in the document between documents. The importance of the character string weighted by the importance is calculated. Here, a character string that appears frequently in a specific document file 31 but hardly appears in other document files 31, that is, a character string related to a specific field is set so as to increase in importance. It is set so that the value of the character string appearing in many document files 31 regardless of the specific document file 31, that is, appearing regardless of the specific field, becomes low. As a calculation method, for example, a product of the importance level in a document and the importance level between documents is used.
[0028]
Where the document file _i 31 (1 ≦ _i ≦ N) Character string extracted from _k String importance of W _ik And String _k Document files _i The importance level in the document representing the appearance frequency at 31 is expressed as tf. _ik And the string _k The inter-document importance level indicating the importance level between the documents is idf _k Then, it is obtained by the following formula 2. (Formula 2) W _ik = Tf _ik × idf _k String importance is string _k Idf because it only appears in document files with few _k Is high and the string _k Tf appears in the extracted document file, so tf _ik A high value is obtained when the value of is high.
[0029]
If a value with high importance is obtained, the string _k Are important keywords in the document file. The obtained character string importance is stored in the character string importance storage section of the in-document importance table 33. The calculation processing of the importance level of the character string is repeatedly performed for all the character strings in the in-document importance level table 32.
[0030]
Next, the character string classification unit 23 will be described in detail.
The character string classification unit 23 reads one record at a time from the in-document importance level table 32 and converts the character string into two types of specialized expression and general expression based on the importance of the character string stored in the character string importance storage unit. Classify and extract only specialized expressions.
[0031]
The classification method is performed, for example, by setting a threshold value in advance and comparing it with the threshold value. Here, if the threshold is T, and the importance of the character string is greater than the threshold T, it is determined that the character string is highly important. The character string determined to be important after the comparison with the threshold value is stored in the keyword table 34 together with the character string and the importance of the character string. This classification process is repeated for all the in-document importance level tables 32 to classify the character strings as important in the document file 31.
[0032]
FIG. 6 is a flowchart showing the operation of the automatic character string extraction apparatus of the present invention.
Here, it is assumed that a text document as data is input from the input unit 11 as a document file to the document file 31 of the storage unit 3 and a plurality of document files 31 are stored. It is assumed that the total number of document files 31 is known in advance.
[0033]
First, the character string extraction unit 21 reads one document file from a plurality of document files from the document file 31 (step 1), extracts an n-gram character string from the read document file, and stores the importance level table in the document. It is stored in 32 character string storage units. (Step 2).
[0034]
Frequency at which the extracted n-gram character string appears in the document file The It is obtained and stored in the in-document importance storage section of the in-document importance table 32. (Step 3).
Here, it is determined whether there is a document file 31 that has not been subjected to character string extraction. If there is an unprocessed document file 31, the process returns to step 1, and if it is the last document file 31, the process proceeds to the next step 4 (step 4).
[0035]
Next, using the in-document importance obtained for each character string, the inter-document importance of the character string is obtained.
With reference to all the intra-document importance tables 32 generated in the process up to step 3, the inter-document importance of each character string is calculated and stored in the inter-document importance table 33 (step 5).
[0036]
Next, using the in-document importance and the inter-document importance, the importance of the character string weighted to the extracted n-gram character string is calculated and stored in the character string importance storage unit of the in-document importance table 33. Store. (Step 6).
Finally, the character strings are classified by comparing the importance of the weighted character strings with a preset threshold value (step 7).
[0037]
Here, step 5 for performing the inter-document importance processing will be described in detail with reference to the flowchart of FIG.
First, one record is read from the in-document importance table 32 (step 51). It is determined whether or not the read character string is already stored in the character string storage unit of the inter-document importance degree table 33. If it is stored, the process proceeds to step 54.
[0038]
If not stored, the process proceeds to step 53, where the character string is stored in the character string storage section of the character string importance table 33, and then the process proceeds to step 54 (steps 52, 53). In step 54, the number of appearance documents of the character string is increased by 1 (step 54). It is determined whether or not there is an unprocessed record in the in-document importance table 32. If there is an unprocessed record, the process returns to step 51, and if not, the process proceeds to step 56 (step 55).
[0039]
Next, it is determined whether or not the processing up to step 55 has been executed for all in-document importance tables. If there is still an unprocessed in-document importance table 32, the process proceeds to step 57, and if there is no unprocessed in-document importance table 32, the process proceeds to step 58 (step 56). If there is an unprocessed in-document importance level table 32, the process moves to the next in-document importance level table 32, and the processing from step 51 to step 55 is performed for all the in-document importance level tables 32 (step 57). .
[0040]
Next, one record is read from the inter-document importance table 33 (step 58). The importance of the character string between the documents is calculated using the number of appearing documents and the total number of document files (step 59), and the obtained inter-document importance is stored in the inter-document importance storage section of the inter-document importance table 33. (Step 511). Next, it is determined whether or not processing has been executed for all records existing in the inter-document importance table 33. If there is still an unprocessed record, the process proceeds to step 58, and if there is no unprocessed record, the process is terminated (step 512).
[0041]
Next, step 6 for performing the character string importance process will be described in detail with reference to the flowchart of FIG.
[0042]
First, one record is read from the in-document importance table 32 (step 61). The inter-document importance corresponding to the read character string is referred to from the inter-document importance table 33, the importance of the weighted character string is calculated (step 62), and the obtained character string importance is subjected to the current processing. Is stored in the character string importance storage section of the in-document importance storage table 32 (step 63).
[0043]
Here, it is determined whether or not there is an unprocessed record in the in-document importance level table 32. If there is an unprocessed record, the process returns to step 61, and if not, the process proceeds to step 65 ( Step 64). Next, the processing up to step 64 is executed on all the document importance level tables 32 to determine whether the weighted character string importance level has been calculated.
[0044]
If there is still an unprocessed in-document importance table 32, the process proceeds to step 66. If there is no unprocessed in-document importance table 32, the process proceeds to step 7 in FIG. 6 (step 65). If there is an unprocessed in-document importance level table 32, the process moves to the next in-document importance level table 32, and the processes from step 61 to step 65 are performed for all the in-document importance level tables 32 (step 66).
[0045]
Next, step 7 for performing classification processing of weighted character strings will be described in detail with reference to the flowchart of FIG.
[0046]
First, one record is read from the intra-document importance table (step 71). The read character string importance level is compared with a preset threshold value. If the character string importance level is larger than the threshold value, the process proceeds to step 73. If the character string importance level is smaller than the threshold value, the process proceeds to step 74 (step 72). The character string importance degree determined to be larger than the threshold value and the character string corresponding to the character string importance degree are assigned to the importance degree storage unit and the character of the keyword table 34 corresponding to the in-document importance degree table 32. Store each in the column storage.
[0047]
Here, it is determined whether or not there is an unprocessed record in the in-document importance table 32. If there is an unprocessed record, the process returns to step 71, and if not, the process proceeds to step 75 ( Step 74). Next, the processing up to step 74 is executed on all the document importance tables 32, and it is determined whether or not the character strings have been classified.
[0048]
If there is still an unprocessed in-document importance table 32, the process proceeds to step 76. If there is no unprocessed in-document importance table 32, the process is terminated. If there is an unprocessed in-document importance level table 32, the process moves to the next in-document importance level table 32, and the processing from step 71 to step 75 is performed for all the in-document importance level tables 32 (step 76).
[0049]
Next, the processing steps of the present invention will be specifically described with reference to actual cases and the flowcharts of FIGS.
[0050]
It is assumed that N document files 31 are stored in the storage device 3.
FIG. 10 shows an example of the document importance level table 32.
[0051]
First, the document file _i 31 (1 ≧ _i ≥N) content, n-gram character string is extracted, and document file of the extracted character string _i The appearance frequency in 31 is obtained and stored in the corresponding document importance table 32 (steps 1, 2, and 3 in FIG. 6).
[0052]
FIG. 10 shows an in-document importance table 32 that stores extracted character strings and appearance frequencies (in-document importance). The processing of steps 1, 2, and 3 in FIG. _i 31 for each. After this process is finished, the document file _i 31 (1 ≦ _i In correspondence with ≦ N), an in-document importance table i32 (1 ≦ i ≦ N) is created.
[0053]
Next, the importance level between documents of each character string is calculated and stored in the inter-document importance level table 33. FIG. 11 shows an example of the inter-document importance table. Here, the importance level table in the document of FIG. _i 32 as the processing target, read character string _k It is determined whether “in” is present in the other document importance level table 32. If it exists, it is counted as the number of occurrences in other documents, and the total value is counted between documents. _i It is stored in the 33 appearing document number storage unit. Here, the number of occurrences n in other documents _k = 43. (FIG. 7, steps 51-57).
[0054]
Next, the importance between documents is obtained. Number of document files 31 N = 50, character string _k = "In", number of appearance documents n _k = 43 is a document file according to equation (1) _i Character string in 31 _k Inter-document importance idf _k = Log (50/43) = 0.15 is obtained, and 0.15 is stored in the inter-document importance storage unit (steps 58 to 511 in FIG. 7). FIG. 11 shows the contents of the inter-document importance table 33 after the inter-document importance calculation is performed in the intra-document importance table 32 shown in FIG.
[0055]
Next, the character string importance degree obtained by weighting the n-gram character string is obtained. Document importance table in FIG. _i Character string from 31 _k = Importance tf in the document of “de” _ik = 11 is read. Further, the character string from the inter-document importance table 33 of FIG. _k = Inter-document importance idf for “de” _k = 0.15 is read. Document file according to equation (2) _i Character string in 31 _k Importance W _ik = 11 × 0.15 = 1.65, and the obtained value is stored in the character string importance storage section of the in-document importance table i31 (steps 61 to 63 in FIG. 8).
[0056]
This processing is performed in the document importance table. _i It repeats with respect to all the character strings of 33 (steps 65 and 66). FIG. 12 shows the document importance table in FIG. _i 32 shows the contents after weighting the character string using the inter-document importance table 33 of FIG. Although not shown, document importance table _i If there is an in-document importance table 32 other than 32, the process moves to the next table and the same processing is repeated.
[0057]
Finally, an n-gram character string classification process is performed.
Document importance table in FIG. _i Character string from 32 _k = "In" and character string _k = Weighted string importance W corresponding to “in” _ik = 1.65 is read and compared with a predetermined threshold value T = 10. If the importance level of the character string is larger than the threshold value T, it is registered as a keyword. However, since the importance level 1.65 of the character string is smaller than the threshold value 10, the keyword table _i 34 is not stored (steps 71 to 74 in FIG. 9).
[0058]
This processing is performed in the document importance table. _i This is repeated for all the records in 32 (steps 75 and 76 in FIG. 9).
[0059]
If the same processing is performed subsequently, the character string _k = String importance W for “de” and “no” _ik Is 0 and is smaller than the threshold value 10, so it is not registered as a keyword. In contrast, the string _k = "Network" string importance W _ik Is 39.33, which is larger than the threshold value 10, so that the character string “network” and the character string importance are indicated in the keyword table. _i 34.
[0060]
FIG. 13 shows the importance level table in the document of FIG. _i 32 shows the contents after performing the classification process with the threshold T = 10.
The above processing is performed on a plurality of document files (1 to N). As a result, corresponding keywords 34 are created for all the document files 31.
[0061]
<Effect of the first embodiment>
According to the first embodiment of the present invention, n-gram character strings extracted from text can be classified into specialized expressions and general expressions. The specialized expression and the general expression in each document can be relatively classified by taking into consideration the importance between documents rather than judging only by the importance in the document. In other words, even if a character string has a low appearance frequency in a document, if it is judged to be highly specialized (if it appears only in a specific document), the value of the specialized expression will be high and be registered as a keyword. Can do.
[0062]
Also, appropriate classification can be performed according to the contents of the prepared document file. For example, in the first embodiment, the character string “network” appears only in a small number of document files, so it can be determined that the expertise in the document is high, and can be classified as a specialized expression. However, if all the document files are network-related papers or the like, the importance of the character string “network” becomes low and cannot be extracted. This feature is effective when extracting keywords to be used in the keyword search device.
[0063]
(B) Second embodiment
FIG. 14 is a block diagram showing a second embodiment of the automatic character string classification device of the present invention. In the second embodiment, blocks having the same functions as those in the first embodiment are given the same numbers, and only blocks different from the first embodiment in the second embodiment will be described in detail.
[0064]
The processing device 2 includes a character string multiple classification unit 24 instead of the character string classification unit 23 in the first embodiment.
[0065]
The character string multiple classifying unit 24 extracts the extracted n-gram character string based on the importance assigned to each character string by the importance calculating unit 22, or a combination of technical expression, general expression, or general expression and technical expression. These are classified into three types.
[0066]
First, one record is read from the in-document importance table 32, and character strings are classified based on the importance of the character string stored in the character string importance storage unit.
[0067]
The classification method is performed, for example, by setting a threshold value in advance and comparing it with the threshold value. Here, when the threshold value is T, when the importance level of the character string is greater than the threshold value T, the character string is determined to have a high importance level, and a classification code (DC = domain dependent collocation; When the technical expression DC) is assigned and smaller than the threshold value T, a classification code (GC = general association, hereinafter referred to as a general expression GC) representing the general expression is assigned.
[0068]
In this process, a classification code of either the professional expression DC or the general expression GC is assigned to all character strings, and then a classification code is reassigned in consideration of the arrangement of characters based on the classification code. Here, in addition to the technical expression DC and the general expression GC, when the expression is a combination of these two expressions, a classification code (CGD = combination of general and domain dependent collocation, hereinafter referred to as a general / special expression). CGD).
[0069]
After the classification code assigning process is completed, the character string and the classification code of the character string are stored in the character string classification table 36. This classification code assigning process is repeated for all the document importance tables 32 to classify the character strings in the document file 31.
[0070]
It should be noted that the general / special expression CGD constitutes the general / special expression CGD in the order of the special expression DC / general expression GC or in the order of the general expression GC / special expression DC regardless of the combination order of the two expressions. To do. Further, the combination of the special expression DC and the general expression / special expression CGD and the combination of the general expression GC and the general expression / special expression CGD also constitute the general / special expression CGD. In this case as well, the order of the two expressions does not matter.
[0071]
The storage device 3 includes a character string classification table 36 instead of the keyword table 34 in the first embodiment.
[0072]
The character string classification table 36 stores character strings (keywords) that are determined to be important in the document file 31 by the character string multiple classification unit 24 and extracted, and a classification code assigned to each character string.
[0073]
FIG. 16 shows an example of the character string classification table 36. The character string classification table 36 includes a character string storage unit that stores the extracted character strings, and a classification code storage unit that stores the classification codes assigned to the character strings by the character string multiple classification unit 24. The character string classification table 36 corresponds to each of a plurality of document files.
[0074]
Next, the character string multiple classification process, which is a feature of the second embodiment, will be described using a flowchart. Processes other than the character string multiple classification process in the present invention are the same as those in the first embodiment.
[0075]
FIG. 15 is a flowchart showing the operation of the character string plural classification unit.
The symbol “+” described in step 720 and the like in FIG. 15 represents a combination of character string elements. For example, “GC + DC” represents a combination of a character string that is an element of general expression and a character string that is an element of specialized expression. The order of classification codes is not related. For example, when “GC + CGD” is described, the order of arrangement of “CGD + GC” is also included.
[0076]
First, one record is read from the in-document importance table 32 (step 711). The read character string importance level is compared with a preset threshold value. If the character string importance level is larger than the threshold value, the process proceeds to step 713, and if it is smaller than the threshold value, the process proceeds to step 714 (step 712). If it is determined that the value is larger than the threshold value, the specialized expression DC is assigned to the character string (step 713).
[0077]
If it is determined that the character string is smaller than the threshold, the general expression GC is assigned to the character string, and the process proceeds to step 715 (step 714). The character string to which the classification code is assigned and the classification code are respectively stored in the character string storage unit and the classification code storage unit of the character string classification table 36 corresponding to the in-document importance level table 32 from which the character string has been read (step 715). ).
[0078]
Here, it is determined whether or not there is an unprocessed record in the in-document importance table 32. If there is an unprocessed record, the process returns to step 711, and the processing from step 711 to step 715 is repeated. A classification code of either the professional expression DC or the general expression GC is assigned to all the character strings in the document importance level table 32. If there is no unprocessed record, the process proceeds to step 717 (step 716). In step 717, the flag is set to 0 (step 717).
[0079]
Next, one record is read from the character string classification table 36 (step 718). It is determined whether the character string element of the read record is “general expression GC” or “general / specialized CGD”. If it corresponds to one of the two expressions, the process proceeds to step 725, and if not, the process proceeds to step 720 (step 719).
[0080]
Next, it is determined whether the element of the character string is “combination of general expression GC and general expression GC”. If this combination is applicable, the process proceeds to step 721, and if not, the process proceeds to step 722 (step 720). If this combination is applicable, the general expression GC is assigned to the character string (step 721), and the classification code stored in the classification code storage unit of the character string classification table 36 is replaced with the general expression GC (step 729). The flag is set to 1 (step 724), and the process proceeds to step 725.
[0081]
Furthermore, the element of the character string is “combination of general expression GC and technical expression DC” or “combination of general expression GC and general / special expression CGD” or “combination of technical expression DC and general / special expression CGD”. Judge if there is. If one of these three expressions is applicable, the process proceeds to step 723, and if not, the process proceeds to step 725 (step 722).
[0082]
When these three expressions are satisfied, the general / specialized expression CGD is assigned to the character string (step 723), and the classification code stored in the classification code storage unit of the character string classification table 36 is added to the general / specialized expression CGD. To substitution (step 729), the flag is set to 1 (step 724), and the process proceeds to step 725.
[0083]
Here, it is determined whether or not there is an unprocessed record in the in-document importance level table 32. If there is an unprocessed record, the process returns to step 718, and the processing from step 718 to step 725 is repeated. A classification code of either the professional expression DC or the general expression GC or the general / special expression CGD is assigned to all the character strings in the in-document importance table 32. If there is no unprocessed record, the process proceeds to step 726 (step 725). In step 726, a flag is set to 0 (step 726).
[0084]
Next, the processing up to step 726 is executed for all the in-document importance degree tables 32, and it is determined whether the character strings have been classified. If there is still an unprocessed in-document importance table 32, the process proceeds to step 728. If there is no unprocessed in-document importance table 32, the process is terminated (step 727).
[0085]
If there is an unprocessed in-document importance level table 32, the process moves to the next in-document importance level table 32, and the processing from step 711 to step 727 is performed for all the in-document importance level tables 32 (step 76).
[0086]
Here, a method for determining whether a character string is a combination of two elements will be described.
First, it is assumed that the processing from Step 711 to Step 717 is executed, and either a DC or GC classification code is assigned to the character string and stored in the character string classification table 36.
[0087]
In order to divide the character string into two elements, the character string delimiter position is shifted one by one, each character string exists in the character string classification table 36, and the classification code of the character string is used as a judgment condition. A classification code is assigned when they match. FIG. 19 shows a method of dividing the character string “network construction” into two elements.
[0088]
Here, the special expression DC is assigned to the character string “network”, the general / special expression CGD is assigned to the character string “network”, and the technical expression DC is assigned to the character string “construct”. The judgment condition for the combination of elements is “a combination of the general expression GC and the technical expression DC” or “a combination of the general expression GC and the general / special expression CGD”.
[0089]
In FIG. 19, the number portion indicates the number of times the character string is divided, the character string A portion and the character string B portion indicate each element of the character string divided into two, and the collation result portion indicates the character string A and the character string B. The result of collating whether both of these elements exist in the character string classification table 36 is shown.
[0090]
First, if the delimiter position of “network construction” is shifted by one character, the character string A becomes “ne” and the character string B becomes “network construction” for the number 1. Since these two elements do not exist in the character string classification table 36, it is determined that they are not a combination of the two elements.
[0091]
Next, in the number 6, the character string A is “network” and the character string B is “construction”. The “network” of the character string A exists in the character string classification table 36, but the “construction” of the character string B does not exist in the character string classification table 36, so it is not a combination of the two elements.
[0092]
In the number 7, the character string A is “network” and the character string B is “construction”. Both the character string A and the character string B exist in the character string classification table 36, and the respective classification codes are the general / special expression CGD and the general expression GC. Therefore, it is determined that the character string “network construction” is “combination of general expression GC and general / special expression CGD”.
[0093]
Next, the process of character string multiple classification processing in the second embodiment will be specifically described with reference to actual cases and the flowchart of FIG.
It is assumed that the contents of the in-document importance table 32 of FIG. 12 are obtained by executing the processing from step 1 to step 5 in the flowchart showing the operation of the automatic character string extracting apparatus of FIG.
[0094]
It is assumed that N document files 31 are stored in the storage device 3. Document importance table in FIG. _i Character string from 31 _k = "In" and character string _k = String importance W corresponding to “de” _ik = 1.65 is read and compared with a predetermined threshold T = 10. Since the importance 1.65 of the character string is smaller than the threshold value 10, the general expression GC is given, and the character string classification table _i A character string and a general expression GC of the classification code are stored in 36. (FIG. 15, steps 711 to 715).
[0095]
This processing is performed in the document importance table. _i This is repeated for all the records in 32 (step 716 in FIG. 15). If you do the same process, _k = String importance W for “de” and “no” _ik Is 0 and is smaller than the threshold value 10, so that the general expression GC is given and the character string classification table _i The character string and the general expression GC of the classification code are stored in 36 (steps 714 and 715 in FIG. 15).
[0096]
In contrast, the string _k = "Network" string importance W _ik Is 39.33, which is larger than the threshold value 10, so that the special expression DC is given and the character string classification table _i 36 stores the character string and the specialized expression DC of the classification code (steps 713 and 715). FIG. 17 shows the importance level table in the document of FIG. _i 32 shows the contents after the character string plural classification processing is performed with the threshold T = 10.
[0097]
Next, set the flag to 0 and the character string classification table _i One record is read from 36. The classification code of the character string “in” of the first record is the general expression GC, and is not the last record, so the process moves to the next record (steps 718, 719, and 725 in FIG. 15). Next read string _k = For “de” and “no”, the classification code is the general expression GC.
[0098]
Next, the string _k = Read “network”. Where the string _k = "Network" does not correspond to any of the conditions shown in Steps 719, 720, and 722 of FIG. Further, since it is not the last record, the processing is shifted to the next record (steps 718, 719, 722, 725).
[0099]
Next, the string _k = “Network” is read. String _k = “Network” is a character string composed of two elements of “Network” of the professional expression DC and “General” of the general expression GC. Therefore, it satisfies the condition of Step 722, and the general / special expression CGD as a classification code. And the classification code DC already stored in the classification code storage section of the character string classification table i36 is replaced with the general / special expression CGD (steps 718, 722, 723, and 729). Since the flag is set to 1 and not the last record, the process moves to the next record (steps 724 and 725).
[0100]
Next, the string _k = Read "Network construction". String _k = "Network construction" is divided into two elements "network" and "construction". Here, “network” corresponds to the condition “combination of general / technical expression CGD and technical expression DC” in step 722 because the classification code is replaced with general / technical expression CGD in the previous processing, and the classification code. And the general / special expression CGD is assigned, and the classification code DC already stored in the classification code storage section of the character string classification table i36 is replaced with the general / special expression CGD.
[0101]
String classification table _i The above processing is repeated up to the last 36 records, and after processing to the last record, it is determined whether the flag is 0 or not. At this time, since the flag is 1, the process proceeds to step 717 and the flag is set to 0. The processing of steps 718 to 726 is repeated until the flag becomes 0 in the determination of step 726.
[0102]
When the flag is 0, the character string multi-classification process is completed for all the character strings stored in the character string classification table 36. In step 727, it is determined whether there is another character string classification table 36, and all character string classification tables are determined. _i The process is executed for 36. 18 shows the character string classification table of FIG. _i For 36, the contents after performing the character string plural classification process are shown.
[0103]
The above processing is performed on a plurality of document files 31 (1 to N). As a result, corresponding character string classification tables 36 are created for all the document files 31.
[0104]
<Effects of Second Embodiment>
According to the second embodiment of the present invention, in addition to the effects obtained in the first embodiment, the n-gram character string extracted from the text is converted into a specialized expression, a general expression, and a combination of the specialized expression and the general expression. It can be classified into three expressions.
[0105]
By classifying the expression into a combination of the technical expression and the general expression, it is possible to remove unnecessary phrases when creating the technical term dictionary.
[0106]
For example, when creating a technical term dictionary from the character string classification table 36 of FIG. A technical term dictionary can be created only with a character string to which an expression DC is assigned.
[0107]
Further, the character string to which the general / special expression CGD is added includes a word used in association with the technical term, and represents a professional wording. Therefore, it is possible to create a specialized expression dictionary (a dictionary storing specialized phrases) from which character strings to which general / expert expressions CGD are assigned are extracted.
[0108]
The character string division when determining whether the extracted character string is a combination of specialized expression and general expression is a method of re-dividing the character string using the result of dividing the character string once. . This makes it possible to determine whether the combination of expressions is more efficient than dividing a character string composed of a plurality of expressions at once.
[0109]
This effect occurs particularly when the character string is repeatedly divided into two. For example, when the character string “network construction” in the character string classification table 36 of FIG. 18 is divided, the character string “network construction” (/ represents a word division delimiter) is divided into three at once.・ It is more efficient than judging it as specialized expression.
[0110]
(C) Other embodiments
(C-1) In the first and second embodiments, character strings extracted from text are classified into various levels of expression, and are necessary when searching for keywords in a document. It can be applied to keyword extraction, automatic extraction of technical term dictionaries used in systems such as machine translation.
[0111]
(C-2) In the first and second embodiments, the character strings after classification are stored in the keyword table and the character string classification table. However, when outputting the extracted character strings, the format of each table is changed. Various modifications such as rearrangement based on the importance of the character string, the classification code, and the similarity of the character string are possible. Further, the output is not limited to only the final classification result stored in the keyword table or the character string classification table, and the result of each process in the keyword classification may be arbitrarily output.
[0112]
(C-3) Regardless of the processing of the present invention, when it is desired to fix the classification of a specific character string to specialized expression or general expression, the classification to be fixed of the character string is registered in the storage device, and classification processing is performed. It may be determined whether or not the character string extracted before is a character string that fixes the classification, and if so, a registered classification may be assigned.
[0113]
【The invention's effect】
As described above, according to the present invention, an automatic character string classification device that automatically classifies extracted character strings, considering the importance of character strings in a document and the importance of a plurality of document files as a whole. Thus, since the importance of the character string is determined, the extracted character string can be classified into specialized expressions and general expressions without being judged only by the importance in the document.
[0114]
Further, in addition to the specialized expression and the general expression, it can be classified into three by determining that the character string is composed of a combination of these two elements.
By obtaining information that character strings are composed of combinations, it is possible to remove unnecessary phrases when creating a technical term dictionary, and a specialized expression dictionary (a dictionary that stores specialized wording) ) Can be created.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of an automatic character string extraction apparatus of the present invention.
FIG. 2 is a diagram illustrating an in-document importance table.
FIG. 3 is a diagram illustrating an inter-document importance table.
FIG. 4 is a diagram showing a keyword table.
FIG. 5 is a block diagram illustrating functions of an importance calculation unit.
FIG. 6 is a flowchart showing the operation of the automatic character string extraction apparatus of the present invention.
FIG. 7 is a flowchart showing an operation of inter-document importance processing.
FIG. 8 is a flowchart showing an operation of character string importance processing;
FIG. 9 is a flowchart illustrating an operation of character string classification processing according to the first embodiment;
FIG. 10 is a diagram illustrating an intermediate result example stored in an in-document importance table.
FIG. 11 is a diagram showing an example stored in an inter-document importance table.
FIG. 12 is a diagram showing an example stored in a document importance level table;
FIG. 13 is a diagram showing an example stored in a keyword table.
FIG. 14 is a block diagram showing a second embodiment of the automatic character string extraction apparatus of the present invention.
FIG. 15 is a flowchart illustrating an operation of character string classification processing according to the second embodiment;
FIG. 16 is a diagram showing a character string classification table.
FIG. 17 is a diagram illustrating an example of an intermediate result of a character string classification table.
FIG. 18 is a diagram illustrating an example of a character string classification table.
FIG. 19 is a diagram illustrating a method of determining whether a character string is composed of two elements.
[Explanation of symbols]
1. Input device
11. Input section
12. Output section
2. ・ Processing equipment
21..Character string extraction unit
22 .. Importance calculator
23..Character string classification
24..Multiple character string classification part
3. Storage device
31. Document file
32 ... Importance table in document
33. Inter-document importance table
34..Keyword table
35. Buffer
36 .. Character string classification table.

Claims

A document storage unit for storing a plurality of documents described in a natural language; a character string extraction unit for extracting a character string from an arbitrary document among the plurality of documents;
In-document importance calculation means for calculating the importance of the character string extracted by the character string extraction means in the extracted document as document importance,
An inter-document importance calculating means for calculating the importance of the whole of the plurality of documents of the character string extracted by the character string extracting means;
Character string importance calculating means for calculating the importance of the character string extracted based on the importance in the document and the importance between the documents as the character string importance,
A character string classification means for classifying the character string extracted based on the character string importance obtained by the character string importance calculation means,
The character string classification means includes:
For all the extracted character strings, a general expression that compares the character string importance assigned to each character string with a predetermined threshold and uses all the extracted character strings regardless of a specific field, or A first classification unit that classifies specialized expressions used in a specific field;
When the first classification unit divides the character string classified into the specialized expression and each component of the divided character string exists in the first classification result, the component of the divided character string A character string comprising: a second classification unit that reclassifies the classification of the character string before division into a general technical expression or a general expression that is a specialized expression used with the general expression from the specialized expression Automatic classification device.

In the automatic character string classification device according to claim 1,
The in-document importance calculating unit determines the in-document importance based on the frequency with which the extracted character string appears in the extracted document.

In the automatic character string classification device according to claim 1,
The inter-document importance calculation unit determines the inter-document importance based on the number of documents in which the extracted character string appears in the whole of the plurality of documents.

In the automatic character string classification device according to claim 1,
The automatic character string classification device, wherein the character string importance calculation means calculates the character string importance weighted by the importance in the document and the importance between documents of the extracted character string.

In the automatic character string classification device according to claim 1,
In the second classification unit, when the first classification result is replaced with a new classification, the second classification unit is replaced with the new classification when referring to the first classification result. An automatic character string classification device that refers to a subsequent classification result.

In the automatic character string classification device according to claim 1,
When the component of the divided character string is a combination of the general expression and the technical expression, the second classification unit classifies the character string before the division from the technical expression to the general technical expression. fix, the divided combined components of the string in the general expertise represented in the general expression, or if the is a combination of specialized representation in the general specialist expressions classified character string before division from the specialist expression An automatic character string classification device that reclassifies the general expression.

In the automatic character string classification device according to claim 1,
The automatic character string classification device, wherein the character string extraction means extracts character strings of all lengths starting from all characters included in a document.

Document storage processing for storing multiple documents written in natural language;
A character string extraction process for extracting a character string from an arbitrary document among the plurality of documents;
In-document importance calculation means for calculating the importance of the character string extracted by the character string extraction means in the extracted document as document importance,
An inter-document importance calculation process for calculating the importance of the character strings extracted by the character string extraction means as the inter-document importance;
Character string importance calculation processing for calculating the importance of the character string extracted based on the importance in the document and the importance between the documents as the character string importance,
A character string classification process for classifying the extracted character string based on the character string importance obtained by the character string importance calculating means,
The character string classification process includes:
For all the extracted character strings, a general expression that compares the character string importance assigned to each character string with a predetermined threshold and uses all the extracted character strings regardless of a specific field or A first classification process for classifying specialized expressions used in a specific field;
When the character string classified into the specialized expression in the first classification process is divided, and each component of the divided character string exists in the first classification result, the component of the divided character string A character string further comprising: a second classification process for reclassifying the classification of the character string before division into a general technical expression or a general expression that is a technical expression that appears with the general expression from the technical expression by combination. Automatic classification method.