JP4253483B2

JP4253483B2 - Different notation dictionary creation device, different notation dictionary creation method, and program for causing computer to execute the method

Info

Publication number: JP4253483B2
Application number: JP2002274708A
Authority: JP
Inventors: 裕一小島
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-09-20
Filing date: 2002-09-20
Publication date: 2009-04-15
Anticipated expiration: 2022-09-20
Also published as: JP2004110633A

Description

【０００１】
【発明の属する技術分野】
本願発明は、異表記辞書作成装置および異表記辞書作成方法およびその方法をコンピュータに実行させるためのプログラムに係り、基本辞書を自動的に拡充する異表記辞書作成装置および異表記辞書作成方法およびその方法をコンピュータに実行させるためのプログラムに関する。
【０００２】
【従来の技術】
文書を自動的に分類し、分類の結果を使ってさらに文書を処理（言語処理ともいう）する場合、異表記された語句に対する扱いが問題になる。異表記とは、同じ意味の語句を異なる表記のし方で表すことをいい、例えば日本語においては「コンピュータ」と「コンピューター」、「取り扱い」と「取扱」などが異表記にあたる。
【０００３】
また、欧州言語においては、一つの単語について単数形、複数形、現在形、過去形などの何通りかの異なる表記が存在する。欧州言語の異なる表記はそれぞれ異なる概念を含む。このため、欧州言語の表記の相違は日本語の異表記と異なるが、言語処理における検索や文書分類において単なる異表記として扱う場合が多い。
【０００４】
単純に文字の配置だけを比較して語句を分類すると、例えば文書にある「コンピュータ」とコンピューターとは、別々のグループ（群）に分類されてしまい、以降の処理によっては不具合が生じる。また、「取り扱い」の語句を検索した場合に「取扱」の語句についてヒットしないことが不具合になることもある。
【０００５】
上記した異表記の問題に対応するための一つの方法として、カタカナによって表記される単語の異表記に的を絞り、所定のルールを定めて異表記の展開を行うものがある。ルールにより異表記を生成する処理は、ルールサイズを小さくすることが可能であり、処理が効率的にできる（例えば特許文献１参照）。
【０００６】
他の従来の技術として、キーワード検索用の異表記辞書を作成するものがある。この技術では、元ととなる辞書にある語句（エントリ）を対象とし、語句と類似度が一定値以上のものを異表記候補として異表記辞書を作成している（例えば特許文献２参照）。
【０００７】
【特許文献１】
特開平６−４４２９５号公報（請求項１）
【特許文献２】
特開平７−７３１９７号公報（請求項１）
【０００８】
【発明が解決しようとする課題】
しかしながら、ルールにより異表記を生成する処理は、ルールそのものの設計と整合性維持に手間がかかることが多く、個別の語句のレベルで、辞書的に処理を進めてしまう方がデータ作成、維持の点で効率的な場合も多い。
【０００９】
また、一般的に元の辞書にエントリが完備していることは多くない。例えば、元の辞書に「インターフェース」は存在していても「インタフェイス」は存在していないなど、辞書の語彙のそのものが問題となることがあった。
【００１０】
本特許は、上記した点に鑑みてなされたものであり、データの作成、維持の負担を軽減し、しかも自動的に異表記を拡充可能で、より多くの語彙を持つ異表記辞書を作成することができる異表記辞書作成装置および異表記辞書作成方法およびその方法をコンピュータに実行させるためのプログラムを提供することを目的とする。
【００１１】
【課題を解決するための手段】
上記目的を達成するため、請求項１にかかる異表記辞書作成装置は、基本辞書に含まれる語句の関連語を複数の文書から取得する関連語取得手段と、前記関連語取得手段によって取得された関連語と前記語句との類似度を算出し、前記関連語が前記語句に類似しているか否か判定する類似度判定手段と、前記類似度判定手段によって前記関連語が前記語句に類似していると判定された場合、該関連語を前記基本辞書に追加する関連語追加手段と、前記関連語および前記語句を表記する表記文字を置き換えて示す置き換え表記文字を蓄積する置き換え表記文字蓄積手段と、を備え、前記関連語取得手段は、複数の前記文書において前記語句を検索することにより、前記語句を含む第一の文書群を取得する検索手段と、前記第一の文書群からキーワードを抽出するキーワード抽出手段と、を備え、前記検索手段は、さらに、前記キーワード抽出手段によって抽出されたキーワードを含み、かつ、前記語句を含まない第二の文書群を複数の前記文書から取得し、前記キーワード抽出手段は、さらに、前記第二の文書群からキーワードを抽出し、該キーワードを関連語とし、前記類似度判定手段は、前記関連語と前記語句との類似度を、前記語句の少なくとも一部の文字を前記置き換え表記文字によって置き換えた表記と前記関連語の表記との一致によって判定することを特徴とする。
【００１２】
この請求項１に記載の発明によれば、基本辞書に含まれる語句の関連語を抽出し、語句との類似度を算出する。そして、関連後が語句に類似していると判定された場合には、この関連後を基本辞書に追加する。このため、辞書にある語句に関する語句のうち、さらに語句と類似性を持つ語句（異表記）を辞書に追加することができるので、より多くの語彙を持つ異表記辞書を作成することができる。また、この処理を自動的にできるので、異表記辞書のデータの作成、維持の負担を軽減することが可能である。
【００１４】
この請求項１に記載の発明によれば、さらに、複数の文書を検索することにより文書の少なくとも一部を含む文書群から複数のキーワードを抽出し、抽出されたキーワード基づいて関連語を取得するので、少ない検索回数で効率的に関連語を取得し、ひいては異表記辞書の拡充ができる。
【００１８】
この請求項１に記載の発明によれば、さらに、関連語と語句との類似度を、語句の少なくとも一部の文字を置き換え表記文字によって置き換えた表記と前記語句との一致によって判定するので、語句に関するさらに多くの関連語の類似性を判断し、類似した関連語を異表記辞書に登録することができる。
【００２３】
請求項２に記載の発明にかかる異表記辞書作成方法は、異表記辞書作成装置における異表記辞書作成方法であって、関連語取得手段が、基本辞書に含まれる語句の関連語を複数の文書から取得する関連語取得ステップと、類似度判定手段が、前記関連語取得ステップにおいて取得された関連語と前記語句との類似度を算出し、前記関連語が前記語句に類似しているか否か判定する類似度判定ステップと、関連語追加手段が、前記類似度判定ステップにおいて前記関連語が前記語句に類似していると判定された場合、該関連語を前記基本辞書に追加する関連語追加ステップと、表記文字蓄積手段が、前記関連語および前記語句を表記する表記文字を置き換えて示す置き換え表記文字を蓄積する置き換え表記文字蓄積ステップと、を含み、前記関連語取得ステップは、検索手段が、複数の前記文書において前記語句を検索することにより、前記語句を含む第一の文書群を取得する第一の検索ステップと、キーワード抽出手段が、前記第一の文書群からキーワードを抽出する第一のキーワード抽出ステップと、前記検索手段が、前記第一のキーワード抽出ステップにおいて抽出されたキーワードを含み、かつ、前記語句を含まない第二の文書群を複数の前記文書から取得する第二の検索ステップと、前記キーワード抽出手段が、前記第二の文書群からキーワードを抽出し、該キーワードを関連語とする第二のキーワード抽出ステップと、を含み、前記類似度判定ステップにおいて、前記類似度判定手段が、前記関連語と前記語句との類似度が、前記語句の少なくとも一部の文字を前記置き換え表記文字によって置き換えた表記と前記関連語の表記との一致によって判定されることを特徴とする。
【００２４】
この請求項２に記載の発明によれば、基本辞書に含まれる語句の関連語を抽出し、語句との類似度を算出する。そして、関連後が語句に類似していると判定された場合には、この関連後を基本辞書に追加する。このため、辞書にある語句に関する語句のうち、さらに語句と類似性を持つ語句（異表記）を辞書に追加することができるので、より多くの語彙を持つ異表記辞書を作成することができる。また、この処理を自動的にできるので、異表記辞書のデータの作成、維持の負担を軽減することが可能である。
【００２６】
この請求項２に記載の発明によれば、さらに、複数の文書を検索することにより文書の少なくとも一部を含む文書群から複数のキーワードを抽出し、抽出されたキーワード基づいて関連語を取得するので、少ない検索回数で効率的に関連語を取得し、ひいては異表記辞書の拡充ができる。
【００２９】
請求項３に記載の発明にかかるプログラムは、コンピュータに、前記請求項２に記載の異表記辞書作成方法を実行させることを特徴とするものである。
【００３０】
この請求項３に記載の発明によれば、コンピュータに、前記請求項２に記載の異表記辞書作成方法を実行させるプログラムを提供することができる。
【００３１】
【発明の実施の形態】
（実施の形態１）
図１は、本発明の実施の形態１の異表記辞書作成装置の構成を説明するための機能ブロック図である。実施の形態１の異表記辞書作成装置は、基本辞書に含まれる語句（実施の形態１では単語とし、この単語を一定の表記方法で表記された単語とし、表記単語Ｗと記す）を複数の文書に対照し、対照された表記単語Ｗに基づいて複数の文書における表記単語Ｗの関連語を取得する関連語取得部１０３、キーワード抽出部１０５、検索部１０９と、取得された関連語と表記単語Ｗとの類似度を算出し、関連語が表記単語Ｗに類似しているか否か判定する異表記フィルタ１１５とを備えている。また、異表記フィルタ１１５は、関連語が表記単語Ｗに類似していると判定した場合、この関連語を異表記辞書１０１に追加する関連語追加手段として機能する。
【００３２】
実施の形態１の異表記辞書作成装置は、異表記辞書１０１を基本辞書として関連語を拡充するものである。また、表記単語Ｗを対照する複数の文書を蓄積した文書ＤＢ１１１を備えている。
【００３３】
さらに、実施の形態１の異表記辞書作成装置は、検索部１０９の検索の結果得られる文書群を格納しておく検索結果文書群格納部１０７、後述する同一視辞書１１３を備えている。
【００３４】
図１に示した構成において、検索部１０９は、表記単語Ｗを用いて文書ＤＢ１１１に蓄積された複数の文書を検索する。そして、文書の少なくとも一部を含む文書群を取得する。キーワード抽出部１０５は、文書群から複数のキーワードを抽出し、抽出された複数のキーワードから表記単語Ｗを含まず、かつ、抽出された複数のキーワードを含むことを条件にして文書ＤＢ１１１に蓄積された複数の文書を検索する。さらに、検索の結果得られた文書からこの文書を表すキーワードを抽出し、抽出されたキーワードを関連語とする。
【００３５】
すなわち、文書ＤＢ（データベース）１１１は、予め大量に文書を格納したＤＢである。格納された文書は、品詞や単語分割等の処理を施されている必要は無く、一般的なテキストファイルでもよい。表１に異表記辞書１０１にある表記とその標準化表記との例を示す。表１に示すように、異表記辞書１０１では、英単語については原形を標準化表記といい、活用形を対応する英単語の異表記とみなす。
【表１】

【００３６】
キーワード抽出部１０５は、検索結果文書群格納部１０７に格納されたテキスト・データを対象にしてテキストデータの特徴を示すキーワードを抽出する。キーワード候補の抽出は、例えば、周知の技術であるＴＤ（ｔｅｒｍｆｒｅｑｕｅｎｃｙ）・ＩＤＦ（ｉｎｖｅｒｓｅｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙ）の値（重み）を用いる手法によって実現することができる。以下、キーワードの抽出の具体的な方法について説明する。
【００３７】
関連語取得部１０３は、まず、異表記辞書１０１にある単語から１つの表記単語Ｗを取り出し、取り出された表記単語Ｗを検索部１０９に送る。検索部１０９は、与えられた表記単語Ｗに基づいて文書ＤＢ１１１を検索する。そして、文書ＤＢ１１１に蓄積されている文書に対して表記単語Ｗを検索する。そして、検索の結果得られた表記単語Ｗを含む文書に含まれるテキスト・データを検索結果文書群格納部１０７に格納する。なお、本実施の形態は、検索の具体的な方法として単純な文字列一致を用いる。
【００３８】
ＴＤ・ＩＤＦを用いてキーワードを抽出する場合、次に、キーワード抽出部１０５は、文書群の中から抽出したキーワードをキーワード候補として検索部１０９に送る。検索部１０９は、文書ＤＢ１１１に蓄積されている文書をキーワード候補で検索し、ヒットした文書数（ヒット文書数）を得る。そして、ヒット文書数を用い、以下の式によってＴＤ・ＩＤＦの値を算出する。
Ａｎ×ｌｏｇ（文書ＤＢ内文書総数／ヒット文書数）
Ａｎ：文書ＤＢ１１１内のキーワード候補ｎの出現数 …式（１）
【００３９】
なお、キーワード候補を得るためには、対象となるテキスト・データを単語単位に分割する必要がある。実施の形態１では、簡単のため、表２に示す単語分割規則にしたがってテキスト・データを単語単位に分割するものとした。
【表２】

【００４０】
キーワード抽出部１０５は、ＴＦ・ＩＤＦの値を基づいてキーワード候補を順位付けし、上位５位までをキーワードとして関連語取得部１０３に出力する。また、キーワード抽出部１０５は、関連語取得部１０３からの問い合わせに対してキーワードを返す。
【００４１】
表３は、関連語取得部１０３からの問い合わせと、問い合わせに対して返されるキーワードとの一例を示すものである。関連語取得部１０３は、得られたキーワードを検査する。そして、キーワードに表記単語Ｗが存在した場合、これを除外する。
【表３】

【００４２】
次に、関連語取得部１０３は、検索部１０９に対してキーワード群を含み、かつ異表記辞書１０１にあった元々の単語表記を含まないことを条件にして検索を実施する。検索部１０９は、文書ＤＢ１１１に蓄積された文書をキーワード群を含み、かつ異表記辞書１０１にあった元々の単語表記を含まないことを条件にして検索する。この結果得られた文書群からキーワードを抽出し、検索に使用したキーワード群を除外した単語群が、表記単語Ｗに対する関連語となる。関連語取得部１０３は、関連語を関連語データＤとして格納する。
【００４３】
次に、異表記フィルタ１１５は、得られた関連語データＤと、表記単語Ｗとの類似度を計算し、類似度０．８以上の単語を異表記データとして異表記辞書１０１に登録する。実施の形態１では、異表記フィルタ１１５が、関連語データＤと表記単語Ｗとの類似度を、関連語を表記する文字と表記単語Ｗを表記する文字との一致によって判定する。より具体的には、表記文字の一致（類似度）は、以下の式によって算出される。なお、以下の式において、元表記とは異表記辞書１０１に元々あった単語の表記である。また、評価先表記とは、元表記との類似度を評価すべき単語の表記である。
類似度＝｛評価先表記に出現した元表記中の文字（同一視文字）の出現数＋元表記に出現した評価先表記中の文字（同一視文字）の出現数｝／｛評価先表記の文字長＋元表記の文字長｝ …式（２）
【００４４】
さらに、実施の形態１の異表記辞書作成装置は、関連語および表記単語Ｗを表記する表記文字を置き換えて示す置き換え表記文字を蓄積する置き換え表記文字蓄積手段である同一視辞書１１３をさらに備えている。異表記フィルタ１１５は、関連語と表記単語Ｗとの類似度を、表記単語Ｗの少なくとも一部の文字を同一視辞書１１３にある置き換え表記文字によって置き換えた表記と表記単語Ｗとの一致によって判定する。
【００４５】
表４は、同一視辞書１１３の一例を示している。表４に示した同一視辞書１１３は、アルファベットの大文字、小文字、全角、半角を同一視すること、漢字とその書き下しひらがなとを同一視することを示している。異表記フィルタ１１５は、同一視辞書１１３を参照し、元表記、評価先表記の表記の仕方を変えた（置き換えた）同一の単語についても類似度を算出する。このとき、文字の出現数は、置き換え後の文字列について文字数をカウントする。
【表４】

【００４６】
たとえば表記単語Ｗ「回覧」と関連語「回らん」の類似度は、以下のように計算される。すなわち、「回覧」のうち、漢字の書き下しひらがなとは同一視辞書１１３によって同一視されることから、「回らん」や「かいらん」は、「回覧」と同一視される。したがって、評価先表記に出現した元表記中の文字は回、ら、ん、の３文字である。また、元表記に出現した文字であって、評価先表記に出現した文字は回の一つである。そして、評価先の文字長が３文字、元表記の文字長が２文字であるから、「回覧」と「回らん」との類似度は、以下のように算出される。実施の形態１では、類似度が０．８以上であることから、「回らん」は「回覧」の異表記と判定される。
類似度＝｛３＋１｝／（３＋２）＝０．８
【００４７】
また、元表記「インターフェース」と評価先表記「インタフェイス」の類似度を計算する。この場合、評価先表記に出現した元表記中の文字は、「イ」、「ン」、「タ」、「フ」、「ェ」、「ス」、の６文字である。また、元表記に出現した文字であって、評価先表記に出現した文字も、「イ」、「ン」、「タ」、「フ」、「ェ」、「ス」、の６文字である。そして、評価先の文字長が７文字、元表記の文字長が８文字であるから、「インターフェース」と「インタフェイス」との類似度は、以下のように算出される。実施の形態１では、類似度が０．８以上であることから、「インタフェイス」は「インターフェース」の異表記と判定される。
類似度＝｛６＋６｝／１５＝０．８
【００４８】
同様に、元表記「ｃｏｌｏｒ」と評価先表記「ｃｏｌｏｕｒ」の類似度を計算すると、以下のようになり、「ｃｏｌｏｕｒ」は「ｃｏｌｏｒ」の異表記と判定される。
類似度＝｛５＋５｝／１１＝０．９１
【００４９】
表５は、算出された類似度の大きさによってそれぞれ「回覧」「インターフェース」「ｃｏｌｏｒ」の異表記と判定された単語を関連語として異表記辞書１０１に追加した例を示すものである。
【表５】

【００５０】
なお、本発明の異表記辞書作成装置は、文書ＤＢ１１１を、ネットワーク上におき、インターネット上の文書ＤＢ１１１に蓄積されている複数の文書に表記単語を対照することも可能である。この際、文書ＤＢ１１１の情報の取得は、例えばＷＷＷによって可能になる。なお、文書ＤＢ１１１をネットワーク上においた場合、ＤＢの総文書数を得ることが困難である。このため、式（１）における総文書数は、検索部１０９（ネットワーク上の検索エンジンを用いてもよい）ごとに固有な、総文書数推定値をあらかじめ与えておいてもよい。
【００５１】
図２は、以上述べた実施の形態１の異表記辞書作成装置で行われる異表記辞書作成方法を説明するためのフローチャートである。関連語取得部１０３は、表記単語Ｗを入力し（ステップＳ２０１）、表記単語で文書ＤＢ１１１に蓄積されている文書を検索する（ステップＳ２０２）。そして、検索の結果表記単語Ｗを含む文書群を取得する（ステップＳ２０３）。取得した文書群は、検索結果文書群格納部１０７に格納される。
【００５２】
次に、キーワード抽出部１０５は、文書群からキーワードを抽出し（ステップＳ２０４）、キーワードが表記単語Ｗを含むか否か判断する（ステップＳ２０５）。表記単語Ｗが含まれていた場合（ステップＳ２０５：Ｙｅｓ）、抽出されたキーワードから表記単語Ｗを除外し、キーワードを含み、また、表記単語Ｗを含まないという条件で文書群を検索する。そして、得られた文書群のキーワードを抽出し、抽出されたキーワードを関連語とする（ステップＳ２０６）。
【００５３】
関連語の関連語データＤは、異表記フィルタ１１５に送られる。異表記フィルタ１１５は、関連語と表記単語Ｗとの類似度を算出する（ステップＳ２０７）。そして、算出された類似度が一定の値以上の関連語を「類似」と判定し（ステップＳ２０８）、この関連語が表記単語Ｗの異表記であるとして異表記辞書１０１に追加する（ステップＳ２０９）。そして、類似度を判定すべき関連語の類似度判定がすべて終了したか否か判断し（ステップＳ２１０）、終了していない場合には（ステップＳ２１０：Ｎｏ）、次に処理すべき関連語の類似度を算出する。また、関連語の類似度の判定がすべて終了した場合（ステップＳ２１０：Ｙｅｓ）、処理を終了する。
【００５４】
以上述べた実施の形態１の異表記辞書作成装置は、異表記辞書の作成を、文書からの関連語の取得と、その中からの異表記の選別という２段階のプロセスによって行うことにより、異表記辞書１０１の拡充を高い信頼度で自動的に実行することが可能になる。また、自動処理にしたことにより、文書ＤＢ１１１を大規模にすることが容易になり、人手によって語彙を拡充する場合にくらべ、漏れのない拡充を実施することが可能になる。
【００５５】
また、実施の形態１の異表記辞書作成装置は、関連語取得を文書検索とキーワード抽出で行うことにより、検索の実行回数を、キーワードの数の範囲内に収めることが可能になり、比較的少ない検索回数で、効率的に関連語取得を行うことが可能になる。
【００５６】
また、実施の形態１の異表記辞書作成装置は、すでに関連語という範囲で選別が済んだ語群を対照に異表記を抽出するため、異表記の抽出を簡便な上に効果的に実施することが可能となる。また、同一視辞書１１３を持つことにより、単純な文字の一致だけでなく、大文字、小文字、半角、全角などの異文字種の対応や、ひらがな書き下しなどの対応を構成文字一致数という簡単な枠組みに取り込むことが可能になる。さらに、文書ＤＢ１１１は特に品詞等のタグ情報を付与されていることを必要としないため、文書ＤＢ１１１としてＷＷＷ上の文書群を用いることが可能であり、これにより、独自に大量の文書を用意することなく、異表記辞書の生成が可能となる。
【００５７】
（実施の形態２）
図３は、実施の形態２の異表記辞書作成装置を説明するための機能ブロック図である。なお、図３に示した異表記辞書作成装置は図１に示した異表記辞書作成装置と同様の構成を含んでいる。このため、実施の形態２の異表記辞書作成装置において実施の形態１の異表記辞書作成装置と同様の構成については同様の符号を付し、説明の一部を略すものとする。
【００５８】
実施の形態２の異表記辞書作成装置は、実施の形態１の異表記辞書作成装置と同様に、異表記辞書１０１を基本辞書とし、異表記辞書１０１を拡充する。このため、実施の形態２の異表記辞書作成装置は、複数の辞書３０５ａ〜３０５ｄ、異表記辞書１０１に含まれる語句である表記単語Ｗを複数の辞書３０５ａ〜３０５ｄに対照し、表記単語Ｗを辞書３０５ａ〜３０５ｄによって対訳する対訳手段である辞書問合せ部３０３、辞書問合せ部３０３による対訳によって得られた訳文から表記単語Ｗに関する関連語を取得する関連語取得部１０３、関連語取得部１０３によって取得された関連語と表記単語Ｗとの類似度を算出し、関連語が表記単語Ｗに類似しているか否か判定する異表記フィルタ１１５を備えている。また、異表記フィルタ１１５は、関連語が表記単語Ｗに類似していると判定した場合、この関連語を異表記辞書１０１に追加する関連語追加手段として機能する。
【００５９】
すなわち、辞書問合せ部３０３は、関連語取得部１０３より表記単語Ｗを受け取る。そして、表記単語Ｗを、複数の対訳辞書３０５ａ〜３０５ｄに対照する（問い合わせる）。そして、各辞書で得られる結果を返す。表６は、辞書問合せ部３０３の問い合わせによって得られる関連語を示す。表６に示す関連語は、関連語データＤとして異表記フィルタ１１５に渡される。実施の形態２では、表記単語Ｗの文字数と関連語データＤの文字数との一致によって両者の類似性を判断する。
【表６】

【００６０】
このため、異表記フィルタ１１５は、表記単語Ｗの文字数と関連語データＤの文字数とをチェックする。そして、例えば表記単語Ｗである「計算機」と文字数が一致する「計算者」「計算器」「計算表」「電算機」を「類似」と判定し、異表記として抽出し、異表記辞書１０１に登録する。抽出された異表記の候補にノイズが残ることがあるため、本実施例では異表記辞書１０１への登録前に、提示・修正部を設け、人手による修正を行う構成とした。
【００６１】
図４は、以上述べた実施の形態２の異表記辞書作成装置で行われる異表記辞書作成方法を説明するためのフローチャートである。関連語取得部１０３は、表記単語Ｗを入力し（ステップＳ４０１）、複数の辞書のうちのいずれかに対照して表記単語Ｗを対訳（問い合わせ）する（ステップＳ４０２）。そして、複数の辞書の全てに対して対訳の処理がなされたか否か判断し（ステップＳ４０３）、未だ対訳に用いられていない辞書があれば（ステップＳ４０３：Ｎｏ）、この辞書によって表記単語Ｗを対訳（問い合わせ）する。
【００６２】
また、複数の辞書のすべてにおいて表記単語Ｗをの対訳が終了した場合（ステップＳ４０３：Ｙｅｓ）、関連語取得部１０３が対訳の結果得られた単語を関連語として取得する（ステップＳ４０４）。異表記フィルタ１１５は、関連語と表記単語Ｗとの例えば文字数の一致によって関連語の表記単語Ｗに対する類似度を算出し（ステップＳ４０５）、算出された類似度によって関連語が表記単語Ｗに類似するものか否か判断する（ステップＳ４０６）。判断の結果、関連語が表記単語Ｗに類似する場合（ステップＳ４０６：Ｙｅｓ）、この関連語が表記単語Ｗの異表記であるとして異表記辞書１０１に追加する（ステップＳ４０７）。
【００６３】
また、異表記フィルタ１１５は、ステップＳ４０６において関連語が表記単語Ｗに類似していないと判断した場合（ステップＳ４０６：Ｎｏ）、異表記辞書１０１にこの関連語を追加せずに類似性を判断すべき関連語のすべてについて処理を終了したか否か判断する（ステップＳ４０８）。判断の結果、処理が終了した場合には（ステップＳ４０８：Ｙｅｓ）、処理を終了する。また、類似性を判断すべき関連語の処理が未だ終了していない場合（ステップＳ４０８：Ｎｏ）、次の関連語と表記単語Ｗとの類似性を判断する。
【００６４】
以上述べたように、実施の形態２の異表記辞書作成装置は、関連語取得の方法として複数の対訳辞書を用いることによって、より簡便な構成で関連語を得ることが可能になる。
【００６５】
なお、実施の形態１、実施の形態２の異表記辞書作成方法は、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フロッピー（Ｒ）ディスク（ＦＤ）、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録されて提供される。また、実施の形態１、実施の形態２の異表記辞書作成方法をコンピュータに実行させるためのプログラムをインターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。
【００６６】
【発明の効果】
以上説明したように、請求項１に記載の発明は、辞書にある語句に関する語句のうち、さらに語句と類似性を持つ語句（異表記）を自動的に辞書に追加することができるので、より多くの語彙を持つ異表記辞書を作成することができ、さらに異表記辞書のデータの作成、維持の負担を軽減する異表記辞書作成装置を提供することができるという効果を奏する。
【００６７】
本発明の実施の形態によれば、少ない検索回数で効率的に関連語を取得し、ひいては異表記辞書の拡充ができる異表記辞書作成装置を提供することができるという効果を奏する。
【００６８】
本発明の実施の形態によれば、また、関連語と語句との類似度を簡易に判定する異表記辞書作成装置を提供することができるという効果を奏する。
【００６９】
本発明の実施の形態によれば、また、語句に関するさらに多くの関連語の類似性を判断し、類似した関連語を異表記辞書に登録する異表記辞書作成装置を提供することができるという効果を奏する。
【００７０】
本発明の実施の形態によれば、また、文書を蓄積するＤＢを自身が持つ必要がなくなって構成を小型、簡易にする異表記辞書作成装置を提供することができるという効果を奏する。
【００７１】
本発明の実施の形態によれば、また、大規模な文書蓄積手段と文書蓄積手段に対する検索を用いない、簡便な方法により関連語を取得する異表記辞書作成装置を提供することができるという効果を奏する。
【００７２】
本発明の実施の形態によれば、また、辞書にある語句に関する語句のうち、さらに語句と類似性を持つ語句（異表記）を自動的に辞書に追加することができるので、より多くの語彙を持つ異表記辞書を作成することができ、さらに異表記辞書のデータの作成、維持の負担を軽減する異表記辞書作成方法を提供できるという効果を奏する。
【００７３】
本発明の実施の形態によれば、また、少ない検索回数で効率的に関連語を取得し、ひいては異表記辞書の拡充ができる異表記辞書作成方法を提供できるという効果を奏する。
【００７４】
本発明の実施の形態によれば、また、大規模な文書蓄積手段に対する検索を用いない、簡便な方法により関連語を取得する異表記辞書作成方法を提供できるという効果を奏する。
【００７５】
本発明の実施の形態によれば、また、コンピュータに、前記請求項７〜９のいずれか一つに記載の異表記辞書作成方法を実行させるプログラムを提供することができるという効果を奏する。
【図面の簡単な説明】
【図１】本発明の実施の形態１の異表記辞書作成装置の構成を説明するための機能ブロック図である。
【図２】実施の形態１の異表記辞書作成装置で行われる異表記辞書作成方法を説明するためのフローチャートである。
【図３】実施の形態２の異表記辞書作成装置を説明するための機能ブロック図である。
【図４】実施の形態２の異表記辞書作成装置で行われる異表記辞書作成方法を説明するためのフローチャートである。
【符号の説明】
１０１異表記辞書
１０３関連語取得部
１０５キーワード抽出部
１０７検索結果文書群格納部
１０９検索部
１１１文書ＤＢ
１１３同一視辞書
１１５異表記フィルタ
３０３辞書問合せ部
３０５ａ〜３０５ｄ対訳辞書
Ｄ関連語データ
Ｗ表記単語[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a different notation dictionary creating apparatus, a different notation dictionary creating method, and a program for causing a computer to execute the method, and an different notation dictionary creating apparatus and an allotted dictionary creating method for automatically expanding a basic dictionary The present invention relates to a program for causing a computer to execute the method.
[0002]
[Prior art]
When a document is automatically classified and the document is further processed (also referred to as language processing) using the result of the classification, the handling of differently expressed words / phrases becomes a problem. Different notation means expressing a phrase having the same meaning in different notations. For example, in Japanese, “computer” and “computer”, “handling” and “handling” are different notations.
[0003]
In European languages, there are several different notations such as singular, plural, present tense and past tense for one word. Different notations in European languages contain different concepts. For this reason, the difference in notation in European languages is different from that in Japanese, but in many cases it is treated as a mere notation in search and document classification in language processing.
[0004]
If words are classified by simply comparing only the arrangement of characters, for example, “computers” and computers in a document are classified into different groups, and problems occur depending on the subsequent processing. In addition, when searching for the word “handling”, it is sometimes a problem that the word “handling” is not hit.
[0005]
One method for dealing with the above-described problem of different notation is to focus on the different notation of words expressed in katakana and to develop different notations by defining predetermined rules. The process of generating the different notation by the rule can reduce the rule size, and the process can be performed efficiently (see, for example, Patent Document 1).
[0006]
As another conventional technique, there is a technique for creating a different notation dictionary for keyword search. In this technology, a phrase (entry) in the original dictionary is targeted, and a phrase having different similarity to the phrase is used as a different notation candidate (see, for example, Patent Document 2).
[0007]
[Patent Document 1]
JP-A-6-44295 (Claim 1)
[Patent Document 2]
JP-A-7-73197 (Claim 1)
[0008]
[Problems to be solved by the invention]
However, the process of generating different notations by rules often takes time and effort to maintain the design and consistency of the rules themselves, and it is easier to create and maintain data at the level of individual words by proceeding lexicographically. It is often efficient in terms.
[0009]
In general, the original dictionary is not often complete with entries. For example, the dictionary vocabulary itself may be a problem, for example, the “interface” exists in the original dictionary but the “interface” does not exist.
[0010]
This patent was made in view of the above points, and reduces the burden of creating and maintaining data, and can automatically expand different notations, and create an allelic dictionary with more vocabularies. Another object of the present invention is to provide a different notation dictionary creation device, an alternative notation dictionary creation method, and a program for causing a computer to execute the method.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, an apparatus for creating a different notation dictionary according to claim 1 is obtained by a related word acquisition unit that acquires related words of a phrase included in a basic dictionary from a plurality of documents, and acquired by the related word acquisition unit. Similarity determination means for calculating a similarity between a related word and the word and determining whether or not the related word is similar to the word; and the related word is similar to the word by the similarity determination means If it is determined that there is a related word adding means for adding the related word to the basic dictionary; Replacement notation character storage means for storing replacement notation characters indicating the related words and the notation characters indicating the phrases; The related word acquisition means includes a search means for acquiring a first document group including the phrase by searching for the phrase in the plurality of documents, and a keyword is extracted from the first document group. A keyword extraction unit, wherein the search unit further acquires a second document group that includes the keyword extracted by the keyword extraction unit and does not include the phrase from the plurality of documents, and the keyword The extracting unit further extracts a keyword from the second document group, and the keyword is used as a related word. The similarity determination means determines the similarity between the related word and the phrase based on a match between a notation obtained by replacing at least some characters of the phrase with the replacement notation character and the notation of the related word. It is characterized by that.
[0012]
According to the first aspect of the present invention, the related words of the words included in the basic dictionary are extracted, and the similarity to the words is calculated. If it is determined that the relation is similar to the phrase, the relation is added to the basic dictionary. For this reason, it is possible to add a phrase (different notation) having a similarity to the phrase among the phrases related to the phrase in the dictionary, and therefore, an different notation dictionary having more words can be created. In addition, since this process can be automatically performed, it is possible to reduce the burden of creating and maintaining the data of the different notation dictionary.
[0014]
According to the invention described in claim 1, By searching multiple documents, multiple keywords are extracted from a document group that includes at least a part of the document, and related words are acquired based on the extracted keywords. Therefore, related words can be efficiently acquired with a small number of searches. As a result, the dictionary of different notations can be expanded.
[0018]
According to the invention described in claim 1, Since the similarity between the related word and the phrase is determined by matching the phrase with a notation obtained by replacing at least some characters of the phrase with the replacement notation character, the similarity of more related words regarding the phrase is determined, Similar related terms can be registered in the different notation dictionary.
[0023]
The different notation dictionary creation method according to the invention of claim 2 is: The different notation dictionary creating method in the different notation dictionary creating device, the related word acquisition means, A related word acquisition step of acquiring related words of words included in the basic dictionary from a plurality of documents; Similarity determination means A similarity determination step for calculating a similarity between the related word acquired in the related word acquisition step and the word and determining whether the related word is similar to the word; and Related word adding means A related word adding step of adding the related word to the basic dictionary when it is determined in the similarity determining step that the related word is similar to the word; The written character storage means A replacement notation character accumulation step of accumulating a replacement notation character indicating the related word and the notation character representing the phrase, and the related word acquisition step includes: Search means A first search step of obtaining a first document group including the phrase by searching for the phrase in a plurality of the documents; Keyword extraction means A first keyword extraction step of extracting a keyword from the first document group; The search means A second search step for obtaining a second document group including the keyword extracted in the first keyword extraction step and not including the word from a plurality of the documents; The keyword extraction means is Extracting a keyword from the second document group, and a second keyword extracting step using the keyword as a related word, and in the similarity determination step, The similarity determination means is The similarity between the related word and the phrase is determined by a match between a notation obtained by replacing at least a part of characters of the phrase with the replacement notation character and the notation of the related word.
[0024]
This claim 2 According to the invention described in (2), the related words of the words included in the basic dictionary are extracted, and the similarity to the words is calculated. If it is determined that the relation is similar to the phrase, the relation is added to the basic dictionary. For this reason, it is possible to add a phrase (different notation) having a similarity to the phrase among the phrases related to the phrase in the dictionary, and therefore, an different notation dictionary having more words can be created. In addition, since this process can be automatically performed, it is possible to reduce the burden of creating and maintaining the data of the different notation dictionary.
[0026]
This claim 2 Further, according to the invention described in the above, a plurality of keywords are extracted from a document group including at least a part of the document by searching a plurality of documents, and related words are acquired based on the extracted keywords. The related words can be efficiently acquired by the number of times, and thus the different notation dictionary can be expanded.
[0029]
Claim 3 The computer program according to claim 1 is stored in the computer according to the claim. 2 The method for creating a different notation dictionary described in the above item is executed.
[0030]
This claim 3 According to the invention described in claim 1, the computer claims the claim. 2 It is possible to provide a program that executes the method for creating a different notation dictionary described in 1.
[0031]
DETAILED DESCRIPTION OF THE INVENTION
(Embodiment 1)
FIG. 1 is a functional block diagram for explaining the configuration of the different notation dictionary creation device according to the first embodiment of the present invention. The different notation dictionary creating apparatus according to the first embodiment includes a plurality of words / phrases included in the basic dictionary (in the first embodiment, a word is used, this word is a word written in a certain notation method, and written as a written word W). A related word acquisition unit 103, a keyword extraction unit 105, a search unit 109, and a related word and a notation acquired, which acquire related words of the written word W in a plurality of documents based on the contrasted written word W. A different notation filter 115 is provided for calculating the degree of similarity with the word W and determining whether the related word is similar to the notation word W. Further, the different notation filter 115 functions as a related word adding unit that adds the related word to the different notation dictionary 101 when it is determined that the related word is similar to the written word W.
[0032]
The different notation dictionary creation device of the first embodiment expands related words using the different notation dictionary 101 as a basic dictionary. In addition, a document DB 111 that stores a plurality of documents that contrast the written word W is provided.
[0033]
Further, the different notation dictionary creating apparatus of the first embodiment includes a search result document group storage unit 107 for storing a document group obtained as a result of the search by the search unit 109, and an identification dictionary 113 to be described later.
[0034]
In the configuration shown in FIG. 1, the search unit 109 searches for a plurality of documents stored in the document DB 111 using the written word W. Then, a document group including at least a part of the document is acquired. The keyword extraction unit 105 extracts a plurality of keywords from the document group, is stored in the document DB 111 on the condition that the notation word W is not included from the plurality of extracted keywords and includes the plurality of extracted keywords. Search multiple documents. Further, a keyword representing this document is extracted from the document obtained as a result of the search, and the extracted keyword is used as a related word.
[0035]
That is, the document DB (database) 111 is a DB that stores a large amount of documents in advance. The stored document does not need to be subjected to processing such as part of speech or word division, and may be a general text file. Table 1 shows examples of notations in the different notation dictionary 101 and their standardized notations. As shown in Table 1, in the different notation dictionary 101, for English words, the original form is referred to as standardized notation, and the utilization form is regarded as different notation of the corresponding English word.
[Table 1]

[0036]
The keyword extraction unit 105 extracts keywords indicating the characteristics of the text data from the text data stored in the search result document group storage unit 107. The extraction of keyword candidates can be realized, for example, by a technique using a value (weight) of TD (term frequency) / IDF (inverse document frequency), which is a well-known technique. Hereinafter, a specific method for extracting keywords will be described.
[0037]
The related word acquisition unit 103 first extracts one notation word W from the words in the different notation dictionary 101 and sends the extracted notation word W to the search unit 109. The search unit 109 searches the document DB 111 based on the given written word W. Then, the notation word W is searched for the documents stored in the document DB 111. Then, the text data included in the document including the written word W obtained as a result of the search is stored in the search result document group storage unit 107. In the present embodiment, simple character string matching is used as a specific search method.
[0038]
When a keyword is extracted using TD / IDF, the keyword extraction unit 105 sends the keyword extracted from the document group to the search unit 109 as a keyword candidate. The search unit 109 searches the documents stored in the document DB 111 with keyword candidates, and obtains the number of hit documents (the number of hit documents). Then, using the number of hit documents, the value of TD / IDF is calculated by the following formula.
An × log (total number of documents in document DB / number of hit documents)
An: Number of occurrences of keyword candidate n in document DB 111 (1)
[0039]
In order to obtain keyword candidates, it is necessary to divide the target text data into words. In the first embodiment, for simplicity, the text data is divided into word units according to the word division rules shown in Table 2.
[Table 2]

[0040]
The keyword extraction unit 105 ranks the keyword candidates based on the values of TF / IDF, and outputs the keywords up to the top five to the related word acquisition unit 103 as keywords. The keyword extraction unit 105 returns a keyword in response to an inquiry from the related word acquisition unit 103.
[0041]
Table 3 shows an example of an inquiry from the related word acquisition unit 103 and keywords returned in response to the inquiry. The related word acquisition unit 103 inspects the obtained keyword. And when the written word W exists in a keyword, this is excluded.
[Table 3]

[0042]
Next, the related word acquisition unit 103 performs a search on the condition that the search unit 109 includes a keyword group and does not include the original word notation included in the different notation dictionary 101. The search unit 109 searches the document stored in the document DB 111 on the condition that it includes a keyword group and does not include the original word notation that exists in the different notation dictionary 101. A word group obtained by extracting a keyword from the document group obtained as a result and excluding the keyword group used for the search becomes a related word for the written word W. The related word acquisition unit 103 stores the related word as related word data D.
[0043]
Next, the different notation filter 115 calculates the similarity between the obtained related word data D and the notation word W, and registers words having a similarity of 0.8 or more in the different notation dictionary 101 as different notation data. In the first embodiment, the different notation filter 115 determines the similarity between the related word data D and the written word W based on the match between the character expressing the related word and the character expressing the written word W. More specifically, the matching (similarity) of written characters is calculated by the following equation. In the following formula, the original notation is a word notation originally in the different notation dictionary 101. The evaluation destination notation is a word notation for which the similarity to the original notation is to be evaluated.
Similarity = {number of occurrences of characters (identified characters) in the original notation appearing in the evaluation destination notation + number of appearances of characters (identification characters) in the evaluation notation appearing in the original notation} / {evaluation notation Character length + Original character length}… Formula (2)
[0044]
Furthermore, the different notation dictionary creation device of the first embodiment further includes an identification dictionary 113 that is a replacement notation character storage unit that stores replacement notation characters that replace the notation characters that indicate related words and notation words W. Yes. The different notation filter 115 determines the similarity between the related word and the notation word W by matching the notation word W with the notation obtained by replacing at least some characters of the notation word W with the replacement notation characters in the same-view dictionary 113. To do.
[0045]
Table 4 shows an example of the identification dictionary 113. The equated dictionary 113 shown in Table 4 indicates that uppercase letters, lowercase letters, full-width characters, and half-width characters of the alphabet are identified, and that kanji characters and their written hiragana characters are identified. The different notation filter 115 refers to the same-view dictionary 113, and calculates the similarity for the same word in which the way of notation of the original notation and the evaluation destination notation is changed (replaced). At this time, the appearance number of characters is counted for the character string after replacement.
[Table 4]

[0046]
For example, the similarity between the written word W “circulation” and the related word “circular” is calculated as follows. That is, in the “circulation”, the newly written hiragana of the Chinese character is identified by the identification dictionary 113, and therefore “circulation” and “kairan” are identified as “circulation”. Therefore, the characters in the original notation appearing in the evaluation destination notation are three characters, times, la, n, and so on. A character that appears in the original notation and appears in the evaluation destination notation is one time. Since the evaluation target character length is 3 characters and the original character length is 2 characters, the similarity between “circulation” and “circular” is calculated as follows. In the first embodiment, since the degree of similarity is 0.8 or more, “circulation” is determined as an alternative notation of “circulation”.
Similarity = {3 + 1} / (3 + 2) = 0.8
[0047]
Also, the similarity between the original notation “interface” and the evaluation destination notation “interface” is calculated. In this case, the characters in the original notation appearing in the evaluation destination notation are six characters of “I”, “N”, “Ta”, “F”, “e”, “Su”. In addition, the characters appearing in the original notation and appearing in the evaluation destination notation are also six characters of “I”, “N”, “Ta”, “F”, “e”, “Su”. . Since the evaluation target character length is 7 characters and the original character length is 8 characters, the similarity between the “interface” and the “interface” is calculated as follows. In the first embodiment, since the degree of similarity is 0.8 or more, “interface” is determined as an alternative notation of “interface”.
Similarity = {6 + 6} /15=0.8
[0048]
Similarly, the similarity between the original notation “color” and the evaluation destination notation “color” is calculated as follows, and “color” is determined as an alternative notation of “color”.
Similarity = {5 + 5} /11=0.91
[0049]
Table 5 shows an example in which words that are determined as different notations of “circulation”, “interface”, and “color” according to the calculated degree of similarity are added to the different notation dictionary 101 as related words.
[Table 5]

[0050]
Note that the different notation dictionary creation device of the present invention can place the document DB 111 on a network and contrast the written words with a plurality of documents stored in the document DB 111 on the Internet. At this time, the information of the document DB 111 can be acquired by, for example, the WWW. If the document DB 111 is placed on the network, it is difficult to obtain the total number of documents in the DB. For this reason, the total number of documents in Expression (1) may be given in advance a total document number estimation value unique to each search unit 109 (a search engine on the network may be used).
[0051]
FIG. 2 is a flowchart for explaining the different notation dictionary creating method performed by the different notation dictionary creating apparatus of the first embodiment described above. The related word acquisition unit 103 inputs the written word W (step S201), and searches for documents stored in the document DB 111 using the written word (step S202). Then, a document group including the written word W as a search result is acquired (step S203). The acquired document group is stored in the search result document group storage unit 107.
[0052]
Next, the keyword extraction unit 105 extracts a keyword from the document group (step S204), and determines whether the keyword includes the written word W (step S205). If the written word W is included (step S205: Yes), the document group is searched under the condition that the written word W is excluded from the extracted keywords, the keyword is included, and the written word W is not included. Then, keywords of the obtained document group are extracted, and the extracted keywords are used as related words (step S206).
[0053]
The related word data D of the related word is sent to the different notation filter 115. The different notation filter 115 calculates the similarity between the related word and the notation word W (step S207). Then, a related word having a calculated similarity equal to or greater than a certain value is determined as “similar” (step S208), and the related word is added to the different notation dictionary 101 as an alternative notation of the written word W (step S209). ). Then, it is determined whether or not all the similarity determinations of related words whose similarity is to be determined have been completed (step S210). If they have not been completed (step S210: No), the related word to be processed next is determined. Calculate similarity. Further, when all the related word similarity determinations are completed (step S210: Yes), the process ends.
[0054]
The different notation dictionary creating apparatus of the first embodiment described above performs different notation dictionary creation by performing a two-stage process of acquiring related terms from a document and selecting different notations from the related words. Expansion of the notation dictionary 101 can be automatically executed with high reliability. Further, the automatic processing makes it easy to make the document DB 111 large-scale, and it is possible to carry out the expansion without omission compared with the case where the vocabulary is expanded manually.
[0055]
In addition, the different notation dictionary creation device of Embodiment 1 can keep the number of search executions within the number of keywords by performing related word acquisition by document search and keyword extraction. Related words can be efficiently acquired with a small number of searches.
[0056]
Moreover, since the different notation dictionary creation apparatus of Embodiment 1 extracts the different notation by contrasting the word group that has already been selected in the range of the related word, the extraction of the different notation is simple and effective. It becomes possible. In addition, by having the identification dictionary 113, not only simple character matching but also correspondence of different character types such as uppercase letters, lowercase letters, half-width characters, full-width characters, and correspondence such as writing hiragana are made into a simple framework of the number of constituent character matches. It becomes possible to capture. Further, since the document DB 111 does not need to be given tag information such as parts of speech in particular, it is possible to use a document group on the WWW as the document DB 111, thereby preparing a large number of documents independently. Thus, it is possible to generate a different notation dictionary.
[0057]
(Embodiment 2)
FIG. 3 is a functional block diagram for explaining the different notation dictionary creating apparatus of the second embodiment. Note that the different notation dictionary creating device shown in FIG. 3 includes the same configuration as the different notation dictionary creating device shown in FIG. For this reason, in the different notation dictionary creation apparatus of Embodiment 2, the same code | symbol is attached | subjected about the structure similar to the different notation dictionary creation apparatus of Embodiment 1, and a part of description is abbreviate | omitted.
[0058]
Similarly to the different notation dictionary creating device of the first embodiment, the different notation dictionary creating device of the second embodiment uses the different notation dictionary 101 as a basic dictionary and expands the different notation dictionary 101. For this reason, the different notation dictionary creation device of the second embodiment compares the notation word W, which is a phrase included in the plurality of dictionaries 305a to 305d and the different notation dictionary 101, with the plurality of dictionaries 305a to 305d. Obtained by the dictionary query unit 303, which is a translation means for translating by the dictionaries 305a to 305d, the related word acquisition unit 103 for acquiring a related word related to the written word W from the translation obtained by the parallel translation by the dictionary query unit 303, A different notation filter 115 is provided for calculating the similarity between the related word and the written word W and determining whether the related word is similar to the written word W. Further, the different notation filter 115 functions as a related word adding unit that adds the related word to the different notation dictionary 101 when it is determined that the related word is similar to the written word W.
[0059]
That is, the dictionary inquiry unit 303 receives the written word W from the related word acquisition unit 103. Then, the written word W is compared (inquired) with a plurality of bilingual dictionaries 305a to 305d. Then, the result obtained in each dictionary is returned. Table 6 shows related words obtained by the query from the dictionary query unit 303. The related words shown in Table 6 are passed to the different notation filter 115 as related word data D. In the second embodiment, the similarity between both is determined by matching the number of characters of the written word W with the number of characters of the related word data D.
[Table 6]

[0060]
For this reason, the different notation filter 115 checks the number of characters of the notation word W and the number of characters of the related word data D. Then, for example, “calculator”, “calculator”, “calculation table”, “computer” having the same number of characters as “notation word W” “computer” is determined as “similar”, extracted as different notation, and different notation dictionary 101 Register with. Since noise may remain in the extracted different notation candidates, in the present embodiment, a presentation / correction unit is provided before manual registration in the different notation dictionary 101, and the correction is performed manually.
[0061]
FIG. 4 is a flowchart for explaining the different notation dictionary creating method performed by the different notation dictionary creating apparatus of the second embodiment described above. The related word acquisition unit 103 inputs the written word W (step S401), and translates (inquires) the written word W against any of the plurality of dictionaries (step S402). Then, it is determined whether or not bilingual processing has been performed for all of the plurality of dictionaries (step S403). If there is a dictionary that has not been used for bilingual translation (step S403: No), the written word W is converted by this dictionary. Make a translation (inquiry).
[0062]
When the translation of the written word W is completed in all of the plurality of dictionaries (step S403: Yes), the related word acquisition unit 103 acquires a word obtained as a result of the parallel translation as a related word (step S404). The different notation filter 115 calculates the similarity of the related word to the written word W by, for example, matching the number of characters between the related word and the written word W (step S405), and the related word is similar to the written word W by the calculated similarity. It is determined whether or not to perform (step S406). As a result of the determination, if the related word is similar to the written word W (step S406: Yes), the related word is added to the different written dictionary 101 as being a different written form of the written word W (step S407).
[0063]
If the different notation filter 115 determines in step S406 that the related word is not similar to the notation word W (step S406: No), the different notation filter 115 determines the similarity without adding the related word to the different notation dictionary 101. It is determined whether or not the processing has been completed for all the related terms to be processed (step S408). As a result of the determination, when the process is finished (step S408: Yes), the process is finished. If the processing of the related word for which similarity is to be determined has not been completed yet (step S408: No), the similarity between the next related word and the written word W is determined.
[0064]
As described above, the different notation dictionary creating apparatus of the second embodiment can obtain related words with a simpler configuration by using a plurality of bilingual dictionaries as a related word acquisition method.
[0065]
Note that the different notation dictionary creation method of the first and second embodiments is an installable or executable file that is read by a computer such as a CD-ROM, floppy (R) disk (FD), or DVD. It is provided by being recorded on a possible recording medium. Further, a program for causing a computer to execute the different notation dictionary creating method of the first embodiment and the second embodiment is stored on a computer connected to a network such as the Internet, and is provided by being downloaded via the network. You may comprise.
[0066]
【The invention's effect】
As described above, the invention according to claim 1 can automatically add a word / phrase (another notation) having similarity to the word / phrase among words / phrases related to the word / phrase in the dictionary. There is an effect that it is possible to create a different notation dictionary having a large number of vocabularies, and further to provide a different notation dictionary creating apparatus that reduces the burden of creating and maintaining the data of the different notation dictionary.
[0067]
According to an embodiment of the present invention, There is an effect that it is possible to provide a different notation dictionary creation device capable of efficiently acquiring related terms with a small number of searches and thus expanding the different notation dictionary.
[0068]
According to an embodiment of the present invention, There is an effect that it is possible to provide a different notation dictionary creation device that easily determines the similarity between related words and phrases.
[0069]
According to an embodiment of the present invention, There is an effect that it is possible to provide a different notation dictionary creating apparatus that determines the similarity of more related words related to a phrase and registers similar related words in the different notation dictionary.
[0070]
According to an embodiment of the present invention, There is an effect that it is possible to provide a different notation dictionary creation device that eliminates the need for having a DB for storing documents and makes the configuration compact and simple.
[0071]
According to an embodiment of the present invention, There is an effect that it is possible to provide a different notation dictionary creation device that acquires related terms by a simple method that does not use large-scale document storage means and search for document storage means.
[0072]
According to an embodiment of the present invention, Of words related to words in the dictionary, words that have similarities to words (another notation) can be automatically added to the dictionary, so you can create a different notation dictionary with more words. In addition, there is an effect that it is possible to provide a method for creating a different notation dictionary that reduces the burden of creating and maintaining the data of the different notation dictionary.
[0073]
According to an embodiment of the present invention, There is an effect that it is possible to provide a different notation dictionary creation method capable of efficiently acquiring related terms with a small number of searches and thus expanding the different notation dictionary.
[0074]
According to an embodiment of the present invention, There is an effect that it is possible to provide a different notation dictionary creation method for acquiring related words by a simple method without using a search for a large-scale document storage unit.
[0075]
According to an embodiment of the present invention, There is an effect that it is possible to provide a program that causes a computer to execute the method of creating an alternate notation dictionary according to any one of claims 7 to 9.
[Brief description of the drawings]
FIG. 1 is a functional block diagram for explaining a configuration of a different notation dictionary creation device according to a first embodiment of the present invention;
FIG. 2 is a flowchart for explaining a different notation dictionary creation method performed by the different notation dictionary creation device of the first embodiment;
FIG. 3 is a functional block diagram for explaining a different notation dictionary creating apparatus according to a second embodiment;
FIG. 4 is a flowchart for explaining a different notation dictionary creation method performed by the different notation dictionary creation device of the second embodiment;
[Explanation of symbols]
101 Different dictionary
103 Related Word Acquisition Unit
105 Keyword extraction unit
107 Search result document group storage
109 Search part
111 Document DB
113 Identification Dictionary
115 Different notation filter
303 Dictionary inquiry section
305a-305d Bilingual dictionary
D Related term data
W written word

Claims

Related word acquisition means for acquiring related words of a phrase included in the basic dictionary from a plurality of documents;
Similarity determination means for calculating similarity between the related word acquired by the related word acquisition means and the word and determining whether the related word is similar to the word;
Related word adding means for adding the related word to the basic dictionary when the similarity determining means determines that the related word is similar to the word;
Replacement notation character storage means for storing replacement notation characters indicating the related words and the notation characters indicating the phrases;
With
The related word acquisition means includes
Search means for acquiring a first document group including the word by searching the word in a plurality of the documents;
Keyword extracting means for extracting a keyword from the first document group;
With
The search means further acquires a second document group that includes the keyword extracted by the keyword extraction means and does not include the phrase from a plurality of the documents,
The keyword extracting means further extracts a keyword from the second document group, and uses the keyword as a related word.
The similarity determination means determines the similarity between the related word and the phrase based on a match between a notation obtained by replacing at least some characters of the phrase with the replacement notation character and the notation of the related word. A different notation dictionary creation device.

A different notation dictionary creation method in an different notation dictionary creation device,
A related word acquisition means for acquiring a related word of a phrase included in the basic dictionary from a plurality of documents;
A similarity determination unit calculates a similarity between the related word acquired in the related word acquisition step and the word, and determines whether the related word is similar to the word;
When the related word adding means determines that the related word is similar to the phrase in the similarity determination step, a related word adding step of adding the related word to the basic dictionary;
A replacement notation character storage step in which the notation character storage means stores replacement notation characters that replace the notation characters that indicate the related word and the phrase; and
Including
The related word acquisition step includes:
A first search step of obtaining a first document group including the phrase by searching for the phrase in a plurality of the documents;
A keyword extraction means for extracting a keyword from the first document group;
A second search step in which the search means acquires a second document group that includes the keyword extracted in the first keyword extraction step and does not include the phrase from a plurality of the documents;
The keyword extraction means extracts a keyword from the second document group, and uses the keyword as a related word; a second keyword extraction step;
Including
In the similarity determination step, the similarity determination unit is configured such that the similarity between the related word and the phrase includes a notation in which at least a part of characters of the phrase is replaced with the replacement notation character and a notation of the related word. A method for creating a different notation dictionary, characterized in that the determination is based on a match between the two.

A program for causing a computer to execute the method for creating a different notation dictionary according to claim 2.