JP3875357B2

JP3875357B2 - Word / collocation classification processing method, collocation extraction method, word / collocation classification processing device, speech recognition device, machine translation device, collocation extraction device, and word / collocation storage medium

Info

Publication number: JP3875357B2
Application number: JP16724397A
Authority: JP
Inventors: 明潮田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1996-08-02
Filing date: 1997-06-24
Publication date: 2007-01-31
Anticipated expiration: 2017-06-24
Also published as: JPH1097286A

Description

【０００１】
【発明の属する技術分野】
本発明は、単語・連語分類処理方法、連語抽出方法、単語・連語分類処理装置、音声認識装置、機械翻訳装置、連語抽出装置及び単語・連語記憶媒体に関し、特に、テキストデータの中から連語を自動的に抽出し、単語及び連語を自動的に分類する場合に好適なものである。
【０００２】
【従来の技術】
従来の単語分類処理装置には、例えば、「Ｂｒｏｗｎ，Ｐ．，ＤｅｌｌａＰｉｅｔｒａ，Ｖ．，ｄｅＳｏｕｚａ，Ｐ．，Ｌａｉ，Ｊ．，Ｍｅｒｃｅｒ，Ｒ．（１９９２）“Ｃｌａｓｓ−Ｂａｓｅｄｎ−ｇｒａｍＭｏｄｅｌｓｏｆＮａｔｕｒａｌＬａｎｇｕａｇｅ”．ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｖｏｌ．１８，Ｎｏ４，ｐｐ．４６７−４７９」に記載されているように、テキストデータの中で使用されている単独の単語を統計的に処理することにより、単独の単語を自動的に分類するものがあり、この単独の単語の分類結果を用いて音声認識や機械翻訳を行っていた。
【０００３】
【発明が解決しようとする課題】
しかしながら、従来の単語分類処理装置は、単語と連語とをまとめて自動的に分類することができず、単語と連語あるいは連語と連語の対応関係や類似度を用いて、音声認識や機械翻訳を行うことがきないため、音声認識や機械翻訳を正確に実行することができないという問題があった。
【０００４】
そこで、本発明の第１の目的は、単語と連語とをまとめて自動的に分類することが可能な単語・連語分類処理方法及び単語・連語分類処理装置を提供することである。
【０００５】
また、本発明の第２の目的は、大量のテキストデータから高速に連語を抽出することが可能な連語抽出装置を提供することである。
また、本発明の第３の目的は、単語と連語あるいは連語と連語の対応関係や類似度を用いることにより、正確な音声認識が可能な音声認識装置を提供することである。
【０００６】
また、本発明の第４の目的は、単語と連語あるいは連語と連語の対応関係や類似度を用いることにより、正確な機械翻訳が可能な機械翻訳装置を提供することである。
【０００７】
【課題を解決するための手段】
上述した第１の目的を達成するために、本発明によれば、テキストデータに含まれる単語と連語とを一緒に分類して、単語と連語とが混在するクラスを生成するようにしている。
【０００８】
このことにより、単語と単語とをまとめて分類するだけでなく、単語と連語あるいは連語と連語とをまとめて一緒に分類することができ、単語と連語あるいは連語と連語との対応関係や類似度を容易に判別することができる。
【０００９】
また、本発明の一態様によれば、単語を分類した単語クラスをテキストデータの単語の一次元列にマッピングして単語クラスの一次元列を生成し、テキストデータの単語クラスの一次元列において、隣接する単語クラス間の粘着度が全て所定値以上の単語クラス列を抽出してその単語クラス列にトークンを付与し、単語とトークンとを一緒に分類してから、トークンに対応する単語クラス列をその単語クラス列に属する連語で置換するようにしている。
【００１０】
このことにより、単語クラス列にトークンを付与してその単語クラス列を１つの単語とみなし、テキストデータに含まれる単語とトークンを付与された単語クラス列とを同等に取り扱って単語と連語との区別なく分類処理を行うことができる。また、単語を分類した単語クラスをテキストデータの単語の一次元列にマッピングして単語クラスの一次元列を生成し、隣接する単語クラス間の粘着度に基づいて連語を抽出することにより、テキストデータからの連語の抽出を高速に行うことができる。
【００１１】
また、上述した第２の目的を達成するために、本発明によれば、単語を分類した単語クラスをテキストデータの単語の一次元列にマッピングして単語クラスの一次元列を生成し、テキストデータの単語クラスの一次元列において、隣接する単語クラス間の粘着度が全て所定値以上の単語クラス列を抽出し、単語クラス列を構成する個々の単語クラスから、テキストデータに隣接して存在する個々の単語を別々に取り出して連語を抽出するようにしている。
【００１２】
このことにより、単語クラス列に基づいて連語を抽出することができ、テキストデータに存在する異なる単語の数よりも、それらの単語を分類した単語クラスの数のほうが少ないので、テキストデータの単語クラスの一次元列において、隣接する単語クラス間の粘着度が所定値以上の単語クラス列を抽出するほうが、テキストデータの単語の一次元列において、隣接する単語間の粘着度が所定値以上の単語列を抽出する場合に比べて、演算量及びメモリ容量を少なくすることができ、連語の抽出処理を高速に行うことができるとともに、メモリ資源を節約できる。なお、単語クラス列には、テキストデータの単語の一次元列に存在しない単語列が含まれている場合があるので、単語クラス列を構成する個々の単語クラスから、テキストデータに隣接して存在する個々の単語を別々に取り出して連語としている。
【００１３】
また、上述した第３の目的を達成するために、本発明によれば、所定のテキストデータに含まれる単語と連語とを、単語と連語とが混在するクラスに分類して格納している単語・連語辞書を参照することにより、発音音声を音声認識するようにしている。
【００１４】
このことにより、単語と連語あるいは連語と連語の対応関係や類似度を用いながら音声認識を行うことができ、正確な処理が可能になる。
また、上述した第４の目的を達成するために、本発明によれば、所定のテキストデータに含まれる単語と連語とを、単語と連語とが混在するクラスに分類して格納している単語・連語辞書に基づいて、用例文集に格納されている用例原文と入力された原文とを対応させるようにしている。
【００１５】
このことにより、用例文集に格納されている用例原文の単語が連語に置き換わった原文が入力された場合においても、入力された原文に用例原文を適用して機械翻訳を行うことができ、単語と連語あるいは連語と連語の対応関係や類似度を用いた正確な機械翻訳が可能になる。
【００１６】
【発明の実施の形態】
以下、本発明の一実施例に係わる単語・連語分類処理装置について図面を参照しながら説明する。この実施例は、所定のテキストデータに含まれる単語と連語とを、単語と連語とが混在するクラスに分類するものである。
【００１７】
図１は、本発明の一実施例に係わる単語・連語分類処理装置の機能的な構成を示すブロック図である。
図１において、単語分類手段１は、テキストデータの単語の一次元列から互いに異なる単語を抽出し、抽出された単語の集合を分割して単語クラスを生成する。
【００１８】
図２は、単語分類手段１の処理を説明するもので、テキストデータに含まれるＴ個の単語よりなる単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）から、テキストデータでの出現頻度順に並べたＶ個のボキャブラリーとしての単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝を生成し、このテキストデータのボキャブラリーとしての単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝のそれぞれに初期化クラスを割り当てる。ここで、単語の個数Ｔ個は、例えば、５０００万個であり、ボキャブラリーの個数Ｖ個は、例えば、７０００個である。
【００１９】
図２の例では、テキストデータでの出現頻度が高い、例えば、“ｔｈｅ”、“ａ”、“ｉｎ”、“ｏｆ”が、それぞれボキャブラリーとしての単語ｖ₁、ｖ₂、ｖ₃、ｖ₄に対応している。初期化クラスを割り当てられたＶ個のボキャブラリーとしての単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝は、クラスタリングによりＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝に分割される。ここで、単語クラスの個数Ｃ個は、例えば、５００個である。
【００２０】
また、図２では、例えば、“ｓｐｅａｋ”、“ｓａｙ”、“ｔｅｌｌ”、“ｔａｌｋ”・・・が単語クラスＣ₁に分類され、“ｈｅ”、“ｓｈｅ”、“ｉｔ”・・・が単語クラスＣ₅に分類され、“ｃａｒ”、“ｔｒａｃｋ”、“ｗａｇｏｎ”・・・が単語クラスＣ₃₂に分類され、“Ｔｏｙｏｔａ”、“Ｎｉｓｓａｎ”、“ＧＭ”・・・が単語クラスＣ₃₀₀に分類されている例を示している。
【００２１】
このＶ個のボキャブラリーとしての単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝よりなる単語の分類は、例えば、テキストデータに存在する２つの単語がおのおの属する２つの単語クラスをマージした場合、元のテキストデータの生成確率の減少が最も少なくなるものを同一の単語クラスに統合することにより行う。ここで、元のテキストデータのクラスバイモデルによる生成確率は、平均相互情報量ＡＭＩを用いて表現することができ、この平均相互情報量ＡＭＩは以下の式により表すことができる。
【００２２】
【数１】

【００２３】
ここで、
Ｐｒ（Ｃ_i）は、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）をその単語が属する単語クラスで置き換えた場合、そのテキストデータの単語クラスの一次元列でのクラスＣ_iの出現確率、
Ｐｒ（Ｃ_j）は、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）をその単語が属する単語クラスで置き換えた場合、そのテキストデータの単語クラスの一次元列でのクラスＣ_jの出現確率、
Ｐｒ（Ｃ_i、Ｃ_j）は、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）を、その単語が属する単語クラスで置き換えた場合、そのテキストデータの単語クラスの一次元列での単語クラスＣ_iの次に隣接して単語クラスＣ_jが出現する確率である。
【００２４】
図３は、図１の単語分類手段１の機能的な構成の一例を示すブロック図である。
図３において、初期化クラス設定部１０は、テキストデータの単語の一次元列｛ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T｝から互いに異なる単語を抽出し、所定の出現頻度を有する単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝のそれぞれに固有の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝を割り当てる。
【００２５】
仮マージ部１１は、単語クラスの集合｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_M｝から２つの単語クラス｛Ｃ_i、Ｃ_j｝を取り出して仮マージする。
平均相互情報量算出部１２は、テキストデータの仮マージされた単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_M-1｝についての平均相互情報量ＡＭＩを（１）式により算出する。この場合、Ｍ個の単語クラスの集合｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_M｝から２つの単語クラス｛Ｃ_i、Ｃ_j｝を取り出だす取り出しかたは、Ｍ（Ｍ−１）／２個だけ存在するので、Ｍ（Ｍ−１）／２回の平均相互情報量ＡＭＩの計算を行う必要がある。
【００２６】
本マージ部１３は、仮マージにより計算されたＭ（Ｍ−１）／２個の平均相互情報量ＡＭＩの基づいて、平均相互情報量ＡＭＩを最大とする２つの単語クラス｛Ｃ_i、Ｃ_j｝を単語クラスの集合｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_M｝から取り出して本マージする。このことにより、本マージされたいずれかの単語クラス｛Ｃ_i、Ｃ_j｝に属する単語は、同一の単語クラスに分類される。
【００２７】
図１の単語クラス列生成手段２は、テキストデータの単語列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）を構成する個々の単語を、単語が属する単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝で置換することにより、テキストデータの単語クラス列を生成する。
【００２８】
図４は、テキストデータの単語クラスの一次元列の一例を示す図である。
図４において、単語分類手段１によりＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝が生成されているものとし、例えば、単語クラスＣ₁には、ボキャブラリーｖ₁、ｖ₃₇、・・・が属しており、単語クラスＣ₂には、ボキャブラリーｖ₃、ｖ₁₅、・・・が属しており、単語クラスＣ₃には、ボキャブラリーｖ₂、ｖ₄、・・・が属しており、単語クラスＣ₄には、ボキャブラリーｖ₇、ｖ₉、・・・が属しており、単語クラスＣ₅には、ボキャブラリーｖ₆、ｖ₈、ｖ₂₆、ｖ_V、・・・が属しており、単語クラスＣ₆には、ボキャブラリーｖ₆、ｖ₂₃、・・・が属しており、単語クラスＣ₇には、ボキャブラリーｖ₅、ｖ₁₀、・・・が属しているものとする。
【００２９】
また、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）において、例えば、単語ｗ₁が示すボキャブラリーとしての単語がｖ₁₅、単語ｗ₂が示すボキャブラリーとしての単語がｖ₂、単語ｗ₃が示すボキャブラリーとしての単語がｖ₂₃、単語ｗ₄が示すボキャブラリーとしての単語がｖ₄、単語ｗ₅が示すボキャブラリーとしての単語がｖ₅、単語ｗ₆が示すボキャブラリーとしての単語がｖ₁₅、単語ｗ₇が示すボキャブラリーとしての単語がｖ₅、単語ｗ₈が示すボキャブラリーとしての単語がｖ₂₆、単語ｗ₉が示すボキャブラリーとしての単語がｖ₃₇、単語ｗ₁₀が示すボキャブラリーとしての単語がｖ₂、・・・、単語ｗ_Tが示すボキャブラリーとしての単語がｖ₈であるとする。
【００３０】
この場合、ボキャブラリーｖ₁₅は単語クラスＣ₂に属しているので、単語ｗ₁は単語クラスＣ₂にマッピングされ、ボキャブラリーｖ₂は単語クラスＣ₃に属しているので、単語ｗ₂は単語クラスＣ₃にマッピングされ、ボキャブラリーｖ₂₃は単語クラスＣ₆に属しているので、単語ｗ₃は単語クラスＣ₆にマッピングされ、ボキャブラリーｖ₄は単語クラスＣ₃に属しているので、単語ｗ₄は単語クラスＣ₃にマッピングされ、ボキャブラリーｖ₅は単語クラスＣ₇に属しているので、単語ｗ₅は単語クラスＣ₇にマッピングされ、ボキャブラリーｖ₁₅は単語クラスＣ₂に属しているので、単語ｗ₆は単語クラスＣ₂にマッピングされ、ボキャブラリーｖ₅は単語クラスＣ₇に属しているので、単語ｗ₇は単語クラスＣ₇にマッピングされ、ボキャブラリーｖ₂₆は単語クラスＣ₅に属しているので、単語ｗ₈は単語クラスＣ₅にマッピングされ、ボキャブラリーｖ₃₇は単語クラスＣ₁に属しているので、単語ｗ₉は単語クラスＣ₁にマッピングされ、ボキャブラリーｖ₂は単語クラスＣ₃に属しているので、単語ｗ₁₀は単語クラスＣ₃にマッピングされ、・・・、ボキャブラリーｖ₈は単語クラスＣ₅に属しているので、単語ｗ_Tは単語クラスＣ₅にマッピングされる。
【００３１】
すなわち、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）が、Ｃ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝によりマッピングされた結果として、テキストデータの単語クラスの一次元列（Ｃ₂Ｃ₃Ｃ₆Ｃ₃Ｃ₇Ｃ₂Ｃ₇Ｃ₅Ｃ₁Ｃ₃・・・Ｃ₅）が１対１対応で生成される。
【００３２】
図１の単語クラス列抽出手段３は、テキストデータの単語クラスの一次元列においての単語クラス間の粘着度が全て所定値以上の単語クラス列を、テキストデータの単語クラスの一次元列から抽出する。ここで、単語クラス間の粘着度は、単語クラス列を構成する単語クラス間のつながりの強さを示す指標であり、この粘着度を表現するものとして、例えば、相互情報量ＭＩ、相関係数、コサインメジャー、ｌｉｋｌｉｈｏｏｄｒａｔｉｏなどがある。
【００３３】
以下の説明では、単語クラス間の粘着度として、相互情報量ＭＩを用いることにより、テキストデータの単語クラスの一次元列から単語クラス列を抽出する場合を例にとる。
【００３４】
図５は、単語クラス列抽出手段３により抽出された単語クラス列の一例を示す図である。
図５において、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄ｗ₅ｗ₆ｗ₇・・・ｗ_T）に対してマッピングされた結果として、テキストデータの単語クラスの一次元列（Ｃ₂Ｃ₃Ｃ₆Ｃ₃Ｃ₇Ｃ₂Ｃ₇・・・Ｃ₅）が１対１対応で生成されているものとする。このテキストデータの単語クラスの一次元列（Ｃ₂Ｃ₃Ｃ₆Ｃ₃Ｃ₇Ｃ₂Ｃ₇・・・Ｃ₅）から、隣接する２つの単語クラス（Ｃ_i、Ｃ_j）を順次に取り出し、隣接する２つの単語クラス（Ｃ_i、Ｃ_j）についての相互情報量ＭＩ（Ｃ_i、Ｃ_j）を、以下の（２）式により計算する。
【００３５】

そして、隣接する２つの単語クラス（Ｃ_i、Ｃ_j）についての相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値ＴＨ以上の場合、これら隣接する２つの単語クラス（Ｃ_i、Ｃ_j）をクラスチェーンで結んで互いに関連づける。
【００３６】
例えば、図５において、隣接する２つの単語クラス（Ｃ₂、Ｃ₃）についての相互情報量ＭＩ（Ｃ₂、Ｃ₃）、隣接する２つの単語クラス（Ｃ₃、Ｃ₆）についての相互情報量ＭＩ（Ｃ₃、Ｃ₆）、隣接する２つの単語クラス（Ｃ₆、Ｃ₃）についての相互情報量ＭＩ（Ｃ₆、Ｃ₃）、隣接する２つの単語クラス（Ｃ₃、Ｃ₇）についての相互情報量ＭＩ（Ｃ₃、Ｃ₇）、隣接する２つの単語クラス（Ｃ₇、Ｃ₂）についての相互情報量ＭＩ（Ｃ₇、Ｃ₂）、隣接する２つの単語クラス（Ｃ₂、Ｃ₇）についての相互情報量ＭＩ（Ｃ₂、Ｃ₇）、・・・を（２）式により順次に計算する。
【００３７】
そして、相互情報量ＭＩ（Ｃ₂、Ｃ₃）、相互情報量ＭＩ（Ｃ₃、Ｃ₇）、相互情報量ＭＩ（Ｃ₇、Ｃ₂）、・・・がしきい値ＴＨ以上で、相互情報量ＭＩ（Ｃ₃、Ｃ₆）、相互情報量ＭＩ（Ｃ₆、Ｃ₃）、相互情報量ＭＩ（Ｃ₂、Ｃ₇）、・・・がしきい値ＴＨより小さい場合、隣接する２つの単語クラス（Ｃ₂、Ｃ₃）、（Ｃ₃、Ｃ₇）、（Ｃ₇、Ｃ₂）、・・・をそれぞれクラスチェーンで結ぶことにより、単語クラス列Ｃ₂−Ｃ₃、Ｃ₃−Ｃ₇−Ｃ₂、・・・を抽出する。
【００３８】
図６は、図１の単語クラス列抽出手段３の機能的な構成の一例を示すブロック図である。
図６において、単語クラス取出部３０は、テキストデータの単語クラスの一次元列から、隣接して存在する２つの単語クラス（Ｃ_i、Ｃ_j）を順次に取り出す。
【００３９】
相互情報量算出部３１は、単語クラス取出部３０により取り出した２つの単語クラス（Ｃ_i、Ｃ_j）の相互情報量ＭＩ（Ｃ_i、Ｃ_j）を（２）式により算出する。
【００４０】
クラスチェーン結合部３２は、相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値以上の２つの単語クラス（Ｃ_i、Ｃ_j）をクラスチェーンで結ぶ。
図１のトークン付与手段４は、単語クラス列抽出手段３によりクラスチェーンで結ばれた単語クラス列にトークンを付与する。
【００４１】
図７は、トークン付与手段４により付与されたトークンの一例を示す図である。
図７において、クラスチェーンで結ばれた単語クラス列は、例えば、Ｃ₁−Ｃ₃、Ｃ₁−Ｃ₇、・・・、Ｃ₂−Ｃ₃、Ｃ₂−Ｃ₁₁、・・・、Ｃ₃₀₀−Ｃ₃₂、・・・、Ｃ₁−Ｃ₃−Ｃ₈₀、Ｃ₁−Ｃ₄−Ｃ₅、Ｃ₃−Ｃ₇−Ｃ₂、・・・、Ｃ₁−Ｃ₉−Ｃ₁₁−Ｃ₃₂、・・・とする。この場合、単語クラス列Ｃ₁−Ｃ₃に対してトークンｔ₁を付与し、単語クラス列Ｃ₁−Ｃ₇に対してトークンｔ₂を付与し、・・・、単語クラス列Ｃ₂−Ｃ₃に対してトークンｔ₃を付与し、単語クラス列Ｃ₂−Ｃ₁₁に対してトークンｔ₄を付与し、・・・、単語クラス列Ｃ₃₀₀−Ｃ₃₂に対してトークンｔ₅を付与し、、・・・、単語クラス列Ｃ₁−Ｃ₃−Ｃ₈₀に対してトークンｔ₆を付与し、単語クラス列Ｃ₁−Ｃ₄−Ｃ₅に対してトークンｔ₇を付与し、単語クラス列Ｃ₃−Ｃ₇−Ｃ₂に対してトークンｔ₈を付与し、・・・、単語クラス列Ｃ₁−Ｃ₉−Ｃ₁₁−Ｃ₃₂に対してトークンｔ₉を付与する。
【００４２】
図１の単語・トークン列生成手段５は、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄ｗ₅ｗ₆ｗ₇・・・ｗ_T）のうち、単語クラス列抽出手段４により抽出された単語クラス列に属する単語列をトークンで置換することにより、テキストデータの単語・トークンの一次元列を生成する。
【００４３】
図８は、テキストデータの単語・トークンの一次元列の一例を示す図である。図８において、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄ｗ₅ｗ₆ｗ₇・・・ｗ_T）に対してマッピングされた結果として、テキストデータの単語クラスの一次元列（Ｃ₂Ｃ₃Ｃ₆Ｃ₃Ｃ₇Ｃ₂Ｃ₇・・・Ｃ₅）が１対１対応で生成されているものとし、クラスチェーンで結ばれた単語クラス列Ｃ₂−Ｃ₃、Ｃ₃−Ｃ₇−Ｃ₂、・・・に対して、図７に示すように、トークンｔ₃、ｔ₈、・・・が付与されているものとする。
【００４４】
この場合、クラスチェーンで結ばれた単語クラス列Ｃ₂−Ｃ₃に属するテキストデータの単語列（ｗ₁ｗ₂）をトークンｔ₃で置き換え、クラスチェーンで結ばれた単語クラス列Ｃ₃−Ｃ₇−Ｃ₂に属するテキストデータの単語列（ｗ₄ｗ₅ｗ₆）をトークンｔ₈で置き換えることにより、テキストデータの単語・トークンの一次元列（ｔ₃ｗ₃ｔ₈ｗ₇・・・ｗ_T）を生成する。
【００４５】
図９は、テキストデータの単語・トークンの一次元列の一例を英文を例にとって示す図である。
図９（ｂ）のテキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄ｗ₅ｗ₆ｗ₇ｗ₈ｗ₉ｗ₁₀ｗ₁₁ｗ₁₂ｗ₁₃ｗ₁₄ｗ₁₅）として、図９（ａ）の“ＨｅｗｅｎｔｔｏｔｈｅａｐａｒｔｍｅｎｔｂｙｂｕｓａｎｄｓｈｅｗｅｎｔｔｏＮｅｗＹｏｒｋｂｙｐｌａｎｅ”が対応しているものとし、この単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄ｗ₅ｗ₆ｗ₇ｗ₈ｗ₉ｗ₁₀ｗ₁₁ｗ₁₂ｗ₁₃ｗ₁₄ｗ₁₅）に１対１で対応する単語クラスの一次元列が図９（ｃ）の（Ｃ₅Ｃ₉₀Ｃ₃Ｃ₂₁Ｃ₁₈Ｃ₁₀₁Ｃ₃₂Ｃ₂Ｃ₅Ｃ₉₀Ｃ₃Ｃ₆₃Ｃ₂₈Ｃ₁₀₁Ｃ₃₂）で与えられるものとする。
【００４６】
この単語クラスの一次元列（Ｃ₅Ｃ₉₀Ｃ₃Ｃ₂₁Ｃ₁₈Ｃ₁₀₁Ｃ₃₂Ｃ₂Ｃ₅Ｃ₉₀Ｃ₃Ｃ₆₃Ｃ₂₈Ｃ₁₀₁Ｃ₃₂）において、隣接する２つの単語クラス（Ｃ_i、Ｃ_j）の相互情報量ＭＩ（Ｃ_i、Ｃ_j）を計算し、相互情報量ＭＩ（Ｃ₆₃、Ｃ₂₈）が所定のしきい値ＴＨ以上、相互情報量ＭＩ（Ｃ₅、Ｃ₉₀）、ＭＩ（Ｃ₉₀、Ｃ₃）、ＭＩ（Ｃ₃、Ｃ₂₁）、ＭＩ（Ｃ₂₁、Ｃ₁₈）、ＭＩ（Ｃ₁₈、Ｃ₁₀₁）、ＭＩ（Ｃ₁₀₁、Ｃ₃₂）、ＭＩ（Ｃ₃₂、Ｃ₂）、ＭＩ（Ｃ₂、Ｃ₅）、ＭＩ（Ｃ₅、Ｃ₉₀）、ＭＩ（Ｃ₉₀、Ｃ₃）、ＭＩ（Ｃ₃、Ｃ₆₃）、ＭＩ（Ｃ₂₈、Ｃ₁₀₁）及びＭＩ（Ｃ₁₀₁ 、Ｃ₃₂）が所定のしきい値ＴＨより小さい場合、隣接する２つの単語クラス（Ｃ₆₃、Ｃ₂₈）が、図９（ｄ）に示すように、クラスチェーンで結ばれる。
【００４７】
このクラスチェーンで結ばれた２つの単語クラス（Ｃ₆₃、Ｃ₂₈）はトークンｔ₁に置き換えられ、図９（ｅ）に示すように、単語・トークンの一次元列（ｗ₁ｗ₂ｗ₃ｗ₄ｗ₅ｗ₆ｗ₇ｗ₈ｗ₉ｗ₁₀ｗ₁₁ｔ₁ｗ₁₄ｗ₁₅）が生成される。
【００４８】
図１の単語・トークン分類手段６は、テキストデータの単語・トークンの一次元列のＮ個の単語の集合｛ｗ₁、ｗ₂、ｗ₃、ｗ₄、・・・、ｗ_N｝又はＬ個のトークンの集合｛ｔ₁、ｔ₂、ｔ₃、ｔ₄、・・・、ｔ_L｝を分割することにより、単語とトークンとが混在して存在するＤ個の単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝を生成する。
【００４９】
この単語・トークン分類手段６では、トークンを付与された単語クラス列が１つの単語のようにみなされ、テキストデータに含まれる単語｛ｗ₁、ｗ₂、ｗ₃、ｗ₄、・・・、ｗ_N｝とトークン｛ｔ₁、ｔ₂、ｔ₃、ｔ₄、・・・、ｔ_L｝とを同等に取り扱うことができるので、単語｛ｗ₁、ｗ₂、ｗ₃、ｗ₄、・・・、ｗ_N｝とトークン｛ｔ₁、ｔ₂、ｔ₃、ｔ₄、・・・、ｔ_L｝との区別なく分類処理を行うことができる
図１０は、図１の単語・トークン分類手段６の機能的な構成を示すブロック図である。
【００５０】
図１０において、初期化クラス設定部４０は、テキストデータの単語・トークン列から互いに異なる単語と互いに異なるトークンとを抽出し、所定の出現頻度を有するＮ個の単語｛ｗ₁、ｗ₂、ｗ₃、ｗ₄、・・・、ｗ_N｝とＬ個のトークン｛ｔ₁、ｔ₂、ｔ₃、ｔ₄、・・・、ｔ_L｝とのそれぞれに固有の単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_Y｝を割り当てる。
【００５１】
仮マージ部４１は、単語・トークンクラスの集合｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_M｝から２つの単語・トークンクラス｛Ｔ_i、Ｔ_j｝を取り出して仮マージする。
【００５２】
平均相互情報量算出部４２は、テキストデータの仮マージされた単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_M-1｝についての平均相互情報量ＡＭＩを（１）式により算出する。この場合、Ｍ個の単語クラス・トークンクラスの集合｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_M｝から、２つの単語・トークンクラス｛Ｔ_i、Ｔ_j｝を取り出だす取り出しかたは、Ｍ（Ｍ−１）／２個だけ存在するので、Ｍ（Ｍ−１）／２回の平均相互情報量ＡＭＩの計算を行う必要がある。
【００５３】
本マージ部４３は、仮マージにより計算されたＭ（Ｍ−１）／２個の平均相互情報量ＡＭＩの基づいて、平均相互情報量ＡＭＩを最大とする２つの単語・トークンクラス｛Ｔ_i、Ｔ_j｝を単語クラス・トークンクラスの集合｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_M｝から取り出して本マージする。このことにより、本マージされたいずれかの単語・トークンクラス｛Ｔ_i、Ｔ_j｝に属する単語及びトークンは、同一の単語クラス・トークンクラスに分類される。
【００５４】
図１の連語置換手段７は、単語・トークンクラスの中のトークンを、単語・トークン列生成手段５により置換された単語列に逆置換して連語を生成する。
図１１は、クラスチェーンと連語との関係を説明する図である。
【００５５】
図１１において、例えば、単語クラスＣ₃₀₀と単語クラスＣ₃₂とがクラスチェーンで結ばれ、このクラスチェーンで結ばれた単語クラス列Ｃ₃₀₀−Ｃ₃₂にトークンｔ₅が付与されているとする。また、単語“Ｔｏｙｏｔａ”、“Ｎｉｓｓａｎ”、“ＧＭ”・・・などのＡ個の単語が単語クラスＣ₃₀₀に属し、単語“ｃａｒ”、“ｔｒａｃｋ”、“ｗａｇｏｎ”・・・などのＢ個の単語が単語クラスＣ₃₂に属しているものとする。
【００５６】
この場合、連語の候補として、図１１（ｂ）に示すように、“Ｔｏｙｏｔａｃａｒ”、“Ｔｏｙｏｔａｔｒａｃｋ”、“Ｔｏｙｏｔａｗａｇｏｎ”、 “Ｎｉｓｓａｎｃａｒ”、“Ｎｉｓｓａｎｔｒａｃｋ”、“Ｎｉｓｓａｎｗａｇｏｎ”、“ＧＭｃａｒ”、“ＧＭｔｒａｃｋ”、“ＧＭｗａｇｏｎ”、・・・など、単語クラスＣ₃₀₀に属するＡ個の単語と単語クラスＣ₃₂に属するＢ個の単語との順列の数Ａ×Ｂだけ連語の候補が生成される。この連語の候補の中にはテキストデータに存在しない連語も含まれているので、テキストデータをスキャンすることにより、これらの連語の候補からテキストデータに存在する連語のみを抽出する。例えば、テキストデータには、“Ｎｉｓｓａｎｔｒａｃｋ”及び“Ｔｏｙｏｔａｗａｇｏｎ”は存在するが、“Ｔｏｙｏｔａｃａｒ”、“Ｔｏｙｏｔａｔｒａｃｋ”、 “Ｎｉｓｓａｎｃａｒ”、“Ｎｉｓｓａｎｗａｇｏｎ”、“ＧＭｃａｒ”、“ＧＭｔｒａｃｋ”及び“ＧＭｗａｇｏｎ”は存在しない場合、図１１（ｃ）に示すように、“Ｎｉｓｓａｎｔｒａｃｋ”及び“Ｔｏｙｏｔａｗａｇｏｎ”のみが連語としてテキストデータから抽出される。
【００５７】
図１２は、Ｃ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝、Ｄ個の単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝及びＤ個の単語・連語クラス｛Ｒ₁、Ｒ₂、Ｒ₃、Ｒ₄、・・・、Ｒ_D｝の一例を示す図である。
【００５８】
図１２（ａ）において、Ｃ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝が、図１の単語分類手段１により生成され、例えば、“ｈｅ”、“ｓｈｅ”、“ｉｔ”・・・などの単語が単語クラスＣ₅に属し、“Ｙｏｒｋ”、“Ｌｏｎｄｏｎ”・・・などの単語が単語クラスＣ₂₈に属し、“ｃａｒ”、“ｔｒａｃｋ”、“ｗａｇｏｎ”・・・などの単語が単語クラスＣ₃₂に属し、“ｎｅｗ”、“ｏｌｄ”・・・などの単語が単語クラスＣ₆₃に属し、“Ｔｏｙｏｔａ”、“Ｎｉｓｓａｎ”、“ＧＭ”・・・などの単語が単語クラスＣ₃₀₀に属しているものとする。また、テキストデータには、“ＮｅｗＹｏｒｋ”、“Ｎｉｓｓａｎｔｒａｃｋ”及び“Ｔｏｙｏｔａｗａｇｏｎ”の連語が多数存在しているものとする。
【００５９】
このＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝をテキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）に１対１対応でマッピングした単語クラスの一次元列において、図１の単語クラス列抽出手段３は、“ｎｅｗ”が属する単語クラスＣ₆₃と“Ｙｏｒｋ”が属する単語クラスＣ₂₈との粘着度が大きいと判断し、単語クラスＣ₆₃と単語クラスＣ₂₈とをクラスチェーンで結ぶ。また、単語クラス列抽出手段３は、“Ｔｏｙｏｔａ”及び“Ｎｉｓｓａｎ”が属する単語クラスＣ₃₀₀と“ｔｒａｃｋ”及び“ｗａｇｏｎ”が属する単語クラスＣ₃₂との粘着度が大きいと判断し、単語クラスＣ₃₀₀と単語クラスＣ₃₂とをクラスチェーンで結ぶ。
【００６０】
トークン付与手段４は、単語クラス列Ｃ₆₃−Ｃ₂₈にトークンｔ₁を付与し、単語クラス列Ｃ₃₀₀−Ｃ₃₂にトークンｔ₅を付与する。
単語・トークン列生成手段５は、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）に存在する“ＮｅｗＹｏｒｋ”をトークンｔ₁で置き換え、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）に存在する“Ｎｉｓｓａｎｔｒａｃｋ”及び“Ｔｏｙｏｔａｗａｇｏｎ”をトークンｔ₅で置き換えた単語・トークンの一次元列を生成する。
【００６１】
単語・トークン分類手段６は、この単語・トークンの一次元列に存在する“ｈｅ”、“ｓｈｅ”、“ｉｔ”、“Ｌｏｎｄｏｎ”、“ｃａｒ”、“ｔｒａｃｋ”、“ｗａｇｏｎ”・・・などの単語及び“ｔ₁”、“ｔ₅”などのトークンについての分類処理を行い、図１２（ｂ）のＤ個の単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝を生成する。
【００６２】
単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝において、例えば、“ｈｅ”、“ｓｈｅ”、“ｉｔ”・・・などの単語やトークンが単語・トークンクラスＴ₅に属し、“ｔ₁”、“Ｌｏｎｄｏｎ”・・・などの単語やトークンが単語・トークンクラスＴ₂₈に属し、“ｃａｒ”、“ｔｒａｃｋ”、“ｗａｇｏｎ”、“ｔ₅”・・・などの単語やトークンが単語・トークンクラスＴ₃₂に属し、“ｎｅｗ”、“ｏｌｄ”・・・などの単語やトークンが単語・トークンクラスＴ₆₃に属し、“Ｔｏｙｏｔａ”、“Ｎｉｓｓａｎ”、“ＧＭ”・・・などの単語やトークンが単語・トークンクラスＴ₃₀₀に属している。このように、単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝には、単語とトークンとの区別なく、単語とトークンとが混在して分類されている。
【００６３】
連語置換手段７は、図１２（ｂ）の単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝に存在する“ｔ₁”、“ｔ₅”などのトークンを、テキストデータの単語の一次元列に存在する連語で逆置換することにより、図１２（ｃ）の単語・連語クラス｛Ｒ₁、Ｒ₂、Ｒ₃、Ｒ₄、・・・、Ｒ_D｝を生成する。例えば、単語・トークンクラスＴ₂₈に属しているトークンｔ₁は、単語・トークン列生成手段５により、テキストデータの単語の一次元列に存在する“ＮｅｗＹｏｒｋ”と置換されたものなので、このトークンｔ₁を“ＮｅｗＹｏｒｋ”で逆置換することにより、単語・連語クラスＲ₂₈を生成し、単語・トークンクラスＴ₃₂に属しているトークンｔ₅は、単語・トークン列生成手段５により、テキストデータの単語の一次元列に存在する“Ｎｉｓｓａｎｔｒａｃｋ”及び“Ｔｏｙｏｔａｗａｇｏｎ”と置換されたものなので、このトークンｔ₅を“Ｎｉｓｓａｎｔｒａｃｋ”及び“Ｔｏｙｏｔａｗａｇｏｎ”で逆置換することにより、単語・連語クラスＲ₃₂を生成する。
【００６４】
図１３は、図１の単語・連語分類処理装置を実現するシステム構成を示すブロック図である。
図１３において、単語・連語分類処理部４１のメモリインターフェース４２、４６、ＣＰＵ４３、ＲＯＭ４４、ワークＲＡＭ４５、ＲＡＭ４７、ドライバ７１及び通信インタフェース７２はバス４８を介して互いに接続され、テキストデータ４０が単語・連語分類処理部４１に入力されると、ＲＯＭ４４に格納されているプログラムに従って、ＣＰＵ４３はテキストデータ４０を処理し、テキストデータ４０の単語及び連語の分類処理を行う。テキストデータ４０の単語及び連語の分類処理結果は、単語・連語辞書４９に格納される。なお、テキストデータ４０や単語及び連語の分類処理結果を通信インタフェース７２から通信ネットワーク７３を介して送信したり、受信したりすることも可能である。
【００６５】
また、単語及び連語の分類処理を行うプログラムを、ハードディスク７４、ＩＣメモリカード７５、磁気テープ７６、フロッピーディスク７７またはＣＤ−ＲＯＭやＤＶＤ−ＲＯＭなどの光ディスク７８による記憶媒体からＲＡＭ４７にロードした後、このプログラムをＣＰＵ４３で実行させるようにしてもよい。
【００６６】
さらに、単語及び連語の分類処理を行うプログラムを、通信インタフェース７２を介して通信ネットワーク７３から取り出すこともできる。通信インタフェース７２と接続される通信ネットワーク７３として、例えば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネット、アナログ電話網、デジタル電話網（ＩＳＤＮ：ＩｎｔｅｇｒａｌＳｅｒｖｉｃｅＤｉｇｉｔａｌＮｅｔｗｏｒｋ）、ＰＨＳ（パーソナルハンディシステム）や衛星通信などの無線通信網などを用いることが可能である。
【００６７】
図１４は、図１の単語・連語分類処理装置の動作を示すフローチャートである。
図１４において、まず、ステップＳ１に示すように、単語クラスタリング処理を行う。この単語クラスタリング処理では、複数の単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）としてのテキストデータから、互いに異なるＶ個の単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝を抽出し、Ｖ個の単語の集合｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝をＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝に分割する第１のクラスタリング処理を行う。
【００６８】
ここで、Ｖ個の単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝それぞれに単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝を割り当ててから、Ｖ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝についてマージ処理を行うことにより、Ｖ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝の個数を１つずつ減らしてＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝を生成する場合、Ｖが７０００もの数となって大きなものとなるときは、マージ処理を行うための（１）式の平均相互情報量ＡＭＩの計算回数が莫大なものとなり、現実的ではなくなる。このため、ウィンドウ処理を行って、マージ処理を行う単語クラスの数を減らすようにする。
【００６９】
図１５は、ウィンドウ処理を説明する図である。
図１５（ａ）において、テキストデータのＶ個の単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝それぞれに割り当てられたＶ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝のうち、テキストデータでの出現頻度の大きい単語に割り当てられたＣ＋１個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C、Ｃ_C+1｝を取り出し、このＣ＋１個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C、Ｃ_C+1｝についてのマージ処理を行う。
【００７０】
ここで、図１５（ｂ）に示すように、Ｍ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_M｝は、ウィンドウ内のＣ＋１個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C、Ｃ_C+1｝についてのマージ処理を行った場合、Ｍ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_M｝の数が１つ減ってＭ−１個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_M-1｝となるとともに、ウィンドウ内のＣ＋１個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C、Ｃ_C+1｝の数も１つ減ってＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝となる。
【００７１】
この場合、図１５（ｃ）に示すように、ウィンドウ外の単語クラス｛Ｃ_C+1、・・・、Ｃ_M-1｝のうち、テキストデータでの出現頻度が最も大きい単語クラスＣ_C+1をウィンドウ内に入れ、ウィンドウ内の単語クラスの数が一定に保たれるようにする。
【００７２】
そして、ウィンドウ外に単語クラスがなくなり、図１５（ｄ）のＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝が生成された時に、単語クラスタリング処理を終了する。
【００７３】
なお、上述した実施例では、ウィンドウ内の単語クラスの個数をＣ＋１個に設定したが、Ｃ＋１個以外のＶ個未満の数でもよく、また、途中で変化させるようにしてもよい。
【００７４】
図１６は、ステップＳ１の単語クラスタリング処理を示すフローチャートである。
図１６において、まず、ステップＳ１０に示すように、Ｔ個の単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）としてのテキストデータに基づいて、重複を除いた全てのＶ個の単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝の出現頻度を調べ、これらのＶ個の単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝を出現頻度の高い単語から順に並べて、これらのＶ個の単語｛ｖ₁、ｖ₂、ｖ₃、ｖ₄、・・・、ｖ_V｝のそれぞれをＶ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝に割り当てる。
【００７５】
次に、ステップＳ１１に示すように、Ｖ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_V｝の単語のうち、出現頻度の高い単語クラスの単語から、Ｖ個未満のＣ＋１個の単語クラスの単語を１つのウィンドウ内の単語クラスの単語とする。
【００７６】
次に、ステップＳ１２に示すように、１つのウィンドウ内の単語クラスの単語の中で、全ての組み合わせの仮ペアを作り、各仮ペアを仮マージした時の平均相互情報量ＡＭＩを（１）式により計算する。
【００７７】
次に、ステップＳ１３に示すように、全ての組み合わせの仮ペアについての平均相互情報量ＡＭＩのうち、最大となる平均相互情報量ＡＭＩを有する仮ペアを本マージすることにより、単語クラスを１つだけ減らし、本マージ後の１つのウィンドウ内の単語クラスの単語を更新する。
【００７８】
次に、ステップＳ１４に示すように、ウィンドウ外の単語クラスはなくなり、かつ、ウィンドウ内の単語クラスはＣ個になったかどうかを判断し、この条件が成り立たない場合、ステップＳ１５に進み、現在のウィンドウよりも外側にあり、最大の出現頻度を有するクラスの単語をウィンドウ内に入れ、ステップＳ１２に戻り、以上の処理を繰り返すことにより、単語クラスの数を減少させる。
【００７９】
一方、ステップＳ１４の条件が成り立ち、ウィンドウ外に単語クラスがなくなり、単語クラスの数がＣ個となった場合、ステップＳ１６に進み、ウィンドウ内のＣ個の単語クラス｛Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、・・・、Ｃ_C｝をメモリに記憶する。
【００８０】
次に、図１４のステップＳ２に示すように、クラスチェーン抽出処理を行う。このクラスチェーン抽出処理では、ステップＳ１の第１のクラスタリング処理に基づいて生成されたテキストデータの単語クラスの一次元列において、所定のしきい値以上の相互情報量を有する隣接する２つの単語クラスをチェーンで結ぶことにより、チェーンで結ばれた単語クラス列の集合を抽出する。
【００８１】
図１７は、ステップＳ２のクラスチェーン抽出処理の第１実施例を示すフローチャートである。
図１７において、まず、ステップＳ２０に示すように、テキストデータの単語クラスの一次元列から、互いに隣接する２つの単語クラス（Ｃ_i、Ｃ_j）を取り出す。
【００８２】
次に、ステップＳ２１に示すように、ステップＳ２０で取り出した２つの単語クラス（Ｃ_i、Ｃ_j）についての相互情報量ＭＩ（Ｃ_i、Ｃ_j）を（２）式により計算する。
【００８３】
次に、ステップＳ２２に示すように、ステップＳ２１で計算した相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値ＴＨ以上であるかどうかを判断し、相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値ＴＨ以上である場合、ステップＳ２３に進んで、ステップＳ２０で取り出した２つの単語クラス（Ｃ_i、Ｃ_j）をクラスチェーンで結んでメモリに格納し、相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値ＴＨより小さい場合、ステップＳ２３をスキップする。
【００８４】
次に、ステップＳ２４に示すように、メモリに格納されているクラスチェーンで結ばれた単語クラスにおいて、単語クラスＣ_iで終了しているクラスチェーンが存在するかどうかを判断し、単語クラスＣ_iで終了しているクラスチェーンが存在する場合、ステップＳ２５に進んで、単語クラスＣ_iで終了しているクラスチェーンに単語クラスＣ_jをつなぐ。
【００８５】
一方、ステップＳ２４において、単語クラスＣ_iで終了しているクラスチェーンが存在しない場合、ステップＳ２５をスキップする。
次に、ステップＳ２６に示すように、テキストデータの単語クラスの一次元列から、互いに隣接する２つの単語クラス（Ｃ_i、Ｃ_j）を全て取り出したかどうかを判断し、互いに隣接する２つの単語クラス（Ｃ_i、Ｃ_j）を全て取り出した場合、クラスチェーン抽出処理を終了し、互いに隣接する２つの単語クラス（Ｃ_i、Ｃ_j）を全て取り出していない場合、ステップＳ２０に戻って以上の処理を繰り返す。
【００８６】
図１８は、ステップＳ２のクラスチェーン抽出処理の第２実施例を示すフローチャートである。
図１８において、まず、ステップＳ２０１に示すように、テキストデータの単語クラスの一次元列から、互いに隣接する２つの単語クラス（Ｃ_i、Ｃ_j）を順次に取り出す。そして、取り出した２つの単語クラス（Ｃ_i、Ｃ_j）について、相互情報量ＭＩ（Ｃ_i、Ｃ_j）を（２）式により計算することにより、長さ２の全てのクラスチェーンをテキストデータの単語クラスの一次元列から抽出する。
【００８７】
次に、ステップＳ２０２に示すように、長さ２の全てのクラスチェーンをそれぞれオブジェクトで置き換える。ここで、オブジェクトは、上述したトークンと同じものを表しているが、長さ２のクラスチェーンに付与されたトークンを、特に、オブジェクトと呼ぶ。
【００８８】
次に、ステップＳ２０３に示すように、テキストデータのクラスの一次元列に対し、ステップＳ２０２でオブジェクトが付与された長さ２のクラスチェーンをオブジェクトで置き換え、テキストデータのクラスとオブジェクトの一次元列を生成する。
【００８９】
次に、ステップＳ２０４に示すように、テキストデータのクラスとオブジェクトの一次元列の中に存在する１つのオブジェクトを１つのクラスとみなし、２つのクラス（Ｃ_i、Ｃ_j）についての相互情報量ＭＩ（Ｃ_i、Ｃ_j）を（２）式により計算する。すなわち、テキストデータのクラスとオブジェクトの一次元列においての相互情報量ＭＩ（Ｃ_i、Ｃ_j）は、互いに隣接する１つのクラスと１つのクラスとの間で算出される場合、互いに隣接する１つのクラスと１つのオブジェクト（長さ２のクラスチェーン）との間で算出される場合、及び互いに隣接する１つのオブジェクト（長さ２のクラスチェーン）と１つのオブジェクト（長さ２のクラスチェーン）との間で算出される場合がある。
【００９０】
次に、ステップＳ２０５に示すように、ステップＳ２０４で計算した相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値ＴＨ以上であるかどうかを判断し、相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値ＴＨ以上である場合、ステップＳ２６に進んで、ステップＳ２０４で取り出した互いに隣接する２つのクラス、又は互いに隣接する１つのクラスと１つのオブジェクト、又は互いに隣接する２つのオブジェクトをクラスチェーンで結び、相互情報量ＭＩ（Ｃ_i、Ｃ_j）が所定のしきい値ＴＨより小さい場合、ステップＳ２０６をスキップする。
【００９１】
図１９は、テキストデータのクラスとオブジェクトの一次元列において抽出されたクラスチェーンを示す図である。
図１９において、互いに隣接する１つのクラスと１つのクラスとの間でクラスチェーンが抽出された場合、長さ２のクラスチェーン（オブジェクト）が生成され、互いに隣接する１つのクラスと１つのオブジェクトとの間でクラスチェーンが抽出された場合、長さ３のクラスチェーンが生成され、互いに隣接する１つのオブジェクトと１つのオブジェクトとの間でクラスチェーンが抽出された場合、長さ４のクラスチェーンが生成される。
【００９２】
次に、図１８のステップＳ２０７に示すように、クラスチェーン抽出処理が所定の回数行われたかどうかを判断し、所定の回数行われていない場合は、ステップＳ２０２に戻って以上の処理を繰り返す。
【００９３】
このように、長さ２のクラスチェーンをオブジェクトに置き換えて、相互情報量ＭＩ（Ｃ_i、Ｃ_j）を算出することを繰り返すことにより、任意の長さのクラスチェーンを抽出することができる。
【００９４】
次に、図１４のステップＳ３に示すように、トークン置換処理を行う。このトークン置換処理では、ステップＳ２のクラスチェーン抽出処理で抽出された単語クラス列に固有のトークンを対応させ、この単語クラス列に属する単語列をテキストデータの単語の一次元列から検索し、テキストデータの単語列を対応するトークンで置換することにより、テキストデータについての単語とトークンとの一次元列を生成する。
【００９５】
図２０は、ステップＳ３のトークン置換処理を示すフローチャートである。
図２０において、まず、ステップＳ３０に示すように、抽出されたクラスチェーンを重複を除いて所定の規則でソートし、それぞれのクラスチェーンにトークンを対応させて、クラスチェーンに名前を付ける。ここで、クラスチェーンのソートは、例えば、ＡＳＣＩＩコード順で行う。
【００９６】
次に、ステップＳ３１に示すように、トークンに対応させたクラスチェーンを１つ取り出す。
次に、ステップＳ３２に示すように、テキストデータの単語の一次元列の中にクラスチェーンで結ばれた単語クラス列に属する単語列が存在するかどうかを判断し、クラスチェーンで結ばれた単語クラス列に属する単語列が存在する場合、ステップＳ３３に進み、テキストデータの対応する単語列を１つのトークンで置き換え、クラスチェーンで結ばれた単語クラス列に属する単語列がテキストデータの単語の一次元列の中に存在しなくなるまで以上の処理を繰り返す。
【００９７】
一方、クラスチェーンで結ばれた単語クラス列に属する単語列が存在しない場合、ステップＳ３４に進み、ステップＳ３０でトークンに対応させた全てのクラスチェーンについての連語・トークン置換処理が終了したかどうかを判断し、全てのクラスチェーンについての連語・トークン置換処理が終了してない場合、ステップＳ３１に戻って、新たなクラスチェーンを１つ取り出して、以上の処理を繰り返す。
【００９８】
次に、図１４のステップＳ４に示すように、単語・トークンクラスタリング処理を行う。この単語・トークンクラスタリング処理では、テキストデータについての単語とトークンとの一次元列において、互いに異なる単語と互いに異なるトークンとを抽出し、単語とトークンとが混在する集合を単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝に分割する第２のクラスタリング処理を行う。
【００９９】
図２１は、ステップＳ４の単語・トークンクラスタリング処理を示すフローチャートである。
図２１において、ステップＳ４０に示すように、ステップＳ３で得られたテキストデータの単語・トークンの一次元列を入力データとして、ステップＳ１の第１の単語クラスタリング処理と同一の方法でクラスタリングを行うことより、単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝を生成する。この第２のクラスタリング処理では、単語とトークンは区別せず、トークンは１つの単語として扱われる。また、生成されたそれぞれの単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝は、その要素として単語とトークンを含んでいる。
【０１００】
次に、図１４のステップＳ５に示すように、データ出力処理を行う。このデータ出力処理では、テキストデータの単語の一次元列に存在する単語列のうち、トークンに対応するものを連語として抽出し、単語・トークンクラス｛Ｔ₁、Ｔ₂、Ｔ₃、Ｔ₄、・・・、Ｔ_D｝の中のトークンを連語で置換することにより、単語と連語とが混在する集合を単語・連語クラス｛Ｒ₁、Ｒ₂、Ｒ₃、Ｒ₄、・・・、Ｒ_D｝に分割する第３のクラスタリング処理を行う。
【０１０１】
図２２は、ステップＳ５のデータ出力処理を示すフローチャートである。
図２２において、まず、ステップＳ５０に示すように、１つの単語・トークンクラスＴ_iから１つのトークンｔ_Kを取り出す。
【０１０２】
次に、ステップＳ５１に示すように、テキストデータの単語の一次元列をスキャンし、ステップＳ５２において、ステップＳ５０で取り出したトークンｔ_Kに対応するクラスチェーンで結ばれた単語クラス列に属する単語列が存在するかどうかを判断する。そして、トークンｔ_Kに対応するクラスチェーンで結ばれた単語クラス列に属する単語列がテキストデータの単語の一次元列に存在する場合、ステップＳ５３に進んで、この単語列を連語とみなす処理を繰り返し、テキストデータの単語の一次元列をスキャンすることにより得られたこれらの連語でトークンｔ_Kを置き換える。
【０１０３】
一方、トークンｔ_Kに対応するクラスチェーンで結ばれた単語クラス列に属する単語列がテキストデータの単語の一次元列に存在しない場合、ステップＳ５４に進んで、全てのトークンについて処理が終了したかどうかを判断し、全てのトークンについて処理が終了していない場合、ステップＳ５０に進んで、以上の処理を繰り返す。
【０１０４】
例えば、ステップＳ３のトークン置換処理において、テキストデータの単語の一次元列（ｗ₁ｗ₂ｗ₃ｗ₄・・・ｗ_T）のうち、単語列（ｗ₁ｗ₂）、（ｗ₁₃ｗ₁₄）、・・・がトークンｔ₁で置換され、単語列（ｗ₄ｗ₅ｗ₆）、（ｗ₁₇ｗ₁₈）、・・・がトークンｔ₂で置換されたとすると、トークンｔ₁に対応する連語として、｛ｗ₁−ｗ₂、ｗ₁₃−ｗ₁₄、・・・｝がテキストデータから抽出され、トークンｔ₂に対応する連語として、｛ｗ₄−ｗ₅−ｗ₆、ｗ₁₇−ｗ₁₈、・・・｝がテキストデータから抽出される。
【０１０５】
１つの単語・トークンクラスＴ_iが単語の集合Ｗ_iとトークンの集合Ｊ_i＝｛ｔ_i1、ｔ_i2、・・・ｔ_in｝からなり、トークンクラスＴ_iが｛Ｗ_i∪Ｊ_i｝により表され、、トークンの集合Ｊ_iの中の１つのトークンｔ_imが、連語の集合Ｖ_im＝｛ｖ_im ⁽¹⁾、ｖ_im ⁽²⁾、・・・｝に逆トークン置換されたとすると、１つの単語・連語クラスＲ_iは、
【０１０６】
【数２】

【０１０７】
で与えられる。
以上説明したように、本発明の一実施例による単語・連語分類処理装置によれば、単語と連語とを区別することなく分類することができる。
【０１０８】
次に、本発明の一実施例による音声認識装置について説明する。
図２３は、図１の単語・連語分類処理装置により得られた単語・連語分類処理結果を利用して音声認識を行う音声認識装置の構成を示すブロック図である。
【０１０９】
図２３において、所定のテキストデータ４０に含まれる単語と連語とが、単語・連語分類処理部４１により単語と連語とが混在するクラスに分類され、この分類された単語と連語とが単語・連語辞書４９に格納されている。
【０１１０】
一方、複数の単語と連語とからなる発音音声は、マイクロフォン５０によりアナログ音声信号に変換された後、Ａ／Ｄ変換器５１でデジタル音声信号に変換され、特徴抽出部５２に入力される。特徴抽出部５２は、デジタル音声信号に対して、例えば、ＬＰＣ分析を行い、ケプストラム係数や対数パワーなどの特徴パラメータを抽出する。特徴抽出部５２で抽出された特徴パラメータは、音声認識部５４に出力され、音素隠れマルコフモデルなどの言語モデル５５を参照するとともに、単語・連語辞書４９に格納されている単語と連語との分類結果を参照しながら、単語及び連語ごとに音声認識を行う。
【０１１１】
図２４は、単語・連語分類処理結果を利用して音声認識を行う場合の例を示す図である。
図２４において、「本日は晴天なり」と発声された発音音声がマイクロフォン５０に入力され、この発音音声に対して音声モデルを適用するとにより、例えば、「本日は晴天なり」という認識結果と「本日は静電なり」という認識結果とが得られる。これらの音声モデルによる認識結果に対し、言語モデルによる処理を行って単語・連語辞書４９の参照を行い、「晴天なり」という連語が単語・連語辞書４９に登録されている場合、「本日は晴天なり」という認識結果に対しては高い確率が与えられ、「本日は静電なり」という認識結果に対しては低い確率が与えられる。
【０１１２】
以上説明したように、本発明の一実施例による音声認識装置によれば、単語・連語辞書４９を参照して音声認識を行うことにより、より正確な認識処理が可能になる。
【０１１３】
次に、本発明の一実施例による機械翻訳装置について説明する。
図２５は、図１の単語・連語分類処理装置により得られた単語・連語分類処理結果を利用して機械翻訳を行う機械翻訳装置の構成を示すブロック図である。
【０１１４】
図２５において、所定のテキストデータ４０に含まれる単語と連語とが、単語・連語分類処理部４１により単語と連語とが混在するクラスに分類され、この分類された単語と連語とが単語・連語辞書４９に格納されている。また、用例原文とその用例原文に対する用例訳文とが、それぞれ対応させて用例文集６０に格納されている。
【０１１５】
用例検索部６１に原文が入力されると、単語・連語辞書４９を参照しながら入力された原文の単語が属するクラスを検索し、そのクラスと同一のクラスに属する単語又は連語により構成される用例原文を用例文集６０から検索する。用例文集６０から検索された用例原文及びその用例訳文は、用例適用部６２に入力され、用例訳文の中の訳語を、入力された原文の単語に対する訳語に置換することにより、入力された原文に対する訳文を生成する。
【０１１６】
図２６は、単語・連語分類処理結果を利用して音声認識を行う場合の例を示す図である。
図２６において、“Ｔｏｙｏｔａ”と“ＫｏｈｌｂｅｒｇＫｒａｖｉｓＲｏｂｅｒｔ＆Ｃｏ．”とは同一のクラスに属し、“ｇａｉｎｅｄ”と“ｌｏｓｔ”とは同一のクラスに属し、“２”と“１”とは同一のクラスに属し、“３０１／４”と“８０１／２”とは同一のクラスに属しているものとする。
【０１１７】
原文として、“Ｔｏｙｏｔａｇａｉｎｅｄ２ｔｏ３０１／４．”が入力されると、用例原文として、用例文集６０から“ＫｏｈｌｂｅｒｇＫｒａｖｉｓＲｏｂｅｒｔ＆Ｃｏ．ｌｏｓｔ１ｔｏ８０１／２．”が検索されるとともに、その用例原文に対する用例訳文「ＫｏｈｌｂｅｒｇＫｒａｖｉｓＲｏｂｅｒｔ＆Ｃｏ．社は、１ドル値を下げて終値８０１／２ドルだった。」も検索される。
【０１１８】
次に、用例原文の原語“ＫｏｈｌｂｅｒｇＫｒａｖｉｓＲｏｂｅｒｔ＆Ｃｏ．”と同一のクラスに属している入力原文の原語“Ｔｏｙｏｔａ”に対する訳語「トヨタ」で、用例訳文の訳語「ＫｏｈｌｂｅｒｇＫｒａｖｉｓＲｏｂｅｒｔ＆Ｃｏ．社」を置き換え、用例原文の原語“ｌｏｓｔ”と同一のクラスに属している入力原文の原語“ｇａｉｎｅｄ”に対する訳語「上げて」で、用例訳文の訳語「下げて」を置き換え、用例訳文の数値“１”を“２”で置き換え、用例訳文の数値“８０１／２”を“３０１／４”で置き換えることにより、入力原文に対する訳文「トヨタは、２ドル値を上げて終値３０１／２ドルだった。」を出力する。
【０１１９】
以上説明したように、本発明の一実施例による機械翻訳装置によれば、単語・連語辞書４９を参照して機械翻訳を行うことにより、より正確な翻訳処理が可能になる。
【０１２０】
以上、本発明の一実施例について説明したが、本発明は上述した実施例に限定されるものではなく、本発明の技術的思想の範囲内で他の様々な変更が可能である。例えば、上述した実施例では、単語・連語分類処理装置を音声認識装置及び機械翻訳装置に適用した場合について説明したが、単語・連語分類処理装置を文字認識装置に用いるようにしてもよい。また、上述した実施例では、単語と連語とを混在される分類する場合について説明したが、連語のみを抽出し、この抽出した連語を分類するようにしてもよい。
【０１２１】
【発明の効果】
以上説明したように、本発明の単語・連語分類処理装置によれば、テキストデータに含まれる単語と連語とを一緒に分類して、単語と連語とが混在するクラスを生成することにより、単語と単語とをまとめて分類するだけでなく、単語と連語あるいは連語と連語とをまとめて分類することができ、単語と連語あるいは連語と連語との対応関係や類似度を容易に判別することができる。
【０１２２】
また、本発明の一態様によれば、テキストデータの単語クラス列にトークンを付与して単語クラス列を１つの単語とみなし、テキストデータに含まれる単語とトークンを付与された単語クラス列とを同等に取り扱ってこれらを分類してから、テキストデータに存在する単語列で対応する単語クラス列を置き換えるようにしたので、単語と連語との区別なく分類処理を行うことができるとともに、テキストデータからの連語の抽出を高速に行うことができる。
【０１２３】
また、本発明の連語抽出装置によれば、テキストデータの単語列を構成する個々の単語を、その単語が属する単語クラスで置換し、テキストデータにおいて出現する確率が所定値以上の単語クラス列を抽出してから、テキストデータに存在する連語を抽出することにより、連語を高速に抽出することができる。
【０１２４】
また、本発明の音声認識装置によれば、単語と連語あるいは連語と連語の対応関係や類似度を用いながら音声認識を行うことができ、正確な処理が可能になる。
【０１２５】
また、本発明の機械翻訳装置によれば、用例文集に格納されている用例原文の単語が連語に置き換わった原文が入力された場合においても、入力された原文に用例原文を適用して機械翻訳を行うことができ、単語と連語あるいは連語と連語の対応関係や類似度を用いた正確な機械翻訳が可能になる。
【図面の簡単な説明】
【図１】本発明の一実施例に係わる単語・連語分類処理装置の機能的な構成を示すブロック図である。
【図２】本発明の一実施例に係わる単語・連語分類処理装置の単語クラスタリング処理を説明する図である。
【図３】図１の単語分類手段の機能的な構成を示すブロック図である。
【図４】本発明の一実施例に係わる単語・連語分類処理装置の単語クラス列生成処理を説明する図である。
【図５】本発明の一実施例に係わる単語・連語分類処理装置のクラスチェーン抽出処理を説明する図である。
【図６】図１の単語クラス列抽出手段の機能的な構成を示すブロック図である。
【図７】本発明の一実施例に係わる単語・連語分類処理装置によるクラスチェーンとトークンとの関係を示す図である。
【図８】本発明の一実施例に係わる単語・連語分類処理装置のトークン置換処理を説明する図である。
【図９】本発明の一実施例に係わる単語・連語分類処理装置によるトークン置換処理の英文例を示す図である。
【図１０】図１の単語・トークン分類手段の機能的な構成を示すブロック図である。
【図１１】本発明の一実施例に係わる単語・連語分類処理装置によるトークンと連語の関係を示す図である。
【図１２】本発明の一実施例に係わる単語・連語分類処理装置による単語・連語分類処理結果を示す図である。
【図１３】本発明の一実施例に係わる単語・連語分類処理装置のシステム構成を示すブロック図である。
【図１４】本発明の一実施例に係わる単語・連語分類処理装置の単語・連語分類処理を示すフローチャートである。
【図１５】本発明の一実施例に係わる単語・連語分類処理装置のウインドウ処理を説明する図である。
【図１６】本発明の一実施例に係わる単語・連語分類処理装置の単語クラスタリング処理を示すフローチャートである。
【図１７】本発明に係わる単語・連語分類処理装置のクラスチェーン抽出処理の第１実施例を示すフローチャートである。
【図１８】本発明に係わる単語・連語分類処理装置のクラスチェーン抽出処理の第２実施例を示すフローチャートである。
【図１９】本発明に係わる単語・連語分類処理装置のクラスチェーン抽出処理の第２実施例を説明する図である。
【図２０】本発明の一実施例に係わる単語・連語分類処理装置のトークン置換処理を示すフローチャートである。
【図２１】本発明の一実施例に係わる単語・連語分類処理装置の単語・トークンクラスタリング処理を示すフローチャートである。
【図２２】本発明の一実施例に係わる単語・連語分類処理装置のデータ出力処理を示すフローチャートである。
【図２３】本発明の一実施例に係わる音声認識装置の機能的な構成を示すブロック図である。
【図２４】本発明の一実施例に係わる音声認識方法を説明する図である。
【図２５】本発明の一実施例に係わる機械翻訳装置の機能的な構成を示すブロック図である。
【図２６】本発明の一実施例に係わる機械翻訳方法を説明する図である。
【符号の説明】
１単語分類手段
２単語クラス列生成手段
３単語クラス列抽出手段
４トークン付与手段
５単語・トークン列生成手段
６単語・トークン分類手段
７連語置換手段
４０テキストデータ
４１単語・連語分類処理部
４２、４６メモリインターフェイス
４３ＣＰＵ
４４ＲＯＭ
４５ワークＲＡＭ
４７ＲＡＭ
４８バス
４９単語・連語辞書
５０マイクロフォン
５１Ａ／Ｄ変換器
５２特徴抽出部
５３バッファメモリ
５４音声認識部
５５言語モデル
６０用例文集
６１用例検索部
６２用例適用部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a word / collocation classification processing method, a collocation extraction method, a word / collocation classification processing device, a speech recognition device, a machine translation device, a collocation extraction device, and a word / collocation storage medium. This is suitable for automatically extracting and automatically classifying words and collocations.
[0002]
[Prior art]
Conventional word classification processing apparatuses include, for example, “Brown, P., Della Pietra, V., deSouza, P., Lai, J., Mercer, R. (1992)“ Class-Based n-gram Models of Natural. Language ”. Computational Linguistics, Vol. 18, No. 4, pp. 467-479”, by statistically processing the single word used in the text data, the single word Are automatically classified, and speech recognition and machine translation are performed using the classification result of this single word.
[0003]
[Problems to be solved by the invention]
However, conventional word classification processing devices cannot automatically classify words and collocations, and perform speech recognition and machine translation using correspondence and similarity between words and collocations or collocations and collocations. There was a problem that speech recognition and machine translation could not be executed accurately because they could not be performed.
[0004]
Accordingly, a first object of the present invention is to provide a word / collocation classification processing method and a word / collocation classification processing apparatus capable of automatically classifying words and collocations together.
[0005]
A second object of the present invention is to provide a collocation extracting apparatus capable of extracting collocations from a large amount of text data at high speed.
A third object of the present invention is to provide a speech recognition device that can perform accurate speech recognition by using correspondence and similarity between words and collocations or collocations and collocations.
[0006]
A fourth object of the present invention is to provide a machine translation apparatus capable of performing accurate machine translation by using correspondence and similarity between words and collocations or collocations and collocations.
[0007]
[Means for Solving the Problems]
In order to achieve the first object described above, according to the present invention, words and collocations included in text data are classified together to generate a class in which words and collocations are mixed.
[0008]
This makes it possible to classify words and collocations or collocations and collocations together, as well as categorizing words and vocabularies together. Correspondence and similarity between words and collocations or collocations and collocations Can be easily determined.
[0009]
Further, according to one aspect of the present invention, a word class in which words are classified is mapped to a one-dimensional column of words in text data to generate a one-dimensional column of word classes. , Extract a word class string in which the adhesion between adjacent word classes is all equal to or greater than a predetermined value, assign tokens to the word class string, classify the words and tokens together, and then the word class corresponding to the token The column is replaced with a collocation that belongs to the word class column.
[0010]
As a result, a token is assigned to the word class string, the word class string is regarded as one word, the word included in the text data and the word class string to which the token is assigned are treated equally, Classification processing can be performed without distinction. In addition, the word class that classifies the word is mapped to a one-dimensional column of words in the text data to generate a one-dimensional column of the word class, and the collocation is extracted based on the degree of adhesion between the adjacent word classes, so that the text Extraction of collocations from data can be performed at high speed.
[0011]
In order to achieve the second object described above, according to the present invention, a word class in which words are classified is mapped to a one-dimensional column of words in text data to generate a one-dimensional column of word classes, and a text In a one-dimensional column of data word classes, extract word class columns where the degree of adhesion between adjacent word classes is all equal to or greater than a predetermined value, and present adjacent to text data from the individual word classes that make up the word class column Each word to be extracted is extracted separately and a collocation is extracted.
[0012]
This makes it possible to extract collocations based on the word class sequence, and since the number of word classes into which these words are classified is smaller than the number of different words existing in the text data, the word class of the text data In a one-dimensional sequence of words, it is better to extract a word class sequence in which the degree of adhesion between adjacent word classes is greater than or equal to a predetermined value. Compared with the case of extracting a column, the amount of calculation and the memory capacity can be reduced, the collocation extraction process can be performed at high speed, and the memory resources can be saved. Note that the word class string may contain word strings that do not exist in the one-dimensional string of words in the text data, so it exists adjacent to the text data from the individual word classes that make up the word class string. Each word is taken separately and used as a collocation.
[0013]
In order to achieve the third object described above, according to the present invention, words and collocations included in predetermined text data are classified and stored in a class in which words and collocations are mixed. -By referring to the collocation dictionary, pronunciation speech is recognized.
[0014]
As a result, speech recognition can be performed using the correspondence and similarity between words and collocations or collocations and collocations, and accurate processing becomes possible.
In order to achieve the fourth object described above, according to the present invention, words and collocations included in predetermined text data are classified and stored in a class in which words and collocations are mixed. -Based on the collocation dictionary, the example original sentence stored in the example sentence collection is made to correspond to the inputted original sentence.
[0015]
As a result, even when an original text in which a word in the example text stored in the example sentence collection is replaced with a collocation is input, machine translation can be performed by applying the example text to the input original text. Accurate machine translation using correspondence or similarity between collocations or collocations is possible.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a word / cold word classification processing apparatus according to an embodiment of the present invention will be described with reference to the drawings. In this embodiment, words and collocations included in predetermined text data are classified into classes in which words and collocations are mixed.
[0017]
FIG. 1 is a block diagram showing a functional configuration of a word / collocation classification processing apparatus according to an embodiment of the present invention.
In FIG. 1, a word classification unit 1 extracts different words from a one-dimensional string of words in text data, and divides the extracted set of words to generate a word class.
[0018]
FIG. 2 explains the processing of the word classification means 1 and is a one-dimensional string (w) of words consisting of T words included in the text data.₁w₂w_Threew_Four... w_T) As V vocabularies arranged in the order of appearance frequency in the text data {v₁, V₂, V_Three, V_Four, ..., v_V}, And the word {v as the vocabulary of this text data₁, V₂, V_Three, V_Four, ..., v_V} Is assigned an initialization class. Here, the number T of words is, for example, 50 million, and the number V of vocabularies is, for example, 7000.
[0019]
In the example of FIG. 2, for example, “the”, “a”, “in”, and “of”, which appear frequently in text data, are words v as vocabulary.₁, V₂, V_Three, V_FourIt corresponds to. V vocabulary words {v assigned to an initialization class₁, V₂, V_Three, V_Four, ..., v_V} Represents C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C}. Here, the number C of word classes is, for example, 500.
[0020]
In FIG. 2, for example, “speak”, “say”, “tell”, “talk”,.₁"He", "she", "it" ... are word classes C_Five"Car", "track", "wagon" ... are word classes C₃₂"Toyota", "Nissan", "GM" ... are word classes C₃₀₀The example classified into is shown.
[0021]
These V vocabulary words {v₁, V₂, V_Three, V_Four, ..., v_V}, For example, when two word classes to which two words existing in the text data belong are merged, those having the smallest decrease in the generation probability of the original text data are assigned to the same word class. Do by integrating. Here, the generation probability of the original text data by the class-by model can be expressed using the average mutual information AMI, and the average mutual information AMI can be expressed by the following equation.
[0022]
[Expression 1]

[0023]
here,
Pr (C_i) Is a one-dimensional string (w₁w₂w_Threew_Four... w_T) Is replaced with the word class to which the word belongs, the class C in the one-dimensional column of the word class of the text data_iOccurrence probability,
Pr (C_j) Is a one-dimensional string (w₁w₂w_Threew_Four... w_T) Is replaced with the word class to which the word belongs, the class C in the one-dimensional column of the word class of the text data_jOccurrence probability,
Pr (C_i, C_j) Is a one-dimensional string (w₁w₂w_Threew_Four... w_T) Is replaced with the word class to which the word belongs, the word class C in the one-dimensional column of the word class of the text data._iNext to the word class C_jIs the probability of appearing.
[0024]
FIG. 3 is a block diagram showing an example of a functional configuration of the word classification unit 1 of FIG.
In FIG. 3, the initialization class setting unit 10 performs a one-dimensional sequence {w₁w₂w_Threew_Four... w_T}, Different words from each other are extracted from the word {v₁, V₂, V_Three, V_Four, ..., v_V}, A unique word class {C₁, C₂, C_Three, C_Four・・・・・・ C_V} Is assigned.
[0025]
The temporary merge unit 11 sets a set of word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_M} From two word classes {C_i, C_j} And temporarily merge.
The average mutual information calculation unit 12 uses the word class {C₁, C₂, C_Three, C_Four・・・・・・ C_M-1}, The average mutual information AMI is calculated by the equation (1). In this case, a set of M word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_M} From two word classes {C_i, C_j}, There are only M (M−1) / 2 extraction methods, and it is necessary to calculate M (M−1) / 2 average mutual information AMI.
[0026]
Based on the M (M−1) / 2 average mutual information AMI calculated by the temporary merge, the merging unit 13 uses two word classes {C_i, C_j} Is a set of word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_M} And merge it. As a result, any merged word class {C_i, C_j} Belong to the same word class.
[0027]
The word class string generation means 2 in FIG.₁w₂w_Threew_Four... w_T) For each word constituting the word class {C₁, C₂, C_Three, C_Four・・・・・・ C_V} To generate a word class string of text data.
[0028]
FIG. 4 is a diagram illustrating an example of a one-dimensional column of word classes of text data.
In FIG. 4, C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C}, For example, the word class C₁In the vocabulary v₁, V₃₇, ... belong to the word class C₂In the vocabulary v_Three, V₁₅, ... belong to the word class C_ThreeIn the vocabulary v₂, V_Four, ... belong to the word class C_FourIn the vocabulary v₇, V₉, ... belong to the word class C_FiveIn the vocabulary v₆, V₈, V₂₆, V_V, ... belong to the word class C₆In the vocabulary v₆, V_{twenty three}, ... belong to the word class C₇In the vocabulary v_Five, V_Ten, ... shall belong.
[0029]
In addition, a one-dimensional sequence of words (w₁w₂w_Threew_Four... w_T), For example, the word w₁Indicates the word vocabulary as v₁₅, Word w₂Indicates the word vocabulary as v₂, Word w_ThreeIndicates the word vocabulary as v_{twenty three}, Word w_FourIndicates the word vocabulary as v_Four, Word w_FiveIndicates the word vocabulary as v_Five, Word w₆Indicates the word vocabulary as v₁₅, Word w₇Indicates the word vocabulary as v_Five, Word w₈Indicates the word vocabulary as v₂₆, Word w₉Indicates the word vocabulary as v₃₇, Word w_TenIndicates the word vocabulary as v₂... word w_TIndicates the word vocabulary as v₈Suppose that
[0030]
In this case, vocabulary v₁₅Is word class C₂Because it belongs to₁Is word class C₂Mapped to vocabulary v₂Is word class C_ThreeBecause it belongs to₂Is word class C_ThreeMapped to vocabulary v_{twenty three}Is word class C₆Because it belongs to_ThreeIs word class C₆Mapped to vocabulary v_FourIs word class C_ThreeBecause it belongs to_FourIs word class C_ThreeMapped to vocabulary v_FiveIs word class C₇Because it belongs to_FiveIs word class C₇Mapped to vocabulary v₁₅Is word class C₂Because it belongs to₆Is word class C₂Mapped to vocabulary v_FiveIs word class C₇Because it belongs to₇Is word class C₇Mapped to vocabulary v₂₆Is word class C_FiveBecause it belongs to₈Is word class C_FiveMapped to vocabulary v₃₇Is word class C₁Because it belongs to₉Is word class C₁Mapped to vocabulary v₂Is word class C_ThreeBecause it belongs to_TenIs word class C_ThreeIs mapped to ... vocabulary v₈Is word class C_FiveBecause it belongs to_TIs word class C_FiveMapped to
[0031]
That is, a one-dimensional string (w₁w₂w_Threew_Four... w_T) Has C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C} As a result of mapping by a one-dimensional sequence of word classes of text data (C₂C_ThreeC₆C_ThreeC₇C₂C₇C_FiveC₁C_Three... C_Five) Is generated in a one-to-one correspondence.
[0032]
The word class string extracting means 3 in FIG. 1 extracts a word class string whose degree of adhesion between word classes in the one-dimensional string of the word class of text data is all equal to or greater than a predetermined value from the one-dimensional string of the word class of text data. To do. Here, the degree of adhesion between word classes is an index indicating the strength of the connection between the word classes constituting the word class sequence. Examples of expressing this degree of adhesion include mutual information MI, correlation coefficient, and the like. , Cosine measure, liklihood ratio, etc.
[0033]
In the following description, a case where a word class string is extracted from a one-dimensional string of word classes of text data by using the mutual information MI as the degree of adhesion between word classes is taken as an example.
[0034]
FIG. 5 is a diagram illustrating an example of a word class string extracted by the word class string extracting unit 3.
In FIG. 5, a one-dimensional string (w₁w₂w_Threew_Fourw_Fivew₆w₇... w_T) As a result of mapping to a one-dimensional column of word classes of text data (C₂C_ThreeC₆C_ThreeC₇C₂C₇... C_Five) Is generated in a one-to-one correspondence. A one-dimensional sequence of word classes (C₂C_ThreeC₆C_ThreeC₇C₂C₇... C_Five) From two adjacent word classes (C_i, C_j) In succession and two adjacent word classes (C_i, C_j) Mutual information MI (C_i, C_j) Is calculated by the following equation (2).
[0035]

And two adjacent word classes (C_i, C_j) Mutual information MI (C_i, C_j) Is greater than or equal to a predetermined threshold TH, these two adjacent word classes (C_i, C_j) In the class chain and relate to each other.
[0036]
For example, in FIG. 5, two adjacent word classes (C₂, C_Three) Mutual information MI (C₂, C_Three), Two adjacent word classes (C_Three, C₆) Mutual information MI (C_Three, C₆), Two adjacent word classes (C₆, C_Three) Mutual information MI (C₆, C_Three), Two adjacent word classes (C_Three, C₇) Mutual information MI (C_Three, C₇), Two adjacent word classes (C₇, C₂) Mutual information MI (C₇, C₂), Two adjacent word classes (C₂, C₇) Mutual information MI (C₂, C₇),... Are sequentially calculated by equation (2).
[0037]
The mutual information MI (C₂, C_Three), Mutual information MI (C_Three, C₇), Mutual information MI (C₇, C₂),... Is greater than or equal to the threshold value TH and the mutual information MI (C_Three, C₆), Mutual information MI (C₆, C_Three), Mutual information MI (C₂, C₇),... Is smaller than the threshold TH, two adjacent word classes (C₂, C_Three), (C_Three, C₇), (C₇, C₂), ... are connected to each other by a class chain so that the word class sequence C₂-C_Three, C_Three-C₇-C₂, ... are extracted.
[0038]
FIG. 6 is a block diagram showing an example of a functional configuration of the word class string extraction unit 3 of FIG.
In FIG. 6, the word class extracting unit 30 extracts two adjacent word classes (C_i, C_j) In order.
[0039]
The mutual information calculation unit 31 uses the two word classes (C_i, C_j) Mutual information MI (C_i, C_j) Is calculated by equation (2).
[0040]
The class chain coupling unit 32 uses the mutual information MI (C_i, C_j) Two word classes (C_i, C_j) In the class chain.
The token granting means 4 in FIG. 1 gives tokens to the word class strings connected by the class chain by the word class string extracting means 3.
[0041]
FIG. 7 is a diagram illustrating an example of a token granted by the token granting unit 4.
In FIG. 7, the word class string connected by the class chain is, for example, C₁-C_Three, C₁-C₇・・・・・・ C₂-C_Three, C₂-C₁₁・・・・・・ C₃₀₀-C₃₂・・・・・・ C₁-C_Three-C₈₀, C₁-C_Four-C_Five, C_Three-C₇-C₂・・・・・・ C₁-C₉-C₁₁-C₃₂... In this case, the word class string C₁-C_ThreeToken t₁And the word class sequence C₁-C₇Token t₂, ..., word class sequence C₂-C_ThreeToken t_ThreeAnd the word class sequence C₂-C₁₁Token t_Four, ..., word class sequence C₃₀₀-C₃₂Token t_Five, ..., word class sequence C₁-C_Three-C₈₀Token t₆And the word class sequence C₁-C_Four-C_FiveToken t₇And the word class sequence C_Three-C₇-C₂Token t₈, ..., word class sequence C₁-C₉-C₁₁-C₃₂Token t₉Is granted.
[0042]
The word / token sequence generating means 5 in FIG.₁w₂w_Threew_Fourw_Fivew₆w₇... w_T), A word string belonging to the word class string extracted by the word class string extracting means 4 is replaced with a token to generate a one-dimensional string of words / tokens of text data.
[0043]
FIG. 8 is a diagram illustrating an example of a one-dimensional sequence of words / tokens of text data. In FIG. 8, a one-dimensional string (w₁w₂w_Threew_Fourw_Fivew₆w₇... w_T) As a result of mapping to a one-dimensional column of word classes of text data (C₂C_ThreeC₆C_ThreeC₇C₂C₇... C_Five) Is generated in a one-to-one correspondence, and the word class string C connected by the class chain₂-C_Three, C_Three-C₇-C₂,..., As shown in FIG._Three, T₈, ... are given.
[0044]
In this case, the word class string C connected by the class chain₂-C_ThreeWord string of text data belonging to (w₁w₂) To token t_ThreeThe word class sequence C replaced by the class chain_Three-C₇-C₂Word string of text data belonging to (w_Fourw_Fivew₆) To token t₈Is replaced with a one-dimensional sequence of words / tokens of text data (t_Threew_Threet₈w₇... w_T) Is generated.
[0045]
FIG. 9 is a diagram illustrating an example of a one-dimensional sequence of words / tokens of text data, using English as an example.
A one-dimensional sequence of words (w) in the text data of FIG.₁w₂w_Threew_Fourw_Fivew₆w₇w₈w₉w_Tenw₁₁w₁₂w₁₃w₁₄w₁₅) In FIG. 9A corresponds to “He Went to the apartment by Bus and She Went to New York by plane”.₁w₂w_Threew_Fourw_Fivew₆w₇w₈w₉w_Tenw₁₁w₁₂w₁₃w₁₄w₁₅) Is a one-dimensional column of word classes corresponding one-to-one with (C) in FIG._FiveC₉₀C_ThreeC_{twenty one}C₁₈C₁₀₁C₃₂C₂C_FiveC₉₀C_ThreeC₆₃C₂₈C₁₀₁C₃₂).
[0046]
A one-dimensional sequence of this word class (C_FiveC₉₀C_ThreeC_{twenty one}C₁₈C₁₀₁C₃₂C₂C_FiveC₉₀C_ThreeC₆₃C₂₈C₁₀₁C₃₂) In two adjacent word classes (C_i, C_j) Mutual information MI (C_i, C_j) To calculate the mutual information MI (C₆₃, C₂₈) Is equal to or greater than a predetermined threshold value TH and the mutual information MI (C_Five, C₉₀), MI (C₉₀, C_Three), MI (C_Three, C_{twenty one}), MI (C_{twenty one}, C₁₈), MI (C₁₈, C₁₀₁), MI (C₁₀₁, C₃₂), MI (C₃₂, C₂), MI (C₂, C_Five), MI (C_Five, C₉₀), MI (C₉₀, C_Three), MI (C_Three, C₆₃), MI (C₂₈, C₁₀₁) And MI (C₁₀₁ , C₃₂) Is smaller than a predetermined threshold TH, two adjacent word classes (C₆₃, C₂₈) Are connected by a class chain as shown in FIG.
[0047]
Two word classes (C₆₃, C₂₈) Is token t₁As shown in FIG. 9E, a one-dimensional sequence of words / tokens (w₁w₂w_Threew_Fourw_Fivew₆w₇w₈w₉w_Tenw₁₁t₁w₁₄w₁₅) Is generated.
[0048]
The word / token classifying means 6 in FIG. 1 sets a set of N words {w₁, W₂, W_Three, W_Four... w_N} Or a set of L tokens {t₁, T₂, T_Three, T_Four, ..., t_L} Is divided into D words / token classes {T₁, T₂, T_Three, T_Four... T_D} Is generated.
[0049]
In this word / token classification means 6, the word class sequence to which the token is given is regarded as one word, and the word {w₁, W₂, W_Three, W_Four... w_N} And token {t₁, T₂, T_Three, T_Four, ..., t_L} Can be treated equally, so the word {w₁, W₂, W_Three, W_Four... w_N} And token {t₁, T₂, T_Three, T_Four, ..., t_L} Can be classified without distinction
FIG. 10 is a block diagram showing a functional configuration of the word / token classification means 6 of FIG.
[0050]
In FIG. 10, the initialization class setting unit 40 extracts different words and different tokens from the word / token string of text data, and N words {w with a predetermined appearance frequency {w₁, W₂, W_Three, W_Four... w_N} And L tokens {t₁, T₂, T_Three, T_Four, ..., t_L} And a unique word / token class {T₁, T₂, T_Three, T_Four... T_Y} Is assigned.
[0051]
The temporary merge unit 41 sets a set of words / token classes {T₁, T₂, T_Three, T_Four... T_M} From two words / token class {T_i, T_j} And temporarily merge.
[0052]
The average mutual information calculation unit 42 tentatively merges the word / token class {T₁, T₂, T_Three, T_Four... T_M-1}, The average mutual information AMI is calculated by the equation (1). In this case, a set of M word classes and token classes {T₁, T₂, T_Three, T_Four... T_M} From two words / token class {T_i, T_j}, There are only M (M−1) / 2 extraction methods, and it is necessary to calculate M (M−1) / 2 average mutual information AMI.
[0053]
The merging unit 43, based on the M (M−1) / 2 average mutual information AMI calculated by the temporary merge, uses two word / token classes {T that maximize the average mutual information AMI._i, T_j} Is a set of word classes and token classes {T₁, T₂, T_Three, T_Four... T_M} And merge it. As a result, any merged word / token class {T_i, T_j} And tokens belonging to} are classified into the same word class / token class.
[0054]
The collocation replacement means 7 in FIG. 1 generates a collocation by reversely replacing the tokens in the word / token class with the word string replaced by the word / token string generation means 5.
FIG. 11 is a diagram illustrating the relationship between class chains and collocations.
[0055]
In FIG. 11, for example, word class C₃₀₀And word class C₃₂Are connected by a class chain, and the word class sequence C connected by this class chain₃₀₀-C₃₂Token t_FiveIs given. A words such as the words “Toyota”, “Nissan”, “GM”.₃₀₀B words such as “car”, “track”, “wagon”...₃₂It belongs to.
[0056]
In this case, as a collocation candidate, as shown in FIG. 11B, “Toyota car”, “Toyota track”, “Toyota wagon”, “Nissan car”, “Nissan track”, “Nissan wagon”, “GM” word class C, such as “car”, “GM track”, “GM wagon”,.₃₀₀A words and word class C belonging to₃₂Conjunction candidates are generated by the number A × B of permutations with B words belonging to. Since the collocation candidates include collocations that do not exist in the text data, by scanning the text data, only collocations existing in the text data are extracted from these collocation candidates. For example, “Nissan track” and “Toyota wagon” exist in text data, but “Toyota car”, “Toyota track”, “Nissan car”, “Nissan wagon”, “GM car”, “GM track”. When “GM wagon” does not exist, only “Nissan track” and “Toyota wagon” are extracted from the text data as collocations as shown in FIG.
[0057]
FIG. 12 shows C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C}, D words / token class {T₁, T₂, T_Three, T_Four... T_D} And D words and collocation classes {R₁, R₂, R_Three, R_Four・・・・・・ R_D} Is a diagram showing an example.
[0058]
In FIG. 12A, C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C} Is generated by the word classification means 1 of FIG. 1, and for example, words such as “he”, “she”, “it”._FiveAnd words such as “York”, “London”...₂₈And the words “car”, “track”, “wagon”...₃₂And the words “new”, “old”...₆₃And the words “Toyota”, “Nissan”, “GM”...₃₀₀It belongs to. In addition, it is assumed that there are many collocations of “New York”, “Nissanttrack”, and “Toyota wagon” in the text data.
[0059]
This C word class {C₁, C₂, C_Three, C_Four・・・・・・ C_C} Is a one-dimensional sequence of words (w₁w₂w_Threew_Four... w_T) In a one-dimensional string of word classes mapped in a one-to-one correspondence, the word class string extracting means 3 in FIG.₆₃And word class C to which “York” belongs₂₈The word class C₆₃And word class C₂₈Are connected with a class chain. Further, the word class string extraction means 3 uses the word class C to which “Toyota” and “Nissan” belong.₃₀₀Class C to which “track” and “wagon” belong₃₂The word class C₃₀₀And word class C₃₂Are connected with a class chain.
[0060]
The token granting means 4 uses the word class sequence C₆₃-C₂₈Token t₁And the word class sequence C₃₀₀-C₃₂Token t_FiveIs granted.
The word / token string generation means 5 is a one-dimensional string (w₁w₂w_Threew_Four... w_T) With “New York” in token t₁Replaced with a one-dimensional string of words in the text data (w₁w₂w_Threew_Four... w_T) “Nissan track” and “Toyota wagon” existing in the token t_FiveGenerate a one-dimensional sequence of words / tokens replaced with.
[0061]
The word / token classifying means 6 includes “he”, “she”, “it”, “London”, “car”, “track”, “wagon”,. The word “t”₁"," T_Five”And the like, and the D words / token class {T in FIG.₁, T₂, T_Three, T_Four... T_D} Is generated.
[0062]
Word / token class {T₁, T₂, T_Three, T_Four... T_D}, For example, a word or token such as “he”, “she”, “it”._FiveBelonging to “t₁”,“ London ”... and other words and tokens are word / token class T₂₈, "Car", "track", "wagon", "t_FiveWords such as "..." and tokens are word / token class T₃₂And the words and tokens such as “new”, “old”...₆₃The words and tokens such as “Toyota”, “Nissan”, “GM”.₃₀₀Belongs to. Thus, the word / token class {T₁, T₂, T_Three, T_Four... T_D}, A word and a token are mixed and classified without distinction between the word and the token.
[0063]
The collocation replacement means 7 uses the word / token class {T in FIG.₁, T₂, T_Three, T_Four... T_D} In “t”₁"," T_FiveThe tokens such as “” are reversely replaced with the collocations existing in the one-dimensional sequence of the words of the text data, whereby the word / collocation class {R in FIG.₁, R₂, R_Three, R_Four・・・・・・ R_D} Is generated. For example, word / token class T₂₈Token t belonging to₁Is replaced by “New York” existing in the one-dimensional string of the word of the text data by the word / token string generation means 5, so this token t₁Is replaced by “New York” to obtain the word / collocation class R₂₈And the word / token class T₃₂Token t belonging to_FiveIs replaced by “Nissan track” and “Toyota wagon” existing in the one-dimensional string of words of the text data by the word / token string generation means 5._FiveIs replaced by “Nissan track” and “Toyota wagon”, so that the word / collocation class R₃₂Is generated.
[0064]
13 is a block diagram showing a system configuration for realizing the word / collocation classification processing apparatus of FIG.
In FIG. 13, the memory interfaces 42 and 46, the CPU 43, the ROM 44, the work RAM 45, the RAM 47, the driver 71, and the communication interface 72 of the word / collocation classification processing unit 41 are connected to each other via a bus 48, and the text data 40 is When input to the classification processing unit 41, the CPU 43 processes the text data 40 according to a program stored in the ROM 44 and performs a classification process of words and collocations of the text data 40. The word and collocation classification results of the text data 40 are stored in the word / collocation dictionary 49. Note that the text data 40 and the word and collocation classification processing results can be transmitted and received from the communication interface 72 via the communication network 73.
[0065]
Further, after loading a program for performing classification processing of words and collocations from a hard disk 74, an IC memory card 75, a magnetic tape 76, a floppy disk 77 or a storage medium such as a CD-ROM or DVD-ROM into the RAM 47, This program may be executed by the CPU 43.
[0066]
Further, a program for performing word and collocation classification processing can be taken out from the communication network 73 via the communication interface 72. As the communication network 73 connected to the communication interface 72, for example, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, an analog telephone network, a digital telephone network (ISDN: Integrated Service Digital Network), a PHS (Personal Handy System). ) Or a wireless communication network such as satellite communication.
[0067]
FIG. 14 is a flowchart showing the operation of the word / collocation classification processing apparatus of FIG.
In FIG. 14, first, as shown in step S1, word clustering processing is performed. In this word clustering process, a one-dimensional sequence (w₁w₂w_Threew_Four... w_T) As V words different from each other {v₁, V₂, V_Three, V_Four, ..., v_V} And extract a set of V words {v₁, V₂, V_Three, V_Four, ..., v_V} To C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C} To perform the first clustering process.
[0068]
Where V words {v₁, V₂, V_Three, V_Four, ..., v_V} Each word class {C₁, C₂, C_Three, C_Four・・・・・・ C_V} And then V word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_V} Is merged to obtain V word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_V} Is reduced by one to C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C}, When V becomes as large as 7000, the number of calculations of the average mutual information AMI of the expression (1) for performing the merge processing becomes enormous, which is not realistic. . For this reason, window processing is performed to reduce the number of word classes to be merged.
[0069]
FIG. 15 is a diagram illustrating window processing.
In FIG. 15A, V words {v of text data₁, V₂, V_Three, V_Four, ..., v_V} V word classes assigned to each {C₁, C₂, C_Three, C_Four・・・・・・ C_V}, C + 1 word classes {C assigned to words with high appearance frequency in the text data₁, C₂, C_Three, C_Four・・・・・・ C_C, C_{C + 1}}, And this C + 1 word class {C₁, C₂, C_Three, C_Four・・・・・・ C_C, C_{C + 1}} Is merged.
[0070]
Here, as shown in FIG. 15B, M word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_M} Means C + 1 word classes {C in the window₁, C₂, C_Three, C_Four・・・・・・ C_C, C_{C + 1}}, The M word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_M} Is reduced by one and M−1 word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_M-1} And C + 1 word classes {C in the window₁, C₂, C_Three, C_Four・・・・・・ C_C, C_{C + 1}} Is also reduced by 1 and C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C}.
[0071]
In this case, as shown in FIG. 15C, the word class {C_{C + 1}・・・・・・ C_M-1}, The word class C having the highest appearance frequency in the text data_{C + 1}Into the window to keep the number of word classes in the window constant.
[0072]
Then, there is no word class outside the window, and C word classes {C in FIG.₁, C₂, C_Three, C_Four・・・・・・ C_C} Is generated, the word clustering process is terminated.
[0073]
In the above-described embodiment, the number of word classes in the window is set to C + 1. However, the number may be less than V + 1 other than C + 1, or may be changed in the middle.
[0074]
FIG. 16 is a flowchart showing the word clustering process in step S1.
In FIG. 16, first, as shown in step S10, a one-dimensional sequence of T words (w₁w₂w_Threew_Four... w_T) Based on the text data of all V words {v₁, V₂, V_Three, V_Four, ..., v_V} And the V words {v₁, V₂, V_Three, V_Four, ..., v_V} Are arranged in descending order of appearance frequency, and these V words {v₁, V₂, V_Three, V_Four, ..., v_V} With V word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_V}.
[0075]
Next, as shown in step S11, V word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_V}, From the words of the word class having a high appearance frequency, the words of the C + 1 word class of less than V are set as the words of the word class in one window.
[0076]
Next, as shown in step S12, among the words of the word class in one window, temporary pairs of all combinations are created, and the average mutual information AMI when the temporary pairs are temporarily merged is expressed by (1) Calculate with the formula.
[0077]
Next, as shown in step S13, one temporary word pair having the maximum average mutual information AMI among the average mutual information AMI for all combinations of temporary pairs is fully merged, thereby obtaining one word class. Only the words of the word class in one window after this merge are updated.
[0078]
Next, as shown in step S14, it is determined whether or not there are no word classes outside the window and there are C word classes in the window. If this condition is not satisfied, the process proceeds to step S15, and the current class is determined. The words of the class having the maximum appearance frequency outside the window are put in the window, the process returns to step S12, and the number of word classes is decreased by repeating the above processing.
[0079]
On the other hand, if the condition of step S14 is satisfied and there are no word classes outside the window and the number of word classes is C, the process proceeds to step S16, where C word classes {C₁, C₂, C_Three, C_Four・・・・・・ C_C} Is stored in the memory.
[0080]
Next, as shown in step S2 of FIG. 14, class chain extraction processing is performed. In this class chain extraction process, two adjacent word classes having a mutual information amount equal to or greater than a predetermined threshold in a one-dimensional sequence of word classes of text data generated based on the first clustering process in step S1. A set of word class strings connected by a chain is extracted by connecting.
[0081]
FIG. 17 is a flowchart showing a first embodiment of the class chain extraction process in step S2.
In FIG. 17, first, as shown in step S20, two word classes (C_i, C_j).
[0082]
Next, as shown in step S21, the two word classes (C_i, C_j) Mutual information MI (C_i, C_j) Is calculated by equation (2).
[0083]
Next, as shown in step S22, the mutual information MI (C_i, C_j) Is equal to or greater than a predetermined threshold value TH, and the mutual information MI (C_i, C_j) Is greater than or equal to a predetermined threshold value TH, the process proceeds to step S23, and the two word classes (C_i, C_j) In the class chain and stored in the memory, mutual information MI (C_i, C_j) Is smaller than the predetermined threshold value TH, step S23 is skipped.
[0084]
Next, as shown in step S24, in the word class connected by the class chain stored in the memory, the word class C_iTo determine if there is a class chain that ends with the word class C_iIf there is a class chain that ends in step S25, the process proceeds to step S25, and the word class C_iThe word class C in the class chain ending with_jConnect.
[0085]
On the other hand, in step S24, the word class C_iIf there is no class chain terminated in step S25, step S25 is skipped.
Next, as shown in step S26, two word classes (C_i, C_j) Are extracted, and two adjacent word classes (C_i, C_j) Are extracted, the class chain extraction process ends, and two adjacent word classes (C_i, C_j) Are not extracted, the process returns to step S20 and the above processing is repeated.
[0086]
FIG. 18 is a flowchart showing a second embodiment of the class chain extraction process in step S2.
In FIG. 18, first, as shown in step S201, two word classes (C_i, C_j) In order. Then, the two extracted word classes (C_i, C_j) For mutual information MI (C_i, C_j) Is calculated from the expression (2) to extract all class chains of length 2 from a one-dimensional sequence of word classes of text data.
[0087]
Next, as shown in step S202, all class chains of length 2 are replaced with objects, respectively. Here, the object represents the same token as the above-described token, but the token given to the class chain of length 2 is particularly called an object.
[0088]
Next, as shown in step S203, for the one-dimensional column of the text data class, the class chain of length 2 to which the object is assigned in step S202 is replaced with the object, and the text data class and the one-dimensional column of the object are replaced. Is generated.
[0089]
Next, as shown in step S204, one object existing in the one-dimensional column of the text data class and the object is regarded as one class, and two classes (C_i, C_j) Mutual information MI (C_i, C_j) Is calculated by equation (2). That is, the mutual information MI (C_i, C_j) Is calculated between one class and one class adjacent to each other, calculated between one class adjacent to each other and one object (class chain of length 2), and There are cases where the calculation is performed between one object (class chain having a length of 2) and one object (class chain having a length of 2) that are adjacent to each other.
[0090]
Next, as shown in step S205, the mutual information amount MI (C_i, C_j) Is equal to or greater than a predetermined threshold value TH, and the mutual information MI (C_i, C_j) Is equal to or greater than the predetermined threshold value TH, the process proceeds to step S26, and the two adjacent classes extracted in step S204, the one class and one object adjacent to each other, or the two objects adjacent to each other Are connected by a class chain and mutual information MI_i, C_j) Is smaller than the predetermined threshold value TH, step S206 is skipped.
[0091]
FIG. 19 is a diagram showing a class chain extracted from a one-dimensional sequence of text data classes and objects.
In FIG. 19, when a class chain is extracted between one class adjacent to each other and one class, a class chain (object) having a length of 2 is generated, and one class and one object adjacent to each other are generated. If a class chain is extracted between two objects, a class chain with a length of 3 is generated. If a class chain is extracted between one object and one object adjacent to each other, a class chain with a length of 4 is Generated.
[0092]
Next, as shown in step S207 of FIG. 18, it is determined whether or not the class chain extraction process has been performed a predetermined number of times. If the predetermined number of times has not been performed, the process returns to step S202 and the above process is repeated.
[0093]
In this way, the class chain of length 2 is replaced with an object, and the mutual information MI (C_i, C_j) Is repeated, a class chain having an arbitrary length can be extracted.
[0094]
Next, as shown in step S3 of FIG. 14, token replacement processing is performed. In this token replacement process, a unique token is associated with the word class string extracted in the class chain extraction process in step S2, a word string belonging to this word class string is searched from a one-dimensional string of words in the text data, and the text By replacing the word string of data with the corresponding token, a one-dimensional string of words and tokens for the text data is generated.
[0095]
FIG. 20 is a flowchart showing the token replacement process in step S3.
In FIG. 20, first, as shown in step S30, the extracted class chains are sorted according to a predetermined rule except for duplication, and a token is associated with each class chain, and a name is given to the class chain. Here, the class chains are sorted in the order of ASCII codes, for example.
[0096]
Next, as shown in step S31, one class chain corresponding to the token is taken out.
Next, as shown in step S32, it is determined whether or not there is a word string belonging to the word class string connected by the class chain in the one-dimensional string of the words of the text data, and the words connected by the class chain If there is a word string belonging to the class string, the process proceeds to step S33, where the corresponding word string in the text data is replaced with one token, and the word string belonging to the word class string connected by the class chain is the primary word of the text data word. Repeat the above process until it no longer exists in the original sequence.
[0097]
On the other hand, if there is no word string belonging to the word class string connected by the class chain, the process proceeds to step S34, and whether or not the collocation / token replacement process for all the class chains associated with the tokens in step S30 is completed. If it is determined that the collocation / token replacement process has not been completed for all the class chains, the process returns to step S31, one new class chain is taken out, and the above process is repeated.
[0098]
Next, as shown in step S4 of FIG. 14, a word / token clustering process is performed. In this word / token clustering process, different words and different tokens are extracted in a one-dimensional sequence of words and tokens for text data, and a set of mixed words and tokens is defined as a word / token class {T₁, T₂, T_Three, T_Four... T_D} To perform the second clustering process.
[0099]
FIG. 21 is a flowchart showing the word / token clustering process in step S4.
In FIG. 21, as shown in step S40, clustering is performed by the same method as the first word clustering process in step S1, using the one-dimensional string / word token of the text data obtained in step S3 as input data. From word / token class {T₁, T₂, T_Three, T_Four... T_D} Is generated. In the second clustering process, the word and token are not distinguished, and the token is treated as one word. Each generated word / token class {T₁, T₂, T_Three, T_Four... T_D} Includes a word and a token as its elements.
[0100]
Next, as shown in step S5 of FIG. 14, data output processing is performed. In this data output process, a word string corresponding to a token among word strings existing in a one-dimensional string of words of text data is extracted as a collocation, and a word / token class {T₁, T₂, T_Three, T_Four... T_D} Is replaced with a collocation, and a set of words and collocations is converted into a word / collocation class {R₁, R₂, R_Three, R_Four・・・・・・ R_D} To perform a third clustering process.
[0101]
FIG. 22 is a flowchart showing the data output process in step S5.
In FIG. 22, first, as shown in step S50, one word / token class T_iOne token t from_KTake out.
[0102]
Next, as shown in step S51, the one-dimensional string of words of the text data is scanned, and in step S52, the token t taken out in step S50 is retrieved._KIt is determined whether or not there is a word string belonging to the word class string connected by the class chain corresponding to. And token t_KIf the word string belonging to the word class string connected by the class chain corresponding to is present in the one-dimensional string of the word of the text data, the process proceeds to step S53, and the process of regarding this word string as a collocation is repeated, The token t in these collocations obtained by scanning a one-dimensional string of words_KReplace
[0103]
On the other hand, token t_KIf the word string belonging to the word class string connected by the class chain corresponding to is not present in the one-dimensional string of the words of the text data, the process proceeds to step S54 to determine whether the processing has been completed for all tokens, If the processing has not been completed for all tokens, the process proceeds to step S50, and the above processing is repeated.
[0104]
For example, in the token replacement process in step S3, a one-dimensional string (w₁w₂w_Threew_Four... w_T) Word string (w₁w₂), (W₁₃w₁₄), ... token t₁Is replaced by the word string (w_Fourw_Fivew₆), (W₁₇w₁₈), ... token t₂Is replaced by the token t₁As a collocation corresponding to {w₁-W₂, W₁₃-W₁₄, ...} are extracted from the text data and the token t₂As a collocation corresponding to {w_Four-W_Five-W₆, W₁₇-W₁₈,... Are extracted from the text data.
[0105]
One word / token class T_iIs a set of words W_iAnd token set J_i= {T_i1, T_i2, ... t_in}, Token class T_iIs {W_i∪J_i}, And a set of tokens J_iOne token t in_imIs a set of collocations V_im= {V_im ⁽¹⁾, V_im ⁽²⁾, ...}, if reverse token substitution is performed, one word / collocation class R_iIs
[0106]
[Expression 2]

[0107]
Given in.
As described above, according to the word / collocation classification processing apparatus according to the embodiment of the present invention, it is possible to classify a word and a collocation without distinction.
[0108]
Next, a speech recognition apparatus according to an embodiment of the present invention will be described.
FIG. 23 is a block diagram illustrating a configuration of a speech recognition apparatus that performs speech recognition using a word / collocation classification processing result obtained by the word / collocation classification processing apparatus of FIG. 1.
[0109]
In FIG. 23, words and collocations included in predetermined text data 40 are classified into a class in which words and collocations are mixed by the word / collocation classification processing unit 41, and the classified words and collocations are classified into words / collocations. It is stored in the dictionary 49.
[0110]
On the other hand, a pronunciation sound composed of a plurality of words and collocations is converted into an analog sound signal by the microphone 50, converted into a digital sound signal by the A / D converter 51, and input to the feature extraction unit 52. The feature extraction unit 52 performs, for example, LPC analysis on the digital audio signal, and extracts feature parameters such as cepstrum coefficients and logarithmic power. The feature parameters extracted by the feature extraction unit 52 are output to the speech recognition unit 54, refer to a language model 55 such as a phoneme hidden Markov model, and classify words and collocations stored in the word / collocation dictionary 49. Speech recognition is performed for each word and collocation while referring to the result.
[0111]
FIG. 24 is a diagram illustrating an example in the case where speech recognition is performed using the word / collocation classification processing result.
In FIG. 24, the pronunciation sound uttered as “Today is sunny” is input to the microphone 50, and the speech model is applied to the pronunciation sound, for example, the recognition result “Today is sunny” and “Today The recognition result “is electrostatic” is obtained. When the recognition result by these speech models is processed by the language model and the word / collocation dictionary 49 is referred to and the collocation “sunny weather” is registered in the word / collusion dictionary 49, “Today is a fine weather” A high probability is given to the recognition result “Nari”, and a low probability is given to the recognition result “Today is electrostatic”.
[0112]
As described above, according to the speech recognition apparatus according to the embodiment of the present invention, more accurate recognition processing can be performed by referring to the word / collocation dictionary 49 and performing speech recognition.
[0113]
Next, a machine translation apparatus according to an embodiment of the present invention will be described.
FIG. 25 is a block diagram illustrating a configuration of a machine translation apparatus that performs machine translation using the word / collocation classification processing result obtained by the word / collocation classification processing apparatus of FIG. 1.
[0114]
In FIG. 25, words and collocations included in the predetermined text data 40 are classified into a class in which words and collocations are mixed by the word / collocation classification processing unit 41, and the classified words and collocations are classified into words / collocations. It is stored in the dictionary 49. In addition, the example original sentence and the example translated sentence for the example original sentence are stored in the example sentence collection 60 in association with each other.
[0115]
When the original text is input to the example search unit 61, the class to which the input original word belongs is searched while referring to the word / collocation dictionary 49, and the example includes words or collocations belonging to the same class as the class. The original sentence is searched from the example sentence collection 60. The example original sentence retrieved from the example sentence collection 60 and its example translation are input to the example application unit 62, and the translation of the example translation is replaced with the translation for the input original word. Generate a translation.
[0116]
FIG. 26 is a diagram illustrating an example in the case where speech recognition is performed using the word / collocation classification processing result.
In FIG. 26, “Toyota” and “Kohlberg Kravis Robert & Co.” belong to the same class, “gained” and “lost” belong to the same class, and “2” and “1” are the same. It is assumed that “30 1/4” and “80 1/2” belong to the same class.
[0117]
When “Toyota gained 2 to 30 1/4” is input as the original text, “Kohlberg Kravis Robert & Co. last 1 to 80 1/2.” Is retrieved from the example sentence collection 60 as an example original text. An example translated sentence “Kohlberg Kravis Robert & Co., Inc. lowered the dollar value to the closing price of 80 1/2 dollars” for the original example text is also searched.
[0118]
Next, the translated word “Kohberg Kravis Robert & Co.” of the example translation is used with the translation “Toyota” for the source word “Toyota” belonging to the same class as the original word “Kohlberg Kravis Robert & Co.”. Replace the translated word “down” with the translated word “raised” for the source word “gained” belonging to the same class as the original word “lost” of the example source text, and replace the translated value “1” with the example translated text “1” Is replaced with “2”, and the numerical value “80 1/2” in the example translation is replaced with “30 1/4”. Is output.
[0119]
As described above, according to the machine translation apparatus according to an embodiment of the present invention, by performing machine translation with reference to the word / collocation dictionary 49, more accurate translation processing can be performed.
[0120]
As mentioned above, although one Example of this invention was described, this invention is not limited to the Example mentioned above, Various other changes are possible within the range of the technical idea of this invention. For example, in the above-described embodiments, the case where the word / collocation classification processing apparatus is applied to a speech recognition apparatus and a machine translation apparatus has been described. However, the word / collocation classification processing apparatus may be used as a character recognition apparatus. In the above-described embodiments, the case where words and collocations are classified is described. However, only collocations may be extracted and the extracted collocations may be classified.
[0121]
【The invention's effect】
As described above, according to the word and collocation classification processing apparatus of the present invention, by classifying words and collocations included in text data together and generating a class in which words and collocations are mixed, the word In addition to categorizing words and collocations or collocations and collocations, the correspondence and similarity between words and collocations or collocations and collocations can be easily identified. it can.
[0122]
According to one aspect of the present invention, a token is assigned to a word class string of text data, the word class string is regarded as one word, and a word included in the text data and a word class string to which the token is given are Since they are treated equally and classified, and the corresponding word class string is replaced with the word string existing in the text data, classification processing can be performed without distinguishing between words and collocations, and from the text data Can be extracted at high speed.
[0123]
Further, according to the collocation extracting apparatus of the present invention, each word constituting the word string of the text data is replaced with the word class to which the word belongs, and the word class string whose probability of appearing in the text data is a predetermined value or more. By extracting collocations existing in the text data after extraction, collocations can be extracted at high speed.
[0124]
Furthermore, according to the speech recognition apparatus of the present invention, speech recognition can be performed using the correspondence and similarity between words and collocations or collocations and collocations, and accurate processing is possible.
[0125]
Further, according to the machine translation device of the present invention, even when an original sentence in which a word of an example original sentence stored in the example sentence collection is replaced with a collocation is inputted, the example original sentence is applied to the inputted original sentence and machine translation is performed. Therefore, accurate machine translation using correspondence and similarity between words and collocations or collocations and collocations becomes possible.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of a word / collocation classification processing apparatus according to an embodiment of the present invention;
FIG. 2 is a diagram for explaining word clustering processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 3 is a block diagram showing a functional configuration of the word classification unit of FIG. 1;
FIG. 4 is a diagram for explaining word class string generation processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 5 is a diagram for explaining class chain extraction processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
6 is a block diagram showing a functional configuration of the word class string extraction unit in FIG. 1; FIG.
FIG. 7 is a diagram showing a relationship between a class chain and a token by a word / collocation classification processing apparatus according to an embodiment of the present invention;
FIG. 8 is a diagram for explaining token replacement processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 9 is a diagram illustrating an English example of token replacement processing by the word / collocation classification processing apparatus according to the embodiment of the present invention;
10 is a block diagram showing a functional configuration of the word / token classification means of FIG. 1. FIG.
FIG. 11 is a diagram showing a relationship between tokens and collocations by the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 12 is a diagram showing a word / collocation classification processing result by the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 13 is a block diagram showing a system configuration of a word / collocation classification processing apparatus according to an embodiment of the present invention;
FIG. 14 is a flowchart showing word / collocation classification processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 15 is a diagram for explaining window processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 16 is a flowchart showing word clustering processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 17 is a flowchart showing a first embodiment of class chain extraction processing of the word / collocation classification processing apparatus according to the present invention;
FIG. 18 is a flowchart showing a second embodiment of class chain extraction processing of the word / collocation classification processing apparatus according to the present invention;
FIG. 19 is a diagram for explaining a second embodiment of class chain extraction processing of the word / collocation classification processing apparatus according to the present invention;
FIG. 20 is a flowchart showing token replacement processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 21 is a flowchart showing word / token clustering processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 22 is a flowchart showing data output processing of the word / collocation classification processing apparatus according to the embodiment of the present invention;
FIG. 23 is a block diagram showing a functional configuration of a speech recognition apparatus according to an embodiment of the present invention.
FIG. 24 is a diagram illustrating a speech recognition method according to an embodiment of the present invention.
FIG. 25 is a block diagram showing a functional configuration of a machine translation apparatus according to an embodiment of the present invention.
FIG. 26 is a diagram illustrating a machine translation method according to an embodiment of the present invention.
[Explanation of symbols]
1 Word classification means
2 Word class string generation means
3. Word class string extraction means
4 token granting means
5 Word / token sequence generation means
6 Word / Token Classification Method
7 collocation means
40 text data
41 Word / Collaborative Word Classification Processing Unit
42, 46 Memory interface
43 CPU
44 ROM
45 Work RAM
47 RAM
48 bus
49 Word / Collaborative Dictionary
50 microphone
51 A / D converter
52 Feature Extraction Unit
53 Buffer memory
54 Voice recognition unit
55 language models
60 example sentences
61 Example search part
62 Application example

Claims

Extracting V words different from each other from text data as a one-dimensional sequence of a plurality of words, and generating a first clustering in which the set of V words is divided into C word classes;
Extracting a set of word class strings in which the degree of adhesion between adjacent word classes is a predetermined value or more in a one-dimensional string of word classes of the text data generated based on the first clustering;
By associating a unique token with the word class string, searching for a word string belonging to the word class string from the text data, and replacing the word string of the text data with a corresponding token, the word for the text data Generating a one-dimensional sequence of tokens and tokens;
In a one-dimensional sequence of words and tokens for the text data, a second clustering is performed in which different words and different tokens are extracted, and a set in which the words and the tokens are mixed is divided into word / token classes. Generating step;
Of the word string existing in the text data, the word corresponding to the token is extracted as a collocation, and the token in the word / token class is replaced with the collocation, so that the word and the collocation are mixed. Generating a third clustering by dividing the set into word / collocation classes, and a word / collocation classification processing method.

2. The word / collocation classification processing method according to claim 1, wherein the first clustering is generated based on an average mutual information amount of the word class.

The method of claim 1, wherein the second clustering is generated based on an average mutual information amount of the word / token class.

Generating a word class in which words included in text data are classified;
Mapping the word class to a one-dimensional sequence of words of the text data to generate a one-dimensional sequence of word classes;
In the one-dimensional column of the word class of the text data, extracting from the one-dimensional column of the word class of the text data, a word class column in which the adhesion between adjacent word classes is all equal to or greater than a predetermined value;
Classifying together the words contained in the text data and the word class sequence;
Extracting individual words by separately taking out individual words existing adjacent to the text data from individual word classes constituting the word class sequence; and
Replacing the word class string with the collocations belonging to the word class string.

Generating a word class in which words included in text data are classified;
Mapping the word class to a one-dimensional sequence of words of the text data to generate a one-dimensional sequence of word classes;
In the one-dimensional column of the word class of the text data, extracting from the one-dimensional column of the word class of the text data, a word class column in which the adhesion between adjacent word classes is all equal to or greater than a predetermined value;
A collocation extraction method, comprising: separately extracting individual words existing adjacent to the text data from individual word classes constituting the word class sequence and extracting collocations.

Word classification means for extracting different words from a word string of text data and dividing the extracted set of words to generate a word class;
A word class string generating means for generating a one-dimensional string of the word class of the text data by replacing individual words constituting the one-dimensional string of words of the text data with the word class to which the word belongs;
A word class string extracting means for extracting a word class string in which the degree of adhesion between adjacent word classes is a predetermined value or more from a one-dimensional string of the word class of the text data in the one-dimensional string class of the text data; ,
Token giving means for giving a token to each word class string extracted by the word class string extracting means;
By replacing the word string belonging to the word class string extracted by the word class string extraction unit with the token among the one-dimensional string of words of the text data, the one-dimensional string of words / tokens of the text data is changed. A word / token sequence generation means to generate;
A word / token classifying means for generating a word / token class by dividing a set of mixed words and tokens included in a one-dimensional column of words / tokens of the text data;
A word / collocation classification processing apparatus comprising: a collocation replacing unit that reversely replaces a token in the word / token class with a word string replaced by the word / token string generation unit to generate a collocation. .

The word classification means includes
An initialization class setting unit that extracts different words from a one-dimensional sequence of words of the text data and assigns a unique word class to each word having a predetermined appearance frequency;
A temporary merge unit that extracts two word classes from a set of word classes and temporarily merges them;
An average mutual information amount calculating unit for calculating an average mutual information amount for the temporarily merged word class of the text data;
The word / collocation classification processing apparatus according to claim 6, further comprising: a main merging unit that performs main merging of two word classes having the maximum average mutual information amount among the set of word classes.

The word class string extraction means includes
A word class extraction unit for sequentially extracting two adjacent word classes from a one-dimensional sequence of word classes of the text data;
A mutual information amount calculation unit that calculates the mutual information amount of the two word classes extracted by the word class extraction unit;
The word / collocation classification processing apparatus according to claim 6, further comprising a class chain coupling unit that connects two word classes having a mutual information amount equal to or greater than a predetermined threshold by a class chain.

The word / token classification means includes:
An initialization class setting unit that extracts different words and different tokens from a one-dimensional sequence of words / tokens of the text data, and assigns unique words / token classes to words and tokens having a predetermined appearance frequency. When,
A temporary merging unit that extracts two words / token classes from a set of words / token classes and temporarily merges them;
An average mutual information calculation unit for calculating an average mutual information amount for the temporarily merged word / token class of the text data;
The word / collocation classification according to claim 6, further comprising: a main merging unit that performs a main merging of two words / token classes having the maximum average mutual information amount among the set of the word / token classes. Processing equipment.

Generating a word class in which words included in the text data are classified, mapping the word class to a one-dimensional column of words of the text data to generate a one-dimensional column of the word class, In the original sequence, word class sequences in which the adhesion levels between adjacent word classes are all equal to or greater than a predetermined value are extracted from the one-dimensional sequence of the word classes of the text data, and from the individual word classes constituting the word class sequence, Collocation extraction means for separately extracting individual words that are adjacent to the text data and extracting collocations;
A word collocation classification processing apparatus comprising: word / collocation classification means for classifying words and collocations included in the text data together to generate a class in which words and collocations are mixed.

The word / collocation classification processing apparatus according to claim 10, wherein the class is generated based on an average mutual information amount of the class.

Word classification means for classifying words contained in text data to generate a word class;
A word class string generating means for generating a one-dimensional string of the word class of the text data by replacing individual words constituting the one-dimensional string of words of the text data with the word class to which the word belongs;
A word class string extracting means for extracting a word class string in which the degree of adhesion between adjacent word classes is a predetermined value or more from a one-dimensional string of the word class of the text data in the one-dimensional string class of the text data; ,
A collocation extraction means comprising: collocation extraction means for extracting collocations by separately extracting individual words existing adjacent to the text data from the individual word classes constituting the word class sequence.

The collocation extracting apparatus according to claim 12, wherein the word class is generated based on an average mutual information amount of the word class.

Generating a word class in which words included in the text data are classified, mapping the word class to a one-dimensional column of words of the text data to generate a one-dimensional column of the word class, In the original sequence, word class sequences in which the adhesion levels between adjacent word classes are all equal to or greater than a predetermined value are extracted from the one-dimensional sequence of the word classes of the text data, and from the individual word classes constituting the word class sequence, A word extraction unit that extracts individual words by separately extracting individual words existing adjacent to the text data and a word and a combination word are classified together, and the word and the combination word are mixed. A word collocation classification processing device comprising a word and collocation classification means for generating a class includes words and collocations, words and collocations included in predetermined text data. Classified into class that, and the word-phrase dictionary which stores the classification result,
A speech recognition apparatus comprising speech recognition means for recognizing pronunciation speech by referring to the word / collocation dictionary and a predetermined hidden Markov model.

Generating a word class in which words included in the text data are classified, mapping the word class to a one-dimensional column of words of the text data to generate a one-dimensional column of the word class, In the original sequence, word class sequences in which the adhesion levels between adjacent word classes are all equal to or greater than a predetermined value are extracted from the one-dimensional sequence of the word classes of the text data, and from the individual word classes constituting the word class sequence, A word extraction unit that extracts individual words by separately extracting individual words existing adjacent to the text data and a word and a combination word are classified together, and the word and the combination word are mixed. A word collocation classification processing device comprising a word and collocation classification means for generating a class includes words and collocations, words and collocations included in predetermined text data. Classified into class that, and the word-phrase dictionary which stores the classification result,
A collection of example sentences storing an example original text and an example translation corresponding to the example original text;
An example search means for searching an example original sentence composed of words or collocations belonging to the same class as the class to which the input original word belongs, from the example sentence collection;
An example application means for generating a translation for the input original text by replacing a translation in the example translation for the example original text with a translation for the word of the input text;
A machine translation device comprising:

A function of extracting different words from a one-dimensional sequence of words of text data and generating a word class by dividing the extracted set of words;
A function of generating a one-dimensional column of the word class of the text data by replacing individual words constituting the one-dimensional column of the words of the text data with the word class to which the word belongs;
A function of extracting a word class string in which all the adhesion levels between adjacent word classes are a predetermined value or more from a one-dimensional string of word classes of the text data;
A function of giving a token to the word class sequence;
A function of generating a one-dimensional sequence of words / tokens of the text data by replacing a word sequence belonging to the word class sequence with the tokens in a one-dimensional sequence of words of the text data;
A function of generating a word / token class by dividing a set of mixed words and tokens included in a one-dimensional sequence of words / tokens of the text data;
A computer-readable storage medium storing a program that causes a computer to execute a function of generating a collocation by reversely replacing a token in the word / token class with a word string existing in the text data.