JP3369127B2

JP3369127B2 - Morphological analyzer

Info

Publication number: JP3369127B2
Application number: JP22141299A
Authority: JP
Inventors: 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-08-04
Filing date: 1999-08-04
Publication date: 2003-01-20
Anticipated expiration: 2019-08-04
Also published as: JP2001051996A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、形態素解析方法及
び装置及び形態素解析プログラムを格納した記憶媒体に
係り、特に、日本語テキストの単語分割及び品詞付与を
行う日本語形態素解析技術において、日本語の単語を構
成する文字種類及び、その変化の特徴を用いることによ
り未知語の同定及び品詞推定を高い精度を行うための形
態素解析方法及び装置及び形態素解析プログラムを格納
した記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a morpheme analysis method and apparatus and a storage medium storing a morpheme analysis program, and more particularly to a Japanese morpheme analysis technique for performing word segmentation and part-of-speech addition of Japanese text. The present invention relates to a morpheme analysis method and apparatus for performing high accuracy in unknown word identification and part-of-speech estimation by using the type of characters that make up a word and its change characteristics, and a storage medium storing a morpheme analysis program.

【０００２】[0002]

【従来の技術】従来の日本語形態素解析技術は、発見的
優先規則に基づく方法（最長一致法や最小文節数法）、
及び接続コストに基づく方法（接続コスト最小法）が主
流である。発見的優先規則に基づく方法は、最長一致や
最小文節数など発見的優先規則（heuristics）を用いて
形態素解析候補の順位付けを行う。この方法は、順位付
けの根拠が不明確であり、解析精度が低いという問題が
ある。これに対して接続コストに基づく方法は、接続コ
ストを適切に設定できれば、高い解析精度が得られる。
しかし、接続コストを設定するための方法論が存在せ
ず、試行錯誤によりコストを決定しなければならないと
いう問題点がある。2. Description of the Related Art Conventional Japanese morphological analysis techniques are based on heuristic priority rules (longest match method and minimum bunsetsu method),
And the method based on the connection cost (the method with the minimum connection cost) is the mainstream. The method based on the heuristic priority rule ranks the morphological analysis candidates using heuristics such as longest match and minimum clause number. This method has a problem that the basis of ranking is unclear and the analysis accuracy is low. On the other hand, the method based on the connection cost can obtain high analysis accuracy if the connection cost can be set appropriately.
However, there is no method for setting the connection cost, and the cost must be decided by trial and error.

【０００３】そこで近年では、大量のテキストデータか
ら学習した統計的言語モデルから得られる確率の対数値
を接続コストとして使用する方法が主流になりつつあ
る。これにより、形態素解析候補の優先度の理論的根拠
が明確になり、かつ、実験的にも高い精度が得られる
（永田：「前向きＤＰ後ろ向きＡ^*アルゴリズムを用い
た確率的日本語形態素解析」情報処理学会研究報告94-N
L-101-10, pp.73-80,1994）。Therefore, in recent years, a method of using a logarithmic value of probability obtained from a statistical language model learned from a large amount of text data as a connection cost is becoming mainstream. As a result, the rationale for the priority of morphological analysis candidates is clarified, and a high degree of accuracy is obtained experimentally (Nagata: "Probabilistic Japanese morphological analysis using forward DP backward A ^* algorithm" information IPSJ Research Report 94-N
L-101-10, pp.73-80, 1994).

【０００４】一般に、日本語形態素解析に使用する統計
的言語モデルは、他の形態素解析プログラムまたは、人
手により単語分割と品詞付与が行われた日本語テキスト
における単語の出現頻度から求める。この際に、学習テ
キストに出現しなかった単語が入力テキストに出現した
場合、どのようにして確率を割り当てるかが重要な問題
になる。Generally, the statistical language model used for Japanese morphological analysis is obtained from another morphological analysis program or the frequency of appearance of words in a Japanese text in which word segmentation and part-of-speech assignment are performed manually. At this time, when a word that does not appear in the learning text appears in the input text, how to assign the probability becomes an important issue.

【０００５】ある単語が未知語であるという事象を＜Ｕ
ＮＫ＞で表すことにすると、単語ｗ _iが長さｋの文字列
ｃ₁，…，ｃ_kから構成され、それが未知語である確率
Ｐ（ｗ_i＜ＵＮＫ＞）は、一般性を失うことなく、未知
語が長さｋである確率Ｐ（ｋ｜＜ＵＮＫ＞）と、長さｋ
の未知語の表記がｃ₁，…，ｃ_kである確率Ｐ（ｃ₁，
…，ｃ_k｜ｋ，＜ＵＮＫ＞）の積で表せる。The phenomenon that a word is unknown is <U
NK> represents the word w _iIs a character string of length k
c₁, ..., c_kProbability that it is an unknown word
P (w_i<UNK>), without loss of generality, unknown
Word is the length k probability P | and (k <UNK>), length k
The unknown word is c₁, ..., c_kProbability P (c₁，
…, C_k| K, can be expressed as the product of <UNK>).

【０００６】Ｐ（ｗ_i｜＜ＵＮＫ＞＝Ｐ（ｃ₁，…，ｃ
_k｜＜ＵＮＫ＞）＝Ｐ（ｋ｜＜ＵＮＫ＞）Ｐ（ｃ₁，
…，ｃ_k｜ｋ，＜ＵＮＫ＞）（１）以下では、前者
を単語長確率、後者を単語表記確率と呼ぶ。英語の場
合、Brown らにより次式の未知語モデルが提案されてい
る(Brown etal, "An Estimate of an Upper Bound for
the Entropy of English", Computational Linguistic
s, Vol.12, No.1, pp31-40,1992)。[0006] _{P (w i | <UNK>} = P (c 1, ..., c
_{k | <UNK>) = P} (k | <UNK>) P (c 1,
..., c _k | k, a <UNK>) (1) below, the words the former length probability, called latter as word notation probability. In the case of English, Brown et al. Have proposed an unknown word model of the following equation (Brown et al, "An Estimate of an Upper Bound for
the Entropy of English ", Computational Linguistic
s, Vol.12, No.1, pp31-40, 1992).

【０００７】[0007]

【数１】 [Equation 1]

【０００８】ここでλは、学習テキスト中の平均単語
長、ｐは、ＡＳＣＩＩ文字集合に含まれる文字数の逆数
である。即ち、Brown92 の未知語モデルでは、単語の長
さ分布は平均単語長λをパラメータとするポワソン分布
に従い、すべての文字が等確率で出現すると仮定してい
る。Brown92 のモデルは、長さ０の単語に確率を割り当
てるという問題、及び、文字の出現分布を反映していな
いという問題がある。そこで、次式のような日本語未知
語モデルを提案している（永田：「文字類似度と統計的
言語モデルを用いた日本語文字認識誤り訂正法」、電子
情報通信学会論文誌Ｄ−ＩＩ，Vol.J81-D-11,No.11, p
p.1-12,1998，（以下、“永田98”と記す））。Here, λ is the average word length in the learning text, and p is the reciprocal of the number of characters included in the ASCII character set. That is, in the Brown 92 unknown word model, it is assumed that the word length distribution follows a Poisson distribution with the average word length λ as a parameter, and all characters appear with equal probability. The Brown92 model has a problem of assigning a probability to a word of length 0 and a problem of not reflecting the occurrence distribution of characters. Therefore, a Japanese unknown word model such as the following formula is proposed (Nagata: "Japanese character recognition error correction method using character similarity and statistical language model", IEICE Transactions D-II. , Vol.J81-D-11, No.11, p
p.1-12,1998, (hereinafter referred to as "Nagata 98").

【０００９】[0009]

【数２】 [Equation 2]

【００１０】ここで、＜ｂｏｗ＞と＜ｅｏｗ＞は、それ
ぞれ単語の先頭と末尾を表す。即ち、上記“永田98”の
未知語モデルは、単語長分布の下限を０から１に移動す
ることにより長さ０の単語の確率を割り当てるという問
題を解決し、単語表記確率を文字「bigram」の積で近似
することにより文字の出現分布を反映できるようにして
いる。Here, <bow> and <eow> represent the beginning and end of a word, respectively. That is, the unknown word model of "Nagata 98" solves the problem of assigning the probability of a word of length 0 by moving the lower limit of the word length distribution from 0 to 1, and the word writing probability is set to the character "bigram". By approximating by the product of, the appearance distribution of characters can be reflected.

【００１１】[0011]

【発明が解決しようとする課題】しかしながら、上記
“永田98”の未知語モデルには、長い未知語、特に音訳
された外来語が辞書中の単語とそれ以外に分割された
り、複数の未知語として同定される現象、即ち、過分割
が発生しやすいという問題がある。例えば、「ペンシル
バニア」という単語が辞書に登録されておらず、「ペ
ン」という単語が辞書に登録されている場合、接頭辞
「ペン」が辞書中の単語と偶然一致したために「ペンシ
ルバニア」と「シルバニア」に分解されるといった現象
が生じる。これは、日本語全体の平均単語長が２文字程
度であるのに対して、カタカナ表記される外来語の平均
単語長が４文字程度であることに起因する。However, in the unknown word model of "Nagata 98" described above, long unknown words, especially transliterated foreign words, are divided into words in the dictionary and other words, and a plurality of unknown words are divided. There is a problem in that over-division easily occurs. For example, if the word "Pennsylvania" is not registered in the dictionary and the word "pen" is registered in the dictionary, the prefix "pen" happens to match a word in the dictionary, so "Pennsylvania" and "Pennsylvania" The phenomenon that it is decomposed into "Sylvania" occurs. This is because the average word length of the entire Japanese word is about 2 characters, whereas the average word length of the foreign words written in katakana is about 4 characters.

【００１２】また、“永田98 ”の未知語モデルは、単
語分割（未知語の同定）のためのモデルであり、未知語
の品詞を考慮していないという問題点がある。これは、
この未知語モデルが文字認識の誤り訂正で使用すること
を前提として設計されているためであるが、音声合成や
情報検索のための形態素解析では、未知語の品詞を推定
することが必要である。Further, the unknown word model of "Nagata 98" is a model for word division (identification of unknown word), and there is a problem that the part of speech of the unknown word is not taken into consideration. this is,
This is because this unknown word model is designed on the assumption that it will be used for error correction in character recognition, but in morphological analysis for speech synthesis and information retrieval, it is necessary to estimate the part of speech of the unknown word. .

【００１３】本発明は、上記の点に鑑みなされたもの
で、従来の統計的言語モデルを用いた日本語形態素解析
における未知語の過分割の問題を解決し、さらに、未知
語の品詞を推定することが可能な形態素解析方法及び装
置及び形態素解析プログラムを格納した記憶媒体を提供
することを目的とする。The present invention has been made in view of the above points, solves the problem of over-division of unknown words in Japanese morphological analysis using a conventional statistical language model, and further estimates the part of speech of unknown words. It is an object of the present invention to provide a morphological analysis method and device capable of performing, and a storage medium storing a morphological analysis program.

【００１４】[0014]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明は、日本語の形態素解
析を行う形態素解析方法において、入力テキストの部分
文字列と照合する単語をデータベース上の単語辞書から
検索して単語候補として生成し（ステップ１）、単語辞
書と照合しない入力テキストの部分文字列から未知語で
ある可能性のあるものを、単語を構成する文字の種類及
びその変化に基づいて単語タイプが定義された単語タイ
プ定義テーブルを参照し、単語タイプのいずれかに任意
の文字列を分類し、分類された単語タイプ毎に未知語候
補として同定し（ステップ２）、品詞別の単語タイプ出
現頻度が定義されている単語タイプ頻度テーブルを参照
し、品詞別に単語タイプ出現確率を求め、品詞及び単語
タイプ別の平均単語長が定義されている平均単語長テー
ブルを参照し、平均単語長をポワソン分布で近似するこ
とにより、品詞及び単語タイプ別に任意の長さの単語長
確率を求め、品詞及び単語タイプ別の文字ngram 頻度が
定義されている文字ngram 頻度テーブルを参照して、品
詞及び単語タイプ及び単語長別に任意の文字列の単語表
記確率を求める未知語モデルを用いて未知語候補の品詞
別単語出現確率を推定し（ステップ３）、動的計画法に
より、単語候補及び未知語候補のすべての組み合わせに
ついて、単語ngram 頻度テーブルを参照して求めた単語
ngram 確率、及び品詞別単語出現確率を用いて同時確率
が最大となる単語列を求める（ステップ４）。FIG. 1 is a diagram for explaining the principle of the present invention. This onset Ming, the morphological analysis method for performing morphological analysis of Japanese, searching for words that match the substring of the input text from the word dictionary in the database to generate a word candidates (step 1), the word dictionary
Unknown character from the substring of the input text that does not match the calligraphy
What is possible is the type of characters that make up the word and the
Word types whose word types are defined based on
Refer to the table definition table and specify any of the word types
Classifies the character strings of and the unknown words for each classified word type
It was identified as a complement (step 2) , and the word type was output for each part of speech.
See word type frequency table where the current frequency is defined
Then, the word type appearance probability is calculated for each part of speech, and the part of speech and word
Average word length table that defines the average word length for each type
Bulls to approximate the average word length by Poisson distribution.
Depending on the part of speech and word type, the word length can be any length
Probability is calculated and character ngram frequency by part of speech and word type is
Refer to the defined character ngram frequency table and
Word table of arbitrary character strings according to words and word types and word lengths
Part-of-speech of an unknown word candidate using an unknown word model for finding probabilities
Estimate another word appearance probability (step 3) and apply it to dynamic programming
To all combinations of word candidates and unknown word candidates
About the word obtained by referring to the word ngram frequency table
Simultaneous probability using ngram probability and word appearance probability by part of speech
The word string that maximizes is obtained (step 4).

【００１５】本発明は、単語表記確率を求める際に、文
字ngram 頻度テーブルを参照して、より低次の品詞及び
単語列の文字ngram頻度、及び、同次または、より低次
の品詞及び単語の違いを考慮しない文字ngram 頻度か
ら、品詞及び単語タイプ別の文字ngram 確率を線形補間
法により求める方法、または、文字ngram 頻度から求め
られた単語表記確率を同じ長さのすべての文字列に割り
当てられた単語表記確率の和で正規化する方法のいずれ
かを用いる。[0015] The present onset Ming, at the time of obtaining the word notation probability, statement
Refer to the character ngram frequency table to determine the lower part of speech and
Character ngram frequency of word string and homogeneity or lower
Ngram frequency without considering the part of speech and word difference
, Linearly interpolating character ngram probabilities by part-of-speech and word type
Method, or from character ngram frequency
Divide the given word notation probability into all strings of the same length.
Any of the methods to normalize with the sum of assigned word notation probabilities
Use or .

【００１６】図２は、本発明の原理構成図である。本発
明は、日本語の形態素解析を行う形態素解析装置であっ
て、入力テキストの部分文字列と照合する単語を単語辞
書から検索して単語候補として生成する単語辞書照合手
段１と、単語を構成する文字の種類及びその変化に基づ
いて単語タイプが定義された単語タイプ定義テーブルを
参照して、単語タイプのいずれか任意の文字列を分類
し、分類された単語タイプを判定する単語タイプ判定手
段６と、品詞別の単語タイプ出現頻度が定義されている
単語タイプ頻度テーブルを参照して、品詞別に単語タイ
プ出現確率を求める単語タイプ確率推定手段７と、単語
品詞及び単語タイプ別の平均単語長が定義されている平
均単語長テーブルを参照し、平均単語長をポワソン分布
で近似することにより、品詞及び単語タイプ別に任意の
長さの単語長確率を求める単語長確率推定手段８と、品
詞及び単語タイプ別の文字ngram 頻度が定義されている
文字ngram 頻度テーブルを参照して、品詞及び単語タイ
プ及び単語長別に任意の文字列の単語表記確率を求める
単語表記確率推定手段９と、からなる未知語モデルと、
未知語モデルの単語タイプ判定手段６を用いて、単語辞
書照合手段１において、単語辞書と照合しない入力テキ
ストの部分文字列から未知語である可能性があるものを
未知語候補として選択する未知語候補同定手段２と、未
知語モデルの単語タイプ確率推定手段７、単語長確率推
定手段８、単語表記確率推定手段９を用いて未知語候補
の品詞単語出現確率を推定する未知語候補確率推定手段
３と、単語辞書照合手段１により求められた単語候補、
及び、未知語候補同定手段２により得られた未知語候補
のすべての組み合わせについて、単語ngram 頻度が定義
されている単語ngram 頻度テーブルを参照して求められ
た単語ngram 確率及び、未知語候補確率推定手段３によ
り求められた品詞別単語出現確率を用いて、同時確率が
最大となる単語列を求める最適単語列探索手段４とを有
する。 FIG . 2 is a block diagram showing the principle of the present invention. Starting
Akira is a morphological analyzer that analyzes Japanese morphemes.
To match the substring of the input text with the word
Word dictionary collation hand that searches from a book and generates it as a word candidate
Based on column 1 and the type of characters that make up a word and their variations
The word type definition table that defines the word type
Browse and classify any string of word type
A word type determination hand that determines the classified word type
Level 6 and word type appearance frequency for each part of speech are defined
Refer to the word type frequency table to see the word type by part of speech.
A word type probability estimating means 7 for obtaining the appearance probability, and a word
The average word length for each part of speech and word type is defined.
Poisson distribution of average word length with reference to uniform word length table
By approximating with
A word length probability estimating means 8 for obtaining a word length probability of a length,
Character ngram frequency by lyric and word type is defined
Refer to the character ngram frequency table to identify the part of speech and word
Find the word notation probability of an arbitrary character string by group and word length
An unknown word model consisting of word notation probability estimation means 9;
Using the word type determination means 6 of the unknown word model,
The input collating means 1 does not collate with the word dictionary.
From the substring of the string that may be an unknown word
The unknown word candidate identifying means 2 for selecting the unknown word candidates, non
Word type probability estimator 7 of the word model, word length probability estimation
Unknown word candidates using the definition means 8 and the word notation probability estimation means 9.
Unknown word candidate probability estimation means for estimating the part-of-speech word appearance probability of
3 and the word candidates obtained by the word dictionary matching means 1,
And an unknown word candidate obtained by the unknown word candidate identifying means 2.
The word ngram frequency is defined for all combinations of
Is found by referring to the ngram frequency table
The estimated word ngram probability and unknown word candidate probability estimation means 3
By using the word appearance probability of each part of speech obtained by
And an optimum word string search means 4 for finding the maximum word string.
To do.

【００１７】本発明は、単語表記確率推定手段９におい
て、文字ngram 頻度テーブルを参照して、より低次の品
詞及び単語列の文字ngram頻度、及び、同次または、よ
り低次の品詞及び単語の違いを考慮しない文字ngram 頻
度から、品詞及び単語タイプ別の文字ngram 確率を線形
補間法により求める方法、または、文字ngram 頻度から
求められた単語表記確率を同じ長さのすべての文字列に
割り当てられた単語表記確率の和で正規化する方法のい
ずれかを用いる。The present invention resides in the word notation probability estimation means 9.
And refer to the character ngram frequency table
Character ngram frequency of words and word strings and the same or
Character ngram frequency that does not consider the difference of low-order parts of speech and words
From the degree, the character ngram probability for each part of speech and word type is linearly calculated.
Interpolation method or character ngram frequency
Find the word probabilities found in all strings of the same length
How to normalize by the sum of assigned word notation probabilities
Use the difference .

【００１８】本発明は、日本語の形態素解析を行う形態
素解析プログラムを格納した記憶媒体であって、入力テ
キストの部分文字列と照合する単語をデータベース上の
単語辞書から検索して単語候補として生成する単語辞書
照合ステップと、単語辞書と照合しない入力テキストの
部分文字列から未知語である可能性のあるものを、単語
を構成する文字の種類及びその変化に基づいて単語タイ
プが定義された単語タイプ定義テーブルを参照し、単語
タイプのいずれかに任意の文字列を分類し、分類された
単語タイプ毎に未知語候補として同定する未知語候補同
定ステップと、品詞別の単語タイプ出現頻度が定義され
ている単語タイプ頻度テーブルを参照し、品詞別に単語
タイプ出現確率を求め、品詞及び単語タイプ別の平均単
語長が定義されている平均単語長テーブルを参照し、平
均単語長をポワソン分布で近似することにより、品詞及
び単語タイプ別に任意の長さの単語長確率を求め、品詞
及び単語タイプ別の文字ngram 頻度が定義されている文
字ngram 頻度テーブルを参照して、品詞及び単語タイプ
及び単語長別に任意の文字列の単語表記確率を求める未
知語モデルを用いて、未知語候補の品詞別単語出現確率
を推定する未知語候補確率推定ステップと、動的計画法
により、単語候補及び未知語候補のすべての組み合わせ
について、単語ngram 頻度テーブルを参照して求めた品
詞別単語出現確率を用いて、同時確率が最大となる単語
列を求める最適単語列探索ステップとからなる。 The present invention is a form for performing morphological analysis of Japanese.
A storage medium storing a prime analysis program, input Te
Words in the database that match substrings of kist
A word dictionary that is searched as a word candidate and generated as a word candidate
Matching steps and input text that does not match the word dictionary
Words that may be unknown words from substrings
Based on the type of characters that make up the word and its variation
Refer to the word type definition table in which the
Classified any string into one of the types and classified
Unknown word candidates identified for each word type as unknown word candidates
Constant steps and word type frequency by part of speech are defined.
Refer to the word type frequency table that has
The type appearance probability is calculated, and the average unit for each part of speech and word type is calculated.
Refer to the average word length table that defines the word length, and
By approximating the uniform word length by Poisson distribution,
And the word length probability of any length for each word type
And sentences with defined character ngram frequencies by word type
Part-of-speech and word type by referring to the character ngram frequency table
And the word notation probability of an arbitrary character string is calculated for each word length.
Probability of word occurrence by part-of-speech of an unknown word candidate by using an intelligent word model
And an unknown word candidate probability estimation step of estimating, dynamic programming
Allows all combinations of word candidates and unknown word candidates
About the word ngram frequency table
The word with the highest joint probability using the word appearance probability
An optimal word string search step for obtaining a string.

【００１９】本発明は、単語表記確率推定ステップにお
いて、文字ngram 頻度テーブルを参照して、より低次の
品詞及び単語列の文字ngram頻度、及び、同次または、
より低次の品詞及び単語の違いを考慮しない文字ngram
頻度から、品詞及び単語タイプ別の文字ngram 確率を線
形補間法により求める方法、または、文字ngram 頻度か
ら求められた単語表記確率を同じ長さのすべての文字列
に割り当てられた単語表記確率の和で正規化する方法の
いずれかを用いる。 The present invention includes a word notation probability estimation step.
And refer to the character ngram frequency table for lower order
Part-of-speech and character string ngram frequency and the same or
Character ngram that does not consider lower part-of-speech and word differences
Character ngram probability by part of speech and word type is plotted from frequency
Shape interpolation method or character ngram frequency
The word notation probability obtained from all strings of the same length
Of the method of normalizing the sum of the word notation probabilities assigned to
Use either one.

【００２０】[0020]

【００２１】[0021]

【００２２】[0022]

【００２３】[0023]

【００２４】[0024]

【００２５】[0025]

【発明の実施の形態】図３は、本発明の日本語形態素解
析装置に構成を示す。同図に示す日本語形態素解析装置
は、単語照合部１、未知語候補同定部２、未知語候補確
率推定部３、最適単語列探索部４、単語辞書５、単語タ
イプ判定部６、単語タイプ確率推定部７、単語長確率推
定部８、単語表記確率推定部９、単語bigram確率推定部
１０、単語タイプ定義テーブル１１、単語タイプ頻度テ
ーブル１２、平均単語長テーブル１３、文字bigram頻度
テーブル１４、単語bigram頻度テーブル１５、単語辞書
作成部１６、単語タイプ頻度計算部１７、平均単語長計
算部１８、文字bigram計算部１９、単語bigram計算部２
０より構成される。FIG. 3 shows the configuration of a Japanese morphological analyzer according to the present invention. The Japanese morphological analyzer shown in FIG. 1 includes a word matching unit 1, an unknown word candidate identifying unit 2, an unknown word candidate probability estimating unit 3, an optimum word string searching unit 4, a word dictionary 5, a word type determining unit 6, and a word type. Probability estimation unit 7, word length probability estimation unit 8, word notation probability estimation unit 9, word bigram probability estimation unit 10, word type definition table 11, word type frequency table 12, average word length table 13, character bigram frequency table 14, Word bigram frequency table 15, word dictionary creation unit 16, word type frequency calculation unit 17, average word length calculation unit 18, character bigram calculation unit 19, word bigram calculation unit 2
It consists of zero.

【００２６】なお、単語分割モデルとして前述の特許請
求の範囲では、ngram を用いているが、以下の説明で
は、単語bigramを用いて説明するが、これに限定される
ものではない。上記の構成のうち、単語照合部１、未知
語候補同定部２、未知語候補確率推定部３、最適単語列
探索部４、単語タイプ判定部６、単語タイプ確率推定部
７、単語長確率推定部８、単語表記確率推定部９、単語
bigram確率推定部１０は、入力テキストを形態素解析す
るためのものである。Although ngram is used as the word segmentation model in the above claims, the following description will be made using the word bigram, but the invention is not limited to this. Of the above configuration, the word collation unit 1, the unknown word candidate identification unit 2, the unknown word candidate probability estimation unit 3, the optimum word string search unit 4, the word type determination unit 6, the word type probability estimation unit 7, the word length probability estimation. Part 8, word notation probability estimation part 9, word
The bigram probability estimation unit 10 is for morphological analysis of the input text.

【００２７】また、単語辞書５、単語タイプ定義テーブ
ル１１、単語タイプ頻度テーブル１２、平均単語長テー
ブル１３、文字bigram頻度テーブル１４、単語bigram頻
度テーブル１５は、形態素解析で使用する統計的言語モ
デルである。また、単語辞書作成部１６、単語タイプ頻
度計算部１７、平均単語長計算部１８、文字bigram計算
部１９、及び単語bigram計算部２０は、学習テキストか
ら統計的言語モデルのパラメータを推定する。The word dictionary 5, the word type definition table 11, the word type frequency table 12, the average word length table 13, the character bigram frequency table 14, and the word bigram frequency table 15 are statistical language models used in morphological analysis. is there. Further, the word dictionary creation unit 16, the word type frequency calculation unit 17, the average word length calculation unit 18, the character bigram calculation unit 19, and the word bigram calculation unit 20 estimate the parameters of the statistical language model from the learning text.

【００２８】上記の構成において、入力テキストを形態
素解析する際には、まず、単語辞書照合部１が入力テキ
ストの部分文字列と照合する単語を単語辞書５から検索
する。次に、未知語候補同定部２が、単語辞書５と照合
しなかった入力テキストの部分文字列の中から未知語と
なりうるものを未知語候補として選択する。単語タイプ
判定部６は、単語タイプ定義テーブル１１に基づいて未
知語候補の単語タイプを決定する。未知語候補確率推定
部３は、単語タイプ確率推定部７により求められた品詞
別単語タイプ確率、単語長確率推定部８により求めた品
詞及び単語タイプ別単語長確率、単語表記確率推定部９
より求めた品詞及び単語タイプ及び単語長別単語表記確
率から未知語の品詞別単語出現確率を求める。この際、
単語タイプ確率推定部７、単語長確率推定部８、単語表
記確率推定部９は、それぞれ、単語タイプ頻度テーブル
１２、平均単語長テーブル１３、文字bigram頻度テーブ
ル１４を使用する。In the above structure, when performing morphological analysis on the input text, first, the word dictionary matching unit 1 searches the word dictionary 5 for a word that matches the partial character string of the input text. Next, the unknown word candidate identification unit 2 selects, as an unknown word candidate, one that can be an unknown word from the partial character strings of the input text that are not matched with the word dictionary 5. The word type determination unit 6 determines the word type of the unknown word candidate based on the word type definition table 11. The unknown word candidate probability estimation unit 3 includes a part-of-speech word type probability obtained by the word type probability estimation unit 7, a part-of-speech and word type-dependent word length probability obtained by the word length probability estimation unit 8, and a word writing probability estimation unit 9
The part-of-speech word appearance probability of the unknown word is obtained from the obtained part-of-speech, word type, and word length-specific word notation probability. On this occasion,
The word type probability estimation unit 7, the word length probability estimation unit 8, and the word notation probability estimation unit 9 use the word type frequency table 12, the average word length table 13, and the character bigram frequency table 14, respectively.

【００２９】最適単語列探索部４は、単語辞書照合部１
により得られた単語候補、及び、未知御候補同定部２に
より得られた未知語候補のすべての組み合わせについ
て、単語bigram頻度テーブル１５を用いて単語bigram確
率推定部１０により求めた単語bigram確率、及び、未知
語候補確率推定部３により求めた未知語の品詞別単語出
現確率を用いて、同時確率が最大となる単語列を求め、
これを形態素解析結果として出力する。なお、単語bigr
am確率推定部１０において、単語bigram頻度テーブル１
５を用いているが、単語分割モデルとして単語ngram
等、種々の単語に基づく言語モデルを使用することが可
能である。The optimum word string searching unit 4 is a word dictionary matching unit 1.
The word bigram probabilities obtained by the word bigram probability estimation part 10 using the word bigram frequency table 15 for all combinations of the word candidates obtained by the above and the unknown word candidates obtained by the unknown candidate identification part 2, and , Using the word appearance probability by part of speech of the unknown word obtained by the unknown word candidate probability estimation unit 3, the word string having the maximum joint probability is obtained,
This is output as a morphological analysis result. Note that the word bigr
In the am probability estimation unit 10, the word bigram frequency table 1
5 is used, but the word ngram is used as a word segmentation model.
It is possible to use language models based on various words, such as.

【００３０】学習テキストから統計的言語モデルのパラ
メーを推定する際には、単語辞書作成部１６、単語タイ
プ頻度計算部１７、平均単語長計算部１８、文字bigram
計算部１９、単語bigram計算部２０がそれぞれ、単語辞
書５、単語タイプ定義１１、単語タイプ頻度テーブル１
２、平均単語長テーブル１３、文字bigram頻度テーブル
１４、単語bigram頻度テーブル１５にパラメータを格納
する。When estimating the parameters of the statistical language model from the learning text, the word dictionary creating unit 16, the word type frequency calculating unit 17, the average word length calculating unit 18, the character bigram.
The calculation unit 19 and the word bigram calculation unit 20 respectively include a word dictionary 5, a word type definition 11, and a word type frequency table 1.
2. The parameters are stored in the average word length table 13, the character bigram frequency table 14, and the word bigram frequency table 15.

【００３１】次に、上記の構成における動作を説明す
る。図４は、本発明の形態素解析処理を説明するための
フローチャートである。ステップ１０１）テキストを単語辞書照合部１に入力
する。ステップ１０２）単語辞書照合部１は、入力テキスト
の部分文字列と照合する単語を単語辞書５から検索す
る。Next, the operation of the above configuration will be described. FIG. 4 is a flow chart for explaining the morphological analysis processing of the present invention. Step 101) Input the text into the word dictionary matching unit 1. Step 102) The word dictionary matching unit 1 searches the word dictionary 5 for a word that matches the partial character string of the input text.

【００３２】ステップ１０３）未知語候補同定部２
が、単語辞書５と照合しなかった入力テキストの部分文
字列の中から未知語となり得る単語を未知語候補として
選択する。ステップ１０４）単語タイプ判定部６は、単語タイプ
定義テーブル１１を参照して、ステップ１０３で選択さ
れた未知語候補の単語タイプを決定する。Step 103) Unknown word candidate identifying section 2
However, a word that can be an unknown word is selected as an unknown word candidate from the partial character strings of the input text that are not matched with the word dictionary 5. Step 104) The word type determination unit 6 refers to the word type definition table 11 and determines the word type of the unknown word candidate selected in step 103.

【００３３】ステップ１０５）単語タイプ確率推定部
７は、単語タイプ頻度テーブル１２を参照して品詞別単
語タイプ確率を推定し、未知語候補確率推定部３に渡
す。ステップ１０６）単語長確率推定部８は、平均単語長
テーブル１３を参照して品詞及び単語タイプ別単語長確
率を推定し、未知語候補確率推定部３に渡す。ステップ１０７）単語表記確率推定部９は、文字bigr
am頻度テーブル１４を参照して品詞及び、単語タイプ及
び、単語長別単語表記確率を推定し、未知語候補確率推
定部３に渡す。Step 105) The word type probability estimating unit 7 estimates the word type probability by part of speech by referring to the word type frequency table 12, and sends it to the unknown word candidate probability estimating unit 3. Step 106) The word length probability estimation unit 8 refers to the average word length table 13 to estimate the word length probability by part of speech and word type, and passes it to the unknown word candidate probability estimation unit 3. Step 107) The word notation probability estimation unit 9 uses the character bigr
The part-of-speech, the word type, and the word notation probability by word length are estimated with reference to the am frequency table 14, and are passed to the unknown word candidate probability estimation unit 3.

【００３４】ステップ１０８）未知語候補確率推定部
３は、単語タイプ確率推定部７から取得した品詞別単語
タイプ確率、単語長確率推定部８から取得した品詞及び
単語タイプ別単語長確率、単語表記確率推定部９から取
得した品詞及び単語タイプ及び単語長別単語表記確率か
ら未知語の品詞別単語出現確率を求める。ステップ１０９）単語bigram確率推定部１０は、単語
bigram頻度テーブル１５を用いて単語bigram確率を求
め、最適単語列探索部４において、単語辞書照合部１に
より得られた単語候補、及び未知語候補同定部２から取
得した未知語候補のすべての組み合わせについて、単語
bigram確率推定部１０より求めた単語bigram確率と、未
知語候補確率推定部３により求めた未知語の品詞別単語
出現確率を用いて、同時確率が最大となる単語列を求
め、これを形態素解析結果として出力する。Step 108) The unknown word candidate probability estimation unit 3 uses the part-of-speech word type probability acquired from the word type probability estimation unit 7, the part-of-speech and word type-specific word length probability acquired from the word length probability estimation unit 8, and the word notation. The part-of-speech-specific word appearance probability of the unknown word is obtained from the part-of-speech, word type, and word length-specific word notation probability acquired from the probability estimation unit 9. Step 109) The word bigram probability estimation unit 10 uses the word
The word bigram probability is calculated using the bigram frequency table 15, and in the optimum word string search unit 4, all combinations of word candidates obtained by the word dictionary matching unit 1 and unknown word candidates acquired from the unknown word candidate identifying unit 2 About the word
Using the word bigram probability obtained by the bigram probability estimation unit 10 and the word appearance probability by part of speech of the unknown word obtained by the unknown word candidate probability estimation unit 3, a word string having the maximum joint probability is obtained, and this is subjected to morphological analysis. Output as a result.

【００３５】[0035]

【実施例】以下、図面と共に本発明の実施例を説明す
る。以下では、図４の入力テキストを形態素解析する手
順に従って、図３の構成に基づいて、単語辞書５及び単
語辞書照合法、単語タイプの定義及び未知語候補同定
法、未知語候補の品詞別出現確率推定法、単語bigram確
率推定法及び最適単語列探索法の順に説明する。Embodiments of the present invention will be described below with reference to the drawings. In the following, the word dictionary 5 and the word dictionary matching method, the definition of the word type and the unknown word candidate identification method, and the appearance of the unknown word candidate by part-of-speech based on the configuration of FIG. The probability estimation method, the word bigram probability estimation method, and the optimum word string search method will be described in this order.

【００３６】なお、以下の説明では、学習テキストは、
予め人手または、他の形態素解析プログラムにより単語
分割及び品詞付与が行われているものとする。（１）単語辞書５及び単語辞書照合処理：単語辞書作
成部１６は、学習テキストにおいて出現頻度がある閾値
を越える単語のリストから単語辞書５を作成する。ここ
で、単語は表記及び品詞から構成されると定義し、同じ
表記でも品詞が異なれば別の単語と考える。本実施例で
は、出現頻度の閾値を１とする。In the following description, the learning text is
It is assumed that word division and part-of-speech assignment have been performed in advance by a human or another morphological analysis program. (1) Word dictionary 5 and word dictionary matching process: The word dictionary creating unit 16 creates the word dictionary 5 from a list of words in which the appearance frequency exceeds a certain threshold in the learning text. Here, a word is defined to be composed of a notation and a part of speech, and even if the same notation has a different part of speech, it is considered as a different word. In the present embodiment, the threshold of the appearance frequency is 1.

【００３７】単語辞書照合部１は、入力テキストの部分
文字列と一致する単語辞書中の単語を列挙する。そのた
めに、単語辞書５には文字列の共通接頭辞を併合した
“トライ”と呼ばれるデータ構造を使用する。（２）単語タイプの定義及び未知語候補同定処理：現
代の日本語の正書法では、句読点などの記号以外に少な
くとも５つの文字の種類（漢字、ひらがな、カタカナ、
アルファベット、アラビア数字）が使用されている。漢
字は中国系の外来語（漢語）、及び、中国語と意味的に
等しい日本語の表記に（送り仮名を伴って）使用され
る。ひらがなは助詞や活用語尾などの機能語の表記に使
用され、カタカナは西欧系の外来語の発音表記に使用さ
れる。アルファベットは西欧系の単語や頭文字の表記に
使用され、アラビア数字は数の表記に使用される。The word dictionary matching unit 1 enumerates the words in the word dictionary that match the partial character strings of the input text. For this purpose, the word dictionary 5 uses a data structure called "try" that merges common prefixes of character strings. (2) Word type definition and unknown word candidate identification processing: In modern Japanese orthography, at least five character types (Kanji, Hiragana, Katakana,
Alphabet, Arabic numerals) are used. Kanji is used for Chinese foreign words (Chinese) and Japanese notation (with sending kana) that is semantically equivalent to Chinese. Hiragana is used for notation of functional words such as postpositions and inflection endings, and katakana is used for phonetic notation of foreign words in Western Europe. The alphabet is used to represent Western European words and acronyms, and the Arabic numeral is used to represent numbers.

【００３８】ＥＤＲコーパスにおける出現頻度１の単語
について、単語を構成する文字の種類及びその変化につ
いて調べた結果を図５に示す。ＥＤＲコーパスは、新
聞、雑誌、教科書などを収集した日本語の代表的なテキ
ストコーパスである。一般に、学習テキストに一度しか
出現しない単語の性質は、未知語の性質の近いと言われ
ている。同図によれば、一つの文字種で構成される単語
（漢字、カタカナ、ひらがな、数字、アルファベット）
が全体の約６５％を占めていることがわかる。FIG. 5 shows the results of an examination of the types of characters that make up words and their changes for words with an appearance frequency of 1 in the EDR corpus. The EDR corpus is a typical Japanese text corpus that collects newspapers, magazines, textbooks, and the like. Generally, it is said that the properties of words that appear only once in a learning text are close to the properties of unknown words. According to the figure, words consisting of one character type (Kanji, Katakana, Hiragana, numbers, alphabets)
It can be seen that occupies about 65% of the total.

【００３９】また、２つ以上の字種で構成される単語の
うち「形態素」即ち、「これ以上分割すると意味をもた
なくなる最小の言語要素」となりうるのは、「漢字−ひ
らがな」または、「ひらがな−漢字」というパターンだ
けである。前者は、「極ま（る）」のような漢字と送り
仮名の組み合わせ、後者は「えい（嬰）児」のような難
しい漢字をひらがなで表記した単語に対応する。Further, among words composed of two or more character types, "morpheme", that is, "minimum linguistic element having no meaning when further divided" is "Kanji-Hiragana" or The only pattern is "Hiragana-Kanji". The former corresponds to a combination of kanji such as "ru" and futana, and the latter corresponds to a word in hiragana that represents difficult kanji such as "ei".

【００４０】そこで、本実施例では、日本語の正書法に
おいて、単語を構成する文字の種類及びその変化のパタ
ーンに基づいて複数の単語タイプを設定し、これを単語
タイプ定義１１にバッカス記法（Backus Naur Form, BN
F ）で記述する。図６は、本発明の一実施例の単語タイ
プの定義を示す。同図は、日本語の未知語を９種類の単
語タイプに分類した場合の例を示す。ここで、［…］
は、文字集合中の任意の１文字と照合することを表す。
２つの文字の間に、“−”を書くことで文字範囲を表
す。文字コードには、ＪＩＳ−Ｘ−０２０８を仮定して
いる。“＊”は０回以上の繰り返し、“＋”は１回以上
の繰り返しを表す。Therefore, in this embodiment, in the Japanese orthography, a plurality of word types are set on the basis of the types of characters forming a word and the patterns of their changes, and these are set in the word type definition 11 using Backus notation. Naur Form, BN
F). FIG. 6 shows a word type definition according to one embodiment of the present invention. This figure shows an example in which Japanese unknown words are classified into nine word types. here,[…]
Indicates matching with any one character in the character set.
A character range is indicated by writing "-" between two characters. JIS-X-0208 is assumed for the character code. “*” Represents 0 or more repetitions, and “+” represents one or more repetitions.

【００４１】＜ｓｙｍ＞，＜ｎｕｍ＞，＜ａｌｐｈａ
＞，＜ｈｉｒａ＞，＜ｋａｒａ＞，＜ｋａｎ＞は、それ
ぞれ記号列、数字列、アルファベット列、ひらがな列、
カタカナ列、漢字列という一つの字種から構成される文
字列を表す。＜ｋａｎ−ｈｉｒａ＞、＜ｈｉｒａ−ｋａ
ｎ＞は、それぞれ漢字列−ひらがな列、ひらがな列−漢
字列という２つの字種から構成される文字列を表す。そ
して、これら以外の複数の字種から構成される文字列は
すべて＜ｍｉｓｃ＞とする。<Sym>, <num>, <alpha>
>, <Hira>, <kara>, and <kan> are a symbol string, a number string, an alphabet string, and a hiragana string, respectively.
Represents a character string that consists of one character type, the Katakana string and the Kanji string. <Kan-hira>, <hira-ka
n> are each kanji string - Hiragana Gana column, hiragana strings - representing the character string composed of two character types that kanji string. A character string composed of a plurality of character types other than these is all set to <misc>.

【００４２】なお、図６は、単語タイプ定義１１の一例
であり、単語タイプの定義は必要に応じて自由に変更で
きる。単語タイプ判定部６は、単語タイプ定義１１に基
づいて、任意の文字列を単語とみなした際の単語タイプ
を求める。未知語候補同定部２は、単語辞書５と照合し
なかった入力テキストの部分文字列を取り出し、単語と
成りえない部分文字列（例えば、句読点を内部に含むも
のなど）を取り除いた、単語タイプ判定部６により単語
タイプを判定して未知語候補を作成する。FIG. 6 shows an example of the word type definition 11, and the word type definition can be freely changed as needed. The word type determination unit 6 obtains a word type when an arbitrary character string is regarded as a word based on the word type definition 11. The unknown word candidate identifying unit 2 extracts a partial character string of the input text that has not been matched with the word dictionary 5 and removes a partial character string that cannot be a word (for example, one that includes punctuation marks inside). The determination unit 6 determines the word type and creates an unknown word candidate.

【００４３】（３）未知語候補の品詞別出現確率推定
法：未知語の品詞がｔである事象を＜Ｕ−ｔ＞で表すこ
とにすると、未知語の品詞別出現確率、即ち、文字列ｃ
₁，…，ｃ_kから構成される未知語がｔである確率Ｐ
（ｃ₁，…，ｃ_k｜＜Ｕ−ｔ＞）は、まず単語タイプを
選び、次に長さを選び、最後に表記を決定するという過
程を考えれば、一般性を失うことなく以下のような確率
の積に分解できる。(3) Part-of-Speech Appearance Probability Estimation Method of Unknown Word Candidate: If an event in which the part-of-speech of an unknown word is t is represented by <Ut>, the appearance probability of the unknown word by part-of-speech, that is, a character string. c
_The probability P that an unknown word composed of ₁ , ..., C _k is t
(C ₁ , ..., c _k | <Ut>) is the following without loss of generality, considering the process of first selecting the word type, then the length, and finally the notation. It can be decomposed into the product of probabilities.

【００４４】Ｐ（ｃ₁…ｃ_k｜＜Ｕ−ｔ＞）＝Ｐ（＜ＷＴ＞｜＜Ｕ−ｔ＞）Ｐ（ｋ｜＜ＷＴ＞，＜Ｕ−ｔ＞）Ｐ（ｃ₁…ｃ_k｜ｋ，＜ＷＴ＞，＜Ｕ−ｔ＞）（５）未知語候補確率推定部３は、式（５）を用いて、未知語
候補の品詞別単語出現確率を推定する。P (c ₁ ... C _k | <U-t>) = P (<WT> | <U-t>) P (k | <WT>, <U-t>) P (c ₁ ... c _k | k, <WT>, <Ut>) (5) The unknown word candidate probability estimation unit 3 estimates the word appearance probability by part-of-speech of the unknown word candidate using Expression (5).

【００４５】式（５）の右辺の第１項は、学習テキスト
における関連する事象の相対頻度から求める。The first term on the right side of the equation (5) is obtained from the relative frequency of related events in the learning text.

【００４６】[0046]

【数３】 [Equation 3]

【００４７】ここで、Ｃ（・）は、学習テキストにおけ
る事象の頻度を表す。従って、単語タイプ頻度計算部１
７は、学習テキストにおける単語タイプと品詞の組の頻
度Ｃ（＜ＷＴ＞，＜Ｕ−ｔ＞）及び品詞の頻度Ｃ（＜Ｕ
−ｔ＞）を計算し、単語タイプ頻度テーブル１２に格納
する。また、単語タイプ確率推定部７は、式（６）を用
いて品詞別単語タイプ確率を推定する。Here, C (·) represents the frequency of events in the learning text. Therefore, the word type frequency calculator 1
7 is the frequency C (<WT>, <Ut>) of the combination of the word type and the part of speech in the learning text and the frequency C (<U of the part of speech).
-T>) is calculated and stored in the word type frequency table 12. Further, the word type probability estimation unit 7 estimates the word type probability by part of speech using the formula (6).

【００４８】式（５）の右辺の第２項は、単語タイプが
＜ＷＴ＞かつ品詞が＜Ｕ−ｔ＞である単語の平均単語長The second term on the right side of the equation (5) is the average word length of the words whose word type is <WT> and whose part of speech is <Ut>.

【００４９】[0049]

【数４】 [Equation 4]

【００５０】をパラメータとするポワソン分布で近似す
る。It is approximated by a Poisson distribution using as a parameter.

【００５１】[0051]

【数５】 [Equation 5]

【００５２】従って、平均単語長計算部１８は、学習テ
キストにおいて単語タイプが＜ＷＴ＞かつ、品詞が＜Ｕ
−ｔ＞である単語の平均単語長Therefore, the average word length calculation unit 18 determines that the word type is <WT> and the part of speech is <U in the learning text.
-Average word length of words with t>

【００５３】[0053]

【数６】 [Equation 6]

【００５４】を計算し、平均単語長テーブル１３に格納
する。また、単語長確率推定部８は、式（７）を用いて
単語タイプ及び品詞別の単語長確率を推定する。式
（５）の右辺の第３項は、文字bigram確率の積で文字列
の出現確率を近似し、これに長さを限定するための補正
を加えるという方法で求める。まず、Ｐ_b（ｃ₁，…，
ｃ_k｜＜ＷＴ＞，＜Ｕ−ｔ＞）を以下のように定義す
る。これは、単語タイプ及び品詞別の文字bigram確率の
積で、単語タイプ及び品詞別の単語表記確率を近似した
ものに相当する。Is calculated and stored in the average word length table 13. Further, the word length probability estimation unit 8 estimates the word length probability for each word type and part of speech using Expression (7). The third term on the right side of the equation (5) is obtained by a method of approximating the appearance probability of the character string by the product of the character bigram probabilities and adding a correction for limiting the length to this. First, P _b (c ₁ , ...,
c _k | <WT>, <Ut>) is defined as follows. This is a product of the character bigram probabilities for each word type and part of speech, and corresponds to an approximation of the word notation probabilities for each word type and part of speech.

【００５５】[0055]

【数７】 [Equation 7]

【００５６】次に、Ｐ_b（ｋ｜＜ＷＴ＞，＜Ｕ−ｔ＞）
を以下のように定義する。これは、単語タイプ及び品詞
別に長さｋの単語が出現する確率を、単語タイプ及び品
詞別の文字unigram モデルで近似したものに相当する。Next, P _b (k | <WT>, <Ut>)
Is defined as follows. This corresponds to the probability that a word of length k appears for each word type and part of speech approximated by a character unigram model for each word type and part of speech.

【００５７】[0057]

【００５８】上式では、長さｋの単語が出現する事象の
確率は、語末記号＜ｅｏｗ＞以外がｋ−１回出現した後
に語末記号が出現する事象の確率に等しいという関係を
用いている。さて、式（５）の右辺の第３項、即ち、単
語タイプ及び品詞及び長さ別の単語表記確率Ｐ（ｃ₁…
ｃ_k｜ｋ，＜ＷＴ＞，＜Ｕ−ｔ＞）は、式（８）に定義
した単語タイプ及び品詞別の単語表記確率Ｐ_b（ｃ₁…
ｃ_k｜ｋ，＜ＷＴ＞，＜Ｕ−ｔ＞）と、式（９）に定義
した単語及び品詞の長さｋの単語の出現確率Ｐ_b（ｋ｜
＜ＷＴ＞，＜Ｕ−ｔ＞）の比で近似することができる。In the above equation, the probability of an event in which a word of length k appears is equal to the probability of an event in which a word end symbol appears after k-1 times other than the word end symbol <eow> appears. . Now, the third term on the right side of the equation (5), that is, word notation probability P (c ₁ ...
c _k | k, <WT>, <Ut>) is the word notation probability P _b (c ₁ ...) for each word type and part of speech defined in Expression (8).
c _k | k, <WT>, <Ut>), and the occurrence probability P _b (k | of the word defined in Expression (9) and the word having the part-of-speech length k.
It can be approximated by the ratio of <WT>, <Ut>).

【００５９】[0059]

【数９】 [Equation 9]

【００６０】ここで、式（８）の計算に必要な単語タイ
プ及び品詞別の文字bigram確率Ｐ（ｃ_i｜ｃ_i-1，＜Ｗ
Ｔ＞，＜Ｕ−ｔ＞）の推定法が問題になる。基本的に
は、学習テキストの対応する事象の相対頻度ｆ（ｃ_i｜
ｃ_i-1，＜ＷＴ＞，＜Ｕ−ｔ＞）から推定すればよい。Here, the character bigram probability P (c _i | c _i−1 , <W for each word type and part of speech necessary for the calculation of the equation (8).
T>, <U-t>) estimation method becomes a problem. Basically, the relative frequency f (c _i |
It may be estimated from c _i-1 , <WT>, <Ut>).

【００６１】[0061]

【数１０】 [Equation 10]

【００６２】しかし、日本語は文字が３０００種類以上
あるので、文字bigramを単語タイプ及び品詞で分割する
とデータ不足の問題が生じる。そこで、次式に示す線形
補間法により単語タイプ及び品詞別の文字bigram確率を
求める。However, since there are more than 3000 types of characters in Japanese, dividing the character bigram by word type and part of speech causes a problem of insufficient data. Therefore, the character bigram probability for each word type and part of speech is obtained by the linear interpolation method shown in the following equation.

【００６３】[0063]

【数１１】 [Equation 11]

【００６４】ここで、α_iは、α₁＋α₂＋α₃＋α₄
＋α₅＝１を満足する重みであり、線形補間法により自
動的に決定できる。ｆ（ｃ_i，＜ＷＴ＞，＜Ｕ−ｔ＞）
及びｆ（ｃ_i｜ｃ_i-1，＜ＷＴ＞，＜Ｕ−ｔ＞）は、単
語タイプ及び品詞別の文字unigram 及び文字bigramの相
対頻度を表す。ｆ（ｃ_i）及びｆ（ｃ_i｜ｃ_i-1）は、
文字unigram 及び文字bigramの相対頻度である。Ｖは、
学習テキストの異なり文字数である。Here, α _i is α ₁ + α ₂ + α ₃ + α ₄
The weight satisfies + α ₅ = 1 and can be automatically determined by the linear interpolation method. _{f (c i, <WT>} , <U-t>)
And f (c _i | c _i-1 , <WT>, <Ut>) represent the relative frequency of the character unigram and the character bigram for each word type and part of speech. f (c _i ) and f (c _i | c _i-1 ) are
It is the relative frequency of character unigram and character bigram. V is
It is the number of different characters in the learning text.

【００６５】従って、文字bigram頻度計算部１９は、式
（１２）の計算に必要な文字unigram 頻度及び文字bigr
am頻度を学習データから計算する。また、単語表記確率
推定部９は、式（８）、式（９）、式（１０）、式（１
２）を用いて単語タイプ及び品詞及び長さ別の単語表記
確率を推定する。図７は、本発明の品詞及び単語タイプ
別の文字bigramの例である。Therefore, the character bigram frequency calculation unit 19 calculates the character unigram frequency and the character bigr required for the calculation of the equation (12).
Calculate am frequency from training data. Further, the word notation probability estimation unit 9 uses the formula (8), the formula (9), the formula (10), and the formula (1
2) is used to estimate word writing probabilities by word type, part of speech, and length. FIG. 7 is an example of a character bigram according to part of speech and word type of the present invention.

【００６６】文字bigramは未知語の品詞を推定する鍵と
なる情報を含んでいる。例えば、図７において、最初の
例は語末が長音記号「ー」であるカタカナ列は名詞であ
る可能性が高いことを表し、三番目の列は語末が「的」
である漢字列は形容動詞である可能性が高いことを表し
ている。（４）単語bigram確率推定処理及び最適単語列探索処
理：文字列Ｃ＝ｃ₁…ｃ_mから構成される入力文が、単
語列Ｗ＝ｗ₁…ｗ_nに分割されるとするとき、数学的に
は、日本語の形態素解析は単語列の条件付き確率Ｐ（Ｗ
｜Ｃ）を最大化する単語列Ｗ’を求める問題と定義でき
る。ここで、文字列Ｃはすべての単語分割に共通なので
Ｐ（Ｗ）を最大化する単語列を求めればよい。なお、Ｐ
（Ｗ）（単語分割モデル）は、単語ngram やクラスに基
づくngram 等、単語に基づく言語モデルであれば使用可
能である。以下の例では、単語bigramを用いて説明す
る。The character bigram contains key information for estimating the part of speech of an unknown word. For example, in FIG. 7, the first example indicates that the katakana sequence with the long syllabary "-" at the end is likely to be a noun, and the third sequence has the end with "target".
It means that the Kanji string that is is likely to be an adjective verb. (4) Word bigram probability estimation process and optimum word string search process: When an input sentence composed of character strings C = c ₁ ... C _m is divided into word strings W = w ₁ ... W _n , mathematics In particular, Japanese morphological analysis requires conditional probability P (W
It can be defined as a problem for finding a word string W ′ that maximizes | C). Here, since the character string C is common to all word divisions, a word string that maximizes P (W) may be obtained. Note that P
(W) (word segmentation model) can be used as long as it is a language model based on words, such as a word ngram or a class-based ngram. The following example will be described using the word bigram.

【００６７】Ｗ’＝argmax_WＰ（Ｗ｜Ｃ）＝argmax_WＰ（Ｗ）（１３）本実施例では、Ｐ（Ｗ）を次式のような単語bigramモデ
ルで近似する。W ′ = argmax _WP (W | C) = argmax _WP (W) (13) In the present embodiment, P (W) is approximated by a word bigram model such as the following equation.

【００６８】[0068]

【数１２】 [Equation 12]

【００６９】ここで、＜ｂｏｓ＞及び＜ｅｏｓ＞は文頭
及び文末を表す特殊記号である。基本的には、単語bigr
am確率は学習テキストにおける対応する事象の相対確率
ｆ（ｗ_i｜ｗ_i-1）から求めることができる。Here, <bos> and <eos> are special symbols representing the beginning and end of a sentence. Basically, the word bigr
The am probability can be obtained from the relative probability f (w _i | w _i-1 ) of the corresponding event in the learning text.

【００７０】[0070]

【数１３】 [Equation 13]

【００７１】但し、本実施例では、未知語の品詞を推定
するために、通常の単語bigramモデルを以下のように拡
張する。未知語の品詞がｔである事象を＜Ｕ−ｔ＞とい
う記号で表すことにすると、単語ｗ_iが品詞ｔの未知語
である場合、単語ｗ_i-1の次に品詞ｔの未知語が出現す
る確率Ｐ（＜Ｕ−ｔ＞｜ｗ_i-1）と、品詞ｔの未知語の
表記がｗ_iである確率Ｐ（ｗ_i｜＜Ｕ−ｔ＞）の積で、
単語bigram確率Ｐ（ｗ _i｜ｗ_i-1）を近似する。However, in this embodiment, the part of speech of the unknown word is estimated.
In order to do this, expand the normal word bigram model as follows.
To stretch. An event in which the part of speech of an unknown word is t is called <Ut>
If you choose to represent it with a sign, the word w_iIs an unknown word whose part of speech is t
, Then the word w_i-1An unknown word of part of speech t appears next to
Probability P (<Ut> | w_i-1) And an unknown word of part of speech t
The notation is w_iProbability P (w_i| <U−t>) product,
Word bigram probability P (w _i| W_i-1) Is approximated.

【００７２】[0072]

【数１４】 [Equation 14]

【００７３】Ｐ（＜Ｕ−ｔ＞｜ｗ_i-1）は、学習テキス
トにおいて、出現頻度が閾値以下の単語をその品詞ｔに
対応する未知語記号＜Ｕ−ｔ＞に置換した後の相対頻度
から求める。本実施例では、出現頻度の閾値を１に設定
している。Ｐ（ｗ_i｜＜Ｕ−ｔ＞）は、前節で説明した
未知語候補確率推定部３により求める。従って、単語bi
gram頻度推定部２０は、出現頻度が閾値以下の単語をそ
の品詞ｔに対応する未知語記号＜Ｕ−ｔ＞に置換した学
習テキストにおける単語bigram頻度を求め、単語bigram
頻度テーブル１５に格納する。また、単語bigram確率推
定部１０は、ｗ_iが既知語の場合は式（１５）を用いて
単語bigram確率を求め、ｗ_iが未知語の場合には式（１
６）を用いて単語bigram確率を求める。P (<Ut> | w _i-1 ) is a relative value after replacing a word whose appearance frequency is less than or equal to a threshold value in the learning text with the unknown word symbol <Ut> corresponding to the part of speech t. Calculate from frequency. In the present embodiment, the appearance frequency threshold is set to 1. P (w _i | <Ut>) is obtained by the unknown word candidate probability estimation unit 3 described in the previous section. Therefore, the word bi
The gram frequency estimation unit 20 obtains the word bigram frequency in the learning text in which words whose appearance frequency is less than or equal to the threshold value are replaced with the unknown word symbol <Ut> corresponding to the part of speech t, and the word bigram frequency is calculated.
Stored in the frequency table 15. Further, the word bigram probability estimation unit 10, if w _i is the known word sought word bigram probability using equation (15), the formula when w _i is the unknown words (1
6) is used to find the word bigram probability.

【００７４】そして、最適単語列探索部４は、単語照合
部１が生成した単語候補、及び、未知語候補同定部２が
生成した未知語候補のすべての組み合わせの中から、式
（１４）の同時確率を最大にする単語列を動的計画法を
用いて求める。さらに、本実施例では、Ａ^*探索を用い
て確率が大きい順番に任意個の形態素解析候補を求めら
れるようにしている。Then, the optimum word string searching unit 4 selects from among all combinations of the word candidates generated by the word matching unit 1 and the unknown word candidates generated by the unknown word candidate identifying unit 2 as shown in Expression (14). The word string that maximizes the joint probability is obtained using dynamic programming. Furthermore, in the present embodiment, an A ^* search is used to obtain an arbitrary number of morphological analysis candidates in descending order of probability.

【００７５】図８は、本発明の一実施例の品詞別未知語
記号を含む単語bigramの例である。品詞別の未知語記号
を導入することにより、単語bigramは未知語が出現する
文脈に関する重要な情報を担うようになる。例えば、図
８において、最初の例は「の」の直後には名詞が出現す
る可能性が高いことを表し、二番目は「し」の直前には
動詞（サ変動詞）が出現する可能性が高いことを表して
いる。FIG. 8 is an example of a word bigram including a part-of-speech unknown word symbol according to an embodiment of the present invention. By introducing unknown word symbols for each part of speech, the word bigram will carry important information about the context in which the unknown word appears. For example, in FIG. 8, the first example indicates that a noun is likely to appear immediately after "no", and the second example is that a verb (sa verb) may appear immediately before "shi". It is high.

【００７６】以下に具体的な処理例を示す。ここでは、
『ペンシルバニア大学はＥＮＩＡＣの５０周年を祝う』
という入力文において、「ペンシルバニア」及び「ＥＮ
ＩＡＣ」が未知語と仮定する。A specific processing example is shown below. here,
"The University of Pennsylvania Celebrates 50th Anniversary of ENIAC"
"Pennsylvania" and "EN
Assume "IAC" is an unknown word.

【００７７】図９は、本発明の一実施例の未知語候補の
生成及び最適単語列探索例である。同図において、「ペ
ンシルバニア大学はＥＮＩＡＣの５０周年を祝う」とい
う文の文字位置４（「ペンシル」と「バニア」の間）に
おける最適単語列探索の様子を示す。最適単語列探索で
は、ある文字位置で終わるすべての単語候補とその文字
位置から始まるすべての単語候補の組み合わせの確率を
計算する。FIG. 9 shows an example of unknown word candidate generation and optimum word string search according to an embodiment of the present invention. In the figure, the state of the optimum word string search at character position 4 (between "Pencil" and "Vania") of the sentence "The University of Pennsylvania celebrates 50th anniversary of ENIAC" is shown. In the optimum word string search, the probabilities of combinations of all word candidates ending at a character position and all word candidates starting at that character position are calculated.

【００７８】文字位置４の前後では、「ペンシル」とう
部分文字列が単語辞書５と照合し、単語候補が生成され
ている。単語辞書５と照合しなかった部分文字列は未知
語候補となり、単語タイプ定義テーブル１１に基づいて
単語タイプが決定され、未知語モデルに基づいて品詞別
単語出現確率が求められる。例えば、「バニア」は単語
タイプが＜ｋａｎａ＞（カタカナだけから構成される）
の名詞（＜Ｕ−名詞＞）の未知語候補であり、「バニア
大学」は単語タイプが，＜ｍｉｓｃ＞（その他）の名詞
の未知語候補である。Before and after the character position 4, the partial character string “pencil” is collated with the word dictionary 5 to generate word candidates. The partial character string that is not matched with the word dictionary 5 becomes an unknown word candidate, the word type is determined based on the word type definition table 11, and the word appearance probability by part of speech is obtained based on the unknown word model. For example, the word type of "vania" is <kana> (consisting only of katakana).
Is a noun (<U-noun>) unknown word candidate, and “Vania University” is a noun unknown word candidate with a word type of <misc> (other).

【００７９】この図では、説明を簡単にするため、未知
語の品詞が名詞（＜Ｕ−名詞＞）の場合だけを表示して
いる。実際には、本実施例では、各部分文字列に対して
すべての品詞別単語出現確率を計算し、上位５つの品詞
のみを未知語候補として生成している。図１０は、本発
明の一実施例の形態素解析結果の例を示す。同図におい
て、「ペンシルバニア大学はＥＮＩＡＣの５０周年を祝
う」という文の上位３個の形態素解析候補を示す。第１
候補では、「ペンシルバニア」及び「ＥＮＩＡＣ」がそ
れれお単語タイプ＜ｋａｎａ＞及び＜ａｌｐｈａ＞の未
知語として同定され、品詞はどちらも名詞＜Ｕ−名詞＞
と推定されている。第２候補及び第３候補は、それぞ
れ、単語タイプ＜ｍｉｓｃ＞の「バニア大学」、単語タ
イプ＜ｋａｎａ＞の「バニア」が未知語（名詞）として
同定されている点が異なっている。In this figure, only the case where the part of speech of the unknown word is a noun (<U-noun>) is shown for the sake of simplicity. In the present embodiment, in practice, all the part-of-speech word appearance probabilities are calculated for each partial character string, and only the top five parts-of-speech are generated as unknown word candidates. FIG. 10 shows an example of a morphological analysis result according to an embodiment of the present invention. In the figure, the top three morphological analysis candidates of the sentence "The University of Pennsylvania celebrates the 50th anniversary of ENIAC" are shown. First
In the candidates, "Pennsylvania" and "ENIAC" are identified as unknown words of the word types <kana> and <alpha>, and the parts of speech are both nouns <U-noun>.
It is estimated that The second candidate and the third candidate are different in that "Bania University" of the word type <misc> and "Bania" of the word type <kana> are identified as unknown words (nouns), respectively.

【００８０】上記のように、日本語の漢字、ひらがな、
カタカナなどの文字集合とその変化に基づいて単語を分
類することにより、日本語の単語をよりよくモデル化す
ることが可能となる。また、上記の実施例では、図３の
構成に基づいて説明したが、図３の構成要素をプログラ
ムとして構築し、形態素解析装置として利用されるコン
ピュータに接続されるディスク装置や、フロッピー（登
録商標）ディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格
納しておき、本発明を実施する際にインストールするこ
とにより、容易に本発明を実現できる。As described above, Japanese kanji, hiragana,
By classifying words based on the character set such as katakana and its variation, it becomes possible to better model Japanese words. Although the above embodiment has been described based on the configuration of FIG. 3, a disk device or a floppy (registered trademark) connected to a computer used as a morphological analyzer is constructed by building the components of FIG. 3 as a program. The present invention can be easily realized by storing it in a portable storage medium such as a disk or a CD-ROM and installing it when carrying out the present invention.

【００８１】[0081]

【発明の効果】上述のように、本発明によれば、単語を
構成する文字の種類とその変化に基づいて単語タイプを
定義し、単語タイプごとに未知語候補の品詞別単語出現
確率を推定し、この未知語の品詞別単語出現確率及び品
詞別未知語記号を含む単語ngram 確率を利用して同時確
率が最大となる単語列を求めるので、未知語の過分割を
防止し、かつ未知語の品詞を推定することが可能な形態
素解析処理を実現することができる。As described above, according to the present invention, a word type is defined on the basis of the types of characters that make up a word and its change, and the word appearance probability of each word category of an unknown word candidate is estimated for each word type. However, since the word sequence that maximizes the joint probability is obtained by using the word occurrence probability by part of speech of this unknown word and the word ngram probability that includes the unknown word symbol by part of speech, it is possible to prevent the unknown word from being overdivided and It is possible to realize a morphological analysis process capable of estimating the part of speech of.

[Brief description of drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の日本語形態素解析装置の構成図であ
る。FIG. 3 is a block diagram of a Japanese morphological analyzer of the present invention.

【図４】本発明の形態素解析処理を説明するためのフロ
ーチャートである。FIG. 4 is a flowchart for explaining a morphological analysis process of the present invention.

【図５】本発明の一実施例のＥＤＲコーパスにおける単
語を構成する文字の種類とその変化を示す図である。FIG. 5 is a diagram showing types of characters constituting a word and changes thereof in the EDR corpus according to the exemplary embodiment of the present invention.

【図６】本発明の一実施例の単語タイプの定義である。FIG. 6 is a definition of a word type according to an embodiment of the present invention.

【図７】本発明の一実施例の品詞及び単語タイプ別の文
字ｂｉｇｒａｍの例である。FIG. 7 is an example of a character bigram for each part of speech and word type according to an embodiment of the present invention.

【図８】本発明の一実施例の品詞別未知語記号を含む単
語ｂｉｇｒａｍの例である。FIG. 8 is an example of a word bigram including an unknown word symbol for each part of speech according to an embodiment of the present invention.

【図９】本発明の一実施例の未知語候補の生成及び最適
単語列探索の例である。FIG. 9 is an example of generation of an unknown word candidate and optimum word string search according to an embodiment of the present invention.

【図１０】本発明の一実施例の形態素解析結果の例であ
る。FIG. 10 is an example of a morphological analysis result according to an embodiment of the present invention.

[Explanation of symbols]

１単語辞書照合手段、単語辞書照合部２未知語候補同定手段、未知語候補同定部３未知語候補確率推定手段、未知語候補確率推定部４最適単語列探索手段、最適単語列探索部５単語辞書６単語タイプ判定手段、単語タイプ判定部７単語タイプ確率推定手段、単語タイプ確率推定部８単語長確率推定手段、単語長確率推定部９単語表記確率推定手段、単語表示確率推定部１０単語bigram確率推定部１１単語タイプ定義テーブル１２単語タイプ頻度テーブル１３平均単語長テーブル１４文字bigram頻度テーブル１５単語bigram頻度テーブル１６単語辞書作成部１７単語タイプ確率推定部１８平均単語長計算部１９文字bigram頻度計算部２０単語bigram頻度計算部 1 word dictionary matching means, word dictionary matching unit 2 Unknown word candidate identification means, unknown word candidate identification unit 3 Unknown word candidate probability estimation means, unknown word candidate probability estimation unit 4 Optimal word string search means, optimal word string search unit 5 word dictionary 6 word type determination means, word type determination unit 7 Word type probability estimation means, word type probability estimation unit 8 Word Length Probability Estimating Means, Word Length Probability Estimating Unit 9 word notation probability estimation means, word display probability estimation unit 10-word bigram probability estimator 11 Word type definition table 12 word type frequency table 13 Average word length table 14-character bigram frequency table 15-word bigram frequency table 16 word dictionary creation section 17 Word Type Probability Estimator 18 Average word length calculator 19-character bigram frequency calculator 20 word bigram frequency calculator

フロントページの続き (56)参考文献特開平９−288673（ＪＰ，Ａ) 特開平８−315078（ＪＰ，Ａ) 永田昌明，単語頻度の期待値に基づく未知語の自動収集，情報処理学会研究報告，日本，1996年11月19日，Ｖｏｌ. 96，Ｎｏ．114，ｐ．13−ｐ．20 永田昌明，文字類似度と統計的言語モデルを用いた日本語文字認識誤り定性法，電子情報通信学会論文誌，日本, 1998年11月25日，Ｖｏｌ．Ｊ81−Ｄ−ＩＩ，Ｎｏ．11，ｐ．2624−ｐ．2634 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/21 - 17/28 G06K 9/72 Continuation of the front page (56) References JP-A-9-288673 (JP, A) JP-A-8-315078 (JP, A) Masaaki Nagata, Automatic collection of unknown words based on expected word frequency, Information Processing Society of Japan Research Report, Japan, November 19, 1996, Vol. 96, No. 114, p. 13-p. 20 Masaaki Nagata, Japanese Character Recognition Error Qualitative Method Using Character Similarity and Statistical Language Model, IEICE Transactions, Japan, November 25, 1998, Vol. J81-D-I I, No. 11, p. 2624-p. 2634 (58) Fields surveyed (Int.Cl. ⁷ , DB name) G06F 17/21-17/28 G06K 9/72

Claims

(57) [Claims]

1. A morphological analyzer for performing morphological analysis of Japanese, comprising: a word dictionary matching means for searching a word dictionary for a word to be matched with a partial character string of input text and generating it as a word candidate; A word type determination unit that classifies any character string of any of the word types and refers to the word type definition table in which the word type is defined based on the type of character And the word type frequency table in which the word type appearance frequency is defined for each part of speech, the word type probability determining means for determining the word type appearance probability for each part of speech, and the average word length for each word part of speech and word type are defined. The average word length table is referenced, and the average word length is approximated by the Poisson distribution to determine the word length probability of an arbitrary length for each part of speech and word type. A word notation for determining a word notation probability of an arbitrary character string by part of speech, word type, and word length by referring to a word ngram frequency for which a word ngram frequency for each part of speech and word type is defined Probability estimation means,
An unknown word model consisting of, using the word type determination means of the unknown word model,
In the word dictionary matching unit, an unknown word candidate identifying unit that selects, as an unknown word candidate, a potentially unknown word from a partial character string of the input text that is not matched with the word dictionary, and the word of the unknown word model. An unknown word candidate probability estimating means for estimating a part-of-speech word appearance probability of an unknown word candidate using a type determining means, a word length probability estimating means, and a word notation probability estimating means, and the word candidate obtained by the word dictionary matching means,
And, for all combinations of the unknown word candidates obtained by the unknown word candidate matching means, the word ngram probability obtained by referring to the word ngram frequency table in which the word ngram frequency is defined, and the unknown word candidate A morpheme analysis apparatus comprising: an optimum word string search unit that finds a word string having a maximum joint probability using the part-of-speech word appearance probability obtained by the probability estimation unit.

2. The word notation probability estimation means is the character ng.
By referring to the ram frequency table, character ngram frequencies of lower-order parts of speech and word strings and character ngram frequencies that do not consider differences between homogeneous or lower-order parts of speech and words the method obtains the character ngram probability by linear interpolation or normalized word notation probabilities obtained from the character ngram frequency by the sum of all words notation assigned to a string probability <br/> rate of the same length, The morphological analysis apparatus according to claim 1, wherein any one of the following methods is used.