JPH0844739A

JPH0844739A - Morpheme analysis device

Info

Publication number: JPH0844739A
Application number: JP6178102A
Authority: JP
Inventors: Atsushi Kawai; 淳河井; Kura Furuse; 蔵古瀬; Eiichiro Sumida; 英一郎隅田
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1994-07-29
Filing date: 1994-07-29
Publication date: 1996-02-16

Abstract

PURPOSE:To provide the morpheme analysis device which can eliminate the vaguness of an analysis of a morpheme related to an adjunct and other independent words and calculates conjunction likelihood which is higher in precision than before. CONSTITUTION:The morpheme analysis device is equipped with an analyzing means 10 which calculates and outputs the conjunction likelihood of an inputted morpheme string on the basis of the respective morphemes of the morpheme string of an inputted natural language sentence consisting of plural character strings by using a morpheme dictionary which is stored in a 1st storage device 12 and contains morpheme information showing the parts of speech and inflection of the respective morphemes and frequency data which is stored in a 2nd storage device 13 and contains the frequency of conjunction between a 1st morpheme among the respective morphemes and a 2nd morpheme following it; and the frequency data consist of (n)-gram frequency data of frequencies at which a group of a natural number (n) of conjugating morphemes appears and includes conjugation frequency data on combinations of words and parts of speech for the 1st morpheme and 2nd morpheme.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、情報処理分野、特に、
かな漢字変換装置、機械翻訳装置や情報検索装置などの
自然言語処理装置に用いられ、形態素の自然数ｎグラム
頻度を利用して形態素の解析を行って連接尤度付き形態
素列を出力する形態素解析装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of information processing, in particular,
The present invention relates to a morphological analyzer used for a natural language processing device such as a kana-kanji conversion device, a machine translation device, or an information retrieval device, which analyzes a morpheme using the natural number n-gram frequency of morphemes and outputs a morpheme sequence with a concatenated likelihood. .

【０００２】[0002]

【従来の技術】形態素解析とは、与えられた文から形態
素を抽出し、それらがどのように結合して語を形成して
いるかを解析し認定することであり、ここで、形態素と
は、語の同定可能な、すなわち自然言語文を具体的に識
別し得るように分割された切片である。従来の形態素解
析装置においては、形態素辞書から得られた複数の形態
素候補に対して、ある形態素が連接可能であるか否かを
記述したテーブルを用いて、もしくは、文法などによっ
て連接チェックを行うことによって形態素候補数を絞っ
て形態素解析を行っていた。さらに、形態素解析の処理
の制御や出力結果の決定においては、単一の形態素の出
現頻度を優先したり、出現形のより長い形態素を優先す
る最長一致法などのヒューリスティックスな方法を用い
ていた。2. Description of the Related Art Morphological analysis is to extract morphemes from a given sentence and analyze and certify how they are combined to form a word. Here, morphemes are It is a segment that is identifiable of words, that is, divided so that a natural language sentence can be specifically identified. In a conventional morpheme analysis device, a connection check is performed for a plurality of morpheme candidates obtained from a morpheme dictionary using a table that describes whether or not a morpheme can be connected, or by grammar. Therefore, the number of morpheme candidates was narrowed down and morpheme analysis was performed. Furthermore, in controlling the processing of morphological analysis and determining the output result, heuristic methods such as the longest matching method in which the appearance frequency of a single morpheme is given priority, or the morpheme having a longer appearance is given priority.

【０００３】しかしながら、この形態素解析装置におい
ては、以下のような問題点があった。（１）連接テーブル又は文法を用いて形態素解析を行っ
ているために、そこから得られる情報は形態素が連接可
能か否かという２値の情報だけであり、そのチェックを
通過させるだけでは、形態素の候補数の十分な絞り込み
ができず、候補数が膨大になる場合がある。このこと
は、多くの曖昧性を残し、その後の形態素解析処理にお
ける処理速度の低下や、処理結果の信頼性の低下を引き
起こす。（２）形態素の出現頻度のみを用いているために、前後
の形態素との連接に関する情報が得られず、誤った結果
を生じやすい。（３）形態素解析の処理の過程で生じる分岐の優先順
位、又は最終的な出力候補の尤度を与えるためには最長
一致法などの単語の連接の個別性を無視した一般的なヒ
ューリスティックスな方法に頼らなければならない。こ
の方法では、システムの性能を向上させるための調整は
困難である。However, this morphological analyzer has the following problems. (1) Since morphological analysis is performed using a concatenation table or grammar, the information obtained from this is only binary information indicating whether or not morphemes can be concatenated. In some cases, the number of candidates cannot be narrowed down sufficiently, and the number of candidates becomes enormous. This leaves a lot of ambiguity, which causes a decrease in processing speed in the subsequent morphological analysis processing and a decrease in reliability of the processing result. (2) Since only the appearance frequency of morphemes is used, information about the connection with the preceding and following morphemes cannot be obtained, and an erroneous result is likely to occur. (3) A general heuristic method that ignores the individuality of word concatenation such as the longest match method in order to give the priority of the branch generated in the process of the morphological analysis or the likelihood of the final output candidate. Must rely on. With this method, adjustments to improve system performance are difficult.

【０００４】このように、従来の方法は多くの問題点を
抱えており、これらの問題点を解決するために、本発明
者は、特願平５−１８７９０７号の特許出願において、
品詞の連接頻度データと形態素の連接頻度のデータとを
別々に作成し、それらを組み合わせて形態素の連接尤度
を計算することによって、形態素解析を行う形態素解析
装置（以下、従来の形態素解析装置という。）を提案し
た。As described above, the conventional method has many problems, and in order to solve these problems, the present inventor made a patent application in Japanese Patent Application No. 5-187907.
A morphological analyzer that performs morphological analysis by creating POS concatenation frequency data and morphological concatenation frequency data separately and calculating a morphological concatenation likelihood by combining them (hereinafter referred to as a conventional morphological analyzer. .) Was proposed.

【０００５】[0005]

【発明が解決しようとする課題】この提案した従来の形
態素解析装置においては、以下のような問題点があっ
た。（１）品詞の連接頻度の影響により、付属語に関連する
形態素において実際にはありえない接続に対しても、連
接尤度が完全には０とならず曖昧性を生じることがあ
る。このことは、形態素解析に誤りを生じやすくする。
また、探索空間の拡大を招き処理時間を増加させる。（２）処理語数を拡大するために、形態素辞書に新たな
単語を登録した場合にその形態素に関する連接頻度情報
がないために、その部分で尤度が著しく低下し、適正な
結果を出力できないことがある。これに対処するために
は、再度、連接頻度データを作成しなければならない等
の作業が必要となる。（３）形態素の出現形の連接の組み合わせの規模は一般
に大きなものであるため、形態素の連接頻度データが計
算機の記憶領域を大きく占有し、本来の形態素解析の処
理を行えないなどの問題が生じる場合がある。The proposed conventional morphological analyzer has the following problems. (1) Due to the influence of the concatenation frequency of part-of-speech, the concatenation likelihood may not be completely 0, and ambiguity may occur even for a connection that cannot actually occur in a morpheme related to an adjunct. This makes the morphological analysis error-prone.
Further, the search space is expanded and the processing time is increased. (2) When a new word is registered in the morpheme dictionary in order to increase the number of processed words, there is no concatenation frequency information about the morpheme, so the likelihood is significantly reduced at that part and an appropriate result cannot be output. There is. In order to cope with this, it is necessary to create the connection frequency data again. (3) Since the scale of the concatenation combination of morpheme appearance forms is generally large, there arises a problem that the morpheme connection frequency data occupies a large storage area of the computer and the original morpheme analysis processing cannot be performed. There are cases.

【０００６】本発明の目的は以上の問題点を解決し、付
属語とその他の自立語に関連する形態素の解析において
解析の曖昧性を排除することができ、従来に比較してよ
り精度の高い連接尤度を計算することができる形態素解
析装置を提供することにある。The object of the present invention is to solve the above problems and to eliminate the ambiguity of analysis in the analysis of morphemes related to adjuncts and other independent words, and thus to achieve higher accuracy than in the past. It is an object to provide a morphological analyzer that can calculate a connection likelihood.

【０００７】[0007]

【課題を解決するための手段】本発明に係る請求項１記
載の形態素解析装置は、複数の文字列からなる入力され
た自然言語文の形態素列の各形態素に基づいて、第１の
記憶装置に記憶され各形態素に対する品詞と活用形を示
す形態素情報を含む形態素辞書と、第２の記憶装置に記
憶され上記各形態素のうちの第１の形態素とそれに続く
第２の形態素との間の連接頻度を含む頻度データとを用
いて、上記入力された形態素列の連接尤度を計算して出
力する解析手段を備えた形態素解析装置であって、上記
頻度データは、自然数ｎ個の連接する形態素の組が出現
する頻度であるｎグラムの頻度データからなり、上記第
１の形態素と上記第２の形態素とに対して単語と品詞の
組み合わせの連接頻度データを含むことを特徴とする。A morphological analysis apparatus according to a first aspect of the present invention is a first storage device based on each morpheme of a morpheme sequence of an input natural language sentence consisting of a plurality of character strings. And a morpheme dictionary stored in a memory containing morpheme information indicating a part-of-speech and a morphological form for each morpheme, and a concatenation between a first morpheme of the morphemes and a second morpheme that follows the morpheme stored in a second storage device. A morphological analyzer that includes an analysis unit that calculates and outputs a concatenated likelihood of the input morpheme sequence using frequency data including a frequency, wherein the frequency data is a natural number of n concatenated morphemes. It is characterized in that it is composed of n-gram frequency data which is the frequency of appearance of the set, and includes concatenation frequency data of a combination of a word and a part of speech for the first morpheme and the second morpheme.

【０００８】また、請求項２記載の形態素解析装置は、
請求項１記載の形態素解析装置において、上記頻度デー
タは、モノグラムの頻度データと、バイグラムの頻度デ
ータとからなることを特徴とする。The morphological analyzer according to claim 2 is
The morphological analysis apparatus according to claim 1, wherein the frequency data comprises monogram frequency data and bigram frequency data.

【０００９】さらに、請求項３記載の形態素解析装置
は、請求項１又は２記載の形態素解析装置において、上
記頻度データにおいて、形態素のうち付属語については
形態素情報のうち各形態素に対する少なくとも品詞と活
用形の組み合わせの連接頻度データを含む一方、自立語
については形態素情報のうち少なくとも品詞と活用形の
組み合わせの連接頻度データを含むことを特徴とする。Further, the morpheme analysis apparatus according to claim 3 is the morpheme analysis apparatus according to claim 1 or 2, wherein in the frequency data, at least a part of speech for each morpheme in the morpheme information of the adjunct word of the morpheme and the utilization. It is characterized in that it includes concatenation frequency data of a combination of forms, while it includes at least concatenation frequency data of a combination of a part-of-speech and an inflectional form in morpheme information for independent words.

【００１０】またさらに、請求項４記載の形態素解析装
置は、請求項１、２又は３記載の形態素解析装置におい
て、上記解析手段は、上記入力された各形態素に基づい
て、上記形態素辞書を用いて各形態素に対する形態素情
報を上記第１の記憶装置から読み出す第１の制御手段
と、上記第１の制御手段によって読み出された各形態素
に対する形態素情報に基づいて、上記頻度データを用い
て、上記読み出された各形態素に対する連接尤度を計算
した後、上記入力された形態素列の連接尤度を計算して
出力する第２の制御手段とを備えたことを特徴とする。Furthermore, the morpheme analysis apparatus according to claim 4 is the morpheme analysis apparatus according to claim 1, 2 or 3, wherein the analysis means uses the morpheme dictionary based on each of the input morphemes. Based on the morpheme information for each morpheme read out by the first control means, and the morpheme information for each morpheme from the first storage device. Second control means for calculating the concatenation likelihood of the input morpheme sequence and outputting the concatenated likelihood for each read morpheme.

【００１１】[0011]

【作用】請求項１記載の形態素解析装置においては、上
記頻度データは、自然数ｎ個の連接する形態素の組が出
現する頻度であるｎグラムの頻度データからなり、上記
第１の形態素と上記第２の形態素とに対して単語と品詞
の組み合わせの連接頻度データを含む。そして、上記解
析手段は、複数の文字列からなる入力された自然言語文
の形態素列の各形態素に基づいて、上記形態素辞書と、
上記頻度データとを用いて、上記入力された形態素列の
連接尤度を計算して出力する。In the morphological analyzer according to claim 1, the frequency data comprises n-gram frequency data, which is a frequency at which a set of natural number n concatenated morphemes appears, and the first morpheme and the first morpheme are used. Concatenation frequency data of a combination of a word and a part of speech is included for the two morphemes. Then, the analysis means, based on each morpheme of the morpheme string of the input natural language sentence consisting of a plurality of character strings, the morpheme dictionary,
Using the frequency data, the connection likelihood of the input morpheme string is calculated and output.

【００１２】また、請求項２記載の形態素解析装置にお
いては、好ましくは、上記頻度データは、モノグラムの
頻度データと、バイグラムの頻度データとからなる。さ
らに、請求項３記載の形態素解析装置においては、好ま
しくは、上記データにおいて、形態素のうち付属語につ
いては形態素情報のうち各形態素に対する少なくとも品
詞と活用形の組み合わせの連接頻度データを含む一方、
自立語については形態素情報のうち少なくとも品詞と活
用形の組み合わせの連接頻度データを含む。またさら
に、請求項４記載の形態素解析装置においては、上記第
１の制御手段は、上記入力された各形態素に基づいて、
上記形態素辞書を用いて各形態素に対する形態素情報を
上記第１の記憶装置から読み出す。そして、上記第２の
制御手段は、上記第１の制御手段によって読み出された
各形態素に対する形態素情報に基づいて、上記頻度デー
タを用いて、上記読み出された各形態素に対する連接尤
度を計算した後、上記入力された形態素列の連接尤度を
計算して出力する。Further, in the morphological analyzer according to the second aspect of the present invention, preferably, the frequency data includes monogram frequency data and bigram frequency data. Further, in the morpheme analysis apparatus according to claim 3, preferably, in the data, while at least the concatenation frequency data of a combination of at least a part-of-speech and an inflectional form for each morpheme in the morpheme information for the adjunct word of the morpheme,
The independent word includes at least concatenation frequency data of a combination of a part of speech and an inflectional form among morpheme information. Still further, in the morphological analysis device according to claim 4, the first control means, based on each of the input morphemes,
The morpheme information for each morpheme is read from the first storage device using the morpheme dictionary. Then, the second control means calculates a concatenated likelihood for each of the read morphemes using the frequency data based on the morpheme information for each of the morphemes read by the first control means. After that, the connection likelihood of the input morpheme string is calculated and output.

【００１３】[0013]

【実施例】以下、図面を参照して本発明に係る実施例に
ついて説明する。図１は本発明に係る一実施例の形態素
解析装置のブロック図である。図１において示すよう
に、本実施例の形態素解析装置は、複数の文字列からな
る入力された自然言語文の形態素列に基づいて、処理メ
モリ１１を利用し、形態素辞書メモリ１２に記憶された
形態素辞書と、頻度データメモリ１３に記憶された頻度
データとを用いて、上記入力された形態素列の連接尤度
を計算して、当該連接尤度付き形態素列を出力する形態
素解析部１０を備え、ここで、本実施例においては、頻
度データは、表２に示すモノグラムの頻度データと、表
３に示すバイグラムの頻度データとからなり、従来の形
態素解析装置に比較して、単語と品詞の組み合わせの連
接頻度データを予め計算されて含むことを特徴としてい
る。また、当該頻度データにおいて、付属語については
形態素情報のうち出現形、標準形、品詞、及び活用形に
着目してそれらの組み合わせの連接頻度データを予め計
算されて含む一方、自立語については品詞、及び活用形
に着目してそれらの組み合わせの連接頻度データが予め
計算されて含むことを特徴としている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a morphological analysis apparatus according to an embodiment of the present invention. As shown in FIG. 1, the morphological analysis apparatus of the present embodiment uses the processing memory 11 to store the morphological dictionary memory 12 based on the morphological sequence of the input natural language sentence composed of a plurality of character strings. A morpheme analysis unit 10 that calculates a concatenated likelihood of the input morpheme sequence using the morpheme dictionary and the frequency data stored in the frequency data memory 13 and outputs the morpheme sequence with the concatenated likelihood is provided. Here, in the present embodiment, the frequency data is composed of the monogram frequency data shown in Table 2 and the bigram frequency data shown in Table 3, and compared with the conventional morphological analyzer, the word and part of speech It is characterized in that the concatenation frequency data of the combination is calculated in advance and included. In addition, in the frequency data, the adjunct word includes the preliminarily calculated concatenation frequency data of the combination of the appearance form, the canonical form, the part of speech, and the inflectional form among the morpheme information, while the independent word has the part of speech. , And the utilization form, the connection frequency data of these combinations is calculated in advance and included.

【００１４】例えば、キーボードなどの入力装置を用い
て入力された複数の文字列からなる入力された自然言語
文の形態素列が形態素解析部１０に入力される。この形
態素解析部１０には、図２と図３を参照して詳細後述す
る形態素解析処理を実行するためのワーキングエリアと
して用いる処理メモリ１１と、表１に示す形態素辞書を
記憶する形態素辞書メモリ１２と、表２に示すモノグラ
ムの頻度データと表３に示すバイグラムの頻度データと
を記憶する頻度データメモリ１３とが接続される。形態
素解析部１０は、入力された形態素列に基づいて、処理
メモリ１１を利用し、形態素辞書メモリ１２に記憶され
た形態素辞書と、頻度データメモリ１３に記憶された頻
度データとを用いて、詳細後述する図２及び図３の形態
素解析処理を実行することによって、上記入力された形
態素列の連接尤度を計算して、当該連接尤度付き形態素
列を出力する。出力される連接尤度付き形態素列は、例
えばワードプロセッサなどに含まれるかな漢字変換装置
においてはかな漢字変換を実行するときにおいて、もし
くは機械翻訳装置においては機械翻訳を実行するときに
おいて、第１の形態素とそれに連続する第２の形態素と
の連接の判断、すなわち、そのように形態素が分割する
ことができて連続して接続するか否かの判断を行うとき
に利用することができる。For example, a morpheme sequence of an input natural language sentence consisting of a plurality of character strings input using an input device such as a keyboard is input to the morpheme analysis unit 10. The morpheme analysis unit 10 includes a processing memory 11 used as a working area for executing a morpheme analysis process described later in detail with reference to FIGS. 2 and 3, and a morpheme dictionary memory 12 storing the morpheme dictionary shown in Table 1. And a frequency data memory 13 for storing the monogram frequency data shown in Table 2 and the bigram frequency data shown in Table 3 are connected. Based on the input morpheme string, the morpheme analysis unit 10 uses the processing memory 11 and uses the morpheme dictionary stored in the morpheme dictionary memory 12 and the frequency data stored in the frequency data memory 13 for details. By executing the morpheme analysis processing of FIG. 2 and FIG. 3 described later, the concatenated likelihood of the input morpheme sequence is calculated, and the concatenated morpheme sequence with concatenation likelihood is output. The output concatenated likelihood morpheme sequence includes the first morpheme and the first morpheme when performing kana-kanji conversion in a kana-kanji conversion device included in a word processor or the like, or when performing machine translation in a machine translation device. This can be used when determining the connection with the continuous second morpheme, that is, determining whether or not the morpheme can be divided in such a manner and continuously connected.

【００１５】表１は、形態素辞書メモリ１２に記憶され
る形態素辞書の一例を示したものである。ここで、形態
素辞書は、形態素の出現形（すなわち、入力される形態
素の文字列）から当該形態素の他の形態素情報である標
準形、品詞、活用形を引くための辞書である。ここで、
「アクセスキー」とは入力される文字列などの情報を示
し、以下同様である。また、「標準形」とは、当該形態
素解析装置において解析するときに標準的に用いる文字
列を表わす。さらに、「活用形」の欄において「無し」
とあるのは活用形が無いことを示す。本実施例において
は、形態素解析部１０は、形態素辞書メモリ１２に記憶
された形態素辞書を参照して、入力された形態素列内の
各形態素に基づいて、それに対応する標準形と、品詞
と、活用形とを引き出してくる。Table 1 shows an example of a morpheme dictionary stored in the morpheme dictionary memory 12. Here, the morpheme dictionary is a dictionary for subtracting, from the appearance form of the morpheme (that is, the character string of the input morpheme), other morpheme information of the morpheme, that is, the standard form, the part of speech, and the inflectional form. here,
The “access key” indicates information such as a character string to be input, and so on. In addition, the “standard form” represents a character string that is used as a standard when the morphological analysis apparatus analyzes. In addition, in the column of "advanced form", "none"
The presence means that there is no inflectional form. In the present embodiment, the morpheme analysis unit 10 refers to the morpheme dictionary stored in the morpheme dictionary memory 12, and based on each morpheme in the input morpheme string, the corresponding standard form and part of speech, Inflectional form is brought out.

【００１６】[0016]

【表１】 ────────────────────────── アクセスキー出現形標準形品詞活用形 ────────────────────────── こちら此方代名詞無し事務局事務局普通名詞無しす為る本動詞終止形はは係助詞無しでで格助詞無しですです助動詞終止形 …… ……… ……… ……… …… ……… ……… ……… ──────────────────────────[Table 1] ────────────────────────── Access key appearance form Standard form Part-of-speech Inflectional form ─────────── ─────────────── This here Pronouns None Secretariat Secretariat Common nouns The main verb end form is without the particle and without the case particle The auxiliary verb end form… ……………………………………………………………………… ──────────────────────────

【００１７】表２及び表３は、頻度メモリ１３に予め計
算されて記憶された頻度データの一例を示すものであ
り、ここで、表２はモノグラムの出現頻度を示す頻度デ
ータであり、表３はバイグラムの連接出現頻度を示す頻
度データである。当該頻度データとは、付属語の形態素
又は自立語の品詞からなる連接するｎ個の組の出現頻度
（ｎグラム頻度という。ここで、連接数ｎは自然数であ
る。）である。ｎ＝１のときは付属語形態素または自立
語品詞の出現頻度と等価である。本実施例では、出現頻
度として、すべての付属語又は活用形情報付きの自立語
品詞に関してｎ＝１の場合の出現頻度（以下、モノグラ
ム頻度という。）と、ｎ＝２の場合の連接頻度（以下、
バイグラム頻度という。）とを利用している。なお、表
２及び表３における頻度は、本出願人が所有するデータ
ベースであって、国際会議の予約といった特定の状況を
想定した日本語と英語の会話のテキスト情報のデータベ
ースであり、基本的な言語分析情報が付加されたデータ
ベース研究用対話テキストデータベースに基づいて予め
計算されたものである。Tables 2 and 3 show examples of frequency data calculated and stored in the frequency memory 13 in advance. Here, Table 2 is frequency data showing the appearance frequency of monograms, and Table 3 Is frequency data indicating the concatenation appearance frequency of bigrams. The frequency data is the appearance frequency of n sets of concatenated morphemes of adjuncts or parts of speech of independent words (referred to as n-gram frequencies, where the concatenation number n is a natural number). When n = 1, it is equivalent to the appearance frequency of the adjunct word morpheme or the independent word part of speech. In the present embodiment, as the appearance frequency, the appearance frequency in the case of n = 1 (hereinafter, referred to as a monogram frequency) and the connection frequency in the case of n = 2 with respect to all independent words or independent word parts of speech with inflected information ( Less than,
It is called bigram frequency. ) And are used. The frequencies in Tables 2 and 3 are the databases owned by the applicant, and are the databases of text information of Japanese and English conversation assuming a specific situation such as reservation of an international conference. It is pre-calculated based on a database text database for research with linguistic analysis information added.

【００１８】[0018]

【表２】モノグラム頻度 ─────────────────────────アクセスキー出現形標準形品詞活用形頻度 ───────────────────────── のの格助詞無し５７１４はは係助詞無し３２１４にに格助詞無し３７５１ですです助動詞終止４４１０ありある補助動詞連用６４＊＊代名詞無し３０８６＊＊普通名詞無し２２１１４ … … …… … …… … … …… … …… ───────────────────────── （注）＊：不定（どんな語も対応すること）を示す。[Table 2] Monogram frequency ───────────────────────── Access key appearance form Standard form Part-of-speech Inflectional form frequency ───────── ────────────────'s no case particle 5714 is no particle particle 3214 and no case particle 3751 There is an auxiliary verb end 4410 There is an auxiliary verb combination 64 * * no pronoun 3086 * * No ordinary noun 22114 …………………………………………………………… ───────────────────────── (Note) *: Indicates indefinite (match any word).

【００１９】[0019]

【表３】バイグラム頻度 ─────────────────────────────────── アクセスキー出現形標準形品詞１活用形出現形標準形品詞２活用形１１１２２２頻度 ─────────────────────────────────── ＊＊代名詞無し＊＊普通名詞無し 98 ＊＊代名詞無しのの格助詞無し 702 ＊＊代名詞無しにに格助詞無し 176 ＊＊代名詞無しはは係助詞無し 343 ＊＊普通名詞無しでで格助詞無し 1108 ＊＊普通名詞無しですです助動詞終止 872 はは係助詞無し＊＊普通名詞無し 566 はは係助詞無しありある補助動詞連用 13 … … …… … … … …… … … … … …… … … … …… … … ─────────────────────────────────── （注）＊：不定（どんな語も対応すること）を示す。[Table 3] Bigram frequency ─────────────────────────────────── Access key appearance form Standard form Part of speech 1 utilization Form Appearance Standard Form Part-of-Speech 2 Inflectional Form 1 1 1 2 2 2 2 Frequency ─────────────────────────────────── ─ * * No pronoun * * No common noun 98 * * No case particle without pronoun 702 * * No pronoun without case particle 176 * * No pronoun is no particle 343 * * With no common noun None 1108 * * No common noun Auxiliary verb end 872 is no verb particle * * No common noun 566 is no verb particle Some auxiliary verb continuation 13 ……………………………………………………………………… …………………………… ──────────────────────────── ────── (Note) *: Indicates the indefinite (any word also corresponding to that).

【００２０】ここで、頻度データは、従来の形態素解析
装置においては、単語と品詞とは別々にモノグラム頻度
とバイグラム頻度とを予め計算して含んでいる。これに
対して、本発明に係るこの実施例においては、単語と品
詞の組み合わせの連接頻度データを予め計算されて含む
ことを特徴としており、また、当該連接頻度データにお
いて、付属語については形態素情報のうち出現形、標準
形、品詞、及び活用形に着目してそれらの組み合わせの
連接頻度データを予め計算されて含む一方、自立語につ
いては品詞、及び活用形に着目してそれらの組み合わせ
の連接頻度データが予め計算されて含まれている。Here, in the conventional morphological analyzer, the frequency data includes a monogram frequency and a bigram frequency calculated in advance separately for a word and a part of speech. On the other hand, this embodiment according to the present invention is characterized in that concatenation frequency data of a combination of a word and a part-of-speech is calculated and included in advance, and in the concatenation frequency data, morpheme information about an adjunct word. Of the appearance forms, canonical forms, parts of speech, and inflectional forms, the concatenation frequency data of their combinations are pre-calculated and included, while for independent words, the concatenation of those combinations by focusing on the parts of speech and inflectional forms Frequency data is pre-calculated and included.

【００２１】表２において、そのモノグラム頻度データ
内の第１段目から第５段目までの欄は、付属語の形態素
について不定無しの単語モノグラム形式で記述される一
方、そのモノグラム頻度データの第６段目から第７段目
までの欄は、自立語の形態素について不定有りの品詞モ
ノグラム形式で記述されている。In Table 2, the columns from the first row to the fifth row in the monogram frequency data are described in the indefinite word monogram format for the morphemes of the adjunct words, while the columns of the monogram frequency data are listed. The columns from the sixth row to the seventh row are described in the part-of-speech monogram format in which the morpheme of the independent word is indefinite.

【００２２】また、表３においては、第１の形態素とそ
れに続く第２の形態素との間の連接頻度データを含んで
おり、表３において、出現形１と標準形１と品詞１と活
用形１とは、第１の形態素に関するものであり、出現形
２と標準形２と品詞２と活用形２とは、第２の形態素に
関するものである。ここで、そのデータの第１段目は不
定の第１の形態素と不定の第２の形態素との連接頻度を
示し、そのデータの第２段目から第５段目までは不定の
第１の形態素と付属語の第２の形態素との間の連接頻度
を示し、そのデータの第６段目は不定の第１の形態素と
自立語の第２の形態素との間の連接頻度を示し、そのデ
ータの第７段目は付属語の第１の形態素と不定の第２の
形態素との間の連接頻度を示し、そのデータの第８段目
は付属語の第１の形態素と自立語の第２の形態素との間
の連接頻度を示す。Further, in Table 3, the concatenation frequency data between the first morpheme and the subsequent second morpheme is included, and in Table 3, the appearance form 1, the standard form 1, the part-of-speech 1, and the conjugation form are included. 1 is related to the first morpheme, and the appearance form 2, the standard form 2, the part-of-speech 2, and the conjugation form 2 are related to the second morpheme. Here, the first row of the data shows the concatenation frequency of the indefinite first morpheme and the indefinite second morpheme, and the second to fifth rows of the data have the indefinite first morpheme. The connection frequency between the morpheme and the second morpheme of the adjunct word is shown, and the sixth row of the data shows the connection frequency between the indefinite first morpheme and the second morpheme of the independent word. The seventh row of data shows the concatenation frequency between the first morpheme of the adjunct and the indefinite second morpheme, and the eighth row of the data shows the first morpheme of the adjunct and the independent morpheme of the independent word. The connection frequency between the two morphemes is shown.

【００２３】なお、従来の形態素解析装置の頻度データ
は、単語と品詞とは別々にモノグラム頻度とバイグラム
頻度とを予め計算して含んでいるために、単語と品詞の
別々の出現頻度が存在するために、形態素解析の曖昧性
が大きくなり、形態素解析の精度が比較的低い。一方、
本実施例においては、単語と品詞の組み合わせのモノグ
ラムの頻度データとバイグラムの頻度データとを用いて
連接尤度を計算しているので、従来の形態素解析装置に
比較して、形態素解析の曖昧性は小さくなり、形態素解
析の精度をより高くすることができる。Since the frequency data of the conventional morphological analyzer includes the monogram frequency and the bigram frequency calculated in advance separately for the word and the part-of-speech, the frequency of occurrence of the word and the part-of-speech exists separately. Therefore, the ambiguity of the morphological analysis becomes large, and the accuracy of the morphological analysis is relatively low. on the other hand,
In this embodiment, since the concatenation likelihood is calculated using the monogram frequency data of the combination of words and parts of speech and the bigram frequency data, the ambiguity of the morphological analysis is higher than that of the conventional morphological analysis device. Becomes smaller, and the accuracy of morphological analysis can be improved.

【００２４】上記形態素解析部１０は、上記形態素辞書
を用いて読み出してきた解析対象の形態素について、表
２及び表３に示す頻度データを用いて頻度を、頻度デー
タメモリ１３を検索して読み出した後、入力された形態
素列に対する連接尤度を計算して、当該連接尤度付き形
態素列を出力する。The morpheme analysis unit 10 searches the frequency data memory 13 for the frequency of the morpheme to be analyzed that has been read out using the morpheme dictionary, using the frequency data shown in Tables 2 and 3, and reads it out. After that, the concatenated likelihood is calculated for the input morpheme sequence, and the morpheme sequence with the concatenated likelihood is output.

【００２５】まず、頻度データを用いた形態素解析処理
の概要を以下に説明する。形態素解析処理の過程で用い
るワーキングエリアＷは処理メモリ１１中で設定され、
ワーキングエリアＷには、図４に示すように、１個又は
それ以上の複数の候補セットＣｎ（ｎ＝１，２，３，
…）の集合が記憶される。各候補セットＣｎは、未処理
文字列Ｓと、形態素列候補Ｍと、累積形態素連接尤度Ｌ
ａとから構成される。ここで、未処理文字列Ｓとは、入
力された文字列のうち処理の過程でまだ解析されていな
い部分の文字列であり、形態素列候補Ｍとは、処理の過
程で既に解析された形態素列である。また、累積形態素
連接尤度Ｌａとは、形態素列候補Ｍについて累積された
形態素連接尤度であり、０から１の間の値をとり、１に
近いほど候補が尤もらしいことを表わす。未処理文字列
Ｓと、形態素列候補Ｍと、累積形態素連接尤度Ｌａとの
組は形態素の区切り方や標準形、品詞、活用形の曖昧性
のために複数の組み合わせを生じ、それらは処理の過程
で必要に応じて新たな候補セットＣｎとして処理メモリ
１１内のワーキングエリアＷ中に動的に格納される。First, an outline of the morphological analysis processing using frequency data will be described below. The working area W used in the process of morphological analysis processing is set in the processing memory 11,
In the working area W, as shown in FIG. 4, one or more candidate sets Cn (n = 1, 2, 3,
...) is stored. Each candidate set Cn includes an unprocessed character string S, a morpheme string candidate M, and a cumulative morpheme connected likelihood L.
and a. Here, the unprocessed character string S is a character string of a portion of the input character string that has not been analyzed in the process of processing, and the morpheme sequence candidate M is the morpheme that has already been analyzed in the process of processing. It is a column. Further, the cumulative morpheme connection likelihood La is a morpheme connection likelihood accumulated for the morpheme string candidates M, takes a value between 0 and 1, and indicates that the closer to 1, the more likely the candidate is. The combination of the unprocessed character string S, the morpheme string candidate M, and the cumulative morpheme concatenated likelihood La produces a plurality of combinations due to the ambiguity of the morpheme demarcation and the standard forms, parts of speech, and inflectional forms, and they are processed. In the process of, a new candidate set Cn is dynamically stored in the working area W in the processing memory 11 as needed.

【００２６】形態素解析処理は、候補セットＣｎ中の未
処理文字列に対して順次行なわれ、その結果は形態素列
候補Ｍとして格納される。さらに、その結果の尤もらし
さが頻度データから計算され、計算された頻度が累積形
態素連接尤度Ｌａに乗算することによって累積して更新
される。形態素解析の終了条件を満たした場合、候補セ
ットＣｎから解析結果が取り出され、形態素情報を持っ
た形態素列と解析された連接尤度の組として出力され
る。この出力結果は一般には複数組存在する。The morpheme analysis process is sequentially performed on the unprocessed character strings in the candidate set Cn, and the result is stored as the morpheme string candidate M. Further, the likelihood of the result is calculated from the frequency data, and the calculated frequency is cumulatively updated by multiplying the cumulative morpheme connected likelihood La. When the termination condition of the morpheme analysis is satisfied, the analysis result is extracted from the candidate set Cn and output as a set of the morpheme sequence having the morpheme information and the analyzed concatenated likelihood. There are generally a plurality of sets of output results.

【００２７】図２は、形態素解析部１０によって実行さ
れる形態素解析処理のフローチャートである。図２にお
いて、まず、ステップＳ１において初期化処理が実行さ
れる。ここでは、処理メモリ１１内のワーキングエリア
Ｗ中に１つの空の候補セットＣｎを用意し、その未処理
文字列Ｓに入力文をセットし、形態素列候補Ｍにダミー
の文頭形態素をセットし、累積形態素連接尤度Ｌａを１
に初期化する。ここで、ダミーの文頭形態素とは、出力
結果には現われないが、初めの形態素と文頭との連接尤
度計算のために使われる仮想的な形態素である。FIG. 2 is a flowchart of the morpheme analysis processing executed by the morpheme analysis unit 10. In FIG. 2, first, an initialization process is executed in step S1. Here, one empty candidate set Cn is prepared in the working area W in the processing memory 11, an input sentence is set in the unprocessed character string S, and a dummy sentence morpheme is set in the morpheme string candidate M. Cumulative morpheme connected likelihood La is 1
Initialize to. Here, the dummy sentence head morpheme is a virtual morpheme that does not appear in the output result, but is used for calculating the likelihood of concatenation of the initial morpheme and the sentence head.

【００２８】次いで、ステップＳ２において、図３にお
いて詳細後述するように、辞書検索及び尤度計算処理を
実行する。この処理では、処理メモリ１１内のワーキン
グエリアＷ内の各候補セットＣｎに対して形態素辞書の
検索処理と、形態素連接尤度の計算処理を実行して、当
該ワーキングエリアＷの内容を更新する。さらに、ステ
ップＳ３において、候補数制限処理が実行される。この
処理においては、ワーキングエリアＷ中の候補セットＣ
ｎに対して累積形態素連接尤度Ｌａの高いものから優先
順位を付け、優先順位の低い候補セットＣｎをワーキン
グエリアＷ中から削除することによって候補数を適当な
数に制限する。なお、この制限数は予めパラメータとし
て設定しておく。そして、次のステップＳ４、Ｓ５及び
Ｓ６において終了条件の分岐処理が実行され、処理メモ
リ１１内のワーキングエリアＷ中に残った候補セットＣ
ｎの各々に対し、未処理文字列に対して、次の終了条件
を満たすまで上記ステップＳ２及びＳ３の処理が実行さ
れる。Next, in step S2, as will be described later in detail with reference to FIG. 3, dictionary retrieval and likelihood calculation processing are executed. In this process, the morpheme dictionary search process and the morpheme concatenated likelihood calculation process are executed for each candidate set Cn in the working area W in the process memory 11 to update the contents of the working area W. Further, in step S3, a candidate number limiting process is executed. In this process, the candidate set C in the working area W is
The priority is assigned to n from the highest cumulative morpheme concatenated likelihood La, and the candidate set Cn having a low priority is deleted from the working area W to limit the number of candidates to an appropriate number. The limit number is set as a parameter in advance. Then, in the next steps S4, S5, and S6, the branch process of the end condition is executed, and the candidate set C remaining in the working area W in the processing memory 11 is processed.
For each of the n, the processing of steps S2 and S3 is executed for the unprocessed character string until the following end condition is satisfied.

【００２９】ステップＳ４においては、「ワーキングエ
リアＷ中のすべての候補セットＣｎに対して、未処理文
字列が存在しなくなり処理が完全に終了すること」とい
う終了条件１を満足するか否かが判断され、満足すると
きステップＳ７に進む一方、満足しないときステップＳ
５に進む。次いで、ステップＳ５においては、「ワーキ
ングエリアＷ中の候補セットＣｎのうち、未処理文字列
が存在しなくなったものが一定の数に達して処理が部分
的に終了すること」という終了条件２を満足するか否か
が判断され、満足するときステップＳ７に進む一方、満
足しないときステップＳ６に進む。なお、この終了条件
２における一定の数は予めパラメータとして設定され
る。さらに、ステップＳ６において、「すべての候補セ
ットＣｎが連接せずに失敗する場合、換言すれば次の形
態素が辞書検索できないか、あるいは詳細後述する形態
素連接尤度Ｌｍが０となって解析を進めるのに必要な候
補セットＣｎが処理メモリ１１内のワーキングエリアＷ
中に存在しなくなった」という終了条件３を満足するか
否かが判断され、満足するときステップＳ８に進む一
方、満足しないときステップＳ２に進んでステップＳ２
からの処理を繰り返す。In step S4, it is determined whether or not the end condition 1 that "the processing is completely completed because there is no unprocessed character string for all the candidate sets Cn in the working area W" is satisfied. If judged and satisfied, go to step S7. If not satisfied, step S7.
Go to 5. Next, in step S5, an end condition 2 of "the processing is partially ended when the number of unprocessed character strings in the candidate set Cn in the working area W no longer exists reaches a certain number". It is determined whether or not it is satisfied. When it is satisfied, the process proceeds to step S7, and when it is not satisfied, the process proceeds to step S6. It should be noted that the fixed number under the termination condition 2 is set as a parameter in advance. Further, in step S6, “If all the candidate sets Cn fail without being connected, in other words, the next morpheme cannot be searched in the dictionary, or the morpheme connection likelihood Lm, which will be described later in detail, becomes 0 and the analysis proceeds. The candidate set Cn necessary for the working area W is the working area W in the processing memory 11.
It is determined whether or not the end condition 3 "there is no longer present" is satisfied, and if satisfied, the process proceeds to step S8, while if not satisfied, the process proceeds to step S2 and step S2.
Repeat the process from.

【００３０】上記終了条件１又は上記終了条件２を満足
した場合に、ステップＳ７においては、未処理文字列が
なくなった候補セットＣｎの各々から解析結果を取り出
す。すなわち、形態素列候補Ｍから形態素情報を持った
形態素列を取り出し、累積形態素連接尤度Ｌａの値を解
析結果の連接尤度として、その形態素列と解析尤度の組
を出力して終了する。この結果は一般には複数個存在す
る。一方、終了条件３を満たした場合は、ステップＳ８
において、解析不能を知らせるメッセージを出力して終
了する。When the end condition 1 or the end condition 2 is satisfied, in step S7, the analysis result is taken out from each of the candidate sets Cn that have no unprocessed character strings. That is, a morpheme string having morpheme information is extracted from the morpheme string candidate M, the value of the cumulative morpheme concatenated likelihood La is set as the concatenated likelihood of the analysis result, and the set of the morpheme string and the analysis likelihood is output, and the process ends. There are generally a plurality of these results. On the other hand, if the ending condition 3 is satisfied, step S8
At, a message notifying that the analysis is impossible is output and the process ends.

【００３１】図３は、ステップＳ３における形態素辞書
検索及び尤度計算処理のサブルーチンの処理を示すフロ
ーチャートである。図３において、まず、ステップＳ１
において、空の暫定的なワーキングエリアＷ’を処理メ
モリ１１内に用意し、ワーキングエリアＷ中に存在する
すべての候補セットＣｎ（ｎ＝１，２，３，…）の各々
に対してステップＳ１４からステップＳ１８までの処理
が実行される。ステップＳ１２では、パラメータｎに０
がセットされた後、ステップＳ１３においてパラメータ
ｎに１が加算されたものをｎとして更新する。FIG. 3 is a flowchart showing the processing of a subroutine of morpheme dictionary search and likelihood calculation processing in step S3. In FIG. 3, first, step S1
In step S14, an empty provisional working area W ′ is prepared in the processing memory 11, and step S14 is performed for each of all candidate sets Cn (n = 1, 2, 3, ...) Present in the working area W. The processes from to S18 are executed. In step S12, the parameter n is set to 0.
Is set, the value obtained by adding 1 to the parameter n is updated as n in step S13.

【００３２】ステップＳ１４においては、候補セットＣ
ｎに対して形態素辞書の検索処理が実行される。この処
理では、未処理文字列Ｓの先頭からの部分文字列に一致
する出現形を有する形態素候補を、形態素辞書１２内の
形態素辞書から検索する。このとき出現形の文字数には
関わらずすべての候補を検索する。次いで、ステップＳ
１５において、ステップＳ１４における検索での形態素
の候補数が０を超えるか否かが判断される。形態素が１
つも検索できなかった場合は（ステップＳ１５において
ＮＯ）、当該候補セットＣｎにおける形態素解析を失敗
と判断して、ステップＳ１６乃至Ｓ１８の処理を実行せ
ずに、ステップＳ１９に進む。In step S14, the candidate set C
The morpheme dictionary search process is executed for n. In this processing, the morpheme dictionary in the morpheme dictionary 12 is searched for a morpheme candidate having an appearance that matches the partial character string from the beginning of the unprocessed character string S. At this time, all candidates are searched regardless of the number of characters in the appearance form. Then, step S
At 15, it is determined whether the number of morpheme candidates in the search in step S14 exceeds 0. Morpheme is 1
If neither is found (NO in step S15), it is determined that the morphological analysis in the candidate set Cn has failed, and the process proceeds to step S19 without executing the processes of steps S16 to S18.

【００３３】一方、ステップＳ１５において形態素の候
補数が０を越える場合は、ステップＳ１６において、候
補セットＣｎのコピー処理を実行する。すなわち、形態
素がｋ個検索された場合は、ワーキングエリアＷ’中に
当該候補セットＣｎのコピーをｋ個だけ用意して、検索
されたｋ個の形態素にそれぞれ対応して、候補セットＣ
ｎのコピーと形態素とのｋ個の組を作成する。この候補
セットＣｎのコピーと形態素とのｋ個の組に対して以下
のステップＳ１７及びＳ１８の処理を実行する。ステッ
プＳ１７においては、形態素連接尤度の計算処理が実行
され、ここで、形態素列候補Ｍの最後尾の形態素をｃ１
とし、上記ステップＳ１４において検索された形態素を
ｃ２としたとき、それぞれ頻度データメモリ１３に格納
されている、形態素ｃ１のモノグラム頻度Ｆｍ（ｃ１）
と、形態素ｃ１と形態素ｃ２との間のバイグラム頻度Ｆ
ｂ（ｃ１，ｃ２）とを用いて、次の数１で表される形態
素連接尤度Ｌｍ（ｃ１，ｃ２）を計算する。On the other hand, if the number of morpheme candidates exceeds 0 in step S15, copy processing of the candidate set Cn is executed in step S16. That is, when k morphemes are searched, only k copies of the candidate set Cn are prepared in the working area W ′, and the candidate set C corresponding to each of the k searched morphemes is prepared.
Create k sets of n copies and morphemes. The following processes of steps S17 and S18 are performed on k sets of copies and morphemes of this candidate set Cn. In step S17, a morpheme concatenated likelihood calculation process is executed, where the last morpheme of the morpheme string candidate M is c1.
When the morpheme searched for in step S14 is c2, the monogram frequency Fm (c1) of the morpheme c1 stored in the frequency data memory 13 respectively.
And the bigram frequency F between the morphemes c1 and c2
Using b (c1, c2), the morpheme connected likelihood Lm (c1, c2) expressed by the following equation 1 is calculated.

【００３４】[0034]

【数１】（ａ）もしＦｍ（ｃ１）≠０のときＬｍ（ｃ１，ｃ２）＝Ｆｂ（ｃ１，ｃ２）／Ｆｍ（ｃ
１）（ｂ）もしＦｍ（ｃ１）＝０のときＬｍ（ｃ１，ｃ２）＝０## EQU1 ## (a) If Fm (c1) ≠ 0, Lm (c1, c2) = Fb (c1, c2) / Fm (c
1) (b) If Fm (c1) = 0, Lm (c1, c2) = 0

【００３５】次いで、ステップＳ１８においては、候補
セットＣｎのコピー更新処理が実行される。ここで、形
態素連接尤度Ｌｍ（ｃ１，ｃ２）が０であるとき、形態
素ｃ１と形態素ｃ２とは連接しないので失敗と判断し、
形態素ｃ２を棄却し、処理メモリ１１内のワーキングエ
リアＷ’中から当該候補セットＣｎのコピーを除いて消
去する。一方、形態素連接尤度Ｌｍ（ｃ１，ｃ２）＞０
のとき、形態素ｃ１とｃ２とは連接可能なので、当該候
補セットＣｎのコピー中の形態素列候補Ｍに形態素ｃ２
を付加して書き込み、その未処理文字列Ｓから形態素ｃ
２に対応する文字列を除き、形態素連接尤度Ｌｍ（ｃ
１，ｃ２）を累積形態素連接尤度Ｌａに乗算することに
よって累積して更新する。すなわち、新しい累積形態素
連接尤度の値をＬａ’とすると、次の数２で得られる新
しい累積形態素連接尤度Ｌａ’を当該候補セットＣｎ中
の累積形態素尤度Ｌａに代入する。Next, in step S18, copy update processing of the candidate set Cn is executed. Here, when the morpheme connection likelihood Lm (c1, c2) is 0, since the morpheme c1 and the morpheme c2 are not connected, it is determined as a failure,
The morpheme c2 is discarded, and the copy of the candidate set Cn is removed from the working area W ′ in the processing memory 11 and deleted. On the other hand, the morpheme connection likelihood Lm (c1, c2)> 0
, The morphemes c1 and c2 can be concatenated, so that the morpheme c2 is added to the morpheme sequence candidate M being copied in the candidate set Cn.
Is added and written, and the morpheme c is obtained from the unprocessed character string S.
Except for the character string corresponding to 2, the morpheme connected likelihood Lm (c
1, c2) is cumulatively updated by multiplying the cumulative morpheme connected likelihood La. That is, assuming that the value of the new cumulative morpheme connected likelihood is La ′, the new accumulated morpheme connected likelihood La ′ obtained by the following equation 2 is substituted for the accumulated morpheme likelihood La in the candidate set Cn.

【００３６】[0036]

【数２】Ｌａ’＝Ｌｍ×Ｌａ## EQU2 ## La '= Lm × La

【００３７】さらに、ステップＳ１９において、次の候
補セットＣｎ＋１が存在するか否かが判断され、存在す
るとき（ステップＳ１９においてＹＥＳ）はステップＳ
１３に戻って上述の処理を繰り返す。一方、次の候補セ
ットＣｎ＋１が存在しないとき（ステップＳ１９でＮ
Ｏ）は、ステップＳ２０で、処理メモリ１１内のワーキ
ングエリアＷを上記処理されたワーキングエリアＷ’で
更新して元のメインルーチンに戻る。Further, in step S19, it is judged whether or not the next candidate set Cn + 1 exists, and if it exists (YES in step S19), step S19.
Returning to step 13, the above processing is repeated. On the other hand, when the next candidate set Cn + 1 does not exist (N in step S19)
O) updates the working area W in the processing memory 11 with the processed working area W'in step S20 and returns to the original main routine.

【００３８】さらに、上記形態素解析処理の具体例を図
４を用いて説明する。入力された形態素列が「こちらは
事務局です」という文であったとする。このとき、図２
の形態素解析処理が進み、処理の中間状態（以下、状態
ＳＳ１という。）、すなわち形態素列候補Ｍ、及びそれ
に対応する未処理文字列Ｓと累積形態素連接尤度Ｌａは
次のようになったとする。＜状態ＳＳ１における候補セットＣｍ＞（ａ）未処理文字列Ｓ＝“です” （ｂ）形態素列候補Ｍ＝（こちらは事務局）（ｃ）累積形態素連接尤度Ｌａ＝０．０６Further, a specific example of the above morphological analysis process will be described with reference to FIG. It is assumed that the input morpheme string is the sentence “This is the secretariat”. At this time,
It is assumed that the morpheme analysis process of No. 1 progresses, and the intermediate state of the process (hereinafter referred to as state SS1), that is, the morpheme string candidate M, the corresponding unprocessed character string S, and the cumulative morpheme connected likelihood La become as follows. . <Candidate set Cm in state SS1> (a) Unprocessed character string S = "is" (b) Morphological string candidate M = (this is the secretariat) (c) Cumulative morpheme connected likelihood La = 0.06

【００３９】上記状態ＳＳ１における未処理文字列Ｓに
対して、ステップＳ１４の形態素辞書の検索処理が実行
されたとき、形態素辞書メモリ１２に記憶された形態素
辞書の検索により、次の複数の形態素候補が得られる。＜形態素候補１＞で（Ｌｍ１＝０．０５０）＜形態素候補２＞です（Ｌｍ２＝０．０３９）但し、上記の括弧内の数値は１つ前の形態素“事務局”
との形態素連接尤度Ｌｍ１，Ｌｍ２である。この値は次
の頻度データを使ってステップＳ１７において計算され
たものである。（ａ）表２の７段目のモノグラム頻度データＦｍ＝２２
１１４（ｂ）表３の５段目のバイグラム頻度データＦｂ１＝１
１０８（ｃ）表３の６段目のバイグラム頻度データＦｂ２＝８
７２このとき、形態素連接尤度Ｌｍ１，Ｌｍ２は次のように
計算することができる。When the morpheme dictionary search process of step S14 is performed on the unprocessed character string S in the state SS1 described above, the morpheme dictionary stored in the morpheme dictionary memory 12 is searched, and the next plurality of morpheme candidates are searched. Is obtained. <Morpheme candidate 1> (Lm1 = 0.050) <Morpheme candidate 2> (Lm2 = 0.039) However, the numbers in parentheses above are the previous morpheme "secretariat"
And Lm1 and Lm2, which are the morpheme connected likelihoods of and. This value is calculated in step S17 using the following frequency data. (A) Monogram frequency data Fm = 22 in the 7th row of Table 2
114 (b) Bigram frequency data Fb1 = 1 in the fifth row of Table 3
108 (c) Bigram frequency data Fb2 = 8 in the sixth row of Table 3
72 At this time, the morpheme connection likelihoods Lm1 and Lm2 can be calculated as follows.

【００４０】[0040]

【数３】Ｌｍ１（事務局，で）＝Ｆｂ１／Ｆｍ＝１１０
８／２２１１４＝０．０５０[Formula 3] Lm1 (in secretariat) = Fb1 / Fm = 110
8/22114 = 0.050

【数４】Ｌｍ２（事務局，です）＝Ｆｂ２／Ｆｍ＝８７
２／２２１１４＝０．０３９[Equation 4] Lm2 (the secretariat is) = Fb2 / Fm = 87
2/221 14 = 0.039

【００４１】このとき、形態素列候補Ｍと未処理文字列
Ｓとは、それぞれ次のように複数の組み合わせの可能性
が生じて、候補セットとして処理メモリ１１内のワーキ
ングメモリＷ’に記憶される。＜第１の組み合わせの候補セットＣｍ１＞（ａ）未処理文字列＝“す” （ｂ）形態素列候補Ｍ＝（こちらは事務局で）（ｃ）累積形態素連接尤度Ｌａ＝０．０６０×０．０５
０＝０．００３＜第２の組み合わせの候補セットＣｍ２＞（ａ）未処理文字列＝“”→処理終了（ｂ）形態素列候補Ｍ＝（こちらは事務局です）（ｃ）累積形態素連接尤度Ｌａ＝０．０６０×０．０３
９＝０．００２At this time, the morpheme string candidate M and the unprocessed character string S are likely to be combined as follows, and are stored in the working memory W'in the processing memory 11 as a candidate set. . <First Combination Candidate Set Cm1> (a) Unprocessed Character String = “S” (b) Morphological Sequence Candidate M = (Here is the secretariat) (c) Cumulative morpheme connected likelihood La = 0.060 × 0.05
0 = 0.003 <Candidate set Cm2 of the second combination> (a) Unprocessed character string = "" → end of processing (b) Morphological string candidate M = (this is the secretariat) (c) Cumulative morpheme connected likelihood Degree La = 0.060 × 0.03
9 = 0.002

【００４２】ここで、上記第２の組み合わせの候補セッ
トＣｍ２については、未処理文字列が無くなったので、
当該形態素解析処理が終了する。Since there are no unprocessed character strings in the second combination candidate set Cm2,
The morphological analysis process ends.

【００４３】一方、上記第１の組み合わせの候補セット
の未処理文字列Ｓに対する形態素辞書検索（ステップＳ
１４）により、次の形態素候補が得られる。＜形態素候補３＞す（Ｌｍ３＝０．０００）但し、上記括弧内の数値は１つ前の形態素“で”との連
接尤度である。ここで、１つ前の形態素“で”は付属語
であり、それに続く本動詞の終止形との頻度データが存
在しなかったために、連接尤度の値が０．０００となっ
ている。すなわち、このときの候補セットＣｍ３に対す
る未処理文字列Ｓと形態素候補Ｍと累積形態素連接尤度
Ｌａは次のように計算される。＜候補セットＣｍ３＞（ａ）未処理文字列＝→失敗（ｂ）形態素列候補Ｍ＝（こちらは事務局で
す）（ｃ）累積形態素連接尤度Ｌａ＝０．００３×０．００
０＝０．０００On the other hand, the morpheme dictionary search for the unprocessed character string S of the candidate set of the first combination (step S
According to 14), the next morpheme candidate is obtained. <Morpheme candidate 3> (Lm3 = 0.000) However, the numerical value in the parenthesis is the concatenation likelihood with the immediately preceding morpheme "de". Here, since the preceding morpheme “de” is an adjunct word and there is no frequency data with the final form of the main verb following it, the value of the concatenation likelihood is 0.000. That is, the unprocessed character string S, the morpheme candidate M, and the cumulative morpheme connected likelihood La for the candidate set Cm3 at this time are calculated as follows. <Candidate set Cm3> (a) Unprocessed character string = → failure (b) Morphological string candidate M = (This is the secretariat
(C) Cumulative morpheme connected likelihood La = 0.003 × 0.00
0 = 0.000

【００４４】当該候補セットＣｍ３の形態素解析は、累
積形態素連接尤度Ｌａは０．０００となったので、失敗
となる。従って、残された候補セットは次の内容を含む
上記Ｃｍ２である。（ａ）未処理文字列＝“”→処理終了（ｂ）形態素列候補Ｍ＝（こちらは事務局です）（ｃ）累積形態素連接尤度Ｌａ＝０．００２この時点ですべての処理が完了し、上記候補セットＣｍ
２の内容の解析結果が得られる。この例では解析結果の
形態素列候補Ｍは１つに決まるが、一般には複数の候補
が得られる。その場合には、累積形態素連接尤度を利用
して、最大の累積形態素連接尤度を有する候補セットを
最適な候補として決定することができる。The morpheme analysis of the candidate set Cm3 fails because the cumulative morpheme connection likelihood La is 0.000. Therefore, the remaining candidate set is the above Cm2 including the following contents. (A) Unprocessed character string = "" → end processing (b) Morphological string candidate M = (this is the secretariat) (c) Cumulative morpheme concatenated likelihood La = 0.002 At this point, all processing is completed. , The above candidate set Cm
The analysis result having the contents of 2 is obtained. In this example, the number of morpheme string candidates M of the analysis result is determined to be one, but generally, a plurality of candidates are obtained. In that case, the cumulative morpheme connected likelihood can be utilized to determine the candidate set having the maximum accumulated morpheme connected likelihood as the optimum candidate.

【００４５】以上説明したように、単語と品詞の組み合
わせの連接頻度データを予め計算されて頻度データメモ
リ１３に記憶することを特徴としており、また、当該頻
度データにおいて、付属語については形態素情報のうち
出現形、標準形、品詞、及び活用形に着目してそれらの
組み合わせの連接頻度データを予め計算されて含む一
方、自立語については品詞、及び活用形に着目してそれ
らの組み合わせの連接頻度データが予め計算されて含ま
れている。従って、付属語の連接について出現形、標準
形と品詞及び活用形まで考慮したバイグラムの頻度デー
タを用いることにより、単に品詞だけを考慮したバイグ
ラムの頻度データを用いる場合に比較してより厳しい連
接のチェックを実行することができる。このことは、付
属語の連接尤度の解析精度を高めることができるという
特有の効果を有する。また、自立語については品詞及び
活用形だけで評価されるので、形態素解析の許容範囲が
広くなっていると言える。このことは、形態素辞書への
新規登録の自立語に対して、連接データを更新すること
なく対処できる柔軟性をもたらすという効果がある。As described above, the concatenation frequency data of the combination of the word and the part of speech is calculated in advance and stored in the frequency data memory 13, and in the frequency data, the attached word of the morpheme information is stored. Among them, the appearance frequency, the canonical form, the part-of-speech, and the inflectional form are pre-calculated and include the concatenation frequency data of those combinations, while for independent words, the concatenation frequency of those combinations is focused on the part-of-speech and the inflectional form. The data is precalculated and included. Therefore, by using bigram frequency data that considers appearance forms, canonical forms, parts of speech, and inflectional forms for adjunct word concatenation, a more stringent concatenation can be achieved as compared to the case of using bigram frequency data that only considers part of speech. The check can be performed. This has a peculiar effect that the analysis accuracy of the concatenation likelihood of the attached word can be improved. Moreover, since the independent word is evaluated only by the part of speech and the inflectional form, it can be said that the allowable range of morphological analysis is wide. This has the effect of providing the flexibility to deal with a newly registered independent word in the morpheme dictionary without updating the concatenation data.

【００４６】さらには、付属語については形態素情報の
うち出現形、標準形、品詞、及び活用形に着目してそれ
らの組み合わせの連接頻度データを予め計算されて含む
一方、自立語については品詞、及び活用形に着目してそ
れらの組み合わせの連接頻度データが予め計算されて含
まれているので、単語と品詞の連接頻度データを別々に
備える従来の形態素解析装置に比較して、出現頻度と連
接頻度を含む頻度データ量を低減することができ、頻度
データメモリ１３の記憶領域の容量を軽減することがで
きる。これによって、連接尤度の計算のための処理時間
を大幅に短縮することができる。Further, regarding the adjunct word, the appearance frequency, the canonical form, the part of speech, and the inflectional form among the morpheme information are focused and the concatenation frequency data of the combination thereof is pre-calculated and included, while the independent word is the part of speech, And since the concatenation frequency data of those combinations is calculated and included in focus on the conjugations, the appearance frequency and the concatenation are compared with the conventional morphological analyzer that separately includes the concatenation frequency data of words and parts of speech. The amount of frequency data including frequency can be reduced, and the capacity of the storage area of the frequency data memory 13 can be reduced. As a result, the processing time for calculating the concatenated likelihood can be significantly reduced.

【００４７】なお、上記頻度データにおいては、好まし
くは、形態素のうち付属語については形態素情報のうち
各形態素に対する少なくとも品詞と活用形の組み合わせ
の連接頻度データを含む一方、自立語については形態素
情報のうち少なくとも品詞と活用形の組み合わせの連接
頻度データを含む。In the frequency data, preferably, the adjunct word of the morpheme contains the concatenation frequency data of at least the combination of the part of speech and the inflectional form for each morpheme information of the morpheme information, while the independent word contains the morpheme information of the morpheme information. At least the concatenation frequency data of the combination of the part of speech and the inflectional form is included.

【００４８】以上の実施例において、頻度データメモリ
１３内の上記頻度データは、モノグラムの頻度データ
と、バイグラムの頻度データとからなるが、本発明はこ
れに限らず、頻度データメモリ１３内の上記頻度データ
は、自然数ｎ個の連接する形態素の組が出現する頻度で
あるｎグラムの頻度データからなるように構成してもよ
い。すなわち、頻度データメモリ１３には、モノグラム
の頻度データと、バイグラムの頻度データの頻度データ
とのほか、ｎ＝３以上のｎグラムの頻度データを記憶
し、これら記憶された頻度データに基づいて入力された
形態素列の連接尤度を計算して出力するようにしてもよ
い。In the above embodiment, the frequency data in the frequency data memory 13 is composed of monogram frequency data and bigram frequency data. However, the present invention is not limited to this, and the frequency data memory 13 may be configured as described above. The frequency data may be composed of n-gram frequency data, which is the frequency at which a set of concatenated morphemes of natural number n appears. That is, the frequency data memory 13 stores frequency data of monograms and frequency data of bigrams, as well as frequency data of n-grams of n = 3 or more, and inputs them based on the stored frequency data. The concatenated likelihood of the generated morpheme string may be calculated and output.

【００４９】[0049]

【発明の効果】以上詳述したように本発明によれば、複
数の文字列からなる入力された自然言語文の形態素列の
各形態素に基づいて、第１の記憶装置に記憶され各形態
素に対する品詞と活用形を示す形態素情報を含む形態素
辞書と、第２の記憶装置に記憶され上記各形態素のうち
の第１の形態素とそれに続く第２の形態素との間の連接
頻度を含む頻度データとを用いて、上記入力された形態
素列の連接尤度を計算して出力する解析手段を備えた形
態素解析装置であって、上記頻度データは、自然数ｎ個
の連接する形態素の組が出現する頻度であるｎグラムの
頻度データからなり、上記第１の形態素と上記第２の形
態素とに対して単語と品詞の組み合わせの連接頻度デー
タを含む。従って、単に品詞だけを考慮したバイグラム
の連接頻度データを用いる従来の形態素解析装置に比較
して、より厳しい連接のチェックを実行することがで
き、これによって、連接尤度の解析精度を高めることが
できる。また、単語と品詞の連接頻度データを別々に備
える従来の形態素解析装置に比較して、頻度データ量を
低減することができるので、頻度データを記憶する記憶
装置の記憶領域の容量を削減することができ、これによ
って、大幅に形態素解析の処理時間を短縮することがで
きる。As described above in detail, according to the present invention, based on each morpheme of a morpheme sequence of an input natural language sentence composed of a plurality of character strings, the morpheme stored in the first storage device is stored for each morpheme. A morpheme dictionary including part-of-speech and morpheme information indicating an inflectional form, and frequency data stored in a second storage device and including a concatenation frequency between a first morpheme and a second morpheme that follows the morpheme. Is a morphological analyzer that includes an analysis unit that calculates and outputs a concatenated likelihood of the input morpheme sequence using, and the frequency data is a frequency at which a set of natural number n concatenated morphemes appears. And the concatenation frequency data of a combination of a word and a part of speech for the first morpheme and the second morpheme. Therefore, as compared with the conventional morphological analyzer that uses bigram concatenation frequency data that simply considers only the part of speech, it is possible to perform a stricter concatenation check, thereby improving the concatenation likelihood analysis accuracy. it can. In addition, the amount of frequency data can be reduced as compared to a conventional morphological analysis device that separately includes concatenation frequency data of words and parts of speech, so that the capacity of the storage area of the storage device that stores frequency data can be reduced. As a result, the processing time of morphological analysis can be greatly shortened.

[Brief description of drawings]

【図１】本発明に係る一実施例である形態素解析装置
のブロック図である。FIG. 1 is a block diagram of a morphological analysis apparatus that is an embodiment according to the present invention.

【図２】図１の形態素解析装置の形態素解析部１０に
よって実行される形態素解析処理のフローチャートであ
る。2 is a flowchart of a morphological analysis process executed by a morphological analysis unit 10 of the morphological analysis apparatus of FIG.

【図３】図２の形態素辞書検索及び尤度計算処理のサ
ブルーチンのフローチャートである。FIG. 3 is a flowchart of a subroutine of morpheme dictionary search and likelihood calculation processing of FIG.

【図４】図３の形態素辞書検索及び尤度計算処理にお
いて処理される候補セットと、処理メモリ１のワーキン
グエリアＷとの関係を示す図である。4 is a diagram showing a relationship between a candidate set processed in the morpheme dictionary search and likelihood calculation process of FIG. 3 and a working area W of the processing memory 1. FIG.

[Explanation of symbols]

１０…形態素解析部、１１…処理メモリ、１２…形態素辞書メモリ、１３…頻度データメモリ。 10 ... Morphological analysis unit, 11 ... Processing memory, 12 ... Morphological dictionary memory, 13 ... Frequency data memory.

───────────────────────────────────────────────────── フロントページの続き (72)発明者古瀬蔵京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者隅田英一郎京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor, Kura Furuse Kyoto, Soraku-gun, Seika-cho, Osamu Osamu, Osamu Osamu, 5 Hiratani, Arai Co., Ltd. Gunma Seika-cho, Osamu Osamu, Osamu Osamu, 5 Hiratani, A-T Co., Ltd.

Claims

[Claims]

1. A morpheme dictionary including morpheme information indicating a part-of-speech and an inflection for each morpheme stored in a first storage device based on each morpheme of a morpheme sequence of an input natural language sentence composed of a plurality of character strings. And the concatenation of the input morpheme sequence using the frequency data including the concatenation frequency between the first morpheme of each of the morphemes and the subsequent second morpheme stored in the second storage device. A morphological analysis device comprising an analysis unit for calculating and outputting a likelihood, wherein the frequency data comprises n-gram frequency data that is a frequency at which a set of natural number n concatenated morphemes appears. A morpheme analysis device comprising concatenation frequency data of a combination of a word and a part of speech for one morpheme and the second morpheme.

2. The morphological analyzer according to claim 1, wherein the frequency data comprises monogram frequency data and bigram frequency data.

3. In the frequency data, the adjunct word of the morpheme includes concatenation frequency data of at least the part of speech and the inflectional combination for each morpheme of the morpheme information, while the independent word includes at least the part of speech of the morpheme information. The morphological analysis apparatus according to claim 1 or 2, which includes concatenation frequency data of a combination of inflectional forms.

4. The first analysis means for reading out morpheme information for each morpheme from the first storage device, using the morpheme dictionary, based on each of the input morphemes, Based on the morpheme information for each morpheme read by the control means of, using the frequency data,
The second control means for calculating and outputting a connection likelihood of the input morpheme sequence after calculating a connection likelihood for each of the read morphemes. The morphological analyzer according to 2 or 3.