JPH11167574A

JPH11167574A - Natural language processor

Info

Publication number: JPH11167574A
Application number: JP9333988A
Authority: JP
Inventors: Toshiyuki Sugio; 俊之杉尾
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1997-12-04
Filing date: 1997-12-04
Publication date: 1999-06-22
Anticipated expiration: 2017-12-04
Also published as: JP3938234B2

Abstract

PROBLEM TO BE SOLVED: To provide a natural language processor which can properly process even an input text that includes an unknown word part. SOLUTION: This processor adds the extension information including at least the word separation information for each character of an input text to form the extension characters, generates an extension character string related to the character string of the input text by means of those extension characters, decides the chaining probability for each extension character string based on the paths of all partial extension character strings covering the head through the end of the input text and also on the partial chaining probability of preliminarily prepared various partial extension character strings, and selects an extension character string that can secure the optimum chaining probability among those obtained chaining probabilities. If the partial extension character strings of the extension character string are not prepared beforehand, an extension character estimation part 5 estimates the partial chaining probability of the partial extension character strings based on the partial chaining probability of other partial extension character strings having the extension characters which are common to some extension characters of the relevant partial extension character string.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は自然言語処理装置に
関し、特に、確率的手法を利用することによって、辞書
を用いることなく電子化された自然言語テキストを処理
（例えば形態素解析）する装置に適用し得るものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a natural language processing apparatus, and more particularly to an apparatus for processing digitized natural language text without using a dictionary (for example, morphological analysis) by using a stochastic method. Can be done.

【０００２】[0002]

【従来の技術】ワードプロセッサによるテキスト作成機
会の増大や、インターネット対応機器の普及により、大
量の電子化された自然言語テキストが容易に入手可能と
なってきた。文字認識システム、機械翻訳システム、情
報検索システム、情報抽出システム等の大量の自然言語
テキストを扱う自然言語処理を応用した各種アプリケー
ションシステムにとって、形態素解析処理は、各種アプ
リケーションが目的とする専門処理を実施する前に共通
して実施され、単語や句等の文中の意味単位、すなわ
ち、形態素を確定する極めて重要な処理である。2. Description of the Related Art A large number of computerized natural language texts have been easily available due to an increase in opportunities for text creation by word processors and the spread of Internet-compatible devices. For various application systems that apply a large amount of natural language text, such as character recognition systems, machine translation systems, information retrieval systems, and information extraction systems, and apply various types of natural language processing, the morphological analysis process performs specialized processing for various applications. This is a very important process that is performed in common before performing a process, and determines a semantic unit in a sentence such as a word or phrase, that is, a morpheme.

【０００３】各種アプリケーションシステムの初段に位
置付けられる形態素解析を誤ると、その誤りが、後段で
の認識、翻訳、検索、抽出等の処理に波及し、その処理
精度に大きく影響する。一般に、後段の処理において
は、形態素解析が正しく行われることを前提としている
ため、その誤りを修復することは非常に困難である。ま
た、たとえその修復が可能であったとしても、その修復
処理は複雑化し、従って、大量の自然言語テキストを期
待された時間内に処理することができなくなってしま
う。[0003] If a morphological analysis positioned at the first stage of various application systems is erroneously performed, the error has an influence on processes such as recognition, translation, search, and extraction at a subsequent stage, and greatly affects the processing accuracy. In general, in the subsequent processing, it is assumed that morphological analysis is performed correctly, and it is very difficult to repair the error. Further, even if the restoration is possible, the restoration process becomes complicated, so that a large amount of natural language texts cannot be processed in an expected time.

【０００４】このように、形態素解析処理においては、
単語分割（形態素分割）の精度の高さが要求されるとと
もに、大量の自然言語テキストを高速に処理するという
処理速度も要求される。Thus, in the morphological analysis processing,
High precision of word segmentation (morpheme segmentation) is required, and a processing speed of processing a large amount of natural language text at high speed is also required.

【０００５】単語が単語区切り（スペース）によって分
かち書きされており、品詞等のタグを単語に与えるだけ
で良い、例えば英語ような言語の形態素解析手法におい
ては、大量のテキストから品詞やその配列であるタグ系
列の確率モデルを推定し、さらに、例に基づく誤り訂正
を加えた手法が確立されている。[0005] Words are separated by word delimiters (spaces), and it is only necessary to give a tag such as part of speech to a word. A method has been established in which a probability model of a tag sequence is estimated and error correction based on an example is added.

【０００６】一方、単語が分かち書きされていない、例
えば日本語のような言語においても、確率モデルを用い
た英語に対する手法を応用した例がいくつか提案されて
いる。確率モデルを用いた形態素解析の一例としては、
下記文献に開示された手法がある。On the other hand, even in a language such as Japanese, for example, in which words are not separated, some examples have been proposed in which a method for English using a probability model is applied. As an example of morphological analysis using a stochastic model,
There is a method disclosed in the following document.

【０００７】文献『山本幹雄、増山正和著、「品詞・区
切り情報を含む拡張文字の連鎖確率を用いた日本語形態
素解析」、言語処理学会第３回年次大会発表論文集、１
９９７年３月』この文献は、単語区切りを明確に持たない日本語に確率
モデルを用いた形態素解析手法を適用する場合に、次に
示す課題があることを記載している。References: Mikio Yamamoto and Masakazu Masuyama, "Japanese Morphological Analysis Using Chain Probability of Extended Characters Including Part-of-Speech and Delimiter Information," Proc. Of the 3rd Annual Meeting of the Association for Language Processing, 1
This document describes that there are the following problems when applying a morphological analysis method using a probabilistic model to Japanese that does not clearly have word breaks.

【０００８】（課題Ａ）英語の場合、未知語があっても
単語分割には影響を与えないが、日本語の場合は、未知
語が単語分割に影響を与えるので、精度への影響がより
深刻になる。(Problem A) In the case of English, even if there is an unknown word, the word division is not affected, but in the case of Japanese, the unknown word affects the word division. Become serious.

【０００９】（課題Ｂ）日本語の場合は、区切りの曖昧
さがあるため、単語分割数が一定であることを前提とし
た英語の確率モデルをそのまま適用するには問題があ
る。(Problem B) In the case of Japanese, since there is ambiguity of the delimiter, there is a problem in applying the English probability model as it is on the assumption that the number of word divisions is constant.

【００１０】上記の課題に対して、上記文献は、文字を
ベースにした形態素解析手法を提案している。日本語の
文字は、一般的に使われているもので約３０００種程度
あり、また、平均単語長も２文字程度であるため、日本
語の１文字は単語に近い情報をもっているという性質に
基づき、日本語の各文字に形態素解析の情報を付与した
拡張文字による連鎖確率モデルを提案している。この手
法によれば、文字をベースにしているため、複数文字列
を単語として登録した単語辞書を用いる必要がなくな
り、単語辞書を用いなければ、未知語の概念自体がなく
なり、（課題Ａ）が解決される。また、文字の長さは常
に１で１文において一定であり、英語の場合の単語分割
数に相当する１文あたりの文字数も一定となり、英語の
確率モデルを適用することが可能なり、（課題Ｂ）が解
決される。In order to solve the above problem, the above-mentioned document proposes a morphological analysis method based on characters. Japanese characters are generally used in about 3000 kinds, and the average word length is also about 2 characters, so based on the property that one Japanese character has information close to a word Proposed a chain probability model using extended characters in which information on morphological analysis was added to each Japanese character. According to this method, since it is based on characters, it is not necessary to use a word dictionary in which a plurality of character strings are registered as words. Without using a word dictionary, the concept of unknown words is eliminated and (Problem A) Will be resolved. In addition, the length of a character is always 1 and is constant in one sentence, the number of characters per sentence corresponding to the number of word divisions in English is also constant, and an English probability model can be applied. B) is solved.

【００１１】前記文献に開示された方法によれば、基本
的には、文字をベースにして形態素解析を行うに当た
り、自然言語テキストが入力文として与えられたとき
に、この入力文を構成する単語列として、各文字の直後
が単語境界であるか否かのあらゆる組み合わせの中から
最も確からしい単語列の並びを出力させることを特徴と
する。この方法を実現する手段として、（１）式に定義
される拡張文字ｅi及び（２）式に定義される拡張文字
の連鎖確率（以下、部分連鎖確率とも称する）ｐ（Ｗ，
Ｔ）を用いた拡張文字列の連鎖確率モデルを用いる。こ
こで、拡張文字ｅiとは、「私」、「は」等の通常の文
字とは異なり、文字に対して少なくとも単語区切り（形
態素区切り）の情報を含む拡張情報を付加したものであ
る。According to the method disclosed in the above document, basically, when performing a morphological analysis based on a character, when a natural language text is given as an input sentence, a word constituting the input sentence is used. As a sequence, the most probable word sequence is output from all combinations of whether or not each character immediately follows a word boundary. As means for realizing this method, a chain probability (hereinafter also referred to as a partial chain probability) p (W, W) of the extended character ei defined by the expression (1) and the extended character defined by the expression (2)
A chain probability model of an extended character string using T) is used. Here, the extended character ei is different from ordinary characters such as "I" and "ha", and is obtained by adding extended information including at least word-separated (morpheme-separated) information to a character.

【００１２】[0012]

【数１】ここで、ｃiは、入力文字列（入力テキスト列）の位置
ｉにおける文字であり、ｄiは、文字ｃiの後（又は前）
における区切り情報であるとする。(Equation 1) Here, ci is the character at the position i of the input character string (input text string), and di is after (or before) the character ci.
It is assumed that the information is delimiter information.

【００１３】[0013]

【数２】ここで、ｎは入力文字列の長さであり、ＮはＮ−ｇｒａ
ｍのＮ、すなわち最適解を求めるために参照する文字組
の長さ（文字組を構成する文字数）、ｅiは形態素列Ｗ
及びタグＴの情報から決定される拡張文字である。(Equation 2) Here, n is the length of the input character string, and N is N-gra
N of m, that is, the length of the character set referred to for finding the optimal solution (the number of characters constituting the character set), ei is the morpheme sequence W
And the extended character determined from the information of the tag T.

【００１４】また、前記文献に記載の形態素解析方法を
実施する装置は、以下の通りである（例えば、特願平９
−６８３００号明細書及び図面参照）。An apparatus for implementing the morphological analysis method described in the above-mentioned document is as follows (for example, Japanese Patent Application No. Hei 9
-68300 and drawings).

【００１５】すなわち、（ａ）テキストを入力文として
読み込んできて、この入力文の入力文字列の文字毎に、
少なくとも単語区切り情報を含む拡張情報を付加して拡
張文字を形成し、この形成された拡張文字を用いて前記
入力文字列に関する全ての拡張文字列を候補として生成
する拡張文字列生成部、（ｂ）生成された全ての拡張文
字列の連鎖確率を候補として求める連鎖確率計算部、
（ｃ）得られた連鎖確率の候補の中から最大の値の連鎖
確率を求め、この最大連鎖確率を与える拡張文字列を最
適拡張文字列として選択し、この最適拡張文字列に対応
する単語列の並びを含む解析結果を形態素解析結果とし
て出力する最適経路探索部を構成要件として備える。That is, (a) a text is read as an input sentence, and for each character of an input character string of the input sentence,
An extended character string generation unit that forms an extended character by adding extended information including at least word delimiter information, and that generates all extended character strings related to the input character string as candidates using the formed extended character; A) a chain probability calculation unit that calculates the chain probabilities of all the generated extended character strings as candidates;
(C) Find the maximum value of the chain probability from the obtained chain probability candidates, select an extended character string that gives the maximum chain probability as the optimal extended character string, and select a word sequence corresponding to the optimal extended character string. Is provided as a configuration requirement.

【００１６】前記各構成要件が、以下の動作を行うこと
により形態素解析を実施する。The above constituent elements perform morphological analysis by performing the following operations.

【００１７】すなわち、（Ｓ１）拡張文字列生成部は、
テキストを入力文として読み込んできて、この入力文の
入力文字列の文字毎に、少なくとも単語区切り情報を含
む拡張情報を付加して拡張文字を形成し、形成された拡
張文字を用いて前記入力文字列に関する全ての入力文の
文頭から文末までの全ての拡張文字列の経路を候補とし
て生成しスコアテーブルに格納する。That is, (S1) the extended character string generation unit
A text is read as an input sentence, and for each character of the input character string of the input sentence, extended information including at least word delimiter information is added to form an extended character, and the input character is formed using the formed extended character. Paths of all extended character strings from the beginning to the end of all input sentences related to the column are generated as candidates and stored in the score table.

【００１８】（Ｓ２）次に、連鎖確率計算部は、事前に
訓練（学習）により作成しておいた拡張文字テーブルに
格納されている一定文字数からなる部分拡張文字列に対
応する部分連鎖確率に基づき、前記拡張文字列の経路に
対応する拡張文字列の連鎖確率ｐ（Ｗ，Ｔ）を計算し、
スコアテーブルに格納しておく。(S2) Next, the chain probability calculating section calculates the partial chain probability corresponding to the partial extended character string having a fixed number of characters stored in the extended character table created by training (learning) in advance. Calculating a chain probability p (W, T) of the extended character string corresponding to the path of the extended character string,
Store it in the score table.

【００１９】（Ｓ３）しかる後に、最適経路探索部が、
スコアテーブルの拡張文字列の連鎖確率の候補を参照
し、候補の中から最大の値の連鎖確率を求め、この最大
連鎖確率を与える拡張文字列を最適拡張文字列として選
択し、この最適拡張文字列に対応する単語列の並びを含
む解析結果を形態素解析結果として出力する。(S3) Thereafter, the optimum route searching unit
Reference the chain probability candidate of the extended character string in the score table, find the maximum value of the chain probability from the candidates, select the extended character string that gives the maximum chain probability as the optimal extended character string, and select this optimal extended character. An analysis result including an arrangement of word strings corresponding to the columns is output as a morphological analysis result.

【００２０】[0020]

【発明が解決しようとする課題】例えば、文字認識装置
等の入力装置の性能限界により、入力テキストの任意の
文字が別の予期せぬ文字に置き換わってしまう現象（以
下、文字化けと称する）が発生することは、通常の利用
形態において普通に起こることである。また、入力テキ
ストが電子化されたものである場合にも、テキストの電
子化の過程においてオペレータの入力誤りにより予期せ
ぬ単語綴り（以下、ミスタイプと称する）が入力テキス
トに含まれることがよく発生する。For example, a phenomenon in which an arbitrary character in the input text is replaced by another unexpected character due to the performance limit of an input device such as a character recognition device (hereinafter, referred to as garbled character). What happens is what happens normally in normal usage. Further, even when the input text is digitized, an unexpected word spelling (hereinafter, referred to as a miss type) due to an input error of the operator is often included in the input text in the process of digitizing the text. Occur.

【００２１】一般的には、入力テキストに混在するこの
種の誤り文字を含む文字列部分は未知語として扱われる
が、本来、自然言語として存在するはずもないこの種の
未知語は、自然言語としては存在するが形態素解析装置
が知り得なかった単語とは区別して扱うべきである。Generally, a character string portion containing such an erroneous character mixed in an input text is treated as an unknown word. However, this kind of unknown word which should not exist as a natural language is a natural language. Should be distinguished from words that exist but are not known by the morphological analyzer.

【００２２】しかしながら、従来の形態素解析装置にお
いては、文字化けやミスタイプにより入力テキスト中の
本来未知語でなかった文字列が未知語として扱われるこ
とになった場合に、未知語は未知語というカテゴリとし
て包括して扱うだけで、当該未知語部分を正しい単語に
復元するという概念もなく、復元する手段も備えていな
い。つまり、従来の形態素解析方法及び装置において
は、入力装置の性能限界あるいは入力手段の不備により
入力されるテキストの品質が既定値より劣る場合の形態
素解析を、すなわち、本来、自然言語として存在するは
ずのない単語を合むテキストの形態素解析を想定してい
ないので、文字化けやミスタイプにより入力テキスト中
の本来未知語でなかった文字列が未知語として扱われる
ことになった場合に、当該未知語部分を正しい単語に復
元することができず、所望の形態素解析結果を得ること
ができないという課題がある。However, in the conventional morphological analyzer, if a character string that was not originally an unknown word in the input text is to be treated as an unknown word due to garbled characters or typos, the unknown word is called an unknown word. Just treating it as a category, there is no concept of restoring the unknown word part to a correct word, and there is no means for restoring. In other words, in the conventional morphological analysis method and apparatus, the morphological analysis in the case where the quality of the input text is inferior to the predetermined value due to the performance limit of the input device or the deficiency of the input means, that is, the natural morphological analysis should originally exist Since morphological analysis of text that matches words without words is not assumed, if a character string that was originally not an unknown word in the input text was treated as an unknown word due to garbled characters or typos, There is a problem that a word part cannot be restored to a correct word, and a desired morphological analysis result cannot be obtained.

【００２３】そのため、文字化けやミスタイプ等による
未知語部分を検出し、当該部分を正しい文字列に復元す
ることができる、言い換えると、正しい文字列を推定す
ることができる自然言語処理装置や、正しい文字列を推
定できなくても所定の自然言語処理を実行できる自然言
語処理装置が求められている。Therefore, it is possible to detect an unknown word portion due to garbled characters, typos, etc., and restore the portion to a correct character string. In other words, a natural language processing device capable of estimating a correct character string, There is a need for a natural language processing device that can execute predetermined natural language processing even if a correct character string cannot be estimated.

【００２４】[0024]

【課題を解決するための手段】第１の本発明の自然言語
処理装置は、（１）読み込んだ入力テキストの文字列の
文字毎に少なくとも単語区切り情報を含む拡張情報を付
加して拡張文字を形成し、該拡張文字を用いて前記入力
テキストの文字列に関する全ての組み合わせの拡張文字
列を生成する拡張文字列生成部と、（２）一定文字数か
らなる部分拡張文字列とこの部分拡張文字列に対する部
分連鎖確率情報を格納している拡張文字記憶部と、
（３）前記入力テキストの先頭から末尾までの全ての部
分拡張文字列の経路と前記拡張文字記憶部に格納されて
いる部分連鎖確率に基づき、前記拡張文字列生成部で生
成された全ての前記拡張文字列のそれぞれについて、連
鎖確率情報を求める連鎖確率計算部と、（４）得られた
連鎖確率情報を格納するスコア記憶部と、（５）得られ
た連鎖確率情報の中から最適の連鎖確率を与える拡張文
字列を選択し、該拡張文字列に対応する単語列の並びを
含む解析結果を形態素解析結果として出力する最適経路
探索部と、（６）前記拡張文字列生成部が生成した拡張
文字列の部分拡張文字列が前記拡張文字記憶部に存在し
ない場合に、前記拡張文字記憶部に格納されている当該
部分拡張文字列の一部の拡張文字と共通の拡張文字を有
する他の部分拡張文字列の部分連鎖確率情報から、当該
部分拡張文字列の部分連鎖確率情報を推定する拡張文字
推定部とを備えることを特徴とする。According to a first aspect of the present invention, there is provided a natural language processing apparatus comprising: (1) adding extended information including at least word delimiter information to each character of a character string of a read input text to generate an extended character; An extended character string generating unit that forms an extended character string of all combinations of the character string of the input text using the extended character, and (2) a partial extended character string having a fixed number of characters and the partial extended character string An extended character storage unit that stores partial chain probability information for
(3) Based on the paths of all partial extended character strings from the beginning to the end of the input text and the partial chain probabilities stored in the extended character storage unit, all of the extended character string generation units For each of the extended character strings, a chain probability calculating unit for obtaining chain probability information, (4) a score storage unit for storing the obtained chain probability information, and (5) an optimal chain from the obtained chain probability information. An optimal path search unit that selects an extended character string that gives a probability, and outputs an analysis result including a sequence of word strings corresponding to the extended character string as a morphological analysis result; and (6) an optimal character string generated by the extended character string generation unit. When the partial extended character string of the extended character string does not exist in the extended character storage unit, another extended character string having a common extended character with some extended characters of the partial extended character string stored in the extended character storage unit Partial expansion From the partial chain probabilities information string, characterized by comprising a extended character estimation unit for estimating the partial chain probability information of the partial extension string.

【００２５】第２の本発明の自然言語処理装置は、
（１）一定文字数からなる部分文字列を格納している部
分文字列記憶部と、（２）未知語を構成すると予め登録
された非対象文字のパターンを格納する非対象文字パタ
ーン記憶部と、（３）この非対象文字パターン記憶部の
格納内容に基づいて、読み込んだ入力テキスト中の未知
語部分を検出すると共に、前記部分文字列記憶部の格納
内容に基づいて、その未知語部分の正しいと思われる文
字列を推定する未知語検出部とを有することを特徴とす
る。According to a second aspect of the present invention, there is provided a natural language processing apparatus comprising:
(1) a partial character string storage unit that stores a partial character string consisting of a fixed number of characters; (2) a non-target character pattern storage unit that stores a pattern of a non-target character that is registered in advance as constituting an unknown word; (3) An unknown word portion in the read input text is detected based on the storage content of the non-target character pattern storage unit, and the correctness of the unknown word portion is determined based on the storage content of the partial character string storage unit. And an unknown word detection unit for estimating a character string considered to be a character string.

【００２６】[0026]

【発明の実施の形態】（Ａ）第１の実施形態以下、本発明による自然言語処理装置を形態素解析装置
に適用した第１の実施形態を図面を参照しながら詳述す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS (A) First Embodiment Hereinafter, a first embodiment in which a natural language processing device according to the present invention is applied to a morphological analyzer will be described in detail with reference to the drawings.

【００２７】（Ａ−１）第１の実施形態の構成図１は、第１の実施形態の形態素解析装置の構成を示す
機能ブロック図である。すなわち、第１の実施形態の形
態素解析装置は、実際上、入出力装置や処理装置や記憶
装置（や通信装置）等を有するワークステーションやパ
ソコン等の情報処理装置上に実現されるものであるが、
機能的には、図１に示す構成を有するものである。(A-1) Configuration of the First Embodiment FIG. 1 is a functional block diagram showing the configuration of the morphological analyzer of the first embodiment. That is, the morphological analysis device of the first embodiment is actually realized on an information processing device such as a workstation or a personal computer having an input / output device, a processing device, a storage device (or a communication device), and the like. But,
Functionally, it has the configuration shown in FIG.

【００２８】図１において、この第１の実施形態の形態
素解析装置は、入力装置１、拡張文字テーブル２、スコ
アテーブル３、拡張文字列生成部４、拡張文字推定部
５、連鎖確率計算部６、最適経路探索部７及び出力装置
８を有する。In FIG. 1, the morphological analyzer of the first embodiment includes an input device 1, an extended character table 2, a score table 3, an extended character string generator 4, an extended character estimator 5, and a chain probability calculator 6. , An optimum route search unit 7 and an output device 8.

【００２９】入力装置１は、自然言語テキストを当該形
態素解析装置へ入力させるためのキーボード、マウス、
ＯＣＲ（光学式文字認識装置）、音声認識装置等の任意
の手段で構成しても良いし、ネットワーク等の通信媒体
を経て外部からの通信信号を受信する手段として構成し
ても良い。The input device 1 includes a keyboard, a mouse, and the like for inputting a natural language text to the morphological analyzer.
Any means such as an OCR (optical character recognition device) and a voice recognition device may be used, or a means for receiving a communication signal from the outside via a communication medium such as a network may be used.

【００３０】拡張文字テーブル２は、拡張文字列とその
連鎖確率（部分連鎖確率）を格納するものであり、訓練
テキスト（コーパス）により予め学習されて作成されて
いる記憶装置である。The extended character table 2 stores extended character strings and their chain probabilities (partial chain probabilities), and is a storage device that has been created by learning in advance with a training text (corpus).

【００３１】スコアテーブル３は、入力テキストの文頭
から文末までの全ての拡張文字列（Ｎ−ｇｒａｍ）の経
路と、拡張文字テーブル２に格納されている部分連鎖確
率とに基づき求めた拡張文字列の経路に対応する拡張文
字列の連鎖確率ｐ（Ｗ，Ｔ）を格納する記憶装置であ
る。The score table 3 includes an extended character string obtained based on the path of all extended character strings (N-gram) from the beginning to the end of the input text and the partial chain probability stored in the extended character table 2. Is a storage device for storing a chain probability p (W, T) of an extended character string corresponding to the path of (i).

【００３２】拡張文字列生成部４は、事前に訓練により
作成してある拡張文字テーブル２を参照して、入力テキ
ストの拡張文字列を生成し、当該拡張文字列の経路を格
納するものである。The extended character string generator 4 refers to the extended character table 2 created in advance by training, generates an extended character string of the input text, and stores the path of the extended character string. .

【００３３】拡張文字推定部５は、拡張文字列生成部４
が入力テキストから拡張文字列の経路を作成する際に、
入力テキスト中に拡張文字テーブル２に格納されていな
い文字が含まれている場合に当、該不明文字を含む部分
拡張文字列の部分連鎖確率を推定するものである。The extended character estimating unit 5 includes an extended character string generating unit 4
Creates an extended string path from the input text,
When the input text includes a character that is not stored in the extended character table 2, the partial chain probability of the partially extended character string including the unknown character is estimated.

【００３４】連鎖確率計算部６は、拡張文字テーブル２
に格納されている部分連鎖確率に基づき、スコアテーブ
ル３に格納されている拡張文字列の経路に対する連鎖確
率を計算するものである。The chain probability calculating section 6 calculates the extended character table 2
Is calculated based on the partial chain probability stored in the score table 3 with respect to the path of the extended character string stored in the score table 3.

【００３５】最適経路探索部７は、連鎖確率計算部６に
より計算された連鎖確率の中から、最適な条件（例えば
最大値の連鎖確率を与えるなど）を満たす拡張文字列
を、最適拡張文字列として選択するものである。The optimum route search unit 7 extracts an extended character string that satisfies an optimal condition (for example, gives the maximum value of the chain probability) from the chain probabilities calculated by the chain probability calculating unit 6. Is to be selected as

【００３６】出力装置８は、通常の情報処理装置の場合
と同様に、当該形態素解析装置で得られた形態素解析結
果を、外部の種々の表示手段や通信手段等へ出力するも
のである。The output device 8 outputs the result of the morphological analysis obtained by the morphological analyzer to various external display means and communication means as in the case of a normal information processing apparatus.

【００３７】図２は、拡張文字テーブル２の構成例を示
す説明図である。図２において、拡張文字テーブル２に
は、上述した（１）式で定義される拡張文字ｅｉに対応
する文字ｃiと拡張情報（ここでは区切り情報）ｄiとが
対になって順次記載されている。具体的には、拡張文字
テーブル２の１レコードを構成する各拡張文字ｅi-N+
1，ｅi-N+2，…，ｅiは、対応する文字ｃi-N+1，ｃi-N+
2，…，ｃiと、拡張情報ｄi-N+1，ｄi-N+2，…，ｄiが
対になって記載されている。また、レコードの右側の欄
には、この拡張文字列に対応する部分連鎖確率ｐ（ｅi
｜ｅi-1、ｅi-2，…，ｅi-N+1）が記載されている。以
下、拡張文字ｅiの記法を、＜ｃi，ｄi＞とする。ま
た、拡張情報である区切り情報は、文字位置ｉの直後で
形態素が分割される場合（ｄi＝１）と、分割されない
場合（ｄi＝０）とで２値をとるものとする。FIG. 2 is an explanatory diagram showing a configuration example of the extended character table 2. In FIG. 2, a character ci corresponding to the extended character ei defined by the above-described equation (1) and extended information (here, delimiter information) di are sequentially described as a pair in the extended character table 2. . Specifically, each extended character ei-N + forming one record of the extended character table 2
1, ei-N + 2, ..., ei are the corresponding characters ci-N + 1, ci-N +
, Ci, and extended information di-N + 1, di-N + 2,..., Di are described in pairs. In the right column of the record, the partial chain probability p (ei
| Ei-1, ei-2,..., Ei-N + 1). Hereinafter, the notation of the extended character ei is <ci, di>. The delimiter information, which is the extended information, takes two values depending on whether the morpheme is divided immediately after the character position i (di = 1) or not (di = 0).

【００３８】図３は、拡張文字テーブル２の具体例を示
す図である。この例では、文字組の数Ｎが３の場合、す
なわち、Ｎ＝３の場合のＮ−ｇｒａｍである部分拡張文
字列ｅi-2，ｅi-1，ｅiに対応する部分連鎖確率ｐ（ｅi
｜ｅi-1，ｅi-2）が各レコードに格納されている。FIG. 3 is a diagram showing a specific example of the extended character table 2. In this example, when the number N of character sets is 3, ie, when N = 3, the partial chain probability p (ei is corresponding to the partial extended character string ei-2, ei-1, ei which is N-gram.
| Ei-1, ei-2) is stored in each record.

【００３９】例えば、レコードＬ３０９は、＜東，０
＞、＜京，１＞，＜都，０＞からなる部分拡張文字列と
その部分拡張文字列に対応する部分連鎖確率が０．１２
であることを示している。また、例えば、レコードＬ３
０１等に現われる＜＃，１＞なる記法は、入力テキスト
の先頭又は末尾において部分連鎖確率を計算する際に、
他の部分拡張文字列と同等の効果を奏するように便宜的
に挿入する特別な（ダミーの）拡張文字である。For example, record L309 is <East, 0
>, <Kyo, 1>, <Tokyo, 0>, and the partial chain probability corresponding to the partial extended character string is 0.12.
Is shown. Further, for example, the record L3
The notation <#, 1> that appears in 01 or the like is used when calculating the partial chain probability at the beginning or end of the input text.
This is a special (dummy) extended character inserted for convenience so as to have the same effect as other partial extended character strings.

【００４０】図４は、スコアテーブル３の構成例を示す
説明図である。図４において、各拡張文字ｅ-(N-1)+1，
ｅ-(N-1)+2，…，ｅ1，ｅ2…，ｅn，…，ｅn+(N-1)はそ
れぞれ、対応する文字ｃ-(N-1)+1，ｃ-(N-1)+2，…，ｃ
1，ｃ2…，ｃn，…，ｃn+(N-1)と、拡張情報（ここでは
区切り情報）ｄ-(N-1)+1，ｄ-(N-1)+2，…，ｄ1，ｄ2
…，ｄn，…，ｄn+(N-1)が対になって、拡張文字列の全
ての組み合わせ（各組み合わせを経路とも呼ぶ）に対し
て順次記録される。スコアテーブル３のレコードの右側
の欄には、当該レコードに格納されている拡張文字列の
経路に対する連鎖確率ｐ（Ｗ，Ｔ）が格納される。FIG. 4 is an explanatory diagram showing a configuration example of the score table 3. In FIG. 4, each extended character e- (N-1) +1,
e- (N-1) +2, ..., e1, e2 ..., en, ..., en + (N-1) are the corresponding characters c- (N-1) +1, c- (N-1), respectively. +2, ..., c
, Cn, ..., cn + (N-1) and extended information (here, delimiter information) d- (N-1) +1, d- (N-1) +2, ..., d1, d2
, Dn,..., Dn + (N-1) are sequentially recorded in pairs for all combinations of the extended character strings (each combination is also called a path). The column on the right side of the record of the score table 3 stores the chain probability p (W, T) for the path of the extended character string stored in the record.

【００４１】図５は、格納処理が終了した後のスコアテ
ーブル３の具体例を示す図である。この例では、「南京
市」なるテキストが入力された場合の対応する拡張文字
列の全て（「南京市」は３文字であるので、区切る、区
切らないの２値の拡張情報の３乗で８種類）の経路と連
鎖確率が格納されている。FIG. 5 is a diagram showing a specific example of the score table 3 after the storing process is completed. In this example, all of the corresponding extended character strings when the text “Nanjing City” is input (because “Nanjing City” has three characters, delimiter and non-delimiter are used as the cube of binary extended information of 8). Type) and the chain probability are stored.

【００４２】（Ａ−２）第１の実施形態の動作以下、第１の実施形態の形態素解析装置の動作（形態素
解析方法）を図面を参照しながら説明する。(A-2) Operation of First Embodiment The operation (morphological analysis method) of the morphological analyzer of the first embodiment will be described below with reference to the drawings.

【００４３】まず、第１の実施形態の形態素解析装置の
全体の動作を、図６に示すフローチャートを参照しなが
ら説明する。First, the overall operation of the morphological analyzer according to the first embodiment will be described with reference to the flowchart shown in FIG.

【００４４】第１の実施形態の形態素解析装置において
は、入力テキストの部分文字列に対応するＮ−ｇｒａｍ
部分拡張文字列のレコードが拡張文字テーブル２に存在
しない場合、すなわち、当該部分文字列が未知語である
場合に、当該部分文字列の部分連鎖確率を求めることが
できないので、形態素解析の精度が著しく低くなるとい
う現象を、図６に示す以下の動作によって解決する。In the morphological analyzer according to the first embodiment, the N-gram corresponding to the partial character string of the input text
When the record of the partial extended character string does not exist in the extended character table 2, that is, when the partial character string is an unknown word, the partial chain probability of the partial character string cannot be obtained. The phenomenon of a remarkable decrease is solved by the following operation shown in FIG.

【００４５】（ステップＳ６０１）入力装置１を介し
て入力テキストを本形態素解析装置に読み込む。(Step S601) The input text is read into the morphological analyzer via the input device 1.

【００４６】（ステップＳ６０２）読み込んだテキス
トの各文字から拡張文字を生成し、入力テキストの先頭
から末尾に至る拡張文字列の経路をもとめてスコアテー
ブル３に格納する。このステップＳ６０２では、入力テ
キスト中に拡張文字テーブル２に存在しない一定文字数
を持つ文字列（Ｎ−ｇｒａｍ文字列）が存在する場合
に、対応する部分拡張文字列、すなわち、未知語に対応
した部分拡張文字列のレコードを推定する。(Step S602) An extended character is generated from each character of the read text, and the path of the extended character string from the beginning to the end of the input text is determined and stored in the score table 3. In step S602, when a character string having a fixed number of characters (N-gram character string) that does not exist in the extended character table 2 exists in the input text, the corresponding partially expanded character string, that is, the part corresponding to the unknown word Estimate extended string records.

【００４７】（ステップＳ６０３）生成された全ての
拡張文字列の経路に対する連鎖確率を求める。拡張文字
列の連鎖確率は、当該拡張文字列を構成する部分拡張文
字列のそれぞれに対応する部分連鎖確率を拡張文字テー
ブル２を参照することにより求めて、それぞれの部分連
鎖確率の積として求める（上述した（２）式参照）。求
めた連鎖確率は、スコアテーブル３の対応する拡張文字
列の経路のレコードに格納しておく。(Step S603) The chain probabilities for the paths of all the generated extended character strings are obtained. The chain probability of the extended character string is obtained by referring to the extended character table 2 for the partial chain probability corresponding to each of the partial extended character strings constituting the extended character string, and is obtained as the product of the partial chain probabilities ( (Refer to the equation (2) described above.) The calculated chain probability is stored in the record of the path of the corresponding extended character string in the score table 3.

【００４８】（ステップＳ６０４）スコアテーブル３
を参照し、得られた連鎖確率の中から最適な条件（例え
ば最大の値の連鎖確率を与えるなど）を満たす拡張文字
列を最適拡張文字列として選択する。(Step S604) Score Table 3
And selects an extended character string that satisfies an optimum condition (for example, gives the maximum value of the chain probability) from the obtained chain probabilities as the optimum expanded character string.

【００４９】（ステップＳ６０５）出力装置８を介し
て、最適拡張文字列により決定される単語列の並びを含
む形態素解析結果を出力する。(Step S605) The output unit 8 outputs a morphological analysis result including a word sequence determined by the optimal expanded character string.

【００５０】図７は、上述したステップＳ６０２の拡張
文字列生成動作を詳細に説明するフローチャートであ
る。FIG. 7 is a flowchart for explaining in detail the extended character string generation operation in step S602 described above.

【００５１】拡張文字列生成部４には、入力装置１を介
してテキストが入力され、入力テキストの各文字毎に拡
張情報（例えば区切り情報）を付加することにより拡張
文字を形成する（ステップＳ７０１）。例えば、「南京
市」という３文字の入力テキストに対しては、＜南、０
＞、＜南、１＞、＜京、０＞、＜京、１＞、＜市、０
＞、＜市、１＞の６種類の拡張文字が生成される。A text is input to the extended character string generation unit 4 via the input device 1, and an extended character is formed by adding extended information (for example, delimiter information) to each character of the input text (step S701). ). For example, for a three character input text "Nanjing City", <South, 0
>, <South, 1>, <Kyo, 0>, <Kyo, 1>, <City, 0
>, <City, 1> are generated.

【００５２】次に、生成された拡張文字から入力テキス
トに対応する拡張文字列の一つの経路を作成し、スコア
テーブル３に格納する（ステップＳ７０２）。この際に
は、経路全体に対する連鎖確率の格納（図８参照）は実
行されない。このステップＳ７０２が後述するように繰
り返されるので、図７に示す一連の処理が終了したとき
には、拡張文字列の組み合わせで定まる全ての経路がス
コアテーブル３に格納される。この第１の実施形態で
は、拡張情報として２値の区切り情報を付加するので、
Ｍ文字の入力テキストからは、２のＭ乗種類の経路が作
成される。また、テキストの先頭と末尾の特殊拡張文字
を考慮して、１つの拡張文字列はＭ＋２＊（Ｎ−１）文
字とする。Next, one path of an extended character string corresponding to the input text is created from the generated extended characters and stored in the score table 3 (step S702). At this time, the storage of the chain probability for the entire route (see FIG. 8) is not executed. Since step S702 is repeated as described later, when the series of processes illustrated in FIG. 7 is completed, all the paths determined by the combination of the extended character strings are stored in the score table 3. In the first embodiment, binary delimiter information is added as extended information.
From the input text of M characters, a path of 2 M kinds is created. Considering the special extended characters at the beginning and end of the text, one extended character string is M + 2 * (N-1) characters.

【００５３】例えば、Ｎ−ｇｒａｍが３−ｇｒａｍであ
れば、「南京市」という３文字の入力テキストに対して
は、図５に示したように、＜＃，１＞−＜＃，１＞−＜
南，０＞−＜京，０＞−＜市，０＞−＜＃，１＞−＜
＃，１＞，＜＃，１＞−＜＃，１＞−＜南，０＞−＜
京，０＞−＜市，１＞−＜＃，１＞−＜＃，１＞，…，
＜＃，１＞−＜＃，１＞−＜南，１＞−＜京，１＞−＜
市，１＞−＜＃，１＞−＜＃，１＞の７拡張文字でなる
８種類の拡張文字列が生成される。For example, if N-gram is 3-gram, as shown in FIG. 5, <#, 1>-<#, 1> for the input text of three characters "Nanjing City". − <
South, 0>-<Kyo, 0>-<City, 0>-<#, 1>-<
#, 1>, <#, 1>-<#, 1>-<South, 0>-<
Kyo, 0>-<city, 1>-<#, 1>-<#, 1>, ...,
<#, 1>-<#, 1>-<South, 1>-<Kyoto, 1>-<
Eight types of extended character strings consisting of seven extended characters of city, 1>-<#, 1>-<#, 1> are generated.

【００５４】ある経路についてのスコアテーブル３への
格納が終了すると、次に、拡張文字推定部５がｉ番目の
文字位置を表わすカウンタを初期化（ｉ＝１）する（ス
テップＳ７０３）。When the storage of a certain path in the score table 3 is completed, the extended character estimating unit 5 initializes a counter representing the i-th character position (i = 1) (step S703).

【００５５】続いて、拡張文字推定部５は、直前のステ
ップＳ７０２において拡張文字列生成部４が作成した拡
張文字列（経路）の中から、カウンタｉで定まる位置の
Ｎ個の格納文字列部分（Ｎ−ｇｒａｍ拡張文字列）ｅ-
(N-1)+i，…，ｅiを取り出す（ステップＳ７０４）。こ
の実施形態では、Ｎ＝３の場合のＮ−ｇｒａｍを想定し
ているので、取り出されるＮ−ｇｒａｍ拡張文字列はｅ
i-2，ｅi-1，ｅiとなる。例えば，＜＃，１＞−＜＃，
１＞−＜南，０＞−＜京，０＞−＜市，０＞−＜＃，１
＞−＜＃，１＞なる拡張文字列（経路）に対して、ｉ＝
１の場合には、ｅ-1，ｅ0，ｅ1、すなわち、＜＃，１
＞，＜＃，１＞，＜南，０＞がＮ−ｇｒａｍ拡張文字列
として取り出される。Subsequently, the extended character estimating unit 5 selects N stored character string portions at positions determined by the counter i from the extended character strings (paths) created by the extended character string generating unit 4 in the immediately preceding step S702. (N-gram extended character string) e-
(N-1) + i, ..., ei are extracted (step S704). In this embodiment, since N-gram in the case of N = 3 is assumed, the extracted N-gram extended character string is e.
i-2, ei-1, and ei. For example, <#, 1>-<#,
1>-<South, 0>-<Kyo, 0>-<City, 0>-<#, 1
>-<#, 1>, i =
In the case of 1, e-1, e0, e1, that is, <#, 1
>, <#, 1>, <south, 0> are extracted as N-gram extended character strings.

【００５６】次に、取り出したＮ−ｇｒａｍ拡張文字列
のパターンが、拡張文字テーブル２に存在するかどうか
を検査する（ステップＳ７０５）。Next, it is checked whether or not the extracted pattern of the N-gram extended character string exists in the extended character table 2 (step S705).

【００５７】存在しなければ、当該Ｎ−ｇｒａｍ拡張文
字列の連鎖確率を推定する（ステップＳ７０６）。連鎖
確率の推定は、例えば、当該Ｎ−ｇｒａｍ拡張文字列の
拡張文字テーブル２に存在しない拡張文字（特殊拡張文
字を除く）を、全ての文字と照合する汎用文字とみな
し、拡張文字テーブル２の照合可能なレコードの連鎖確
率の平均値を当該Ｎ−ｇｒａｍ拡張文字列の連鎖確率と
する方法により行う。If it does not exist, the chain probability of the N-gram extended character string is estimated (step S706). To estimate the chain probability, for example, an extended character (excluding special extended characters) that does not exist in the extended character table 2 of the N-gram extended character string is regarded as a general character to be matched with all characters, and the extended character table 2 The average value of the chain probabilities of collatable records is used as the chain probability of the N-gram extended character string.

【００５８】次に、拡張文字推定部５は、推定した部分
連鎖確率をもつＮ−ｇｒａｍ拡張文字列を拡張文字テー
ブル２に追加格納する（ステップＳ７０７）。Next, the extended character estimation unit 5 additionally stores the N-gram extended character string having the estimated partial chain probability in the extended character table 2 (step S707).

【００５９】取り出したＮ−ｇｒａｍ拡張文字列のパタ
ーンが拡張文字テーブル２に存在すると（ステップＳ７
０５で否定結果）、又は、Ｎ−ｇｒａｍ拡張文字列を拡
張文字テーブル２に追加格納すると（ステップＳ７０
７）、次に、文字位置を表わすカウンタｉを１インクリ
メントした後（ステップＳ７０８）、カウンタｉの値を
値Ｍ＋Ｎ−１（Ｍは入力テキストの文字数、ＮはＮ−ｇ
ｒａｍ文字列の文字数）とを比較することを通じて、現
在処理対象となっている拡張文字列（経路）についての
全てのＮ−ｇｒａｍ拡張文字列部分についての拡張文字
テーブル２の存在有無の確認を終了したか否かを判定す
る（ステップＳ７０９）。If the pattern of the extracted N-gram extended character string exists in the extended character table 2 (step S7).
05, or an N-gram extended character string is additionally stored in the extended character table 2 (step S70).
7) Then, after incrementing the counter i representing the character position by one (step S708), the value of the counter i is increased to a value M + N-1 (M is the number of characters of the input text, N is N-g).
ending the confirmation of the presence or absence of the extended character table 2 for all N-gram extended character string portions of the extended character string (path) currently being processed by comparing the number of characters in the extended character table 2 It is determined whether or not the process has been performed (step S709).

【００６０】現在処理対象となっている拡張文字列（経
路）についての全てのＮ−ｇｒａｍ拡張文字列部分につ
いての拡張文字テーブル２の存在有無の確認を終了して
いなければ、上述したステップＳ７０４に戻る。If the confirmation of the existence of the extended character table 2 has not been completed for all N-gram extended character string portions of the currently processed extended character string (path), the above-described step S704 is executed. Return.

【００６１】一方、１つの拡張文字列の経路を構成する
全ての部分拡張文字列を処理した場合には（ステップＳ
７０９で肯定結果）、拡張文字推定部５は動作を完了
し、次に拡張文字列生成部４が未処理の拡張文字列の経
路が残されているかどうかを検査し（ステップＳ７１
０）、まだ、未処理の経路が残されているならば、上述
したステップＳ７０２〜Ｓ７１０を繰り返す。全ての経
路を処理したならば、拡張文字列生成部４は動作を終了
する。On the other hand, when all the partial extended character strings forming the path of one extended character string have been processed (step S
(Yes at 709), the extended character estimating unit 5 completes the operation, and then the extended character string generating unit 4 checks whether an unprocessed extended character string path remains (step S71).
0) If there is still an unprocessed route, the above steps S702 to S710 are repeated. When all the paths have been processed, the extended character string generation unit 4 ends the operation.

【００６２】上述したステップＳ７０４〜Ｓ７０７の動
作を、実例にて具体的に説明する。ここでは、Ｎ−ｇｒ
ａｍ拡張文字列とその連鎖確率が格納されている拡張文
字テーブル２の初期状態が、図３のレコードＬ３０１〜
Ｌ３３４が格納されている状態であるとする。また、入
力テキストが「南京市」であるとする。The operations of steps S704 to S707 will be specifically described with reference to actual examples. Here, N-gr
The initial state of the extended character table 2 storing the am extended character string and its chain probability is the record L301 to L301 in FIG.
It is assumed that L334 is stored. It is also assumed that the input text is “Nanjing City”.

【００６３】ここで、「南京市」という入力テキストに
対する拡張文字列（経路）の−つである＜＃、１＞−＜
＃，１＞−＜南，０＞−＜京，０＞−＜市，０＞−＜
＃，１＞−＜＃，１＞なる拡張文字列に対して、ｉ＝１
の場合には、Ｎ−ｇｒａｍ拡張文字列ｅ-1，ｅ0，ｅ1と
して、＜＃，１＞，＜＃，１＞，＜南，０＞が取り出さ
れる（ステップＳ７０４）。図３に示す拡張文字テーブ
ル２のレコードＬ３０１〜Ｌ３３４の中には、このパタ
ーン＜＃，１＞，＜＃，１＞，＜南，０＞が存在しない
ので（ステップＳ７０５）、当該Ｎ−ｇｒａｍ拡張文字
列の拡張文字テーブル２に存在しない拡張文字＜南，０
＞を汎用文字とみなし（このパターンの前半２拡張文字
部分を有する当該Ｎ−ｇｒａｍ拡張文字列は拡張文字テ
ーブル２に存在する）、拡張文字テーブル２に照合する
レコードを探すと、レコードＬ３０１の＜＃，１＞、＜
＃，１＞，＜東，０＞と、レコードＬ３２１の＜＃，１
＞，＜＃，１＞，＜北，０＞とが検索される。その結
果、レコードＬ３０１及びＬ３２１の連鎖確率の平均値
（０．０６＋０．０６）／２＝０．０６が当該Ｎ−ｇｒ
ａｍ拡張文字列＜＃，１＞，＜＃，１＞，＜南，０＞の
連鎖確率として推定される（ステップＳ７０６）。しか
る後に、当該Ｎ−ｇｒａｍ拡張文字列＜＃，１＞，＜
＃，１＞，＜南，０＞とその連鎖確率０．０６とが拡張
文字テーブル２に追加格納される（ステップＳ７０
７）。この動作により、拡張文字テーブル２には、図３
のレコードＬ３５１が追加される。Here, one of the extended character strings (paths) for the input text "Nanjing City" is <#, 1>-<
#, 1>-<South, 0>-<Kyo, 0>-<City, 0>-<
For an extended character string of #, 1>-<#, 1>, i = 1
In the case of <#, 1>, <#, 1>, <South, 0> are extracted as the N-gram extended character strings e-1, e0, e1 (step S704). Since this pattern <#, 1>, <#, 1>, <South, 0> does not exist in the records L301 to L334 of the extended character table 2 shown in FIG. 3 (step S705), the N-gram concerned Extended character not present in extended character table 2 of extended character string <south, 0
> Is regarded as a general character (the N-gram extended character string having the first two extended character portions of this pattern exists in the extended character table 2), and when a record to be collated with the extended character table 2 is searched, the <#,1>,<
#, 1>, <East, 0> and <#, 1 in record L321
>, <#, 1>, <north, 0>. As a result, the average value (0.06 + 0.06) /2=0.06 of the chain probabilities of the records L301 and L321 is equal to the N-gr
It is estimated as the chain probability of the am extended character string <#, 1>, <#, 1>, <South, 0> (step S706). Thereafter, the N-gram extended character string <#, 1>, <
#, 1>, <South, 0> and their chain probability 0.06 are additionally stored in the extended character table 2 (step S70).
7). By this operation, the extended character table 2 is displayed in FIG.
Record L351 is added.

【００６４】その後、カウンタｉの値を変化させてステ
ップＳ７０４〜Ｓ７０８が繰り返し実行されるが、上述
したと同様にして、レコードＬ３５２〜Ｌ３６４が新た
に拡張文字テーブル２に追加される。Thereafter, the values of the counter i are changed, and steps S704 to S708 are repeatedly executed. In the same manner as described above, records L352 to L364 are newly added to the extended character table 2.

【００６５】図８は、上述したステップＳ６０３の拡張
文字列（経路）の連鎖確率の計算動作を詳細に説明する
フローチャートである。FIG. 8 is a flowchart for explaining in detail the operation of calculating the chain probability of the extended character string (path) in step S603 described above.

【００６６】連鎖確率計算部６は、まず、スコアテーブ
ル３に格納されている拡張文字列レコードを１つ取り出
す（ステップＳ８０１）。次に、文字位置を表わすカウ
ンタｉを初期化（ｉ＝１）する（ステップＳ８０２）。The chain probability calculating section 6 first takes out one extended character string record stored in the score table 3 (step S801). Next, a counter i representing a character position is initialized (i = 1) (step S802).

【００６７】そして、当該レコードから、カウンタｉの
値で定まるｅ-(N-1)+iからｅiまでのＮ文字の部分拡張
文字列、すなわちＮ−ｇｒａｍ拡張文字列を取り出し、
拡張文字テーブル２中の当該Ｎ−ｇｒａｍ拡張文字列に
照合するレコードの連鎖確率ｐ（ｅi）を取り出す（ス
テップＳ８０３）。Then, a partial extended character string of N characters from e- (N-1) + i to ei determined by the value of the counter i, that is, an N-gram extended character string is extracted from the record,
The chain probability p (ei) of the record to be collated with the N-gram extended character string in the extended character table 2 is extracted (step S803).

【００６８】ここで、文字位置カウンタｉが１である
（テキストの先頭のＮ−ｇｒａｍ拡張文字列である）な
らば（ステップＳ８０４で肯定結果）、当該拡張文字列
レコードの連鎖確率ｐ（Ｗ，Ｔ）に前記部分連鎖確率ｐ
（ｅi）を格納する（ステップＳ８０５）。また、Ｎ−
ｇｒａｍ拡張文字列が入力テキストの先頭でない場合に
は（ステップＳ８０４で否定結果）、当該拡張文字列レ
コードの連鎖確率ｐ（Ｗ，Ｔ）に前記部分連鎖確率ｐ
（ｅi）を乗じて、新しい連鎖確率ｐ（Ｗ，Ｔ）とする
（ステップＳ８０６）。If the character position counter i is 1 (it is an N-gram extended character string at the beginning of the text) (Yes in step S804), the chain probability p (W, T) indicates the partial chain probability p
(Ei) is stored (step S805). Also, N-
If the gram extended character string is not the head of the input text (a negative result in step S804), the partial chain probability p (W, T) is added to the chain probability p (W, T) of the extended character string record.
(Ei) to obtain a new chain probability p (W, T) (step S806).

【００６９】次に、文字位置カウンタｉを１インクリメ
ントした後（ステップＳ８０７）、カウンタｉの値を値
Ｍ＋Ｎ−１（Ｍは入力テキストの文字数、ＮはＮ−ｇｒ
ａｍ文字列の文字数）とを比較することを通じて、現在
処理対象となっている拡張文字列（経路）についての全
てのＮ−ｇｒａｍ拡張文字列部分についての部分連鎖確
率ｐ（ｅi）の取り出し、及び、それを反映した連鎖確
率ｐ（Ｗ，Ｔ）の更新処理が終了したか否かを判定する
（ステップＳ８０８）。Next, after incrementing the character position counter i by 1 (step S807), the value of the counter i is increased to a value M + N-1 (M is the number of characters of the input text, N is N-gr).
am, the number of partial chain probabilities p (ei) for all N-gram extended character string portions of the currently processed extended character string (path), and Then, it is determined whether or not the update processing of the chain probability p (W, T) reflecting the reflection has been completed (step S808).

【００７０】終了していなければ、上述したステップ８
０８に戻る。一方、終了したならば、すなわち、１つの
拡張文字列の経路の全てを構成する部分拡張文字列の部
分連鎖確率を処理した場合には、当該拡張文字列の連鎖
確率ｐ（Ｗ，Ｔ）をスコアテーブル３の該当する位置に
格納する（ステップＳ８０９）。If not completed, step 8 described above
Return to 08. On the other hand, if the processing is completed, that is, if the partial chain probability of the partial extended character string that constitutes the entire path of one extended character string is processed, the chain probability p (W, T) of the extended character string is calculated. It is stored in the corresponding position of the score table 3 (step S809).

【００７１】連鎖確率ｐ（Ｗ，Ｔ）の計算（ステップＳ
８０１〜Ｓ８０９）は、スコアテーブル３に格納されて
いる全てのレコードについて行い、全てのレコードを処
理したならば（ステップＳ８１０）、連鎖確率計算部６
は、動作を終了する。Calculation of chain probability p (W, T) (step S
Steps 801 to S809) are performed for all the records stored in the score table 3, and if all the records have been processed (step S810), the linkage probability calculating unit 6
Ends the operation.

【００７２】以下、具体例で連鎖確率の計算動作を説明
する。入力テキストは「南京市」であるとする。また、
拡張文字テーブル２には、図３の状態の部分拡張文字及
び部分連鎖確率が格納されているものとする。また、ス
コアテーブル３には、図５に示すように「南京市」に対
応する拡張文字列の経路が格納されているものとする。
但し、図５に示す拡張文字列の各レコードの連鎖確率の
欄は、初期状態では、空欄であるものとする。Hereinafter, the operation of calculating the chain probability will be described with reference to a specific example. It is assumed that the input text is “Nanjing City”. Also,
It is assumed that the extended character table 2 stores the partially extended characters and the partial chain probabilities in the state of FIG. It is assumed that the score table 3 stores the path of the extended character string corresponding to "Nanjing City" as shown in FIG.
However, the column of the chain probability of each record of the extended character string shown in FIG. 5 is blank in the initial state.

【００７３】まず、スコアテーブル３から１レコードを
取り出す。例えば、図５のレコードＬ５０１を取り出
す。レコードＬ５０１は、＜＃，１＞，＜＃，１＞，＜
南，０＞，＜京，０＞，＜市，０＞，＜＃，１＞，＜
＃，１＞となっており、まず、ｉ＝１の場合のＮ−ｇｒ
ａｍ拡張文字列＜＃，１＞，＜＃，１＞，＜南，０＞の
連鎖確率を拡張文字テーブル２から探す。図３のレコー
ドＬ３５１が該当し、部分連鎖確率ｐ（ｅ1）として
０．０６が得られる（ステップＳ８０３）。今、ｉ＝１
であるので、当該拡張文字列の連鎖確率ｐ（Ｗ，Ｔ）に
ｐ（ｅi）を格納し、ｐ（Ｗ，Ｔ）＝０．０６となる。First, one record is extracted from the score table 3. For example, the record L501 in FIG. 5 is extracted. Record L501 includes <#, 1>, <#, 1>, <
South, 0>, <Kyo, 0>, <City, 0>, <#, 1>, <
#, 1>, first, N-gr when i = 1
The extended character table 2 is searched for the chain probability of the am extended character string <#, 1>, <#, 1>, <South, 0>. The record L351 in FIG. 3 corresponds to this, and 0.06 is obtained as the partial chain probability p (e1) (step S803). Now i = 1
Therefore, p (ei) is stored in the chain probability p (W, T) of the extended character string, and p (W, T) = 0.06.

【００７４】次に、ｉを１だけ増やしてｉ＝２とする
（ステップＳ８０７）。ｉ（＝２）＜Ｍ＋Ｎ−１（＝
５）であるので（ステップＳ８０８）、ステップＳ８０
３へ戻り、次のＮ−ｇｒａｍ拡張文字列＜＃，１＞，＜
南，０＞，＜京，０＞の連鎖確率を拡張文字テーブル２
から探す。図３のレコードＬ３５３が該当し、部分連鎖
確率ｐ（ｅ2）として０．０１が得られる（ステップＳ
８０３）。今、ｉ＝２であるので、当該拡張文字列の連
鎖確率ｐ（Ｗ，Ｔ）は、元のｐ（Ｗ，Ｔ）（＝０．０
６）にｐ（ｅ2）（＝０．０１）を乗じた値となる。す
なわち、新しい連鎖確率は、ｐ（Ｗ，Ｔ）＝ｐ（ｅ1）
×ｐ（ｅ2）である。同様な処理が、ｉ＞Ｍ＋Ｎ−１
（＝５）となるまで繰り返される。Next, i is increased by 1 to make i = 2 (step S807). i (= 2) <M + N-1 (=
5) (Step S808), the Step S80
3, the next N-gram extended character string <#, 1>, <
Extended character table 2 for linkage probabilities of south, 0>, <Kyo, 0>
Search from This corresponds to the record L353 in FIG. 3, and a partial chain probability p (e2) of 0.01 is obtained (step S).
803). Now, since i = 2, the chain probability p (W, T) of the extended character string is the original p (W, T) (= 0.0
6) multiplied by p (e2) (= 0.01). That is, the new chain probability is p (W, T) = p (e1)
× p (e2). Similar processing is performed when i> M + N-1.
This is repeated until (= 5).

【００７５】その結果、最終的には、＜＃，１＞，＜
＃，１＞，＜南，０＞と、＜＃，１＞，＜南，０＞，＜
京，０＞と、＜南，０＞，＜京，０＞，＜市，０＞と、
＜京，０＞，＜市，０＞，＜＃，１＞と、＜市，０＞，
＜＃，１＞，＜＃，１＞の５種の部分連鎖確率を乗じた
値が、当該拡張文字列（レコードＬ５０１）の連鎖確率
ｐ（Ｗ，Ｔ）となる。この連鎖確率をスコアテーブル３
のレコードＬ５０１の連鎖確率の欄に格納する（ステッ
プＳ８０９）。As a result, finally, <#, 1>, <
#, 1>, <South, 0> and <#, 1>, <South, 0>, <
K, 0>, <South, 0>, <Kyo, 0>, <City, 0>,
<Kyo, 0>, <City, 0>, <#, 1>, and <City, 0>,
The value obtained by multiplying the five partial chain probabilities <#, 1> and <#, 1> is the chain probability p (W, T) of the extended character string (record L501). This chain probability is calculated using score table 3
Is stored in the column of the linkage probability of the record L501 (step S809).

【００７６】以上の動作を、スコアテーブルの全てのレ
コードに対して行う（ステップＳ８１０）。The above operation is performed for all records in the score table (step S810).

【００７７】図９は、上述したステップＳ６０４の最適
拡張文字列の選択動作を詳細に説明するフローチャート
である。FIG. 9 is a flowchart for explaining in detail the operation of selecting the optimum extended character string in step S604 described above.

【００７８】まず、最適経路探索部７は、拡張文字列の
選択条件を決定する（ステップＳ９０１）。選択条件と
しては、例えば、スコアテーブル３中で最大の連鎖確率
をもつ拡張文字列のレコードを選択するなど、任意の選
択条件を設定できるものとする。以後、拡張文字の選択
条件としては、便宜的に最大連鎖確率をもつレコードを
選択することにする。First, the optimum route search unit 7 determines a condition for selecting an extended character string (step S901). As the selection condition, it is assumed that an arbitrary selection condition can be set, such as, for example, selecting a record of an extended character string having the largest linkage probability in the score table 3. Thereafter, as a condition for selecting an extended character, a record having the maximum chain probability is selected for convenience.

【００７９】次に、スコアテーブル３を参照し、最大の
連鎖確率をもつレコードを取り出す（ステップＳ９０
２）。例えば、図５に示すスコアテーブルの例では、最
大の連鎖確率（＝０．４５９×１０ＥＸＰ−３（ＥＸＰ
−３は−３乗を意味する））をもつレコードＬ５０４＜
＃，１＞，＜＃，１＞，＜南，０＞，＜京，１＞，＜
市，１＞，＜＃，１＞，＜＃，１＞が取り出される。Next, the record having the maximum linkage probability is extracted with reference to the score table 3 (step S90).
2). For example, in the example of the score table shown in FIG. 5, the maximum chain probability (= 0.459 × 10 EXP-3 (EXP
-3 means −3 power)) record L504 <
#, 1>, <#, 1>, <South, 0>, <Kyoto, 1>, <
City, 1>, <#, 1>, <#, 1> are retrieved.

【００８０】次に、文字位置を示すカウンタｉを初期化
し（ステップＳ９０３）、拡張文字ｅｉ＝＜ｃｉ，ｄｉ
＞の文字ｃｉを出力する（ステップＳ９０４）。ここ
で、当該拡張文字の拡張情報（区切り情報）ｄｉが１な
らば（ステップＳ９０５）、続けて単語区切り記号（例
えば「／」等）を出力し（ステップＳ９０６）、拡張情
報（区切り情報）ｄｉが０ならば直ちにステップＳ９０
７に進む。Next, the counter i indicating the character position is initialized (step S903), and the extended character ei = <ci, di
The character ci of> is output (step S904). Here, if the extended information (separation information) di of the extended character is 1 (step S905), a word delimiter (for example, “/”) is output (step S906), and the extended information (separation information) di If is 0, immediately step S90
Go to 7.

【００８１】次に、カウンタｉを１インクリメントした
後（ステップＳ９０７）、カウンタｉの値を値Ｍ＋Ｎ−
１とを比較することを通じて、全て拡張文字についての
出力処理を終了していないことを確認してステップＳ９
０４に戻って次の拡張文字についての出力処理に移行す
る（ステップＳ９０８）。Next, after incrementing the counter i by 1 (step S907), the value of the counter i is changed to the value M + N-
In step S9, it is confirmed that the output process for all extended characters has not been completed by comparing with step S1.
Returning to step S908, the process proceeds to the output process for the next extended character.

【００８２】このような繰り返し処理により、拡張文字
列の全ての拡張文字についての出力処理が完了したら、
テキスト区切り記号（例えば、改行コード等）を出力す
る（ステップＳ９０９）。その結果、例えば、拡張文字
列のレコードＬ５０４では、「＃／＃／南京／市／＃／
＃／」が出力され、「南京」と「市」が形態素として抽
出されたことになる。When output processing for all extended characters of the extended character string is completed by such repetitive processing,
A text delimiter (for example, a line feed code) is output (step S909). As a result, for example, in the record L504 of the extended character string, “# / # / Nanjing / city / # /
# / ”Is output, and“ Nanjing ”and“ city ”are extracted as morphemes.

【００８３】最後に、上述したステップＳ９０１の選択
条件に適合する拡張文字列（経路；レコード）が残って
いるかどうかを検査し（ステップＳ９１０）、選択条件
に適合する全てのレコードを処理したならば、最適拡張
文字列の一連の選択動作を終了する。Finally, it is checked whether or not an extended character string (path; record) meeting the selection condition in step S901 remains (step S910), and if all records meeting the selection condition have been processed. Then, a series of operations for selecting the optimum extended character string ends.

【００８４】（Ａ−３）第１の実施形態の効果以上、説明した第１の実施形態の形態素解析装置によれ
ば、以下の効果を奏することができる。(A-3) Effects of the First Embodiment According to the morphological analyzer of the first embodiment described above, the following effects can be obtained.

【００８５】入力テキスト中に拡張文字テーブルに存在
しない、未知のＮ−ｇｒａｍ文字列が存在したとして
も、拡張文字推定部を備え、拡張文字テーブルから未知
の部分拡張文字列とその連鎖確率を推定するようにした
ので、従来ならば未知語として扱われていた文字列を形
態素解析の精度を損なうことなく推定することができ
る。Even if an unknown N-gram character string that does not exist in the extended character table exists in the input text, an extended character estimating unit is provided to estimate the unknown partial extended character string and its chain probability from the extended character table. Therefore, a character string conventionally treated as an unknown word can be estimated without deteriorating the accuracy of morphological analysis.

【００８６】また、入力テキスト中に拡張文字テーブル
に存在しない、未知のＮ−ｇｒａｍ文字列が存在したと
しても、拡張文字推定部を備え、推定した拡張文字列の
部分拡張文字列及び連鎖確率を拡張文字テーブルに格納
するようにしたので、次回の形態素解析時からは、拡張
文字の推定が必要なくなり、効率的な形態素解析を実施
することができる。Even if an unknown N-gram character string that does not exist in the extended character table exists in the input text, an extended character estimating unit is provided, and the partial extended character string and the chain probability of the estimated extended character string are calculated. Since it is stored in the extended character table, it is not necessary to estimate extended characters from the next morphological analysis, so that efficient morphological analysis can be performed.

【００８７】（Ａ−４）第１の実施形態の変形実施形態第１の実施形態においては、スコアテーブル３から選択
する拡張文字列の選択条件を最大の連鎖確率をもつ拡張
文字列としたが、この選択条件を、任意の閾値以上の連
鎖確率をもつ拡張文字列とすれば、複数の候補の形態素
解析結果を出力することができる。(A-4) Modified Embodiment of First Embodiment In the first embodiment, the condition for selecting an extended character string to be selected from the score table 3 is an extended character string having the maximum linkage probability. If this selection condition is an extended character string having a chain probability equal to or greater than an arbitrary threshold value, morphological analysis results of a plurality of candidates can be output.

【００８８】また、上記第１の実施形態においては、拡
張文字推定部５が、取り出したＮ−ｇｒａｍ拡張文字列
のパターンが、拡張文字テーブル２に存在しない場合
に、そのうちの１個の拡張文字（特殊拡張文字を除く）
を、全ての文字と照合する汎用文字とみなし、拡張文字
テーブル２から照合するレコードを取り出して、その連
鎖確率の平均値（相加平均値）を当該Ｎ−ｇｒａｍ拡張
文字列の連鎖確率とするものであったが、相乗平均を用
いるようにしても良い。In the first embodiment, when the pattern of the N-gram extended character string extracted by the extended character estimating unit 5 does not exist in the extended character table 2, one of the extended character (Excluding special extended characters)
Is regarded as a general-purpose character to be collated with all characters, a record to be collated is extracted from the extended character table 2, and the average (arithmetic average) of the chain probabilities is used as the chain probability of the N-gram extended character string. However, a geometric mean may be used.

【００８９】さらに、Ｎ−ｇｒａｍ拡張文字列のＮが大
きい場合には、１個の拡張文字（特殊拡張文字を除く）
を全ての文字と照合する汎用文字とみなして求めた平均
値と、２個の拡張文字（特殊拡張文字を除く）を全ての
文字と照合する汎用文字とみなして求めた平均値との重
み付け平均処理等をさらに行うようにしても良い。Further, when N of the N-gram extended character string is large, one extended character (excluding special extended characters)
A weighted average of the average value obtained assuming that is a general character to be compared with all characters, and the average value obtained by considering two extended characters (excluding special extended characters) as general characters to be compared with all characters Processing and the like may be further performed.

【００９０】さらにまた、取り出したＮ−ｇｒａｍ拡張
文字列のパターンが拡張文字テーブル２に存在しない場
合において、そのＮ−ｇｒａｍ拡張文字列について推定
した連鎖確率に応じて、他の連鎖確率を修正するように
しても良い。例えば、Ｎ−ｇｒａｍ拡張文字列を拡張文
字テーブル２に格納する場合、一般的には、先頭側のＮ
−１個の拡張文字が同じ全てのＮ−ｇｒａｍ拡張文字列
の連鎖確率の和が１になるようになされているが、その
Ｎ−ｇｒａｍ拡張文字列について推定した連鎖確率に応
じ、この条件を満足するように、他のＮ−ｇｒａｍ拡張
文字列の連鎖確率を修正するようにしても良い。Furthermore, when the extracted pattern of the N-gram extended character string does not exist in the extended character table 2, another chain probability is corrected according to the estimated chain probability for the N-gram extended character string. You may do it. For example, when an N-gram extended character string is stored in the extended character table 2, the N-gram extended character string is generally
The sum of the chain probabilities of all N-gram extended character strings for which one -1 extended character is the same is set to 1. According to the chain probability estimated for the N-gram extended character string, this condition is The chain probability of another N-gram extended character string may be modified so as to be satisfied.

【００９１】ところで、拡張文字テーブル２に、Ｎ−ｇ
ｒａｍ拡張文字列の情報だけでなく、（Ｎ−Ｘ）−ｇｒ
ａｍ拡張文字列とＸ−ｇｒａｍ拡張文字列との情報も格
納しておき、拡張文字列（ある経路）から取り出したＮ
−ｇｒａｍ拡張文字列のパターンが拡張文字テーブル２
に存在しない場合には、そのＮ−ｇｒａｍ拡張文字列の
連鎖確率を、Ｎ−ｇｒａｍ拡張文字列を分割した（Ｎ−
Ｘ）−ｇｒａｍ拡張文字列とＸ−ｇｒａｍ拡張文字列の
連鎖確率から求める方法も提案されている。By the way, in the extended character table 2, Ng
Not only the information of the ram extension character string but also (NX) -gr
The information of the am extended character string and the X-gram extended character string is also stored, and N extracted from the extended character string (a certain path) is stored.
-Gram extended character string pattern is extended character table 2
Does not exist, the chain probability of the N-gram extended character string is determined by dividing the N-gram extended character string (N-gram extended character string).
A method has also been proposed in which an X-gram extended character string and an X-gram extended character string are obtained from the chain probability.

【００９２】上記実施形態において、例えば、平均処理
に供するレコード数が少ない場合等には、上述したＮ−
ｇｒａｍ拡張文字列の連鎖確率を、Ｎ−ｇｒａｍ拡張文
字列を分割した（Ｎ−Ｘ）−ｇｒａｍ拡張文字列とＸ−
ｇｒａｍ拡張文字列の連鎖確率から求める方法に切り替
えるようにしても良い。In the above embodiment, for example, when the number of records to be subjected to the averaging process is small, the N-
The chain probability of the gram extended character string is calculated by dividing the N-gram extended character string by (NX) -gram extended character string and X-
The method may be switched to a method obtained from the chain probability of the gram extended character string.

【００９３】なお、Ｎ−ｇｒａｍ拡張文字列の連鎖確率
を、Ｎ−ｇｒａｍ拡張文字列を分割した（Ｎ−Ｘ）−ｇ
ｒａｍ拡張文字列とＸ−ｇｒａｍ拡張文字列の連鎖確率
から求める方法は、実際上、必要とするメモリ容量が膨
大であり、また、Ｎ−ｇｒａｍ拡張文字列という枠組み
だけで処理できないので、上述した実施形態の連鎖確率
の推定方法より、実製品への適用が難しいものである。Note that the chain probability of the N-gram extended character string is determined by dividing the N-gram extended character string by (NX) -g
The method of obtaining from the chain probability of the gram extended character string and the X-gram extended character string actually requires a huge memory capacity and cannot be processed only by the framework of the N-gram extended character string. It is more difficult to apply to an actual product than the chain probability estimation method of the embodiment.

【００９４】また、第１の実施形態においては、拡張文
字が文字（文字種）と区切り情報とでなるものを示した
が、さらに、品詞情報（活用形を含んでいても良い）を
含む拡張文字であっても良い。この場合、拡張文字推定
部５が行う連鎖確率の推定処理（平均化処理）は、品詞
情報が一致するレコード群毎に行うこととなり、拡張文
字テーブル２への追加も、品詞情報が異なれば全て追加
することになる。なお、照合するレコードの数が少ない
品詞情報に係る、拡張文字テーブル２に存在しない入力
テキストのＮ−ｇｒａｍ拡張文字列に対しては、拡張文
字テーブル２への追加を実行しないようにしても良い。In the first embodiment, the extended character is composed of the character (character type) and the delimiter information. However, the extended character including the part of speech information (which may include the inflected form) is further described. It may be. In this case, the chain probability estimating process (averaging process) performed by the extended character estimating unit 5 is performed for each record group having the same part of speech information, and addition to the extended character table 2 is performed only when the part of speech information is different. Will be added. Note that addition to the extended character table 2 may not be performed on an N-gram extended character string of input text that does not exist in the extended character table 2 and is related to part of speech information with a small number of records to be compared. .

【００９５】（Ｂ）第２の実施形態以下、本発明による自然言語処理装置を形態素解析装置
に適用した第２の実施形態を図面を参照しながら詳述す
る。(B) Second Embodiment Hereinafter, a second embodiment in which the natural language processing apparatus according to the present invention is applied to a morphological analysis apparatus will be described in detail with reference to the drawings.

【００９６】（Ｂ−１）第２の実施形態の構成図１０は、第２の実施形態の形態素解析装置の構成を示
す機能ブロック図であり、上述した第１の実施形態に係
る図１との同一、対応部分には同一符号を付して示して
いる。(B-1) Configuration of the Second Embodiment FIG. 10 is a functional block diagram showing the configuration of the morphological analyzer of the second embodiment. The same and corresponding parts are denoted by the same reference numerals.

【００９７】図１０において、この第２の実施形態の形
態素解析装置は、第１の実施形態と同様な入力装置１、
拡張文字テーブル２（図２参照）、スコアテーブル３
（図４参照）、拡張文字列生成部４、拡張文字推定部
５、連鎖確率計算部６、最適経路探索部７及び出力装置
８に加えて、さらに、未知語検出部９、非対象文字パタ
ーンメモリ１０、入力制御部１１、入力バッファメモリ
１２、未知語バッファメモリ１３及び出力合成部１４を
有する。In FIG. 10, a morphological analyzer according to the second embodiment has the same input device 1 as that of the first embodiment.
Extended character table 2 (see FIG. 2), score table 3
(See FIG. 4), in addition to the extended character string generation unit 4, the extended character estimation unit 5, the chain probability calculation unit 6, the optimal route search unit 7, and the output device 8, an unknown word detection unit 9, a non-target character pattern It has a memory 10, an input controller 11, an input buffer memory 12, an unknown word buffer memory 13, and an output synthesizer 14.

【００９８】第１の実施形態と同様な入力装置１、拡張
文字テーブル２、スコアテーブル３、拡張文字列生成部
４、拡張文字推定部５、連鎖確率計算部６、最適経路探
索部７及び出力装置８の機能は、第１の実施形態と同様
であるので、その説明は省略する。The input device 1, the extended character table 2, the score table 3, the extended character string generator 4, the extended character estimator 5, the chain probability calculator 6, the optimum route search unit 7, and the output similar to those in the first embodiment. The function of the device 8 is the same as in the first embodiment, and a description thereof will be omitted.

【００９９】第２の実施形態において新たに設けられた
未知語検出部９、非対象文字パターンメモリ１０、入力
制御部１１、入力バッファメモリ１２、未知語バッファ
メモリ１３及び出力合成部１４は、入力テキスト中に存
在する未知語文字列を検出し、当該未知語文字列を推定
し、それを形態素解析結果に反映させるために設けられ
たものである。In the second embodiment, the newly provided unknown word detecting unit 9, non-target character pattern memory 10, input control unit 11, input buffer memory 12, unknown word buffer memory 13, and output synthesizing unit 14 It is provided for detecting an unknown word character string existing in a text, estimating the unknown word character string, and reflecting it in a morphological analysis result.

【０１００】未知語検出部９は、入力装置１からの入力
テキスト中の未知語部分を非対象文字パターンメモリ１
０に格納されている情報に基づいて検出し、拡張文字テ
ーブル２の格納内容を参照して、検出した未知語文字列
に対して正しいと思われる文字列を推定するものであ
る。The unknown word detecting section 9 stores the unknown word portion in the input text from the input device 1 into the non-target character pattern memory 1.
The detection is performed based on the information stored in 0, and a character string that is considered to be correct for the detected unknown word character string is estimated by referring to the stored contents of the extended character table 2.

【０１０１】非対象文字パターンメモリ１０は、未知語
検出部９が、未知語を検出する際に利用する、当該形態
素解析装置の形態素解析の対象となり得ない文字（非対
象文字）のパターンを格納しているものである。The non-target character pattern memory 10 stores a pattern of characters (non-target characters) that the unknown word detection unit 9 uses when detecting unknown words and cannot be subjected to morphological analysis by the morphological analyzer. Is what you are doing.

【０１０２】入力制御部１１は、未知語検出部９により
入力テキスト中に未知語が検出され、正しいと推定され
た入力テキストを制御するものである。The input control section 11 controls an input text that is estimated to be correct when an unknown word is detected in the input text by the unknown word detection section 9.

【０１０３】入力バッファメモリ１２は、未知語検出部
９や入力制御部１１が新たに作成した（推定した）１又
は複数の入力テキストを一時保存するものである。The input buffer memory 12 temporarily stores one or more input texts newly created (estimated) by the unknown word detection unit 9 and the input control unit 11.

【０１０４】未知語バッファメモリ１３は、入力制御部
１１の制御下で、未知語検出部９が検出した未知語部分
の文字列を一時退避保存しておくものである。Under the control of the input control unit 11, the unknown word buffer memory 13 temporarily saves the character string of the unknown word portion detected by the unknown word detection unit 9.

【０１０５】出力合成部１４は、入力制御部１１が制御
する複数の入力テキストについての最適経路探索部７か
らの形態素解析結果と、未知語バッファメモリ１３に退
避されている未知語部分の文字列をを合成し、所望の形
態素解析結果を得るための処理を行うものである。The output synthesizing unit 14 calculates the morphological analysis results of the plurality of input texts controlled by the input control unit 11 from the optimum route searching unit 7 and the character string of the unknown word part saved in the unknown word buffer memory 13. And performs processing for obtaining a desired morphological analysis result.

【０１０６】図１１は、第２の実施形態における非対象
文字パターンテーブル１０の構成例を示す説明図であ
る。FIG. 11 is an explanatory diagram showing a configuration example of the non-target character pattern table 10 in the second embodiment.

【０１０７】非対象文字とは、当該形態素解析装置への
入力テキスト中に含まれるはずがないと考えられる文字
のことであり、例えば、文字化けやミスタイプ等によっ
て入力テキスト中に生じる可能性が高いものである。非
対象文字パターンメモリ１０には、非対象文字の集合が
予め設定格納されている。[0107] The non-target character is a character that is considered not to be included in the input text to the morphological analysis device. It is expensive. A set of non-target characters is set and stored in the non-target character pattern memory 10 in advance.

【０１０８】図１１において、例えば、レコードＬ１１
０１には「∬‰♪¶‡」という通常の文章では生じるこ
とが考えられない非対象文字の集合が登録されており、
入力テキスト中に現われる「∬‰♪¶‡」の各文字は、
非対象文字であることが示されている。また、レコード
Ｌ１１０２に示すように、非対象文字パターンとして、
［辧−咨］や［嵌−巍］のように、［開始文字コード−
終了文字コード］の表現による文字コード（テキスト）
の範囲で指定することもできる。すなわち、入力テキス
ト中に、このコード範囲のコードを有する文字がある場
合には、その文字は非対象文字であることが示されてい
る。現在の文章の多くは、第２水準の漢字を含むことは
ごく稀であり、含まれていてもその第２水準の漢字はあ
る程度限られたものとなり、第２水準の漢字の多くを非
対象文字として登録することは実際的である。In FIG. 11, for example, record L11
In 01, a set of non-target characters that are unlikely to occur in ordinary sentences such as “∬ ‰ ♪ ¶ ‡” is registered.
Each character of “∬ ‰ ♪ ¶ ‡” appearing in the input text is
It is indicated that it is a non-target character. Also, as shown in record L1102, as a non-target character pattern,
[Starting character code-
Character code (text) in the expression [End character code]
Can be specified in the range. That is, if there is a character having a code in this code range in the input text, this indicates that the character is a non-target character. Most of the current sentences rarely contain second-level kanji, and even if they are included, the second-level kanji is limited to some extent, and many of the second-level kanji are not targeted. It is practical to register as characters.

【０１０９】図１２は、第２の実施形態における入力バ
ッファメモリ１２の格納例を示す説明図である。FIG. 12 is an explanatory diagram showing a storage example of the input buffer memory 12 in the second embodiment.

【０１１０】図１２（Ａ）は、未知語検出部９によっ
て、入力テキスト中の未知語が検出され、当該未知語候
補の推定が行われた後の入力テキスト（ここでは２種
類）が格納されている状態を示している。レコードＬ１
２０１の「この形態素の答における利点は」とレコード
Ｌ１２０２の「この形態素解析における利点は」との差
分である「の答」と「解析」の部分が、未知語検出部９
によって推定された未知語部分である。FIG. 12A shows an unknown word in the input text detected by the unknown word detection unit 9 and the input text (here, two types) after the unknown word candidate is estimated is stored. It shows the state where it is. Record L1
The “answer” and “analysis” parts, which are the differences between “the advantage in this morphological analysis” 201 and “the advantage in this morphological analysis” of the record L1202, are the unknown word detection unit 9
Unknown word part estimated by

【０１１１】図１２（Ｂ）は、未知語検出部９をもって
しても推定できなかった未知語部分が存在した場合に、
未知語検出部９によって未知語部分にマークして格納さ
れている状態である。実際の入力テキストが「この形態
♪‰¶∬おける利点は」であった場合に、未知語検出部
９が検出した未知語「♪‰¶∬」の文字長（＝４）が、
当該形態素解析装置の未知語推定能力である３文字（Ｎ
＝３のＮ−ｇｒａｍ文字列を扱うようにしている）を超
えていると（後述する図１５参照）、未知語部分を推定
できないので、未知語部分の領域を規定するかっこ｛｝
でマークして格納される。FIG. 12B shows a case where there is an unknown word part which could not be estimated even by the unknown word detecting unit 9.
In this state, the unknown word portion is marked and stored by the unknown word detection unit 9. When the actual input text is “this form ♪ ‰ ¶∬ is advantageous”, the character length (= 4) of the unknown word “♪ ‰ ¶∬” detected by the unknown word detection unit 9 is
Three characters (N
= 3 (see FIG. 15 to be described later), the unknown word portion cannot be estimated, so the parentheses defining the region of the unknown word portion.
Marked and stored.

【０１１２】なお、入力バッファメモリ１２は、例え
ば、ＦＩＦＯ（ＦｉｒｓｔＩｎＦｉｒｓｔＯｕ
ｔ）形式のバッファ、すなわち、先入れ先出し形式のバ
ッファとなっており、例えば、レコードＬ１２０１が取
り出されて処理されると、レコードＬ１２０２がレコー
ドＬ１２０１の位置にシフトしてレコードＬ１２０２の
今までの位置が空となるように、次々と上位レコードの
エリアへシフトする構成となっている。The input buffer memory 12 stores, for example, a FIFO (First In First Ou).
t) format buffer, that is, a first-in first-out format buffer. For example, when the record L1201 is taken out and processed, the record L1202 is shifted to the position of the record L1201 and the previous position of the record L1202 is empty. So that the record is shifted one after another to the area of the upper record.

【０１１３】図１３は、第２の実施形態における未知語
バッファメモリ１３の格納例を示す説明図である。FIG. 13 is an explanatory diagram showing a storage example of the unknown word buffer memory 13 in the second embodiment.

【０１１４】未知語バッファメモリ１３には、図１２
（Ｂ）について説明したような、未知語検出部９をもっ
てしても推定できなかった未知語部分が存在した場合
に、入力制御部１１によって当該未知語部分が取り出さ
れて格納されるものである。実際の入力テキストが「こ
の形態♪‰¶∬おける利点は」であった場合には、その
未知語部分「♪‰¶∬」が、未知語バッファメモリ１３
に格納される。In the unknown word buffer memory 13, FIG.
When there is an unknown word part that cannot be estimated even by the unknown word detection unit 9 as described in (B), the unknown word part is extracted and stored by the input control unit 11. . If the actual input text is “this form ♪ ‰ ¶∬ is advantageous”, the unknown word part “♪ ‰ ¶∬” is stored in the unknown word buffer memory 13.
Is stored in

【０１１５】（Ｂ−２）第２の実施形態の動作以下、第２の実施形態の形態素解析装置の動作（形態素
解析方法）を図面を参照しながら説明する。(B-2) Operation of the Second Embodiment The operation (morphological analysis method) of the morphological analyzer of the second embodiment will be described below with reference to the drawings.

【０１１６】まず、第２の実施形態の形態素解析装置の
全体の動作を、図１４に示すフローチャートを参照しな
がら説明する。なお、図１４において、図６との同一、
対応ステップには同一符号を付して示している。First, the overall operation of the morphological analyzer according to the second embodiment will be described with reference to the flowchart shown in FIG. In FIG. 14, the same as FIG.
Corresponding steps are denoted by the same reference numerals.

【０１１７】第２の実施形態においては、入力テキスト
の部分文字列に、自然言語の通常の文章には存在し得な
い文字列としての未知語が存在する場合に、当該未知語
部分を検出し、可能な限り復元することを以下の動作に
よって実施する。なお、（ステップＳ６０１）〜（ステ
ップＳ６０５）の各ステップは、第１の実施形態と同様
の動作である。In the second embodiment, when an unknown word as a character string that cannot exist in a normal sentence of a natural language exists in a partial character string of the input text, the unknown word part is detected. The restoration is performed as much as possible by the following operation. In addition, each step of (Step S601) to (Step S605) is the same operation as in the first embodiment.

【０１１８】（ステップＳ６０１）入力装置１を介し
て入力テキストを本形態素解析装置に読み込む。(Step S601) The input text is read into the morphological analyzer via the input device 1.

【０１１９】（ステップＳ１４０１）入力テキストの
未知語部分を検出し、可能な限り未知語部分を復元した
テキスト（以下、推定テキストと称する）を生成して入
力バッファメモリ１２に格納する。また、未知語部分の
推定ができない場合には、当該未知語部分にマークを付
したしたテキスト（以下、マークテキストと称する）を
入力バッファメモリ１２に格納する。なお、入力テキス
トの未知語部分が検出できない場合には、当然に復元や
マーク付与処理は実行されない。また、次のステップＳ
１４０２の処理も省略される。(Step S1401) An unknown word portion of the input text is detected, a text in which the unknown word portion is restored as much as possible (hereinafter referred to as an estimated text) is generated and stored in the input buffer memory 12. When the unknown word portion cannot be estimated, the text in which the unknown word portion is marked (hereinafter, referred to as a mark text) is stored in the input buffer memory 12. If the unknown word portion of the input text cannot be detected, the restoration and the marking process are not executed. Also, the next step S
The processing of 1402 is also omitted.

【０１２０】（ステップＳ１４０２）ステップＳ１４
０１にて生成された推定テキスト又はマークテキストが
格納されている入力バッファメモリ１２を制御し、以下
のステップにテキストを渡す。また、推定できない未知
語部分は、未知語バッファメモリ１３に格納する。(Step S1402) Step S14
01 controls the input buffer memory 12 that stores the estimated text or the mark text, and passes the text to the following steps. The unknown word portion that cannot be estimated is stored in the unknown word buffer memory 13.

【０１２１】（ステップＳ６０２）入力バッファメモ
リ１２から読み込んだテキストの各文字から拡張文字を
生成し、入力テキストの先頭から末尾に至る拡張文字列
の経路をもとめてスコアテーブル３に格納する。このス
テップでは、入力テキスト中に拡張文字テーブル２に存
在しない一定文字数を持つ文字列（Ｎ−ｇｒａｍ文字
列）が存在する場合に、対応する部分拡張文字列、すな
わち、未知語に対応した部分拡張文字列のレコードを推
定する。(Step S602) An extended character is generated from each character of the text read from the input buffer memory 12, and the path of the extended character string from the beginning to the end of the input text is obtained and stored in the score table 3. In this step, when a character string having a fixed number of characters (N-gram character string) that does not exist in the extended character table 2 exists in the input text, the corresponding partial expansion character string, that is, the partial expansion corresponding to the unknown word Guess the string record.

【０１２２】（ステップＳ６０３）生成された全ての
拡張文字列の経路に対する連鎖確率を求める。拡張文字
列の連鎖確率は、当該拡張文字列を構成する部分拡張文
字列のそれぞれに対応する部分連鎖確率を拡張文字テー
ブル２を参照することにより求めて、それぞれの部分連
鎖確率の積として求める。求めた連鎖確率は、スコアテ
ーブル３の対応する拡張文字列の経路のレコードに格納
しておく。(Step S603) The chain probabilities of all the generated extended character strings with respect to the path are obtained. The chain probability of the extended character string is obtained by referring to the extended character table 2 for the partial chain probability corresponding to each of the partial extended character strings constituting the extended character string, and is obtained as the product of the partial chain probabilities. The calculated chain probability is stored in the record of the path of the corresponding extended character string in the score table 3.

【０１２３】（ステップＳ６０４）スコアテーブル３
を参照し、得られた連鎖確率の中から最適な条件（例え
ば最大の値の連鎖確率を与えるなど）を満たす拡張文字
列を最適拡張文字列として選択する。(Step S604) Score table 3
And selects an extended character string that satisfies an optimum condition (for example, gives the maximum value of the chain probability) from the obtained chain probabilities as the optimum expanded character string.

【０１２４】（ステップＳ１４０３）入力制御部１１
によって制御された複数のテキストの形態素解析結果を
出力として合成する。なお、入力テキストに未知語（非
対象文字）部分がない場合には、このステップは、最適
経路探索部７からの形態素解析結果をそのまま出力装置
８に引き渡す処理となる。(Step S1403) Input control unit 11
The morphological analysis results of a plurality of texts controlled by the above are combined as an output. If there is no unknown word (non-target character) portion in the input text, this step is a process of transferring the morphological analysis result from the optimal route search unit 7 to the output device 8 as it is.

【０１２５】（ステップＳ６０５）出力装置８を介し
て単語列の並びを含む形態素解析結果として出力する。(Step S605) The morphological analysis result including the word sequence is output via the output device 8.

【０１２６】図１５は、ステップＳ１４０１による未知
語（非対象文字）の検出動作を詳細に説明するフローチ
ャートである。FIG. 15 is a flowchart illustrating in detail the operation of detecting an unknown word (non-target character) in step S1401.

【０１２７】未知語検出部９は、非対象文字パターンメ
モリ１０を参照することにより、入力テキスト中の全て
の非対象文字連続部分と、各非対象文字連続部分の非対
象文字数Ｌを検出する（ステップＳ１５０１）。なお、
この処理により、非対象文字が１個も検出できない場合
には、分岐線の図示は省略しているが、一連の処理を終
了する。The unknown word detecting section 9 refers to the non-target character pattern memory 10 to detect all non-target character continuous portions in the input text and the number L of non-target characters in each non-target character continuous portion ( Step S1501). In addition,
If no non-target character can be detected by this process, a series of processes is terminated although the illustration of the branch line is omitted.

【０１２８】次に、拡張文字テーブル２に格納されてい
るＮ−ｇｒａｍ拡張文字列の次数（文字数）Ｎと、ある
１個の非対象文字連続部分についての非対象文字数Ｌと
比較する（ステップＳ１５０２）。この比較は、現在処
理対象の非対象文字連続部分について、拡張文字テーブ
ル２の格納内容を利用して正しいと思われる文字列が推
定できるか否かの判定を意味する。Next, the degree (number of characters) N of the N-gram extended character string stored in the extended character table 2 is compared with the number L of non-target characters in one non-target character continuous portion (step S1502). ). This comparison means determination as to whether or not a character string considered to be correct can be estimated using the storage contents of the extended character table 2 for the non-target character continuous portion to be processed at present.

【０１２９】未知語検出部９は、Ｌ＜Ｎであるならば
（ステップＳ１５０２で肯定結果）、現在処理対象の非
対象文字連続部分の前又は及び後の非対象文字以外の文
字を含み、非対象文字部分がいずれの文字であっても良
い、Ｎ−ｇｒａｍ文字列（拡張情報は何れでも良い）に
合致するレコードが拡張文字テーブル２に存在するかを
検索し（ステップＳ１５０３）、照合したレコードで定
まる、非対象文字列部分に置き換え可能な文字列パター
ンが存在するか否かを判定する（ステップＳ１５０
４）。If L <N (Yes in step S1502), the unknown word detection unit 9 includes a character other than the non-target character before or after the non-target character continuation part of the current processing target, and The target character portion may be any character, and it is searched whether a record matching the N-gram character string (extended information can be any) exists in the extended character table 2 (step S1503). It is determined whether or not there is a character string pattern that can be replaced with the non-target character string portion determined by (step S150)
4).

【０１３０】そして、非対象文字列部分に置き換え可能
な文字列パターンが存在するならば（ステップＳ１５０
４で肯定結果）、当該非対象文字列部分に代えて、その
文字列パターンを適用した推定テキストを生成して入力
バッファメモリ１２に格納する（ステップＳ１５０
５）。ここで、ステップＳ１５０３の検索において、非
対象文字列部分に置き換え可能な文字列パターンとして
複数のものが得られる場合も想定され、この場合には、
異なる文字列パターンを有する複数の推定テキストを生
成して入力バッファメモリ１２に格納する。If there is a character string pattern that can be replaced with the non-target character string part (step S150)
(Yes at 4), instead of the non-target character string portion, generate an estimated text to which the character string pattern is applied and store it in the input buffer memory 12 (step S150).
5). Here, in the search in step S1503, a case where a plurality of character string patterns that can be replaced with a non-target character string portion may be obtained, in which case,
A plurality of estimated texts having different character string patterns are generated and stored in the input buffer memory 12.

【０１３１】なお、入力バッファメモリ１２に格納し得
る推定テキストであるか否かを連鎖確率を利用して判定
するようにしても良い。このことについては、後述する
具体例を用いた処理で説明する。Note that whether or not the estimated text can be stored in the input buffer memory 12 may be determined by using the chain probability. This will be described in a process using a specific example described later.

【０１３２】一方、非対象文字数ＬがＮ−ｇｒａｍ文字
列の次数Ｎ以上である場合（ステップＳ１５０２で否定
結果）や、非対象文字列部分に置き換え可能な文字列パ
ターンが存在しない場合（ステップＳ１５０４で否定結
果）には、未知語（非対象文字列）の本来の文字列への
推定が不可能であるので、入力テキストの非対象文字列
に未知語マーカを付与してマークテキストを生成し、入
力バッファメモリ１２に格納する（ステップＳ１５０
６）。On the other hand, when the number L of non-target characters is equal to or greater than the order N of the N-gram character string (negative result in step S1502), or when there is no character string pattern that can be replaced with the non-target character string portion (step S1504) In the negative result, it is impossible to estimate the unknown word (non-target character string) to the original character string. Therefore, an unknown word marker is added to the non-target character string of the input text to generate a mark text. Are stored in the input buffer memory 12 (step S150).
6).

【０１３３】しかる後に、入力テキスト中の非対象文字
の全ての連続部分を処理したかどうかを判定し（ステッ
プＳ１５０７）、未処理の非対象文字の連続部分が存在
する場合には、上述したステップＳ１５０２〜Ｓ１５０
７の処理を他の非対象文字の連続部分に対して繰り返
し、入力テキスト中の非対象文字の全ての連続部分を処
理した場合には、未知語検出部９は一連の動作を終了す
る。Thereafter, it is determined whether or not all the continuous portions of the non-target characters in the input text have been processed (step S1507). S1502-S150
7 is repeated for a continuous portion of another non-target character, and when all the continuous portions of the non-target character in the input text have been processed, the unknown word detection unit 9 ends a series of operations.

【０１３４】例えば、「この形態素‰¶における利点
は」という入力テキストに対して、未知語検出部９が、
図１１に示す非対象文字パターンメモリ１０を参照する
と、レコードＬ１１０１より「‰」及び「¶」が非対象
文字であることが判り、前記入力テキストの「‰¶」が
未知語（非対象文字連続部分）であると検出し、その長
さＬが２であると検出する（ステップＳ１５０１）。For example, in response to an input text “What is the advantage of this morpheme ‰ ¶”, the unknown word detection unit 9
Referring to the non-target character pattern memory 10 shown in FIG. 11, it is found from the record L1101 that “‰” and “¶” are non-target characters, and “‰ ¶” of the input text is an unknown word (non-target character Part), and the length L is detected to be 2 (step S1501).

【０１３５】ここで、Ｎ−ｇｒａｍ文字列の次数Ｎを３
とすると、Ｌ＜Ｎとなり（ステップＳ１５０２）、拡張
文字テーブル２を検索する（ステップＳ１５０３）。Here, the degree N of the N-gram character string is set to 3
Then, L <N (step S1502), and the extended character table 2 is searched (step S1503).

【０１３６】今、この検索が、非対象文字連続部分「‰
¶」の前側の非対象文字以外の文字「素」と、非対象文
字連続部分「‰¶」に対する２個の汎用文字とのＮ−ｇ
ｒａｍ文字列（３−ｇｒａｍ文字列）で行われたとす
る。また、拡張文字テーブル２には、例えば「＜素，１
＞＜の，０＞＜答，１＞」及び「＜素，１＞＜解，０＞
＜析，１＞」なるＮ−ｇｒａｍ文字列のレコードが存在
したとする。Now, this search is performed with the non-target character continuous portion “{
N-g of a character "prime" other than the non-target character preceding "?" And two general-purpose characters for the non-target character continuous part "‰ ¶"
It is assumed that the processing is performed with a ram character string (3-gram character string). The extended character table 2 includes, for example, “<prime, 1
><,0><answer,1>"and"<prime,1><solution,0>
It is assumed that an N-gram character string record “<analysis, 1>” exists.

【０１３７】この場合には、拡張文字テーブル２の検索
により、非対象文字連続部分「‰¶」に置き換えられる
文字列パターンとして、「の答」及び「解析」の存在が
確認され（ステップＳ１５０４）、推定テキスト「この
形態素の答における利点は」と「この形態素解析におけ
る利点は」が生成されてこれらが入力バッファメモリ１
２に格納される（ステップＳ１５０５）。この格納状態
での入力バッファメモリ１２は、上述した図１２（Ａ）
に示すようになる。In this case, by searching the extended character table 2, it is confirmed that “answer” and “analysis” exist as character string patterns to be replaced with the non-target character continuation part “‰ ¶” (step S1504). , The estimated texts "What is the advantage of this morphological answer" and "What is the advantage of this morphological analysis" are generated and these are input buffer memory 1
2 (step S1505). The input buffer memory 12 in this storage state is the same as that shown in FIG.
It becomes as shown in.

【０１３８】なお、検索により発見した「＜素，１＞＜
の，０＞＜答，１＞」及び「＜素，１＞＜解，０＞＜
析，１＞」なるＮ−ｇｒａｍ文字列の連鎖確率を取り出
し、その連鎖確率を閾値と比較し、閾値を越えている場
合にのみ、推定テキストの生成を行うようにしても良
い。It should be noted that “<prime, 1><
, 0><answer,1> ”and“ <prime, 1><solution,0><
It is also possible to take out the chain probability of the N-gram character string of “analysis, 1>”, compare the chain probability with a threshold, and generate an estimated text only when the threshold is exceeded.

【０１３９】また、非対象文字連続部分「‰¶」の前側
の非対象文字以外の文字「素」と、非対象文字連続部分
「‰¶」に対する２個の汎用文字とのＮ−ｇｒａｍ文字
列（３−ｇｒａｍ文字列）での検索で「＜素，１＞＜
の，０＞＜答，１＞」というＮ−ｇｒａｍ文字列が得ら
れた場合、非対象文字連続部分「‰¶」に置き換え可能
な検索文字列「＜の，０＞＜答，１＞」と、入力テキス
トにおける非対象文字連続部分「‰¶」の後側の非対象
文字以外の文字「に」とのＮ−ｇｒａｍ文字列（３−ｇ
ｒａｍ文字列）で再度拡張文字テーブル２を照合し、こ
のＮ−ｇｒａｍ文字列（３−ｇｒａｍ文字列）が検索で
きたことで、非対象文字連続部分「‰¶」が「の答」と
推定するようにしても良い。この場合においても、例え
ば、前側の非対象文字以外の文字「素」を含むＮ−ｇｒ
ａｍ文字列の検索で発見できたＮ−ｇｒａｍ文字列の連
鎖確率と、後側の非対象文字以外の文字「に」を含むＮ
−ｇｒａｍ文字列の検索で発見できたＮ−ｇｒａｍ文字
列の連鎖確率とを乗算した後、その乗算値を閾値と比較
し、閾値を越えている場合にのみ、推定テキストの生成
を行うようにしても良い。An N-gram character string consisting of a character "prime" other than the non-target character preceding the non-target character continuous portion "$" and two general-purpose characters for the non-target character continuous portion "$"(<Prime,1><
, 0><answer,1> ”, a search character string“ <, 0><answer,1> ”that can be replaced with a non-target character continuous portion“ ‰ ¶ ” And an N-gram character string (3-g) of a character other than the non-target character subsequent to the non-target character continuation part “‰ ¶” in the input text.
The extended character table 2 is compared again with the (gram character string), and the N-gram character string (3-gram character string) can be searched. Therefore, the non-target character continuous portion “‰ ¶” is estimated as “the answer”. You may do it. Also in this case, for example, N-gr including the character “prime” other than the non-target character on the front side
am and the chain probability of the N-gram character string found in the search for the character string, and N
After multiplying the chain probability of an N-gram character string found by searching for a gram character string, the multiplied value is compared with a threshold value, and an estimated text is generated only when the threshold value is exceeded. May be.

【０１４０】一方、入力テキストが、例えば、「この形
態♪‰¶∬おける利点は」であった場合には、未知語検
出部９は、以下のように動作する。未知語検出部９が、
図１１に示す非対象文字パターンメモリ１０を参照する
と、レコードＬ１１０１より「♪」、「‰」、「¶」及
び「∬」が非対象文字であることが判り、前記入力テキ
ストの「♪‰¶∬」が未知語（非対象文字連続部分）で
あると検出し、その長さＬが４であると検出する（ステ
ップＳ１５０１）。なお、通信手段で受信したテキスト
が入力テキストの場合等では、バーストエラーが発生し
易く、多くの非対象文字が連続することも発生する恐れ
がある。On the other hand, if the input text is, for example, “this form ♪ {¶}, the advantage is”, the unknown word detection unit 9 operates as follows. Unknown word detection unit 9
Referring to the non-target character pattern memory 10 shown in FIG. 11, it can be seen from the record L1101 that “♪”, “‰”, “¶”, and “∬” are non-target characters, and “♪ ‰ ¶ ∬ ”is detected as an unknown word (non-target character continuous portion), and its length L is detected as 4 (step S1501). If the text received by the communication means is an input text, a burst error is likely to occur, and a large number of non-target characters may continue.

【０１４１】この場合にはＬ＞Ｎとなるので（ステップ
Ｓ１５０２）、未知語部分にマーカを付与したマーカテ
キスト「この形態｛♪‰¶∬｝おける利点は」が生成さ
れて入力バッファメモリ１２に格納される（ステップＳ
１５０６）。上述した図１２（Ｂ）は、この場合の入力
バッファメモリ１２の格納状態を示している。In this case, since L> N is satisfied (step S1502), a marker text “the advantage in this form {♪} ¶¶} is generated by adding a marker to the unknown word part, and is stored in the input buffer memory 12. Is stored (step S
1506). FIG. 12B shows the storage state of the input buffer memory 12 in this case.

【０１４２】図１６は、上述したステップＳ１４０２の
入力制御の動作を詳細に説明するフローチャートであ
る。FIG. 16 is a flowchart for explaining in detail the operation of the input control in step S1402 described above.

【０１４３】入力制御部１１は、入力バッファメモリ１
２からレコードを１つ取り出す（ステップＳ１６０
２）。そして、取り出したレコード中に、未知語マーカ
が付与されているか否かを判定する。The input control unit 11 is provided with the input buffer memory 1
One record is taken out from Step 2 (Step S160)
2). Then, it is determined whether or not an unknown word marker is added to the retrieved record.

【０１４４】未知語マーカが付与されている（すなわ
ち、取り出したレコードがマーカテキストである）なら
ば（ステップＳ１６０２で肯定結果）、当該レコードよ
り未知語マーカ部分を切り出し、マーカを除去して未知
語バッファメモリ１３に格納する（ステップＳ１６０
３；図１３参照）。この場合、マーカテキストは、未知
語部分が除去されて２つに分割される。If an unknown word marker is added (that is, the retrieved record is a marker text) (Yes in step S1602), an unknown word marker portion is cut out from the record, the marker is removed, and the unknown word marker is removed. The data is stored in the buffer memory 13 (step S160)
3: see FIG. 13). In this case, the marker text is divided into two with the unknown word part removed.

【０１４５】一方、ステップＳ１６０２にて未知語マー
カが検出されなかった場合（すなわち、取り出したレコ
ードが推定テキストである場合）には、未知語なし記号
（例えば、「＠」など）を未知語バッファメモリ１３に
格納する（ステップＳ１６０４）。On the other hand, if no unknown word marker is detected in step S1602 (that is, if the retrieved record is an estimated text), an unknown word-free symbol (for example, “＠”) is stored in the unknown word buffer. It is stored in the memory 13 (step S1604).

【０１４６】しかる後に、当該マーカテキストの未知語
部分より前半の部分又は当該推定テキストを拡張文字列
生成部４に渡す（ステップＳ１６０５）。ここで、後述
する出力合成部１４の動作と同期するために、入力制御
部１１は、未知語バッファメモリ１３が空になるのを監
視し（ステップＳ１６０６）、未知語バッファメモリ１
３が空になれば、当該レコードの全てを処理したかどう
かをチェックする（ステップＳ１６０７）。Thereafter, the first half of the unknown word portion of the marker text or the estimated text is transferred to the extended character string generator 4 (step S1605). Here, in order to synchronize with the operation of the output synthesizing unit 14 described later, the input control unit 11 monitors that the unknown word buffer memory 13 becomes empty (step S1606), and the unknown word buffer memory 1
When 3 becomes empty, it is checked whether all the records have been processed (step S1607).

【０１４７】当該レコードの全てを処理していなけれ
ば、すなわち、マーカテキストの未知語部分より後半の
部分が残されているならば（ステップＳ１６０７で否定
結果）、ステップＳ１６０２〜Ｓ１６０７を繰り返す。If all the records have not been processed, that is, if the latter half of the unknown portion of the marker text is left (negative result in step S1607), steps S1602 to S1607 are repeated.

【０１４８】また、当該レコードを全て処理したなら
ば、入力バッファメモリ１２中に、テキストが残されて
いないかチェックし（ステップＳ１６０８）、未処理の
レコードが入力バッファメモリ１２中に残されているな
らば、ステップＳ１６０１〜Ｓ１６０８を繰り返し、未
処理のレコードが入力バッファメモリ１２中に残されて
いないならば、最後に、入力終了記号（例えば、「＄」
など）を未知語バッファメモリ１３に格納し（ステップ
Ｓ１６０９）、入力制御部１１は一連の動作を終了す
る。If all the records have been processed, it is checked whether any text is left in the input buffer memory 12 (step S1608), and an unprocessed record is left in the input buffer memory 12. Then, steps S1601 to S1608 are repeated, and if an unprocessed record is not left in the input buffer memory 12, finally, an input end symbol (for example, “＄”)
) Is stored in the unknown word buffer memory 13 (step S1609), and the input control unit 11 ends a series of operations.

【０１４９】例えば、入力バッファメモリ１２が図１２
（Ａ）の状態であるならば、最初に、「この形態素の答
における利点は」（レコードＬ１２０１）が取り出され
（ステップＳ１６０１）、当該レコード中に未知語マー
カはないので（ステップＳ１６０２）、未知語なし記号
「＠」が未知語バッファメモリ１３に格納される（ステ
ップＳ１６０４）。そして、当該レコード「この形態素
の答における利点は」が拡張文字列生成部４に渡され
（ステップＳ１６０５）、以降は、第１の実施形態と同
様に形態素解析が行われる。一方、入力バッファメモリ
１２が図１２（Ｂ）の状態であるならば、「この形態
｛♪‰¶∬｝おける利点は」（レコードＬ１２０３）が
取り出され（ステップＳ１６０１）、未知語マーカが検
出され（ステップＳ１６０２）、当該未知語部分「♪‰
¶∬」が未知語バッファメモリ１３に格納される（ステ
ップＳ１６０３）。そして、当該マーカテキストの未知
語部分より前半の部分「この形態」が拡張文字列生成部
４に渡され（ステップＳ１６０５）、以降は、第１の実
施形態と同様に形態素解析が行われる。For example, if the input buffer memory 12
If the state is (A), first, “the advantage of this morpheme answer” (record L1201) is extracted (step S1601), and there is no unknown word marker in the record (step S1602). The wordless symbol "$" is stored in the unknown word buffer memory 13 (step S1604). Then, the record "the advantage of this morpheme answer" is passed to the extended character string generation unit 4 (step S1605), and thereafter, morphological analysis is performed in the same manner as in the first embodiment. On the other hand, if the input buffer memory 12 is in the state of FIG. 12 (B), “the advantage in this form {♪} ¶” (record L1203) is extracted (step S1601), and the unknown word marker is detected. (Step S1602), the unknown word part “♪ ‰
¶∬ ”is stored in the unknown word buffer memory 13 (step S1603). Then, the first half “this form” of the unknown part of the marker text is passed to the extended character string generation unit 4 (step S1605), and thereafter, morphological analysis is performed as in the first embodiment.

【０１５０】しかる後に、未知語バッファメモリ１３を
監視することにより、後述する出力合成の動作と同期を
図り（ステップＳ１６０６）、出力合成の動作が完了す
れば、次の推定テキスト（Ｌ１２０２「この形態素解析
における利点は」）又は、マーカテキストの未知語部分
より後半の部分（「おける利点は」）が同様に処理され
る。最後に、入力終了記号（例えば、「＄」など）が未
知語バッファメモリ１３に格納される（ステップＳ１６
０９）。Thereafter, the unknown word buffer memory 13 is monitored to synchronize with the output combining operation described later (step S1606). When the output combining operation is completed, the next estimated text (L1202 "this morpheme" The advantage in the analysis ") or the latter half of the unknown word part of the marker text (" the advantage in the analysis ") is similarly processed. Finally, the input end symbol (for example, "@") is stored in the unknown word buffer memory 13 (step S16).
09).

【０１５１】図１７は、上述したステップＳ１４０３の
出力合成の動作を詳細に説明するフローチャートであ
る。FIG. 17 is a flowchart for explaining the output synthesizing operation in step S1403 described above in detail.

【０１５２】出力合成部１４は、未知語バッファメモリ
１３が空であるかどうかで、入力制御部１１と同期する
（ステップＳ１７０１）。すなわち、未知語バッファメ
モリ１３に未知語又は未知語なし記号「＠」が格納され
た時点で動作を開始し、未知語バッファメモリ１３から
未知語テキストを取り出す（ステップＳ１７０２）。そ
して、取り出した未知語テキストが、入力終了記号
「＄」か否かを判定する（ステップＳ１７０３）。入力
終了記号「＄」であれば一連の出力制御動作を終了し、
入力終了記号「＄」でなければ（ステップＳ１７０３で
否定結果）、最適経路探索部７から出力テキストを受け
取る（ステップＳ１７０４）。The output synthesizing unit 14 synchronizes with the input control unit 11 depending on whether or not the unknown word buffer memory 13 is empty (step S1701). That is, the operation is started when the unknown word or the symbol "@" without unknown word is stored in the unknown word buffer memory 13, and the unknown word text is extracted from the unknown word buffer memory 13 (step S1702). Then, it is determined whether or not the extracted unknown word text is the input end symbol “＄” (step S1703). If the input end symbol is "＄", a series of output control operations are terminated,
If the input end symbol is not "$" (a negative result in step S1703), an output text is received from the optimum route search unit 7 (step S1704).

【０１５３】その後、ステップＳ１７０２で取り出した
未知語テキストが未知語なし記号「＠」か否かを判定す
る（ステップＳ１７０５）。未知語なし記号「＠」でな
ければ、最適経路探索部７からの出力テキストに未知語
テキストを付加した後（ステップＳ１７０６）。未知語
なし記号「＠」であれば、最適経路探索部７からの出力
テキストをそのまま出力装置８へ渡す（ステップＳ１７
０７）。Thereafter, it is determined whether or not the unknown word text extracted in step S1702 is an unknown word absence symbol “記号” (step S1705). If the symbol without unknown word is not “$”, the unknown word text is added to the output text from the optimum route search unit 7 (step S1706). If the symbol has no unknown word “なし”, the output text from the optimum route search unit 7 is passed to the output device 8 as it is (step S17).
07).

【０１５４】最後に、未知語バッファメモリ１３をクリ
アし（ステップＳ１７０８）、上述したステップＳ１７
０１へ戻る。未知語バッファメモリ１３がクリアされる
ことにより、入力制御部１１は次の動作を開始する。Finally, the unknown word buffer memory 13 is cleared (step S1708), and the above-described step S17 is performed.
Return to 01. When the unknown word buffer memory 13 is cleared, the input control unit 11 starts the next operation.

【０１５５】例えば、入力バッファメモリ１２が図１２
（Ｂ）の状態であるならば、未知語部分「♪‰¶∬」が
未知語バッファメモリ１３に格納される（ステップＳ１
６０３）。そして、当該マーカテキストの未知語部分よ
り前半の部分「この形態」が拡張文字列生成部４に渡さ
れ（ステップＳ１６０５）、以降は、第１の実施形態と
同様に形態素解析が行われる。出力合成部１４は、未知
語バッファメモリ１３に未知語部分「♪‰¶∬」が格納
されたことで動作を開始し（ステップＳ１７０１）、最
適経路探索部７から出力テキスト「／この／形態／」を
受け取り（ステップＳ１７０４）、当該テキストに未知
語を付加し（ステップＳ１７０６）、「／この／形態／
♪‰¶∬／」が出力される（ステップＳ１７０７）。For example, if the input buffer memory 12
If the state is (B), the unknown word portion “♪ ‰ ¶∬” is stored in the unknown word buffer memory 13 (step S1).
603). Then, the first half “this form” of the unknown part of the marker text is passed to the extended character string generation unit 4 (step S1605), and thereafter, morphological analysis is performed as in the first embodiment. The output synthesizing unit 14 starts operating when the unknown word portion “♪ ‰ ¶∬” is stored in the unknown word buffer memory 13 (step S1701), and the output text “/ this / form / (Step S1704), an unknown word is added to the text (step S1706), and "/ this / form /
♪ ‰ ¶∬ / ”is output (step S1707).

【０１５６】次に、未知語バッファメモリ１３がクリア
される（ステップＳ１７０８）ので、入力制御部１１
は、動作を再開し、マーカテキストの未知語部分より後
半の部分「おける利点は」が同様に処理され、最適経路
探索部７から出力テキスト「おける／利点／は／」を受
け取り（ステップＳ１７０４）、この場合、未知語テキ
ストは未知語なし記号「＠」であるので、そのまま、
「おける／利点／は／」が出力される（ステップＳ１７
０７）。Next, since the unknown word buffer memory 13 is cleared (step S1708), the input controller 11 is cleared.
Restarts the operation, and the second half of the marker text, the “advantage in the word”, is processed in the same way, and receives the output text “in / advantage / ha /” from the optimum route search unit 7 (step S1704). In this case, since the unknown word text is the unknown word-free symbol “＠”,
“OK / Advantage / Wa /” is output (Step S17)
07).

【０１５７】（Ｂ−３）第２の実施形態の効果この第２の実施形態においても、第１の実施形態と同様
な構成要素を備えるので、第１の実施形態と同様な効果
を奏することができる。(B-3) Effects of the Second Embodiment In the second embodiment, the same components as those in the first embodiment are provided, so that the same effects as in the first embodiment can be obtained. Can be.

【０１５８】これに加えて、第２の実施形態によれば、
未知語検出部９、入力制御部１１及び出力合成部１４等
を備えるので、以下の効果を奏することができる。In addition to this, according to the second embodiment,
Since the apparatus includes the unknown word detection unit 9, the input control unit 11, the output synthesis unit 14, and the like, the following effects can be obtained.

【０１５９】すなわち、第２の実施形態によれば、未知
語部分（非対象文字列）を検出し、その未知語部分が短
いならば、その未知語部分の本来の文字列と思われる文
字列を推定することができる。That is, according to the second embodiment, an unknown word portion (non-target character string) is detected, and if the unknown word portion is short, a character string considered to be the original character string of the unknown word portion Can be estimated.

【０１６０】例えば、「この形態素‰¶における利点
は」という入力テキストに対して、「／この／形態素／
の／答／に／おける／利点／は／」と「／この／形態素
／解析／に／おける／利点／は／」という形態素解析結
果を得ることができる。すなわち、「‰¶」なる未知語
を検出し、従来ならば未知語として扱われていた文字列
を「の答」や「解析」などのように正しいと思われる文
字列として推定することができる。For example, with respect to the input text “Advantage of this morpheme ‰ ¶”, “/ this / morpheme /
The morphological analysis results of “/ this / morphology / advantage / ha /” and “/ this / morpheme / analysis / ni / in / advantage / ha /” can be obtained. That is, an unknown word “なる ¶” can be detected, and a character string conventionally treated as an unknown word can be estimated as a character string considered to be correct, such as “answer” or “analysis”. .

【０１６１】また、第２の実施形態の形態素解析装置に
よれば、未知語部分（非対象文字列）を検出し、その文
字数が多い場合においても、未知語以外の部分の形態素
解析の精度を損なうことなく所望の形態素解析結果を得
ることができる。According to the morphological analyzer of the second embodiment, an unknown word portion (non-target character string) is detected, and even when the number of characters is large, the accuracy of morphological analysis of a portion other than the unknown word is improved. A desired morphological analysis result can be obtained without any loss.

【０１６２】例えば、「この形態♪‰¶∬おける利点
は」という入力テキストに対して、「この形態」及び
「おける利点は」を独立して形態素解析し、合成により
「／この／形態／♪‰¶∬／おける／利点／は／」なる
形態素解析結果を得ることができる。すなわち、従来な
らば、「この形態♪‰¶∬おける利点は」全体を形態素
解析する構成であったので、未知語部分「♪‰¶∬」の
影響が以降の解析精度に影響していたが、第２の実施形
態によれば、未知語の影響を受けずに正確な形態素解析
を行うことができる。For example, with respect to the input text “This form ♪ ‰ ¶∬ 利点ける利点”, the morphological analysis of “This form” and “Advantage in the form” are performed independently, and “/ this / form / ♪” is synthesized. A morphological analysis result of {‰} / can / advantage / ha / ”can be obtained. That is, in the past, the configuration of performing the morphological analysis of the whole "the advantage in this form ♪ ‰ ∬ 利点」 "was morphologically analyzed. According to the second embodiment, accurate morphological analysis can be performed without being affected by unknown words.

【０１６３】（Ｂ−４）第２の実施形態の変形実施形態第２の実施形態においては、未知語を推定できない場合
に入力テキストを分割し、それぞれについて形態素解析
を順次実施し、最後に合成するものであったが、複数の
分割テキストに対する形態素解析を並列に実施するよう
にしても良い。(B-4) Modified Embodiment of Second Embodiment In the second embodiment, when an unknown word cannot be estimated, the input text is divided, morphological analysis is sequentially performed for each of them, and finally, synthesis is performed. However, morphological analysis on a plurality of divided texts may be performed in parallel.

【０１６４】また、第２の実施形態においては、未知語
を検出するための非対象文字が１文字を単位としたもの
であったが、それに加えて、２文字以上の組み合わせ
（熟語的に）でも非対象文字パターンメモリ１０に登録
しておくようにしても良い。このようにした場合には、
例えば、一般的な文章で用いられることがない熟語を、
それと等価な一般的な文章で良く用いられる熟語に置換
して形態素解析に供するようなことができる。In the second embodiment, the non-target characters for detecting an unknown word are in units of one character. In addition, a combination of two or more characters (in a idiom) However, it may be registered in the non-target character pattern memory 10. If you do this,
For example, idioms that are not used in common sentences
It can be used for morphological analysis by replacing it with idioms often used in general sentences equivalent to it.

【０１６５】さらに、第２の実施形態においては、非対
象文字パターンメモリ１０が固定のものを示したが、非
対象文字パターン編集処理部及びそれに対する入力装置
を設けて、ユーザが登録、削除等を実行できるものであ
っても良い。Further, in the second embodiment, the non-target character pattern memory 10 is of a fixed type, but a non-target character pattern editing unit and an input device for the same are provided so that the user can register, delete, etc. May be executed.

【０１６６】さらにまた、第２の実施形態においても、
未知語検出部９が、拡張文字列生成部４や拡張文字推定
部５や連鎖確率計算部６等の形態素解析要素が利用する
拡張文字テーブル３を利用するものを示したが、これと
は別個に構成されたテーブルを用いるようにしても良
い。例えば、拡張情報を含まない文字のＮ−ｇｒａｍ文
字列（好ましくは連鎖確率は有する）を格納したテーブ
ルを用いるようにしても良い。Further, in the second embodiment,
Although the unknown word detection unit 9 uses the extended character table 3 used by the morphological analysis elements such as the extended character string generation unit 4, the extended character estimation unit 5, and the chain probability calculation unit 6, it is different from this. May be used. For example, a table storing an N-gram character string (preferably having a chain probability) of characters that do not include extended information may be used.

【０１６７】また、第２の実施形態においては、非対象
文字数ＬがＮ−ｇｒａｍ拡張文字列の次数（文字数）Ｎ
以上であるときには、正しいと思われる文字列への推定
動作を実行しないものであったが、Ｌが２Ｎ−２以下で
あれば、推定精度は落ちるが、推定動作を行うようにし
ても良い。例えば、非対象文字列をほぼ均一に２分し、
前半の非対象文字列とその前側の対象文字とで拡張文字
テーブルを走査すると共に、後半の非対象文字列とその
後側の対象文字とで拡張文字テーブルを走査し、両走査
結果を統合することにより、正しいと思われる文字列を
推定するようにしても良い。In the second embodiment, the number L of non-target characters is the degree (number of characters) N of the N-gram extended character string.
In the above case, the estimation operation for the character string considered to be correct is not performed. However, if L is 2N−2 or less, the estimation accuracy is reduced, but the estimation operation may be performed. For example, a non-target character string is roughly evenly divided into two,
Scanning the extended character table with the first non-target character string and the target character before it, and scanning the extended character table with the second non-target character string and the target character on the rear side, and integrating both scan results Thus, a character string considered to be correct may be estimated.

【０１６８】第２の実施形態は、入力テキストの未知語
部分を正しいと思われる推定された文字列に置き換えた
後に、第１の実施形態に係る方法で形態素解析を行うも
のであったが、他の方法により形態素解析するものであ
っても良く（例えば、単語辞書を利用したもの）、さら
には、推定置換後の入力テキスト（推定テキスト）に対
して形態素解析以外の自然言語処理を施すものであって
も良く、推定置換だけでそれ以降、自然言語処理を行わ
ないものにも本発明を適用できる。例えば、通信されて
きたテキストにおける文字化け（未知語）を本来の文字
に戻す手段としてのみ、第２の実施形態の特徴を適用す
ることができる。In the second embodiment, the morphological analysis is performed by the method according to the first embodiment after replacing the unknown word portion of the input text with an estimated character string considered to be correct. It may be one that performs morphological analysis by another method (for example, one that uses a word dictionary), and that performs natural language processing other than morphological analysis on the input text after the estimated replacement (estimated text) The present invention can also be applied to a case where only the estimated replacement is performed and the natural language processing is not performed thereafter. For example, the features of the second embodiment can be applied only as a means for returning garbled characters (unknown words) in a transmitted text to original characters.

【０１６９】（Ｃ）他の実施形態上述した第１及び第２の実施形態の説明においても、種
々変形実施形態について言及したが、さらに、以下のよ
うな変形実施形態を挙げることができる。(C) Other Embodiments In the above description of the first and second embodiments, various modified embodiments have been described, but the following modified embodiments can be further mentioned.

【０１７０】上記各実施形態においては、拡張文字テー
ブル２内のＮ−ｇｒａｍ拡張文字列について連鎖確率で
頻度情報を格納したものを示したが、頻度そのものを格
納するようにしても良い。この場合、例えば、確率が１
となるＮ−ｇｒａｍ拡張文字列のグループ毎に総頻度も
格納しておき、経路のスコア（評価値）を計算するとき
に、確率に置き換えるようにしても良い。また、経路の
スコアを、各Ｎ−ｇｒａｍ拡張文字列の頻度の総和等で
計算するようにしても良い。In each of the above embodiments, the frequency information is stored with the chain probability for the N-gram extended character string in the extended character table 2, but the frequency itself may be stored. In this case, for example, the probability is 1
The total frequency may also be stored for each group of the N-gram extended character string, and replaced with a probability when calculating the score (evaluation value) of the route. Alternatively, the route score may be calculated based on the sum of the frequencies of the N-gram extended character strings.

【０１７１】また、上記各実施形態における拡張文字テ
ーブル２やスコアテーブル３等は、テーブル構成以外の
構成で実現しても良い。The extended character table 2 and score table 3 in each of the above embodiments may be realized by a configuration other than the table configuration.

【０１７２】さらに、上記各実施形態においては、対象
とする自然言語が日本語であるものを示したが、他の言
語の入力テキストに対しても本発明を適用することがで
きる。ここで、他の言語としては、スペース等で単語区
切りが明確になっていない言語だけでなく、スペース等
で単語区切りが明確になっている言語であっても良い。
例えば、文字化け等を考慮した第２の実施形態の特徴
は、スペース等で単語区切りが明確になっている言語に
おいても、非常に有効なものである。Furthermore, in each of the above embodiments, the target natural language is Japanese, but the present invention can be applied to input texts in other languages. Here, as the other languages, not only languages in which word breaks are not clearly defined by spaces or the like, but also languages in which word breaks are clear by spaces or the like may be used.
For example, the feature of the second embodiment that considers garbled characters and the like is very effective even in a language in which word breaks are clearly defined by spaces or the like.

【０１７３】[0173]

【発明の効果】以上のように、第１の本発明によれば、
文字化けやミスタイプ等による未知語部分を含む入力テ
キストに対しても良好な形態素解析結果を得ることがで
きる自然言語処理装置を実現できる。As described above, according to the first aspect of the present invention,
It is possible to realize a natural language processing device capable of obtaining a good morphological analysis result even for an input text including an unknown word portion due to garbled characters, typos, and the like.

【０１７４】また、第２の本発明によれば、文字化けや
ミスタイプ等による未知語部分を含む入力テキストに対
し、未知語部分を検出し、当該部分を正しい文字列に復
元することができる自然言語処理装置を実現できる。Further, according to the second aspect of the present invention, an unknown word portion can be detected from an input text including an unknown word portion due to garbled characters, typographical errors, etc., and the portion can be restored to a correct character string. A natural language processing device can be realized.

[Brief description of the drawings]

【図１】第１の実施形態の構成を示すブロック図であ
る。FIG. 1 is a block diagram illustrating a configuration of a first embodiment.

【図２】第１の実施形態の拡張文字テーブルの構成を示
す説明図である。FIG. 2 is an explanatory diagram illustrating a configuration of an extended character table according to the first embodiment.

【図３】第１の実施形態の拡張文字テーブルの具体例を
示す説明図である。FIG. 3 is an explanatory diagram illustrating a specific example of an extended character table according to the first embodiment;

【図４】第１の実施形態のスコアテーブルの構成を示す
説明図である。FIG. 4 is an explanatory diagram illustrating a configuration of a score table according to the first embodiment.

【図５】第１の実施形態のスコアテーブルの具体例を示
す説明図である。FIG. 5 is an explanatory diagram showing a specific example of a score table according to the first embodiment.

【図６】第１の実施形態の全体動作を示すフローチャー
トである。FIG. 6 is a flowchart illustrating an overall operation of the first embodiment.

【図７】第１の実施形態の拡張文字列の生成動作を示す
フローチャートである。FIG. 7 is a flowchart illustrating an operation of generating an extended character string according to the first embodiment.

【図８】第１の実施形態の連鎖確率の計算動作を示すフ
ローチャートである。FIG. 8 is a flowchart illustrating an operation of calculating a chain probability according to the first embodiment.

【図９】第１の実施形態の最適拡張文字列の選択動作を
示すフローチャートである。FIG. 9 is a flowchart illustrating an operation of selecting an optimum extended character string according to the first embodiment.

【図１０】第２の実施形態の構成を示すブロック図であ
る。FIG. 10 is a block diagram illustrating a configuration of a second embodiment.

【図１１】第２の実施形態の非対象文字パターンメモリ
の構成を示す説明図である。FIG. 11 is an explanatory diagram illustrating a configuration of a non-target character pattern memory according to the second embodiment.

【図１２】第２の実施形態の入力バッファメモリの構成
を示す説明図である。FIG. 12 is an explanatory diagram illustrating a configuration of an input buffer memory according to a second embodiment.

【図１３】第２の実施形態の未知語バッファメモリの構
成を示す説明図である。FIG. 13 is an explanatory diagram showing a configuration of an unknown word buffer memory according to the second embodiment.

【図１４】第２の実施形態の全体動作を示すフローチャ
ートである。FIG. 14 is a flowchart showing an overall operation of the second embodiment.

【図１５】第２の実施形態の未知語の検出動作を示すフ
ローチャートである。FIG. 15 is a flowchart illustrating an operation of detecting an unknown word according to the second embodiment.

【図１６】第２の実施形態の入力制御動作を示すフロー
チャートである。FIG. 16 is a flowchart illustrating an input control operation according to the second embodiment.

【図１７】第２の実施形態の出力合成動作を示すフロー
チャートである。FIG. 17 is a flowchart illustrating an output combining operation according to the second embodiment.

[Explanation of symbols]

２…拡張文字テーブル、３…スコアテーブル、４…拡張
文字列生成部、５…拡張文字推定部、６…連鎖確率計算
部、７…最適経路探索部、９…未知語検出部、１０…非
対象文字パターンメモリ、１１…入力制御部、１２…入
力バッファメモリ、１３…未知語バッファメモリ、１４
…出力合成部。2 ... extended character table, 3 ... score table, 4 ... extended character string generation unit, 5 ... extended character estimation unit, 6 ... chain probability calculation unit, 7 ... optimal route search unit, 9 ... unknown word detection unit, 10 ... non- Target character pattern memory, 11: input control unit, 12: input buffer memory, 13: unknown word buffer memory, 14
... Output synthesis unit.

Claims

[Claims]

An extended character is formed by adding at least extended information including word delimiter information for each character of a character string of a read input text, and all combinations of the character string of the input text are formed using the extended character. An extended character string generation unit that generates an extended character string of: an extended character storage unit that stores a partial extended character string having a fixed number of characters and partial chain probability information for the partial extended character string; Based on the paths of all the partial extended character strings up to the end and the partial chain probabilities stored in the extended character storage unit, for each of the extended character strings generated by the extended character string generation unit, the chain probability A chain probability calculation unit for obtaining information; a score storage unit for storing the obtained chain probability information; and an extension for providing an optimal chain probability from the obtained chain probability information. An optimal path search unit that selects a character string and outputs an analysis result including a sequence of word strings corresponding to the extended character string as a morphological analysis result; and a partial extended character of the extended character string generated by the extended character string generation unit. When a string does not exist in the extended character storage unit, a partial chain of another partially extended character string having a common extended character with some extended characters of the partial extended character string stored in the extended character storage unit A natural language processing apparatus comprising: an extended character estimating unit that estimates partial chain probability information of the partial extended character string from the probability information.

2. The natural language processing device according to claim 1, wherein the input text is a Japanese text.

3. The extended character estimating unit additionally stores the estimated chain probability information together with a partial extended character string related to the information in the extended character storage unit.
Or the natural language processing device according to 2.

4. The extended character estimating unit regards one or more extended characters of the partial extended character string that does not exist in the extended character storage unit as general-purpose characters that can be matched with all characters, 2. A partial chain probability information of a partial extended character string is estimated from partial chain probability information of one or a plurality of partial extended character strings that match with the contents stored in the extended character storage unit. 4. The natural language processing device according to any one of claims 1 to 3.

5. The extended character estimating unit calculates an arithmetic average of partial chain probability information of one or a plurality of partial extended character strings that match with the contents stored in the extended character storage unit. The natural language processing device according to claim 1, wherein the estimated value is partial chain probability information.

6. The optimal route search unit is capable of externally setting an optimal selection condition for selecting an optimal extended character string related to an input text from contents stored in the score storage unit. The natural language processing device according to claim 1.

7. The optimum route search unit prepares, as a settable optimum selection condition, a condition that an extended character string having chain probability information equal to or greater than an arbitrary threshold value is selected. 7. The natural language processing device according to any one of 1 to 6.

8. A partial character string storage unit that stores a partial character string having a fixed number of characters, a non-target character pattern storage unit that stores a pattern of a non-target character registered in advance when forming an unknown word, Based on the content stored in the non-target character pattern storage unit,
And detecting an unknown word portion in the read input text, and, based on the storage content of the partial character string storage unit, including an unknown word detection unit that estimates a character string considered to be correct for the unknown word portion. Characteristic natural language processing device.

9. The natural language processing apparatus according to claim 8, wherein the non-target character pattern storage unit sets in advance a non-target pattern and a non-target character range listing non-target characters. .

10. The natural language according to claim 8, further comprising morphological analysis means for performing morphological analysis on the estimated text, which is the input text after the unknown word detection unit has performed the estimation operation. Language processor.

11. The morphological analysis unit adds extended information including at least word delimiter information to each character of a character string of an input text to form an extended character, and uses the extended character to form the extended text. An extended character string generation unit that generates extended character strings of all combinations of character strings; an extended character storage unit that stores a partial extended character string having a fixed number of characters and partial chain probability information for the partial extended character string; Based on the paths of all the partial expansion character strings from the beginning to the end of the input text and the partial chain probabilities stored in the expansion character storage unit, all of the expansions generated by the expansion character string generation unit For each of the character strings, a chain probability calculating unit that obtains the chain probability information, a score storage unit that stores the obtained chain probability information, and whether the obtained chain probability information is An optimal path search unit that selects an extended character string that provides an optimal chain probability and outputs an analysis result including a sequence of word strings corresponding to the extended character string as a morphological analysis result. The natural language processing device according to claim 10.

12. The extended morphological analysis means stores the extended character string of the extended character string generated by the extended character string generation unit in the extended character storage unit when the extended character string does not exist in the extended character storage unit. An extended character estimating unit for estimating partial chain probability information of the partial extended character string from partial chain probability information of another partial extended character string having a common extended character with a part of the extended character string; The natural language processing device according to claim 11, further comprising:

13. The natural language processing device according to claim 11, wherein the unknown word detection unit uses the extended character storage unit as the partial character string storage unit.

14. The unknown word detecting section, when the number of characters of the detected unknown word portion is equal to or more than a predetermined number of characters, generates a mark text in which the unknown word portion is marked without performing an operation of estimating a correct character string. For the mark text, a character string portion other than the unknown word portion is separated and provided to the morphological analysis means, and the analysis result for each character string portion from the morphological analysis means is combined with the unknown word portion. 14. The natural language processing device according to claim 10, further comprising a morphological analysis input / output control unit that performs the morphological analysis.

15. The natural language processing device according to claim 8, wherein the input text is a Japanese text.