JPH0519185B2

JPH0519185B2 -

Info

Publication number: JPH0519185B2
Application number: JP63131630A
Authority: JP
Inventors: Hiroshi Sano
Original assignee: Agency of Industrial Science and Technology
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 1988-05-31
Filing date: 1988-05-31
Publication date: 1993-03-16
Also published as: JPH01302470A

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は、自然言語文の形態素解析において未
知語が検出された際の処理を行なう文の形態素解
析における未知語処理装置に関する。[Detailed Description of the Invention] [Object of the Invention] (Industrial Application Field) The present invention provides an unknown word processing device for morphological analysis of sentences that performs processing when unknown words are detected in morphological analysis of natural language sentences. Regarding.

（従来の技術）自動翻訳機械等の文章処理装置においては、自
然言語文を正確に形態素解析することが重要であ
る。従来、この形態素解析処理は、単語辞書及び
その品詞分類と品詞分類に基づく品詞別接続表を
用いて、品詞間の接続を検査することにより行わ
れていた。(Prior Art) In text processing devices such as automatic translation machines, it is important to accurately morphologically analyze natural language sentences. Conventionally, this morphological analysis process has been performed by checking connections between parts of speech using a word dictionary, its part of speech classification, and a connection table for each part of speech based on the part of speech classification.

しかしながら、例えば、文例解析例「遅れて来た」：動詞＋助動詞動詞「遅れて来た」：形容詞動詞「ゆつくり来た」：副詞動詞の３つの文例を見ると、いずれの文も「来る」と
いう動詞の動作の程度を、それに前接する語が修
飾しており、この点において３つの文は構文的に
全く同じである。従つて、形態素解析において
も、上記の例については、「連用修飾語＋用言」
であることさえ分れば十分である。それにも拘ら
ず、従来の形態素解析では、品詞別接続表という
二次元的な接続性という形態中心の解析に専ら頼
る方法であるため、細かい品詞分類を設定しなけ
ればならず、単語の辞書登録手続が繁雑であり、
しかも未登録単語に対しては全くなすべき手立て
を持たないという問題があつた。 However, for example, if you look at the three example sentences with the verb "I came late": verb + auxiliary verb Verb "I came late": adjective Verb "I came late": adverb The word preceding the verb modifies the degree of action of the verb "come," and in this respect the three sentences are syntactically identical. Therefore, in morphological analysis, in the above example, "conjunctive modifier + predicate"
It is enough to know that it is. Despite this, conventional morphological analysis relies exclusively on form-centered analysis of two-dimensional connectivity in the form of connection tables by part of speech, which requires setting detailed part-of-speech classifications and registering words in dictionaries. The procedures are complicated;
Moreover, there was a problem in that there was no way to deal with unregistered words.

（発明が解決しようとする課題）このように、従来の形態素解析処理では、品詞
別接続表により処理が行われるため、辞書の登録
が繁雑であり、しかも未登録単語に対して対処で
きないという問題があつた。(Problems to be Solved by the Invention) As described above, in conventional morphological analysis processing, processing is performed using a connection table for each part of speech, so dictionary registration is complicated, and furthermore, it is impossible to deal with unregistered words. It was hot.

本発明は、上述した従来の問題点を解決し、形
態素解析のための辞書登録が楽で、未知語に対し
てもその形態素レベルの分類名の付与を可能にし
た文の形態素解析における未知語処理装置を提供
することを目的とする。 The present invention solves the above-mentioned conventional problems, makes it easy to register unknown words in a dictionary for morphological analysis, and allows unknown words to be assigned a classification name at the morphological level. The purpose is to provide processing equipment.

［発明の構成］（課題を解決するための手段）本発明に係る未知語処理装置は、入力文の形態
素解析に失敗した文字列を検出する解析失敗文字
列検出手段と、検出された前記解析失敗文字列を
逆方向から解析して前記解析失敗文字列に含まれ
ている接辞・助辞を求める接辞・助辞解析手段
と、前記求められた接辞・助市を前記解析失敗文
字列中から除いて未知語を検出する未知語検出手
段と、前記求められた接辞・助辞の接続関係情報
から前記未知語の語基名を予測し、この予測語基
名を未知語の形態名として出力する解析結果出力
手段とを具備している。[Structure of the Invention] (Means for Solving the Problems) An unknown word processing device according to the present invention includes an analysis failure character string detection means for detecting a character string in which morphological analysis of an input sentence has failed, and affix and auxiliary analysis means for analyzing the failed character string from the reverse direction to obtain affixes and auxiliaries included in the failed character string; and removing the found affix and auxiliary from the failed character string. An unknown word detection means for detecting an unknown word, and an analysis result that predicts the word base name of the unknown word from the connection relationship information of the affixes and auxiliary words obtained, and outputs this predicted word base name as the form name of the unknown word. and an output means.

ここで、語基とは形態素を規定するものであ
り、「有意義の最小言語形式」と定義される（森
岡健著：「語彙の形成」、明治書院、1987）。「有意
義」とは、ある文字列によつて、概念や観念なり
を示している部分に相当する。語基はその文字列
が構文上どのような働きをするかには無関係あ
り、例えば主語になるとか、活用するとか、動詞
を修飾するなどといつたことには一応切り難して
考えられる単位である。 Here, a word base specifies a morpheme and is defined as the "minimum meaningful linguistic form" (Ken Morioka, "Formation of Vocabulary", Meiji Shoin, 1987). "Significant" corresponds to a part where a certain character string indicates a concept or idea. The word base has nothing to do with how the character string functions syntactically; for example, it is a unit that can be considered with difficulty, such as becoming a subject, conjugating, or modifying a verb. be.

これに対して、語幹は所謂活用語にだけ認めら
れるもので、構文上の機能が明示されており、語
基は語幹に比べて一般性がある。 On the other hand, word stems are recognized only in so-called conjugated words, have a clearly defined syntactic function, and are more general than word stems.

例えば、「高−」は、形容詞の語幹とされ、こ
の分類では「高−い」「高−かつた」を分析する
ことは、形容詞の取扱いが自立語であつて活用語
であるとしても比較的単純であるが、「高まる
（動詞）」「高々（副詞）」「高さ（名詞）」のような
語彙の構成を考えると、「高々」「高さ」には活用
語はなく、この場合語彙構成の分析は、もともと
「高−」は自立語の活用語であつた語幹部分が機
能変化し、更に特定の「−さ」「−まる」といつ
た接尾辞がつくことによつて語彙としても変化す
ると考えるので、機械処理において分析手続が煩
雑となる。 For example, ``taka-'' is considered to be the stem of an adjective, and in this classification, analyzing ``taka-i'' and ``taka-katsuta'' makes comparison even if the adjective is treated as an independent word and a conjugated word. It may be simple, but if you consider the structure of vocabulary such as ``to rise (verb)'', ``takata (adverb)'', and ``height (noun)'', there is no conjugation for ``takata'' and ``height'', and this Analysis of the lexical structure of the case shows that ``taka-'' was originally a conjugated word of an independent word, but the function of the stem part changed, and furthermore, by adding specific suffixes such as ``-sa'' and ``-maru,'' Since the vocabulary is considered to change, the analysis procedure becomes complicated in machine processing.

これに対して、語基によると、「高−」は形容
語基として区分されるが、この区分はどのような
概念や観念を示すかによるものであり、活用、品
詞等に無関係に区分されているため、語基に接尾
辞が接続することにより、(1)形容詞を構成するも
の、(2)動詞を構成するもの、(3)副詞を構成するも
の、(4)名詞を構成するものができる。したがつ
て、語幹によるよりも、規則化の一般性が高ま
る。例えば、上述の「高−い（形容詞）」「高−ま
る（動詞）」「高−々（副詞）」についても、語基
による場合、活用の有無とは無関係に分析できる
ので、機械処理における分析手続の煩雑さが低減
される。 On the other hand, according to the word base, ``高-'' is classified as an adjective base, but this classification depends on what kind of concept or idea it expresses, and it is classified regardless of conjugation, part of speech, etc. Therefore, by connecting a suffix to a word base, we can define (1) things that make up an adjective, (2) things that make up a verb, (3) things that make up an adverb, and (4) things that make up a noun. I can do it. Therefore, the generality of regularization is higher than that of word stems. For example, the above-mentioned ``taka-ai (adjective),'' ``taka-maru (verb),'' and ``taka-dai (adverb)'' can be analyzed based on the word base, regardless of the presence or absence of conjugation, so machine processing The complexity of analysis procedures is reduced.

（作用）本発明では、従来使用されてきた単語の品詞分
類を廃し、語を構成する観念語である誤基と、そ
の語基の文中での自立を促す機能語である接辞、
助辞をもとに、語構成の立場から形態素解析が行
われる。そして、もし解析に失敗したら、その解
析失敗文字列を逆方向から解析する。即ち、通常
一つの文節は、語の中心をなす語基に、接辞・助
辞といつた付属語を付加した形態で成立する。従
つて、形態素解析に失敗した文節を逆方向に解析
することで、まず接辞・助辞の解析を行なうこと
ができる。接辞・助辞は、付属語であるため、語
基に較べてその種類は限られており、全てを網羅
して登録しておくことは容易である。また、これ
ら接辞、助辞から未知語の分類名決定のための予
測情報を得ることができる。従つて、解析失敗文
字列でも、それに含まれる接辞・助辞について解
析を行なうことにより、接辞・助辞の接続関係か
ら未知の語基の分類名を予測することができ、こ
の予測語基名を未知語の形態素解析結果として出
力すれば、未知語が辞書に登録されていない場合
でも、形態素解析結果を得ることができる。(Operation) The present invention abolishes the conventional part-of-speech classification of words, and uses a false base, which is an idea word that makes up a word, and an affix, which is a function word that promotes the independence of the base in a sentence.
Morphological analysis is performed from the standpoint of word structure based on adjuncts. If the parsing fails, the parsing failed string is parsed in the reverse direction. In other words, a single clause is usually formed by adding adjuncts such as affixes and auxiliary words to the central word base. Therefore, by analyzing the clauses for which morphological analysis has failed in the reverse direction, affixes and auxiliary words can be analyzed first. Since affixes and auxiliary words are adjunctive words, their types are limited compared to base words, and it is easy to comprehensively register them all. Moreover, prediction information for determining the classification name of an unknown word can be obtained from these affixes and auxiliaries. Therefore, even if the analysis fails, by analyzing the affixes and auxiliaries contained in the string, it is possible to predict the classification name of the unknown word base from the connection relationship of the affixes and auxiliaries, and use this predicted word base name as the unknown word base name. By outputting the result as a morphological analysis result of a word, the morphological analysis result can be obtained even if the unknown word is not registered in the dictionary.

従つて、本発明によれば、辞書登録が簡単で、
且つ未知語に対しても解析素解析を行なうことが
できる。 Therefore, according to the present invention, dictionary registration is easy;
Furthermore, it is possible to perform elementary analysis even on unknown words.

（実施例）以下、図面を参照しながら本発明の実施例につ
いて説明する。(Example) Hereinafter, an example of the present invention will be described with reference to the drawings.

第１図は、本発明の一実施例に係る未知語処理
装置を備えた形態素解析装置の構成を示す図であ
る。 FIG. 1 is a diagram showing the configuration of a morphological analysis device including an unknown word processing device according to an embodiment of the present invention.

この形態素解析装置は、形態素解析部１、解析
失敗文字列検出部２、文字列変換部３、接辞・助
辞解析部４、解析結果比較部５、予測語基登録部
６、未知語置換部７及び解析結果出力部８にて構
成されている。 This morphological analysis device includes a morphological analysis unit 1, an analysis failure character string detection unit 2, a character string conversion unit 3, an affix/auxiliary analysis unit 4, an analysis result comparison unit 5, a predicted word base registration unit 6, and an unknown word replacement unit 7. and an analysis result output section 8.

形態素解析部１は、図示しない単語辞書、語基
辞書及び接辞・助辞辞書に基づいて入力文を形態
素解析する。語基辞書は、単語辞書に基づいて生
成されたもので、語を構成する中核である観念語
としての語基（一般には語幹）を記憶した辞書で
ある。語の種類には述語、連用語、連帯語、独立
語等があり、構文解析において機能する要素であ
る。文はこの語の連鎖として扱われる。各語基
は、その性質から親族語基、人称語基、敬称語
基、動詞語基、形容語基、体言語基等の品詞的性
格付けがなれている。接辞・助辞辞書も単語辞書
に基づき生成されたもので、語基と結びついて語
を派生、屈折若しくは自立させる機能語として働
く。接辞・助辞についてもその性質から、格接
辞・助辞、敬称接辞・助辞、動詞型活用接辞・助
辞、形容詞型活用接辞・助辞等の品詞的性格付け
がなされている。 The morphological analysis unit 1 performs morphological analysis of an input sentence based on a word dictionary, a word base dictionary, and an affix/auxiliary dictionary (not shown). The word base dictionary is generated based on a word dictionary, and is a dictionary that stores word bases (generally word stems) as idea words that are the core of words. Types of words include predicates, conjunctions, joint words, independent words, etc., and are elements that function in syntactic analysis. A sentence is treated as a chain of words. Each word base has different part-of-speech characteristics such as kinship base, personal base, honorific base, verb base, adjective base, and body language base. The affix and auxiliary dictionary was also created based on the word dictionary, and works as a function word that combines with a word base to derive, inflect, or make a word independent. Affixes and auxiliaries are also classified based on their properties, such as case affixes and auxiliaries, honorific affixes and auxiliaries, verb-type conjugation affixes and auxiliaries, and adjective-type conjugation affixes and auxiliaries.

解析失敗文字列検出部２は、形態素解析部１に
おいて解析に失敗した文字列、具体的には文節を
検出する。文字列変換部３は、解析失敗文字列検
出部２で検出された失敗文字列を逆順に変換す
る。接辞・助辞解析部４は、逆順に変換された解
析失敗文字列を先頭から解析して失敗文字列の付
属語の部分を解析する。解析結果比較部５は、こ
の接辞・助辞解析の結果と文字列変換部３からの
逆順の文字列とを比較して、未知語を検出する。
予測語基登録部６は、検出された未知語ととも
に、接辞・助辞解析部４の解析結果から予測され
る上記未知語の語基を登録する。未知語置換部７
は、未知語を逆順に置換え、予測語基とともに解
析結果出力部８に出力する。解析結果出力部８
は、形態素解析部１における解析結果に、未知語
の予測語基を付加し、これを解析結果として出力
する。 The analysis failure character string detection section 2 detects character strings, specifically phrases, for which analysis has failed in the morphological analysis section 1. The character string converter 3 converts the failed character string detected by the failed analysis character string detector 2 in reverse order. The affix/auxiliary analysis unit 4 analyzes the failed character string converted in reverse order from the beginning, and analyzes the adjunct part of the failed character string. The analysis result comparison section 5 compares the result of this affix/auxiliary analysis with the reversed character string from the character string conversion section 3 to detect unknown words.
The predicted word base registration unit 6 registers the base of the unknown word predicted from the analysis result of the affix/auxiliary analysis unit 4 together with the detected unknown word. Unknown word replacement part 7
replaces the unknown word in reverse order and outputs it to the analysis result output unit 8 together with the predicted word base. Analysis result output section 8
adds the predicted word base of the unknown word to the analysis result in the morphological analysis unit 1, and outputs this as the analysis result.

次に、以上のように構成された装置の作用につ
いて第２図の流れ図に基づき説明する。 Next, the operation of the apparatus configured as described above will be explained based on the flowchart of FIG. 2.

まず、入力文字列は形態素解析装置１にて形態
素解析される（S1）。例えば、「今日、太郎さんが花子さんとインド料理を食
べましたが、その時に辛いのが苦手な花子さんは
大変困りました。」という文が入力されたとすると、「太郎さんが」
の解析においては、第３図のような体言語基（太
郎）、接辞（さん）、助辞（が）が抽出される。こ
こで、その右に記載されるのはそれぞれの接続情
報であり、この接続情報にある派生とは、形態素
が接続して行くことを示しており、括弧中の値
は、どのような性質の接尾語に繋るかを示してい
る。また、屈折とは、そこで形態上の接続が終了
し、括弧中の値で示される形態に接続して一つの
文字列の単位を構成することを示す。したがつ
て、括弧内の指標は文字列の接続検査のために使
用され、派生、屈折は語の構成経路を示す。〜
は、解析経路である。つまり、「太郎さんが」
は、 −太郎−派生（敬称）−さん −屈折（格）−が− のように、まず「太郎」から敬称接辞の「さん」
へ派生し、新しく派生語基「太郎さん」を形成
し、その後、通常の体言と同様に核助辞「が」が
付加されて屈折する。 First, an input character string is morphologically analyzed by the morphological analysis device 1 (S1). For example, if the sentence ``Taro-san and Hanako-san ate Indian food today, Hanako-san, who doesn't like spicy food, was in a lot of trouble.''
In the analysis, the body language base (taro), affix (san), and auxiliary word (ga) as shown in Figure 3 are extracted. Here, what is written to the right is the connection information for each, and the derivation in this connection information indicates that the morphemes are connected, and the value in parentheses is the nature of the connection. It shows whether it is connected to a suffix. In addition, inflection indicates that the morphological connection ends there, and that the morphological connection is connected to the morphologies indicated by the values in parentheses to form one character string unit. Therefore, the indicators in parentheses are used to check the connection of strings, and the derivations and inflections indicate the construction path of words. ~
is the analysis path. In other words, "Taro-san"
First, the honorific affix "san" is added from "Taro", as in -Taro-derivative (honorific title)-san -inflection (case)-ga-
, forming a new derived base ``Taro-san'', and then, as in normal nominal language, the nuclear auxiliary ``ga'' is added and inflected.

この例は「太郎」が登録された単語である場合
の例であるが、もし「太郎」が辞書に未登録であ
るとすると、形態素解析部１では、「太郎さんが」
という文節の形態素解析が失敗する。 In this example, "Taro" is a registered word. However, if "Taro" is not registered in the dictionary, the morphological analysis unit 1 will write "Taro-san ga" as "Taro-san ga".
Morphological analysis of the phrase fails.

解析失敗文字列検出部２では、このような解析
失敗文字列が検出される（S2）。この例では「太
郎さんが」という文節が解析失敗文字列として検
出され、文字列変換部３に送られる。 The analysis failure character string detection unit 2 detects such an analysis failure character string (S2). In this example, the phrase "Taro-san ga" is detected as a character string that fails to be analyzed, and is sent to the character string converter 3.

文字列変換部３では、解析失敗文字列検出部２
で失敗文字列が検出されたら（S3）、上記解析失
敗文字列「太郎さんが」を「がんさ郎太」と逆順
に変換し（S4）、接辞・助辞解析部４と解析結果
比較部５とに出力する。 In the character string conversion unit 3, the analysis failure character string detection unit 2
When a failed character string is detected (S3), the above-mentioned failed character string "Taro-san ga" is converted into "Gansaro-ta" in reverse order (S4), and the affix/auxiliary analysis unit 4 and the analysis result comparison unit Output to 5.

接辞・助辞解析部４には、形態素解析部１に用
意された規則のうち、例えば第４図に示すよう
に、語基を除いた接辞、助辞だけの解析規則を全
ての文字列及び接続関係について逆順にした規則
が用意されている。この解析規則は、付属語だけ
で構成された小規模の解析規則であり、全てを網
羅することができる。接辞・助辞解析部４は、こ
の規則に対して上記解析失敗文字列を逆順にした
ものを適用して接辞・助辞解析を行なう（S5）。
この解析は、 −が−屈折（格）−んさ−派生（敬称）− のように、〜の経路で行われ、その結果、「がんさ（派生［敬称］）／Ｘ」という接辞、助辞列の解析結果を得ることができ
る。 Among the rules prepared in the morphological analysis unit 1, the affix and auxiliary analysis unit 4 includes analysis rules for only affixes and auxiliaries excluding the word base, including all character strings and connection relations. Rules are provided in reverse order for . This analysis rule is a small-scale analysis rule composed only of adjunct words, and can cover everything. The affix/auxiliary analysis unit 4 performs affix/auxiliary analysis by applying the above-mentioned failed analysis character string in reverse order to this rule (S5).
This analysis is performed on the path of ~, as in - ga - inflection (case) - nsa - derivation (honorific title) -, and as a result, the affix ``gansa (derivation [honorific title]) /X'', You can obtain analysis results for auxiliary strings.

続いて、解析結果比較部５は、文字列変換部３
の出力である解析失敗文字列を逆順にしたもの
と、それから接辞・助辞解析部４だ解析された接
辞、助辞列を除いたものを比較して、未知語を検
出する（S6）。上記の例では「郎太」が未知語の
文字列として検出される。 Next, the analysis result comparison unit 5 converts the character string
An unknown word is detected by comparing the reverse order of the failed analysis character string output by the affix and auxiliary character string with the affix and auxiliary strings removed from the affix and auxiliary strings analyzed by the affix and auxiliary analysis unit 4 (S6). In the above example, "rota" is detected as a character string of an unknown word.

予測語基登録部６は、接辞、助辞解析結果に基
づいて未知語Ｘの語基を予測する（S7）。例え
ば、上記の例で、接辞、助辞列の解析結果から
「派生（敬称）」が得られるが、派生して後方に
「敬称」接辞を接続する可能性のある語基は体言
語基であるため、「郎太」は、体言語基であるこ
とが予測できる。この予測語基は未知語とともに
登録される。 The predicted word base registration unit 6 predicts the word base of the unknown word X based on the affix and auxiliary analysis results (S7). For example, in the above example, "derivation (honorific title)" is obtained from the analysis result of the affix and auxiliary sequence, but the word base that can be derived and connected with the "honorific title" affix afterward is a body language base. Therefore, it can be predicted that ``rota'' is a body language base. This predicted word base is registered together with the unknown word.

未知語置換部７は、未知語「郎太」を「太郎」
に置換えて、予測語基とともに解析結果出力部８
に送る。解析結果出力部８では、形態素解析部１
での解析結果に上記未知語についての解析結果を
付け加え、これを形態素解析結果として出力する
（S8）。 The unknown word replacement unit 7 replaces the unknown word "Rota" with "Taro".
and the analysis result output unit 8 along with the predicted word base.
send to In the analysis result output section 8, the morphological analysis section 1
The analysis result for the unknown word is added to the analysis result for the above, and this is output as the morphological analysis result (S8).

次に、前述した例において、「辛いのが」につ
いて考えると、これは第５図に示すような形容語
基（辛）、イ系屈折助辞（い）、（かつた）、機能助
辞（の）、格助辞（が）が抽出される。ここで、
その右に記載されるのは第３図と同様に接続情報
である。これについても、第５図の〜の経路
で形態素解析が行われる。 Next, in the example mentioned above, if we consider ``spicy no ga'', this includes the adjective base (shin), the i-based inflectional auxiliary (ii), (katsuta), and the functional auxiliary (no) as shown in Figure 5. ), the case particle (ga) is extracted. here,
What is written to the right is connection information, similar to FIG. 3. In this case as well, morphological analysis is performed along the route .about. in FIG. 5.

この例において、もし、「辛い」が辞書に未登
録であるとすると、「辛いのが」という文節が形
態素解析で失敗する。この場合、上述した処理と
同様に、接辞・助辞解析部４では、「がのい（派
生［イ系ア］）」という解析結果が得られる。後方
接続指標として、派生［イ系ア］である語基は、
形容語基であるので、未知語の文字列「辛」の分
類名は形容語基であると予測することができる。 In this example, if "spicy" is not registered in the dictionary, the phrase "spicy" will fail in morphological analysis. In this case, similar to the process described above, the affix/auxiliary analysis unit 4 obtains an analysis result of "ganoi (derived [i-kei-a])". As a backward connection indicator, the base that is derived [i-kei-a] is
Since it is an adjective base, it can be predicted that the classification name of the unknown character string "spicy" is an adjective base.

同様に「太郎のが」において失敗した場合に
は、未知語解析の結果は、「がの（派生［中立］）」
であることから、未知語として検出された「太
郎」は体言語基であることが分る。また、「太郎
さんのが」においても同様で、接辞、助辞解析の
結果、「がのんさ（派生［中立］）」が得られ、未
知語文字列として検出される「太郎」の分類は体
言語基となる。 Similarly, if "Taro no ga" fails, the unknown word analysis result will be "Gano (derived [neutral])"
Therefore, it can be seen that "Taro", which was detected as an unknown word, is a body language base. The same is true for "Taro-san no ga"; as a result of affix and auxiliary analysis, "ganonsa (derived [neutral])" is obtained, and the classification of "Taro" detected as an unknown word string is Becomes the basis of body language.

このような予測が可能なのは、基本的な形態素
解析が語構成に基づいているためである。つまり
本装置のような語構成によれば、品詞間接続検査
に較べて未知語決定のための予測の情報が得られ
る利点があり、この予測情報を用いて未知語の分
類名予測を的確に行なうことができる。 Such predictions are possible because basic morphological analysis is based on word structure. In other words, the word structure of this device has the advantage of providing predictive information for determining unknown words compared to part-of-speech connection tests, and this predictive information can be used to accurately predict the classification name of unknown words. can be done.

このように、本装置によれば、「太郎」、「辛」
などが未知語であつても、これに付加される接
辞、助辞の情報から未知語の分類名を予測して、
これを解析結果として出力することができる。 In this way, according to this device, "Taro" and "Spicy"
etc. is an unknown word, the classification name of the unknown word is predicted from the information of the affixes and auxiliary words added to it,
This can be output as an analysis result.

なお、本発明は上述した実施例に限定されるも
のではない。例えば上述した未知語の分類名解析
結果を記憶しておくことにより、同一の未知語が
入力されたときに、上記記憶しておいた情報を分
類名予測に利用するようにしても良い。 Note that the present invention is not limited to the embodiments described above. For example, by storing the above-described classification name analysis results for unknown words, the stored information may be used to predict the classification name when the same unknown word is input.

また、接辞、助辞解析部４の規則は、形態素解
析部１に備えられた規則から、例えば第６図のよ
うに求めることができる。すなわち、形態素解析
規則記憶部９に記憶された規則を変換装置１０に
与え、変換装置１０において、第７図に示すよう
に、付属語の規則の切出し（S11）、表層文字列
の変換（S12）、接続関係の変換（S13）の各処理
を施して、閉ざされた語彙の全ての文字列を逆順
にしたものと接続関係を前方、後方ともに逆にし
たものを未知語解析のための接辞・助辞規則とし
て求め、接辞・助辞規則記憶部１１に格納する。
このような処理によつて容易に接辞、助辞規則を
得ることができる。 Further, the rules for the affix and auxiliary analysis unit 4 can be obtained from the rules provided in the morphological analysis unit 1, for example, as shown in FIG. That is, the rules stored in the morphological analysis rule storage unit 9 are given to the conversion device 10, and the conversion device 10 extracts rules for attached words (S11) and converts surface character strings (S12), as shown in FIG. ), conversion of conjunctive relations (S13), all character strings in the closed vocabulary are reversed, and the conjunctive relations are reversed both forward and backward, and the resulting affixes are used for unknown word analysis. - Obtain as an affix rule and store it in the affix/auxiliary rule storage unit 11.
By such processing, affix and auxiliary rules can be easily obtained.

［発明の効果］以上述べたように、本発明によれば、従来の品
詞別分類とは異なり、形態素の機能に応じた解析
を行なうとともに、解析失敗文字列については、
逆方向から付属語部分を解析することによつて未
知語の分類名を予測するようにしているので、辞
書として厳密な辞書がいらず、辞書登録が簡単
で、且つ未知語に対しても解析素解析を行なうこ
とができる。特に、従来、多品詞であるとされて
いる語に対しては、解析効率が向上するという効
果を有する。[Effects of the Invention] As described above, according to the present invention, unlike conventional classification by parts of speech, analysis is performed according to the function of morphemes, and character strings that fail to be analyzed are
Since the classification name of an unknown word is predicted by analyzing the attached word part from the reverse direction, there is no need for a strict dictionary, dictionary registration is easy, and unknown words can also be analyzed. Elementary analysis can be performed. Particularly, this method has the effect of improving analysis efficiency for words that have conventionally been considered to have multiple parts of speech.

[Brief explanation of the drawing]

第１図は本発明の一実施例に係る形態素解析装
置の構成を示すブロツク図、第２図は同装置にお
ける形態素解析手順を示す流れ図、第３図は同装
置における形態素解析規則の一例を示す図、第４
図は同装置における接辞・助辞解析規則の一例を
示す図、第５図は同装置における形態素解析規則
の他の例を示す図、第６図は同装置における接
辞・助辞解析規則の変換手段を示すブロツク図、
第７図は同変換手段の処理手順を示す流れ図であ
る。１……形態素解析部、２……解析失敗文字列検
出部、３……文字列変換部、４……接辞助辞解析
部、５……解析結果比較部、６……予測語基登録
部、７……未知語置換部、８……解析結果出力
部。 Fig. 1 is a block diagram showing the configuration of a morphological analysis device according to an embodiment of the present invention, Fig. 2 is a flowchart showing a morphological analysis procedure in the same device, and Fig. 3 shows an example of morphological analysis rules in the same device. Figure, 4th
The figure shows an example of the affix and auxiliary analysis rules in the same device, FIG. 5 shows another example of the morphological analysis rules in the same device, and FIG. 6 shows the conversion means for the affix and auxiliary analysis rules in the same device. Block diagram shown,
FIG. 7 is a flowchart showing the processing procedure of the conversion means. 1... Morphological analysis unit, 2... Analysis failure character string detection unit, 3... Character string conversion unit, 4... Affix auxiliary analysis unit, 5... Analysis result comparison unit, 6... Predicted word base registration unit, 7... Unknown word replacement section, 8... Analysis result output section.

Claims

[Claims]

1. An analysis failure character string detection means for detecting a character string for which morphological analysis of an input sentence has failed, and affixes and auxiliary words included in the analysis failure character string by analyzing the detected analysis failure character string from the reverse direction. affix and auxiliary analysis means for determining the affix and auxiliary, unknown word detection means for detecting unknown words by removing the affix and auxiliary from the character string that has failed to be analyzed; 1. An unknown word processing device for morphological analysis, comprising: an analysis result output means for predicting a word base name of an unknown word and outputting the predicted word base name as a morphological name of the unknown word.