JPH0619960A

JPH0619960A - Morpheme analyzing processing method

Info

Publication number: JPH0619960A
Application number: JP4172177A
Authority: JP
Inventors: Tsuyoshi Kitani; 強木谷
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1992-06-30
Filing date: 1992-06-30
Publication date: 1994-01-28

Abstract

PURPOSE:To attain highly accurate morpheme division or the like by dividing a charac ter string based upon character sort changing information, and when morphemes is disconnected, specifying an unregistered word by the category of its part of speech, the sorts of characters and the number of characters. CONSTITUTION:A dictionary retrieving character string forming processing part 2 extracts all continued characters from an input character string and forms a partial character string for dictionary retrieval and a dictionary retrieving processing part 3 retrieves the partial character string from an independent word dictionary and an adjunct dictionary and determines its part of speech. Then an inter-morpheme connection checking processing part 4 checks the connecting condition of adjacent morphemes by a morpheme connection dictionary and a word table registering processing part 5 registers connectable morphemes. A character sort change point determining processing part 6 divides the input character string on a character sort changing position, and when an unregistered word exists, an unregistered word range determining processing part 7 determines the range of the character string including the unregistered word and an unregistered word range assembling processing part 9 specifies the unregistered word based upon the sort of the morpheme as a part of speech, the number characters in the morpheme, and so on.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、形態素を分割し、品詞
を付与する形態素解析処理方法に関し、特に日本語文章
を処理する第１段階として、日本語文章を解析し、構文
・意味解析パ−サ−への入力として用いる形態素解析処
理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a morphological analysis processing method for dividing a morpheme and giving a part of speech, and particularly, as the first step of processing a Japanese sentence, it analyzes the Japanese sentence and analyzes the syntax / semantics. -A morphological analysis processing method used as an input to a server.

【０００２】[0002]

【従来の技術】従来、辞書への未登録語を特定する方法
として、仮文節の範囲内で全ての連続する文字列の組合
わせを、総当りで辞書検索し、仮文節内で前方から最長
一致法により、先頭から最も長い形態素を優先的に採用
していき、辞書に未登録の部分、あるいは辞書に登録さ
れていても形態素間の接続が許可されない部分を、未登
録語とする方法が用いられている。ここで、仮文節と
は、例えば、平仮名から他の文字種類への変化点、記号
文字の前後、非平仮名列から数字列への変化点、数字列
から非平仮名列への変化点を、それぞれ区切り位置とみ
なして、区切り位置により区切られた文字列のことを呼
ぶ。しかしながら、上述のような従来の未登録語を特定
する方法では、未登録語の特定精度が高くなかった。例
えば、未登録語を含む仮文節として『鼎立し』の場合を
考える。総当りに検索文字列を生成して、辞書を検索し
た結果、『立』と『し』が登録語であり、『鼎』、『鼎
立』、および『鼎立し』がそれぞれ未登録語であるとす
ると、『鼎立』が未登録語として解析されるべきである
にもかかわらず、従来の方法では、『鼎』の語が未登録
語とされてしまう。また、別の例として、仮文節『御簾
納会長』を考えた場合には、『御』、『簾』、『納』、
『納会』、『会』、『会長』、『長』が登録語であると
すると、前方からの最長一致法のために、その出力は
『御簾納／会長』（『御簾納』が未登録語）となるべき
ところ、実際には、『御／簾／納会／長』のようにな
る。このように、従来の方法では、未登録語が特定でき
ないだけでなく、隣接する形態素へも悪影響を及ぼして
しまう。2. Description of the Related Art Conventionally, as a method of identifying an unregistered word in a dictionary, a combination of all consecutive character strings within the range of a provisional phrase is brute-searched and the longest from the front in the provisional phrase. By the matching method, the longest morpheme from the beginning is preferentially adopted, and the part that is not registered in the dictionary or the part that is not allowed to connect between morphemes even if registered in the dictionary is the unregistered word. It is used. Here, the kana clause is, for example, a change point from hiragana to another character type, before and after a symbol character, a change point from a non-hiragana string to a numeric string, and a change point from a numeric string to a non-hiragana string, respectively. The character string delimited by the delimiter position is regarded as the delimiter position. However, the conventional method for identifying unregistered words as described above does not provide high accuracy in identifying unregistered words. For example, let us consider the case of "dating" as a provisional phrase including an unregistered word. As a result of searching the dictionary by brute force and generating a search string, "standing" and "shi" are registered words, and "ding", "dating", and "dating" are unregistered words. Then, although the word "Ding" should be analyzed as an unregistered word, the conventional method causes the word "Ding" to be an unregistered word. In addition, as another example, when considering the provisional phrase "Midori-no-Chairman", "Go", "ren", "nano",
Assuming that "bankai", "kai", "chairman", and "chosen" are registered words, the output is "Mirei / Chairman"("Mirei" is not registered because of the longest match method from the front. Where it should be a word), in reality, it becomes like "Go / Ren / Meeting / Chief". As described above, in the conventional method, not only unregistered words cannot be specified, but also adjacent morphemes are adversely affected.

【０００３】さらに、形態素解析処理の結果、複数の候
補を確からしい順序で配列し直す方法について考える。
すなわち、形態素解析処理の結果として、同一の文字列
に対して複数の形態素分割候補、または同一の形態素に
対して複数の品詞の候補が存在するとき、それらを確か
らしい順序で並び替える方法としては、統計的な情報で
ある形態素の使用頻度を利用して、優先順位を付与する
方法がある。しかしながら、上記形態素の使用頻度は、
隣接する前後の形態素との接続関係を考慮して決定され
たものではなく、該当する単語自体を独立のものとして
扱った使用頻度であるため、優先順位付けの精度は低か
った。従来の処理方法では、形態素解析処理の次の処理
段階である構文解析処理において、前後の形態素の品詞
カテゴリに基づく文法的な接続関係の優先性をチェック
していた。Further, a method of rearranging a plurality of candidates in a certain order as a result of the morphological analysis processing will be considered.
That is, as a result of the morpheme analysis processing, when there are a plurality of morpheme division candidates for the same character string or a plurality of parts of speech candidates for the same morpheme, as a method for rearranging them in a probable order, There is a method of assigning a priority order by using the frequency of use of morphemes, which is statistical information. However, the frequency of use of the above morphemes is
The accuracy of the prioritization was low because it was not determined in consideration of the connection relationship between adjacent morphemes before and after, but the frequency of use of the corresponding word itself as an independent word. In the conventional processing method, the priority of the grammatical connection relation based on the part-of-speech category of the preceding and following morphemes is checked in the syntactic analysis processing which is the next processing step of the morpheme analysis processing.

【０００４】[0004]

【発明が解決しようとする課題】このように、従来の方
法では、日本語文章を形態素に分割して品詞を付与する
形態素解析処理において、辞書への未登録語が存在する
場合に、その特定精度が低いだけでなく、隣接する形態
素の分割にまで影響を及ぼすことがあった。さらに、形
態素解析処理の結果として、同一の文字列に対して複数
の形態素分割候補、あるいは同一の形態素に対して複数
の品詞候補が得られたとき、それらを確からしい順序に
並び替える精度が低いという問題があった。本発明の目
的は、これら従来の課題を解決し、辞書に未登録の単語
が存在するときには、未登録語を正しく特定して１つの
形態素として扱い、かつ同一の文字列に対して複数の形
態素分割候補、あるいは同一の形態素に対して複数の品
詞候補があるときには、それらを確からしい順序に高精
度で並び替えることが可能な形態素解析処理方法を提供
することにある。As described above, according to the conventional method, in a morphological analysis process in which a Japanese sentence is divided into morphemes and part-of-speech is added, if there is an unregistered word in the dictionary, the identification is performed. Not only the accuracy was low, but it sometimes affected the division of adjacent morphemes. Furthermore, when multiple morpheme division candidates for the same character string or multiple part-of-speech candidates for the same morpheme are obtained as a result of the morpheme analysis processing, the accuracy of rearranging them in a certain order is low. There was a problem. An object of the present invention is to solve these conventional problems. When an unregistered word exists in a dictionary, the unregistered word is correctly identified and treated as one morpheme, and a plurality of morphemes are used for the same character string. It is to provide a morpheme analysis processing method capable of highly accurately rearranging a division candidate or a plurality of part-of-speech candidates for the same morpheme in a certain order.

【０００５】[0005]

【課題を解決するための手段】上記目的を達成するた
め、本発明の形態素解析処理方法は、（イ）名詞、動
詞、形容詞、形容動詞等の自立語を登録してある自立語
辞書と、助詞、助動詞、活用語尾等を登録してある付属
語辞書と、形態素の品詞カテゴリおよび文字数に基づい
て隣接する形態素間の接続の可否を定める形態素接続辞
書とを備え、入力された文字列を自立語辞書および付属
語辞書を検索して品詞を付与し、検索結果の中から形態
素接続辞書により接続が許可される形態素を単語登録テ
−ブルに登録する形態解析処理方法において、文字種類
の変化情報に基づいて文字列を分割し、分割された文字
列を決定する第１のステップと、自立語辞書および付属
語辞書への未登録語により、単語登録テ−ブル中の形態
素の接続が分断される場合には、分断位置を内部に含む
分割文字列を決定する第２のステップと、形態素の品詞
カテゴリと文字種類および文字数に基づいて、分割文字
列の範囲内に存在する未登録語を特定する第３のステッ
プとを有することを特徴としている。また、（ロ）第１
のステップにより、同一の文字列に対して複数の形態素
分割候補が生成された場合、または自立語辞書および付
属語辞書の検索により、同一の形態素に対して複数の品
詞候補が生成された場合には、隣接する形態素間の品詞
カテゴリの接続優先順位ル−ルに基づいて、形態素分割
候補および品詞候補を確からしい順序に並び替える第４
のステップを有することも特徴としている。In order to achieve the above object, the morphological analysis processing method of the present invention comprises: (a) an independent word dictionary in which independent words such as nouns, verbs, adjectives and adjective verbs are registered; It has an adjunct dictionary in which particles, auxiliary verbs, inflection endings, etc. are registered, and a morpheme connection dictionary that determines whether or not adjacent morphemes can be connected based on the part-of-speech category of the morpheme and the number of characters. In the morphological analysis processing method of searching the word dictionary and the auxiliary word dictionary, adding a part of speech, and registering the morphemes that are permitted to be connected by the morpheme connection dictionary from the search results in the word registration table, character type change information The first step of dividing the character string based on the above, and the unregistered word in the independent word dictionary and the auxiliary word dictionary disconnects the morpheme connection in the word registration table. Ru In this case, the second step of determining the divided character string that includes the division position inside, and the unregistered word existing within the range of the divided character string are specified based on the part-of-speech category of the morpheme, the character type, and the number of characters. And a third step. Also, (b) the first
When multiple morpheme division candidates are generated for the same character string by the step of 1., or when multiple part-of-speech candidates are generated for the same morpheme by searching the independent word dictionary and the adjunct word dictionary. Is a fourth order for rearranging the morpheme division candidates and the part-of-speech candidates in a probable order based on the connection priority rule of the part-of-speech category between adjacent morphemes.
It is also characterized by having the step of.

【０００６】[0006]

【作用】本発明においては、未登録語の範囲を正しく特
定することにより、未登録語を１つの形態素とみなすこ
とができるようにする。その結果、未登録語の周囲にあ
る形態素への影響を小さくすることができる。また、形
態素分割の候補、および品詞の候補を形態素解析処理で
確からしい順序に並び替えることにより、次段の構文解
析パ−サ−では、確からしい候補のみ処理対象にするこ
とができ、構文解析候補が多数発生することを防止でき
る。一般に、固有名詞、各種分野の専門用語等の使用頻
度の低い形態素を、全て辞書に登録しておくことは不可
能である。その結果、文章中に現われる単語で、辞書へ
の未登録語が発生することは避けられない。そこで、未
登録語の範囲を正しく特定することができれば、未登録
語を１つの形態素とみなすことができ、かつ登録語であ
る周囲の形態素の分割結果への影響も避けることができ
る。単語分割を誤ってしまうと、品詞を正しく付与でき
なくなるので、正しく単語分割すれば、品詞付与の精度
の向上にもなる。また、未登録語の多くは名詞であるた
め、未登録語が特定できると、品詞も推定できる。次段
の構文解析パ−サ−は、形態素解析の結果を受け取る
と、句ないし文単位に品詞カテゴリの文法的な接続関係
を調べて、構文解析木を作成する。しかし、複数の形態
素分割、複数の品詞カテゴリ、複数の文法接続の可能性
が存在すると、複数の解析による構文解析木が生成され
ることが多い。本発明では、高精度で複数の形態素分割
の候補、品詞の候補を確からしい順序に並び替えるの
で、構文解析パ−サ−では、確からしい候補のみを処理
対象とすれば、構文解釈が多数発生しないですむ。形態
素解析処理は、日本語文章の解析の第１段階の必須の処
理であるため、高精度な形態素分割と品詞付与を実現す
れば、形態素解析処理の結果を種々の分野の日本語処理
に利用させることができる。In the present invention, the unregistered word can be regarded as one morpheme by correctly identifying the range of the unregistered word. As a result, it is possible to reduce the influence on the morphemes around the unregistered word. Also, by rearranging the candidates for morpheme division and the candidates for part-of-speech in a probable order in the morpheme analysis processing, the probable candidate can be processed only in the parsing parser in the next stage. It is possible to prevent a large number of candidates from occurring. In general, it is impossible to register all morphemes that are rarely used, such as proper nouns and technical terms in various fields, in a dictionary. As a result, it is inevitable that a word appearing in a sentence will be unregistered in the dictionary. Therefore, if the range of the unregistered word can be correctly specified, the unregistered word can be regarded as one morpheme, and the influence on the division result of the surrounding morphemes that are registered words can be avoided. If the word division is incorrect, the part-of-speech cannot be correctly added. Therefore, if the word is correctly divided, the accuracy of the part-of-speech addition can be improved. Further, since most unregistered words are nouns, if the unregistered word can be identified, the part of speech can be estimated. Upon receiving the result of the morphological analysis, the syntactic analysis server in the next stage checks the grammatical connection relation of the part-of-speech category for each phrase or sentence and creates a syntactic analysis tree. However, when there are multiple morpheme divisions, multiple part-of-speech categories, and multiple grammatical connections, a parse tree is often generated by multiple parses. In the present invention, a plurality of morpheme division candidates and part-of-speech candidates are rearranged with high accuracy in a probable order. Therefore, if only probable candidates are processed in the parsing parser, a lot of syntactical interpretation occurs. You don't have to. Morphological analysis processing is an essential processing in the first stage of Japanese sentence analysis. Therefore, if highly accurate morpheme division and part-of-speech assignment are realized, the results of morphological analysis processing can be used for Japanese processing in various fields. Can be made.

【０００７】[0007]

【実施例】以下、本発明の実施例を、図面により詳細に
説明する。図１は、本発明の一実施例を示す形態素解析
処理システムの機能ブロック図である。形態素解析処理
システムは、図１に示すように、入力処理部１、辞書検
索文字列生成処理部２、辞書検索処理部３、形態素間接
続チェック処理部４、単語テ−ブル登録処理部５、文字
種類変化点決定処理部６、未登録語範囲決定処理部７、
解析範囲組み立て処理部８、未登録範囲組み立て処理部
９、形態素・品詞並び替え処理部１０、および出力処理
部１１から構成される。これらの処理部は、いずれもプ
ロセッサにより実行されるプログラムモジュ−ルであっ
て、実線の順序で起動される。実線は、各処理部間の接
続シ−ケンスを示すもので、全体の処理動作が示され
る。先ず、入力処理部１が日本語文章を入力装置から入
力すると、次に辞書検索文字列生成処理部２は、入力文
字列から連続する文字を総当りに取り出し、入力した文
字列から辞書検索のための部分文字列を生成する。ここ
で、辞書検索文字列生成処理部２で生成する部分文字列
は、仮文節の範囲内で生成すればよい。また、その最大
文字列長は、辞書に格納されている形態素の最大文字列
長とすれば、不要な辞書検索を防止することができる。
次の辞書検索処理部３では、生成された文字列を自立語
辞書と付属語辞書から検索し、品詞の種類を決定する。
次の形態素間接続チェック処理部４は、品詞の種類と形
態素の文字数を基にして、品詞カテゴリを決定し、隣接
する形態素の品詞カテゴリをパラメ−タにして形態素間
の前後の接続の可否を定めた形態素間接続辞書を参照し
て、隣接する形態素の接続の可否を判断する。すなわ
ち、形態素接続辞書から、検索した隣接する形態素の接
続条件を調べる。次に、単語テ−ブル登録処理部５は、
接続可能な形態素を単語登録テ−ブルに登録する。すな
わち、文節数最小法により、前方からの累積文節数が最
小となる接続可能な形態素を選択し、単語登録テ−ブル
に登録する。次の文字種類変化点決定処理部６は、文字
種類の変化位置で入力文字列を分割する。Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 1 is a functional block diagram of a morphological analysis processing system showing an embodiment of the present invention. The morphological analysis processing system includes, as shown in FIG. 1, an input processing unit 1, a dictionary search character string generation processing unit 2, a dictionary search processing unit 3, a morpheme connection check processing unit 4, a word table registration processing unit 5, A character type change point determination processing unit 6, an unregistered word range determination processing unit 7,
The analysis range assembly processing unit 8, the unregistered range assembly processing unit 9, the morpheme / part-of-speech rearrangement processing unit 10, and the output processing unit 11 are included. All of these processing units are program modules executed by the processor and are activated in the order indicated by the solid line. The solid line shows the connection sequence between the processing units, and shows the entire processing operation. First, when the input processing unit 1 inputs a Japanese sentence from the input device, then the dictionary search character string generation processing unit 2 extracts all consecutive characters from the input character string in a brute force, and performs a dictionary search from the input character string. To generate a substring for. Here, the partial character string generated by the dictionary search character string generation processing unit 2 may be generated within the range of the temporary clause. Further, if the maximum character string length is the maximum character string length of morphemes stored in the dictionary, unnecessary dictionary search can be prevented.
In the next dictionary search processing unit 3, the generated character string is searched from the independent word dictionary and the auxiliary word dictionary to determine the type of part of speech.
The next morpheme connection check processing unit 4 determines a part-of-speech category based on the type of part-of-speech and the number of characters of the morpheme, and uses the part-of-speech category of an adjacent morpheme as a parameter to determine whether or not the morpheme can be connected before and after. By referring to the determined morpheme connection dictionary, it is determined whether or not the adjacent morphemes can be connected. That is, the connection condition of the searched adjacent morpheme is checked from the morpheme connection dictionary. Next, the word table registration processing unit 5
The connectable morphemes are registered in the word registration table. That is, a connectable morpheme that minimizes the cumulative number of phrases from the front is selected by the minimum phrase count method and registered in the word registration table. The next character type change point determination processing unit 6 divides the input character string at the character type change position.

【０００８】全ての入力文字列に対して、辞書検索文字
列生成処理部２から単語テ−ブル登録処理部５までの処
理が終了したならば、次の文字種類変化点決定処理部６
が起動される。未登録語範囲決定処理部７では、入力文
字列に未登録語が存在する場合に、未登録語が含まれる
文字列の範囲を決定する。つまり、文字種類の変化する
位置で文字列を分割し、分割文字列を決定する。未登録
語が存在しない場合には（ステップ１５）、解析範囲組
み立て処理部８で、未登録語が存在しない範囲に対し
て、出力形態素の組み合わせを決定する。また、未登録
語が存在する場合には（ステップ１５）、現在の処理位
置と、未登録語範囲の先頭位置とを比較して、等しい
か、あるいは後者の方が前者より大きいかを判定し（ス
テップ１６）、後者が大きい場合には、解析範囲組み立
て処理部８で、未登録語が存在しない範囲に対して、出
力形態素の組み合わせを決定する。その後、あるいは両
者が等しい場合には、未登録語範囲組み立て処理部９
で、未登録語が存在する範囲に対して、未登録語を特定
し、出力する形態素の組み合わせを決定する。次に、入
力文字列の終端まで処理したか否かを判定し（ステップ
１７）、処理したならば、形態素・品詞の並び替え処理
部１０で、同一の文字列に対して複数の形態素分割の候
補、または同一の形態素に対して複数の品詞の候補を、
形態素・品詞並び替えル−ル（後述の図２を参照）によ
り確からしい順序に並び替える。そして、出力処理部１
１は、決定された処理結果を出力装置に出力する。When the processing from the dictionary search character string generation processing unit 2 to the word table registration processing unit 5 is completed for all input character strings, the next character type change point determination processing unit 6 is executed.
Is started. The unregistered word range determination processing unit 7 determines the range of the character string including the unregistered word when the input character string includes the unregistered word. That is, the character string is divided at the position where the character type changes, and the divided character string is determined. When there is no unregistered word (step 15), the analysis range assembly processing unit 8 determines a combination of output morphemes for a range in which no unregistered word exists. If an unregistered word exists (step 15), the current processing position is compared with the start position of the unregistered word range to determine whether they are equal or the latter is greater than the former. (Step 16) If the latter is large, the analysis range assembly processing unit 8 determines a combination of output morphemes for a range in which no unregistered word exists. After that, or when both are equal, the unregistered word range assembly processing unit 9
Then, the unregistered word is specified for the range where the unregistered word exists, and the combination of morphemes to be output is determined. Next, it is determined whether or not the end of the input character string has been processed (step 17), and if processed, the morpheme / part-of-speech rearrangement processing unit 10 divides the same character string into a plurality of morpheme divisions. Candidates, or multiple part-of-speech candidates for the same morpheme,
The morpheme / part-of-speech rearrangement rule (see FIG. 2, which will be described later) is used to rearrange in a certain order. Then, the output processing unit 1
1 outputs the determined processing result to the output device.

【０００９】図２は、形態素・品詞並び替えル−ルの一
例を示す図である。図２（ａ）は、同一の文字列に対し
て形態素分割の候補が前置助数詞と数字からなるもの
と、それ以外の形態素分割の候補がある場合、前置助数
詞と数字からなる分割候補を優先することを示すもので
ある。すなわち、前置助数詞であるｐｏｓ１と数字であ
るｐｏｓ２を含むｓｅｇ１の方を、その他のｐｏｓ３を
含むｓｅｇ２よりも優先する順位とする。また、図２
（ｂ）は、同一の文字列に対して、名詞の形態素の直後
の形態素が名詞または接尾語の候補であるときには、接
尾語を優先することを示している。すなわち、ｓｅｇ２
において、直後を名詞とするｐｏｓ２と直後を接尾語と
するｐｏｓ３とでは、ｐｏｓ３の方を優先する順位とす
る。以下、図１における各処理部の詳細フロ−チャ−ト
を、図４〜図７により説明する。FIG. 2 is a diagram showing an example of a morpheme / part-of-speech rearrangement rule. FIG. 2 (a) shows that, for a same character string, candidates for morpheme division consist of a prefix particle and numbers, and if there are other candidates for morpheme division, a division candidate composed of a prefix particle and a digit is selected. It indicates that priority is given. That is, seg1 including pos1 that is a prefix and pos2 that is a number is prioritized over seg2 that includes other pos3. Also, FIG.
(B) shows that when the morpheme immediately after the morpheme of the noun is a noun or suffix candidate for the same character string, the suffix is prioritized. That is, seg2
In the case of pos2 having a noun immediately after and pos3 having a suffix immediately after, pos3 is prioritized. The detailed flow chart of each processing unit in FIG. 1 will be described below with reference to FIGS.

【００１０】図５は、図１における未登録語範囲決定処
理部の詳細動作フロ−チャ−トである。未登録語範囲決
定処理部７は、図５に示すように、先ず単語登録テ−ブ
ルに登録された形態素が、入力文字列の現在処理位置か
ら終了文字位置まで接続できるか否かを調べる（ステッ
プ２２）。全て接続した場合には（ステップ２３）、未
登録語はないので（ステップ２８）、処理を終了する。
また、接続が途中で途切れる場合には、未登録語が存在
する場合であるから、接続が途切れた文字位置を含む分
割文字列を、未登録語を含む最小の文字列範囲（未登録
語範囲）とする（ステップ２４）。ただし、分割文字列
の開始位置の直前の文字位置で終る形態素が単語登録テ
−ブルに存在しない場合には（ステップ２５）、分割文
字列の範囲を該当条件を満足するまで、入力文字列方向
を遡る方向に拡張する(ステップ２６)。このようにし
て、分割文字列の範囲を、未登録語の範囲とする（ステ
ップ２７）。FIG. 5 is a detailed operation flowchart of the unregistered word range determination processing section in FIG. As shown in FIG. 5, the unregistered word range determination processing unit 7 first checks whether or not the morpheme registered in the word registration table can be connected from the current processing position of the input character string to the end character position ( Step 22). If all are connected (step 23), there is no unregistered word (step 28), and the process is terminated.
Also, if the connection is interrupted in the middle, it means that there is an unregistered word, so the divided character string that includes the character position where the connection is interrupted is the minimum character string range that includes the unregistered word (unregistered word range ) (Step 24). However, if there is no morpheme ending in the character position immediately before the start position of the divided character string in the word registration table (step 25), the range of the divided character string is set in the input character string direction until the corresponding condition is satisfied. Is expanded in the direction of going back (step 26). In this way, the range of the divided character string is set as the range of unregistered words (step 27).

【００１１】図６は、図１における解析範囲組み立て処
理部の詳細動作フロ−チャ−トである。図１に示すよう
に、現在の処理位置と未登録語範囲の開始位置を比較し
て、未登録語範囲の開始位置が現在の処理位置よりも右
側に存在する時には（ステップ１６）、解析範囲組み立
て処理部８は、図６に示すような手順により、再帰的な
手法により、現在の処理位置から未登録語範囲の開始位
置の直前の範囲内で接続する全ての形態素の組み合わせ
を決定する。すなわち、先ず開始位置が入力文字列の終
了位置より大きいか否かを判定し（ステップ３２）、開
始位置が入力文字列の終了位置より大きい、つまり終了
位置より後に進んでいれば、組み立て処理は終了する
（ステップ３７）。そうでなければ、単語登録テ−ブル
にある開始位置から始まる形態素を出力とする（ステッ
プ３３）。次に、出力とした形態素の終端位置の直後の
文字位置を開始位置とする（ステップ３４）。次に、開
始位置をパラメ−タにして、解析範囲組み立て処理を再
帰的にコ−ルする（ステップ３５）。次に、開始位置か
ら始まる形態素が他にあるか否かを判定し（ステップ３
６）、あれば、ステップ３３に戻る。なければ処理を終
了する（ステップ３７）。FIG. 6 is a detailed operation flowchart of the analysis range assembling processing section in FIG. As shown in FIG. 1, the current processing position is compared with the starting position of the unregistered word range, and when the starting position of the unregistered word range is on the right side of the current processing position (step 16), the analysis range The assembly processing unit 8 determines a combination of all morphemes connected in the range immediately before the start position of the unregistered word range from the current processing position by a recursive method according to the procedure shown in FIG. That is, first, it is determined whether or not the start position is larger than the end position of the input character string (step 32), and if the start position is larger than the end position of the input character string, that is, if the start position has advanced after the end position, the assembly process is It ends (step 37). Otherwise, the morpheme starting from the start position in the word registration table is output (step 33). Next, the character position immediately after the end position of the output morpheme is set as the start position (step 34). Next, the start position is set as a parameter and the analysis range assembling process is recursively called (step 35). Next, it is determined whether there is another morpheme starting from the start position (step 3
6) If there is, return to step 33. If not, the process ends (step 37).

【００１２】図７は、図１における未登録語範囲組み立
て処理部の詳細動作フロ−チャ−トである。未登録語範
囲組み立て処理部９は、図７に示すように、未登録語範
囲決定処理部（図５参照）で決定した未登録語範囲内
で、前方からの最長一致法により出力する形態素を決定
する（ステップ４２）。ただし、１文字の自立語の多く
は誤りであるという経験的な法則から、助詞、助動詞、
活用語尾、１文字の英文字、１文字の記号文字、１文字
の数字、および２文字以上の自立語のみを出力する形態
素として、出力テ−ブルに登録する（ステップ４３）。
開始位置が入力文字列の終了位置より大きくなったなら
ば（ステップ４５）、さらに、連続するカタカナ文字列
の一部が単語登録テ−ブルに登録されていないか否かを
判定する（ステップ４６）。登録されていない場合に
は、そのカタカナ文字列全体を未登録語とする（ステッ
プ４７）。そして、出力されない連続する文字があれ
ば、まとめて未登録語とする（ステップ４８）。未登録
語は、出力テ−ブルに登録されて、処理を終了する（ス
テップ４９）。FIG. 7 is a detailed operation flowchart of the unregistered word range assembling processing section in FIG. As shown in FIG. 7, the unregistered word range assembly processing unit 9 selects a morpheme to be output by the longest matching method from the front within the unregistered word range determined by the unregistered word range determination processing unit (see FIG. 5). It is determined (step 42). However, from the empirical law that many 1-character independent words are incorrect, particles, auxiliary verbs,
It is registered in the output table as a morpheme for outputting only the inflectional ending, 1 alphabetic character, 1 symbolic character, 1 character number, and 2 or more independent words (step 43).
If the start position becomes larger than the end position of the input character string (step 45), it is further determined whether or not a part of the continuous katakana character string is registered in the word registration table (step 46). ). If not registered, the entire katakana character string is regarded as an unregistered word (step 47). Then, if there are consecutive characters that are not output, they are collectively treated as unregistered words (step 48). The unregistered word is registered in the output table, and the process ends (step 49).

【００１３】図１に戻って、現在の処理位置と未登録語
範囲の開始位置が等しい場合には（ステップ１６）、未
登録語範囲の左方向に解析可能範囲が存在しないので、
未登録語範囲組み立て処理部９により未登録語範囲のみ
を出力する。そして、入力文字列の終端まで組み立て終
っていなければ（ステップ１７）、未登録語範囲の直後
の文字位置を現在処理位置として、未登録語範囲決定処
理部７に戻る。入力文字列の前方方向に未登録語が存在
しない場合には（ステップ１５）、入力文字の終端まで
解析範囲組み立て処理部８により出力する形態素を決定
する。図８は、図１における形態素・品詞並び替え処理
部の詳細動作フロ−チャ−トである。形態素・品詞並び
替え処理部１０は、図８に示すように、処理位置を入力
文字列の先頭にセットした後（ステップ５２）、出力テ
−ブルに複数の形態素分割候補が存在するときには（ス
テップ５３）、統計的に求めた形態素の並び替えル−ル
に従って、同一の文字列に対する複数の形態素分割の可
能性がある範囲に対して、形態素分割の候補を確からし
い順序に並び替える（ステップ５４）。次に、処理位置
を進め（ステップ５５）、処理位置が入力文字の終端ま
で到達したならば（ステップ５６）、処理位置を入力文
字列の先頭にセットした後（ステップ５７）、統計的に
求めた品詞の並び替えル−ルに従って、同一の形態素に
対して複数の品詞候補が存在する場合には（ステップ５
８）、複数の品詞を確からしい順序に並び替える（ステ
ップ５９）。そして、処理位置を進めて（ステップ６
０）、処理位置が入力文字の終端に到達したならば（ス
テップ６１）、処理を終了する（ステップ６２）。図１
に戻って、最後に出力処理部１１により、確からしい順
序に並び替えた形態素と品詞を、外部出力装置に出力す
る。Returning to FIG. 1, when the current processing position is equal to the start position of the unregistered word range (step 16), there is no parsable range to the left of the unregistered word range.
The unregistered word range assembly processing unit 9 outputs only the unregistered word range. If the assembly is not completed up to the end of the input character string (step 17), the character position immediately after the unregistered word range is set as the current processing position, and the process returns to the unregistered word range determination processing unit 7. When there is no unregistered word in the forward direction of the input character string (step 15), the morpheme to be output is determined by the analysis range assembly processing unit 8 up to the end of the input character. FIG. 8 is a detailed operation flowchart of the morpheme / part-of-speech rearrangement processing unit in FIG. As shown in FIG. 8, the morpheme / part-of-speech rearrangement processing unit 10 sets the processing position at the beginning of the input character string (step 52) and then, when there are a plurality of morpheme division candidates in the output table (step 52). 53) According to the morpheme rearrangement rule calculated statistically, the morpheme division candidates are rearranged in a certain order in a range in which there is a possibility of a plurality of morpheme divisions for the same character string (step 54). ). Next, the processing position is advanced (step 55), and when the processing position reaches the end of the input character (step 56), the processing position is set at the beginning of the input character string (step 57) and then statistically calculated. If there are a plurality of part-of-speech candidates for the same morpheme according to the part-of-speech rearrangement rule (step 5).
8) Then, rearrange a plurality of parts of speech in a certain order (step 59). Then, the processing position is advanced (step 6
0) If the processing position reaches the end of the input character (step 61), the processing is ended (step 62). Figure 1
Finally, the output processing unit 11 outputs the morpheme and the part of speech rearranged in a probable order to the external output device.

【００１４】図３および図４は、図１における各ステッ
プの一例を示す図である。ここでは、図３（ａ）に示す
ように、入力文字列を『御簾納候補の票が伸び悩み、』
にした場合を例にして説明する。検索文字列の最大長を
８文字とすると、辞書検索文字列生成処理部２の処理に
よって６８通りの部分文字列が生成される。ただし、読
点は記号文字であるため、辞書検索は行わない。次に、
これらの部分文字列に対して辞書検索処理部３により辞
書を検索した結果、図３（ｂ）に示すように、１６個の
形態素が自立語辞書および付属語辞書に存在する。これ
らのうち、『御』と『簾』および『簾』と『納』の間の
接続は、１文字の自立語と１文字の自立語の接続を禁止
する形態素間接続辞書の定義により許可されないものと
する。このように仮定すると、形態素間接続チェック処
理部により形態素間の接続が許可されるものは、１３個
の形態素となる。さらに、部分文字列『伸び悩み』は、
『伸び悩』および『み』、または『伸び悩み』によれば
１文節で文節が構成できるが、『伸』、『び』、
『悩』、『み』、または『伸び』、『悩み』によれば２
文節で文節が構成されるため、前方からの文節数最小法
により２文節となる形態素は除外され、単語登録テ−ブ
ルに登録される形態素は、最終的に図３（ｃ）に示す９
個となる。FIG. 3 and FIG. 4 are diagrams showing an example of each step in FIG. Here, as shown in FIG. 3 (a), the input character string is changed to “The votes of the candidate for payment are sluggish,”
An example will be described below. If the maximum length of the search character string is 8 characters, the dictionary search character string generation processing unit 2 generates 68 partial character strings. However, since the reading point is a symbol character, the dictionary search is not performed. next,
As a result of searching the dictionary by the dictionary search processing unit 3 for these partial character strings, as shown in FIG. 3B, 16 morphemes exist in the independent word dictionary and the adjunct word dictionary. Of these, the connection between "O" and "ren" and between "ren" and "na" is not permitted by the definition of the morpheme connection dictionary that prohibits the connection of 1-character independent words and 1-character independent words. I shall. Assuming this, there are 13 morphemes that are allowed to be connected by the morpheme connection check processing unit. In addition, the substring "Impaired" is
According to "stretching" and "mi", or "stretching", you can compose a clause with one clause, but "stretch", "bi",
2 according to "worry", "mi", or "growth", "worry"
Since the bunsetsu is composed of bunsetsu, the morpheme which becomes two bunsetsu from the front by the minimum bunsetsu number method is excluded, and the morpheme registered in the word registration table is finally 9 shown in FIG.
It becomes an individual.

【００１５】次に、文字種類変化点決定処理部６によ
り、文字種類の変化点で入力文字列を切り、図４（ｄ）
に示すように９個の分割文字列を作成する。ここでは、
‘｜’で挟まれた文字列が分割文字列である。次に、未
登録語範囲決定処理部７は、『御』が形態素の接続を分
断している先頭部分であると認識し、この部分が含まれ
る未登録語範囲を『御簾納候補』とする。そして、未登
録語が存在して、現在の処理位置と未登録語範囲の開始
位置が等しいので、未登録語範囲組み立て処理部９は未
登録範囲内で先頭から単語登録テ−ブルを調べ、『候
補』は２文字の名詞であるから出力形態素とする。その
結果、図４（ｅ）に示すように、未登録語範囲での出力
を『御簾納／候補』（『御簾納』は未登録語）とする。
未だ出力していない部分が残っているため、未登録語範
囲の終了位置の直後の文字位置『の』を現在処理位置と
して未登録語範囲決定処理部７に戻る（図１のステップ
１７）。しかし、他に未登録語は存在しないので（図１
のステップ１５）、残りの文字列に対して解析範囲組み
立て処理部８が処理を実行して、接続可能な全ての形態
素の組み合わせを選択する。この結果、図４（ｆ）に示
すように、『の／票／が／伸び悩み』と、『の／票／が
／伸び悩／み』とが得られる。最後に、形態素・品詞並
び替え処理部１０は、２つの形態素分割候補が存在する
部分『伸び悩み』に着目し、読点の直前に名詞と動詞の
文節が存在する場合には、動詞を含む形態素の分割を優
先するというル−ルを実行して、『の／票／が／伸び悩
／み』を第１候補にし、『の／票／が／伸び悩み』を第
２候補にする。この例の場合には、１つの形態素に複数
の品詞候補が存在しないので、品詞候補の並び替えル−
ルは発火せず、図４（ｇ）に示すように出力される。Next, the character type change point determination processing unit 6 cuts the input character string at the character type change point, and then, as shown in FIG.
9 divided character strings are created as shown in FIG. here,
The character string sandwiched by '|' is a divided character string. Next, the unregistered word range determination processing unit 7 recognizes that "Go" is the head part that divides the connection of the morpheme, and sets the unregistered word range that includes this part as "Annei candidate". . Since there is an unregistered word and the current processing position is equal to the start position of the unregistered word range, the unregistered word range assembly processing unit 9 checks the word registration table from the beginning within the unregistered range, Since the "candidate" is a two-character noun, it is an output morpheme. As a result, as shown in FIG. 4 (e), the output in the unregistered word range is set as "Mirei-no / Candidate"("Mirei-no" is an unregistered word).
Since there is still a portion that has not been output, the character position "no" immediately after the end position of the unregistered word range is set as the current processing position and the process returns to the unregistered word range determination processing unit 7 (step 17 in FIG. 1). However, since there are no other unregistered words (Fig. 1
15), the analysis range assembly processing unit 8 executes processing on the remaining character strings, and selects all combinations of connectable morphemes. As a result, as shown in FIG. 4 (f), "no / vote / ga / slowness" and "no / vote / ga / slowness / mi" are obtained. Lastly, the morpheme / part-of-speech rearrangement processing unit 10 pays attention to the part "extension trouble" in which there are two morpheme division candidates, and when there is a noun and verb phrase immediately before the reading point, the morpheme containing the verb is detected. By executing the rule of giving priority to division, "no / vote / ga / slowness / mi" is the first candidate and "no / vote / ga / slowness" is the second candidate. In the case of this example, since there are no plural part-of-speech candidates in one morpheme, the part-of-speech candidate sorting rule is
Does not ignite, and is output as shown in FIG.

【００１６】[0016]

【発明の効果】以上説明したように、本発明によれば、
辞書への未登録語が存在する場合でも、高い精度で未登
録語を特定することができ、解析結果に複数の形態素分
割の候補、または複数の品詞の候補が存在する場合で
も、高い精度で形態素分割と品詞の付与が可能となる。As described above, according to the present invention,
Even if there are unregistered words in the dictionary, it is possible to identify unregistered words with high accuracy, and even if there are multiple morpheme division candidates or multiple part-of-speech candidates in the analysis result, there is high accuracy. Morphological division and part-of-speech addition are possible.

【００１７】[0017]

[Brief description of drawings]

【図１】本発明の一実施例を示す形態素解析処理システ
ムの機能ブロック図である。FIG. 1 is a functional block diagram of a morphological analysis processing system showing an embodiment of the present invention.

【図２】本発明における形態素・品詞の並び替えル−ル
の一例を示す図である。FIG. 2 is a diagram showing an example of a morpheme / part-of-speech rearrangement rule according to the present invention.

【図３】図１における各処理の実例を示す説明図であ
る。FIG. 3 is an explanatory diagram showing an actual example of each process in FIG. 1.

【図４】図３と同じく、各処理の実例を示す説明図であ
る。FIG. 4 is an explanatory diagram showing an actual example of each process, similar to FIG.

【図５】図１における未登録語範囲決定処理部の詳細動
作フロ−チャ−トである。5 is a detailed operation flowchart of an unregistered word range determination processing unit in FIG.

【図６】図１における解析範囲組み立て処理部の詳細動
作フロ−チャ−トである。6 is a detailed operation flowchart of an analysis range assembling processing unit in FIG.

【図７】図１における未登録語範囲組み立て処理部の詳
細動作フロ−チャ−トである。FIG. 7 is a detailed operation flowchart of an unregistered word range assembling processing unit in FIG.

【図８】図１における形態素・品詞並び替え処理部の詳
細動作フロ−チャ−トである。FIG. 8 is a detailed operation flowchart of the morpheme / part-of-speech rearrangement processing unit in FIG.

[Explanation of symbols]

１入力処理部２辞書検索文字列生成処理部３辞書検索処理部４形態素間接続チェック処理部５単語テ−ブル登録処理部６文字種類変化点決定処理部７未登録語範囲決定処理部８解析範囲組み立て処理部９未登録語範囲組み立て処理部１０形態素・品詞並び替え処理部１１出力処理部 1 Input processing unit 2 Dictionary search character string generation processing unit 3 Dictionary search processing unit 4 Morphological connection check processing unit 5 Word table registration processing unit 6 Character type change point determination processing unit 7 Unregistered word range determination processing unit 8 Analysis Range assembly processing unit 9 Unregistered word range assembly processing unit 10 Morphological / part-of-speech rearrangement processing unit 11 Output processing unit

Claims

[Claims]

1. Based on an independent word dictionary in which independent words such as nouns, verbs, adjectives, and adjectives are registered, an adjunct dictionary in which particles, auxiliary verbs, inflectional endings, etc. are registered, and part-of-speech categories and the number of characters of morphemes. And a morpheme connection dictionary that determines whether or not a connection between adjacent morphemes is possible, the input character string is searched for in the independent word dictionary and the adjunct word dictionary, a part of speech is added, and the morpheme connection dictionary is selected from the search results. In the morphological analysis processing method of registering the morphemes that are permitted to connect in the word registration table by the first step, the first step of dividing the character string based on the change information of the character type and determining the divided character string, When the unregistered word in the independent word dictionary and the adjunct word dictionary disconnects the connection of the morphemes in the word registration table, the second step of determining the divided character string including the divided position therein. And a third step of identifying an unregistered word existing within the range of the divided character string, based on the part-of-speech category of the morpheme, the character type, and the number of characters.

2. The morpheme analysis processing method according to claim 1, wherein a plurality of morpheme division candidates are generated for the same character string by the first step, or the independent word dictionary and attached words. When multiple part-of-speech candidates are generated for the same morpheme by searching the dictionary, the morpheme division candidate and part-of-speech candidate are likely to be obtained based on the connection priority rule of the part-of-speech category between adjacent morphemes. A morphological analysis processing method comprising a fourth step of rearranging in order.