JPH02253370A

JPH02253370A - Morpheme analyzing device

Info

Publication number: JPH02253370A
Application number: JP1074423A
Authority: JP
Inventors: Toshihiko Yokogawa; 横川　壽彦
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-03-27
Filing date: 1989-03-27
Publication date: 1990-10-12

Abstract

PURPOSE:To prevent the failure of analysis of the morpheme due to the mistake of the break of a word by registering a position where a word is not broken as a pattern and using this pattern to perform the division into the word units. CONSTITUTION:When a language analysis system is started, a morpheme analyzing part 2 containing a word dividing function gives an access to a file 3 for patterns (non-word terminal patterns) where the words are not broken and registers these patterns into a non-wrd terminal pattern buffer 4 and performs the division into the word units based on the registered patterns. Thus it is possible to dynamically cope with the non-word terminal patterns which are different by the sentences. As a result, a natural language can be easily and correctly divided into the words serving as the units for analysis of the language in accordance with the types of sentences, etc., when the natural language is analyzed. Then the sentence analyzing efficiency is extremely improved.

Description

【発明の詳細な説明】技先分立本発明は、英語等の形態素解析装置に関する。[Detailed description of the invention] Separation of skills The present invention relates to a morphological analysis device for English, etc.

ｋ米挟延特開昭６３−８９９７５号公報は、辞書引き単位の切出
し処理に於いて、辞書引きデリミタは、英文字、数字、
アポストロフィ、およびピリオド以外の文字というよう
に、システム制作者があらかじめ設定した単語の切れ目
と認識するものだけを単語の切れ目としていた。Japanese Patent Application Laid-Open No. 63-89975 discloses that in the extraction process of dictionary lookup units, the dictionary lookup delimiter can be used for alphabetical characters, numbers,
Word breaks were limited to characters other than apostrophes and periods, which were preset by the system creator and recognized as word breaks.

上記の例では１例えば、左括弧”　（”が来たときは必
ず単語が切れるようになる。しかし、文章によって、単
語として認めなければならない単位に差があるのは予想
される１例えば、ＵＮＩＸのマニュアルのような文章に
おいては、ｒｅａｄ（）（関数の名前で、全体の名詞と
なる）のような形で、（）が単語の一部として扱われる
場合がある。この場合、予めシステム製作者の設定した
単語の切れ目のパターンでは対処できない、また、この
ようなパターンは、少なくとも対象となる分野によって
異なることが予想でき、システム製作者がすべてを予想
してシステム内に組込むことは不可能である。In the above example, 1 For example, when the left parenthesis " In sentences such as manuals, () may be treated as part of a word, such as read() (the name of a function, which is a noun for the entire function). This cannot be done with the word break pattern set by the person, and it is expected that such patterns will differ depending on the field of interest, and it is impossible for the system creator to anticipate all of them and incorporate them into the system. It is.

こうした欠点を解決するために、従来技術では、例えば
、前編集機能を用いて、この部分は一つの単語として扱
うべきものだというような情報を一つ一つにあたえるよ
うなやり方を取る以外に対処のしようがない、しかし、
すべての場合にそれらの情報を与えるのは非常に面倒で
あった。In order to solve these shortcomings, in the conventional technology, for example, there is no other way than to use a pre-editing function to give information such as that this part should be treated as one word to each word. There's nothing I can do about it, but
It was very tedious to provide that information in all cases.

自然言語の文章を解析するいわゆる自然言語解析システ
ム等においては、現在のところ、単語を単位に解析をす
るシステムが大半である。したがって、単語の切れ目の
認識に失敗すると解析がうまく行かないということにな
る０、従来技術では、システム作成者が最初に設定した
単語の切れ目しか許されない。それ以外の場合は、例え
ば、単語の切れ目とならない等の決まりの記号連続等を
入れるということになるが、非常に大変な労力が必要で
ある。ところで、単語の切れ目の約束は、文章ごとに決
まっているといってよい、したがって。Currently, most so-called natural language analysis systems that analyze natural language sentences are systems that perform analysis in units of words. Therefore, failure to recognize word breaks means that the analysis will not be successful.In the conventional technology, only the word breaks initially set by the system creator are allowed. In other cases, for example, a certain number of symbols must be inserted so that they do not break between words, but this requires a great deal of effort. By the way, it can be said that the rules for word breaks are fixed for each sentence.

同じ英語の文章でも、手紙の文と、マニュアルの文では
、単語の切れ目とすべきものが異なっているのが普通で
ある。Even in the same English sentence, word breaks are usually different between letters and manuals.

且−一枚本発明は、上述のごとき実情に鑑みてなされたもので、
単語の切れ目とならないとこを示す複数の記号連続を使
用者が文章の種類に合わせて、自由に設定できるように
することによって、上記のような場合に単語の切れ目を
まちがえるがために解析に失敗するといったことをなく
すような形態素解析装置を提供することを目的としてな
されたものである。The present invention was made in view of the above-mentioned circumstances, and
By allowing the user to freely set a series of symbols that indicate non-word breaks, depending on the type of text, analysis will fail due to the wrong word break in cases such as the above. This was done with the purpose of providing a morphological analysis device that eliminates the need to do so.

構成本発明は、上記目的を達成するために、少なくとも、入
力されたその言語の文章を単語単位に分割する機能を有
する形態素解析装置において、単語の切れ目とならない
位置をパターンとして登録することができ、そのパター
ンを用いて、単語単位への分割を行なうことを特徴とし
たものである。Configuration In order to achieve the above object, the present invention provides a morphological analysis device having at least a function of dividing an input sentence in a language into word units, which is capable of registering positions that do not form word breaks as patterns. , the pattern is used to perform division into word units.

以下、本発明の実施例に基づいて説明する、第１図は、
本発明による形態素解析装置を含む英日機械翻訳装置の
一実施例を説明するための構成図で、１は入力装置、２
は単語分割機能を有する形態素解析部、３は非単語末パ
ターンファイル、４は非単語末パターンバッファ、５は
言語解析部、６は変換・生成部、７は辞書文法、８は出
力部、９は形態素解析部、１０は言語処理部、１１は表
示部である。FIG. 1, which will be explained below based on an embodiment of the present invention, shows the following:
1 is a configuration diagram for explaining an embodiment of an English-Japanese machine translation device including a morphological analysis device according to the present invention, in which 1 is an input device;
3 is a non-word ending pattern file, 4 is a non-word ending pattern buffer, 5 is a language analysis unit, 6 is a conversion/generation unit, 7 is a dictionary grammar, 8 is an output unit, 9 1 is a morphological analysis section, 10 is a language processing section, and 11 is a display section.

まず、言語解析システムが起動されると、単語の切れ目
とならないパターン（以下非単請求パターンと呼ぶ）の
ファイル３をアクセスし、言語解析システム中の非単語
末パターンバッファ４に登録する０機械翻訳システム起
動の際のパラメータとして、非単語末パターンファイル
の名称を与えることができるようにして、所望の非単語
末パターンファイルを用いることができるように構成し
である。これは、例えば、翻訳エディタと言ったものを
用いる際は、起動の際のパラメータとして与えることも
可能である。また、翻訳エディタ中の一機能として、非
単語末パターンファイルを与えることもできる。非単語
末パターンバッファを用意して、そこに登録するのは速
度を向上させるためである。もちろん、バッファを用意
せず、パターンチエツクの際につねにファイルにアクセ
スするように構成することも可能である。First, when the language analysis system is started, it accesses the file 3 of patterns that do not form word breaks (hereinafter referred to as non-single request patterns) and registers them in the non-word ending pattern buffer 4 in the language analysis system. The configuration is such that the name of the non-word ending pattern file can be given as a parameter when starting the system, so that a desired non-word ending pattern file can be used. For example, when using something like a translation editor, it is also possible to give it as a parameter when starting it. Also, as a function in the translation editor, a non-word ending pattern file can be provided. The purpose of preparing a non-word ending pattern buffer and registering it there is to improve speed. Of course, it is also possible to configure the system so that a file is always accessed during pattern checking without providing a buffer.

非単語末パターンファイル３は、非単語末パターンをル
コードとするレコードの集合として構成されている。し
たがって、ルコードを１パターンとして、レコード数だ
けのパターンを登録できるよう４こしである。The non-word ending pattern file 3 is configured as a collection of records whose codes are non-word ending patterns. Therefore, there are 4 patterns so that as many patterns as the number of records can be registered, with one code being one pattern.

非単語末パターンは、英語の場合では、アルファベット
・句読点・空白・改行・大文字・小文字等を表わすこと
のできる文字あるいは、文字連続の並びとして表わされ
る。In English, non-word ending patterns are expressed as characters or sequences of consecutive characters that can represent alphabets, punctuation marks, spaces, line breaks, uppercase letters, lowercase letters, etc.

文字連続として目に見ることのできる記号は、記号をそ
のまま用いる。ピリオド（、）　　疑問符（？）等、類
をなす場合や不可視のコードの場合には、￥に統けて、
記号の種類を示す１文字を入れることにする。For symbols that can be seen as a series of characters, use the symbols as they are. Periods (,) For similar cases such as question marks (?) or invisible codes, combine them with \,
We will include one character that indicates the type of symbol.

￥Ｓ：空白類、￥Ｃ：大文字類、￥Ｓ：小文字類、￥ｔ
：タブ、￥ｎ：改行等但し、￥を表わす場合は、￥￥とする。¥S: Spaces, ¥C: Uppercase letters, ¥S: Lowercase letters, ¥t
:Tab, ¥n: Line feed, etc. However, when representing ¥, use ¥¥.

非単語末パターンファイルのルコードは、次の形式をな
す。The code for a non-word ending pattern file has the following format:

非単語末パターン（改行）なお、非単語末パターンの例を第４図に示す。Non-word ending pattern (line break) Incidentally, an example of a non-word ending pattern is shown in FIG.

第２図は、単語分割処理のフローチャートを示す、以下
に各ステップについて説明する。FIG. 2 shows a flowchart of word division processing, and each step will be explained below.

！工；言語解析システムは、起動すると、非単語末パターンフ
ァイルから、−行ずつ読み取り、改行までの部分を非単
請求パターンとして登録する。このとき、非単請求パタ
ーンの最後には、ヌルコードを入れるものとする。これ
を非単請求パターンファイルの終了まで繰り返す。! Technique: When the language analysis system is activated, it reads line by line from the non-word-end pattern file and registers the part up to the line break as a non-single request pattern. At this time, a null code shall be inserted at the end of the non-single request pattern. This is repeated until the end of the non-single billing pattern file.

形態素解析部２は１文章を単語単位に分割するための単
語の切れ目となる標準のパターンを持っている。このパ
ターンは、一般の英語のパターンと考えてよい１本発明
のシステムの単語末パターンは、英字でも数字でもない
文字が来た場合にはその直前を単語の切れ目とみなすと
いうようにしている。The morphological analysis unit 2 has standard patterns that serve as word breaks for dividing one sentence into word units. This pattern can be considered as a general English pattern.The word-end pattern of the system of the present invention is such that when a character that is neither an alphabetic nor a numeric character comes, the immediately preceding character is regarded as a word break.

したがって、非単請求パターンはそれらに対する抑制と
なって働くことになる。Therefore, the non-single billing pattern will act as a restraint on them.

ｓｔｅ　　２　３；単語分割処理は、入力手段１から、入力を入力バッファ
に受は取ると、入力バッファ中の現文字位置を記憶し、
１文字ずつ取り出して（、単語末パターンと等しいか（
同じ類に属するか）を検査する。Step 2 3: In the word division process, when an input is received from the input means 1 into the input buffer, the current character position in the input buffer is stored,
Extract each character (, Is it equal to the word ending pattern (
whether they belong to the same class).

Ａ」Ｌｉ」Ｌにその文字が単語末パターンと合致したパターンについて
は、第３図に示した非単語末パターンのチエツクのフロ
ーに従って処理される。A pattern in which the character matches the end-of-word pattern in A"Li"L is processed according to the flow of checking non-word-end patterns shown in FIG.

１互見ｌ旦；すべての非単請求パターンと合致しない場合は、非単請
求パターンのチエツクに不成功であるので、単語末を示
すコードを言語処理部に送る。1 time: If it does not match all the non-single-single patterns, the check for non-single-single patterns was unsuccessful, and a code indicating the end of the word is sent to the language processing unit.

ｓ　ｔ　ａ　　６　７　；その後に、その単語末パターンが空白類でない場合は、
その文字を言語処理部に送り、その後にもう一度単請求
を示すコードを送る。s ta 6 7 ; If the word ending pattern is not a blank space after that,
It sends that character to the language processor, and then sends a code indicating a single request again.

１↓見り旦；５ｔｅｐ２における文字が単語末パターンとも合致しな
い場合は、その文字がそのまま、言語解析部に送られ、
次の文字に進める。1↓Miridan; If the character in step 2 does not match the word ending pattern, that character is sent as is to the language analysis department,
Advance to next character.

次にこの非単請求パターンのチエツクのフローチャート
を第３図に示す。Next, a flowchart for checking this non-single request pattern is shown in FIG.

ｓｔｅ　１〜７；その文字が、非単請求パターンの１文字目と等しいか（
あるいは同じ類に属するか）を検査する。ste 1-7; Is the character equal to the first character of the non-single claim pattern? (
or whether they belong to the same class).

ｓｔｅ　　８　１０；等しい場合は、その非単請求パターンの次の文字と、入
力バッファ中の次の文字を取り出し、同様に比較してい
く。ste 8 10; If they are equal, the next character of the non-single request pattern and the next character in the input buffer are taken out and compared in the same way.

１旦見ｌ主；一つの非単請求パターンの終了をしめずヌルコードに達
するまで検査し、ｓｔｅ　　１１；チエツクに成功すれば、その部分までの入力バッファ中
の文字を言語解析部に送る。Once checked, one non-single request pattern is checked until it reaches a null code, and step 11: If the check is successful, the characters in the input buffer up to that part are sent to the language analysis section.

ｓｔｅ　　１２；その後、文字のポインタを解析部に送った位置まですす
める。失敗すれば１次の非単請求パターンというように
すべてのパターンについて同様のチエツクを行なう。Step 12; Thereafter, move the character pointer to the position sent to the analysis section. If it fails, a similar check is performed for all patterns, including the first non-single request pattern.

このようにして、処理済文字を示すポインタを一つ進め
て、同様の処理を入力が尽きるまで繰り返す。In this way, the pointer indicating the processed character is advanced by one, and the same process is repeated until the input is exhausted.

本発明の解析システムでは、単語の末尾を示すコードを
改行コードに設定しである。したがって、言語処理部１
０では、１ラインを１語と考えて、辞書検索、構文解析
等の処理を行なうように構成する。これは、本発明の装
置と、それ以後の言語解析部５の間で、適宜取り決める
ものであり、これには限らない。In the analysis system of the present invention, a code indicating the end of a word is set as a line break code. Therefore, the language processing unit 1
0, one line is considered to be one word, and processing such as dictionary search and syntax analysis is performed. This is to be decided as appropriate between the device of the present invention and the language analysis unit 5 that follows it, and is not limited to this.

たとえば、第４図のような非単請求パターンが登録され
ていたとき。For example, when a non-single billing pattern as shown in Figure 4 is registered.

Ｔｈｅ　ｆｕｎｃｔｉｏｎ　ｒｅａｄ（）　ｕｓｅｓ　
ｔｈｅ　ｒｅａｄ　ｂｕｆ［］　ｂｕｆｆｅｒ。The function read() uses
the read buf[] buffer.

という文章が入力されたとする。Suppose that the following sentence is input.

ｒｅａｄの直後の０の部分や、ｒｅａｄＪｕｆの−やそ
の直後の［］の部分では、単語末パターンと一致しては
いるが、非単請求パターンとも一致しているので、単語
の切れ目ではないと認識されることになる。したがって
、言語解析部５には。The 0 part immediately after read, the - in readJuf, and the [] part immediately after it match the end-of-word pattern, but they also match a non-claim pattern, so they are not word breaks. It will be recognized. Therefore, in the language analysis section 5.

ｈｅｆｕｎｃｔｉｏｎｒｅａｄ　（）ｕｓｅｓｈｅｒｅａｄｊｕｆ［］ｂ　ｕ　ｆ　ｆ　ｅ　ｒ。he function read () uses he readjuf[] b　u　　f　f　e　r.

というように送られ、それぞれが−語となるように認識
される。この後、辞書検索、構文解析等が行なわれ。and so on, and each is recognized as a - word. After this, dictionary searches, syntax analysis, etc. are performed.

［関数ｒｅａｄＯは、ｒｅａｄ　　ｂｕｆ［］バッファ
を用いる。」という正しい訳文が得られることになる。[The function readO uses the read buf[] buffer. ”, the correct translation will be obtained.

従来は、たとえば、ｒｅａｄの後で単語がきれたものと
なるので、ｒｅａｄは動詞（の過去形）というふうに辞
書検索され。Conventionally, for example, the word "read" is broken after the word "read", so read is searched in the dictionary as (the past tense of) the verb.

「関数は読んだ０　・・・」というような間違った解析
・翻訳をしてしまうこととなっていた。This resulted in incorrect analysis and translation such as "The function read 0...".

翻訳エディタ中の一機能として、実現する場合には、非
単語末パターンファイルが指定される度に、直前までの
非単語末バッファをクリアし、非単語末バッファへの新
たな登録を行なうようにすることによっても容易に実現
できる。If implemented as a function in the translation editor, every time a non-word ending pattern file is specified, the previous non-word ending buffer will be cleared and a new registration will be made to the non-word ending buffer. This can also be easily achieved by doing this.

本発明の実施例では、非単語末パターンファイルは、テ
キストファイルで構成され、その変更等は、一般のエデ
ィタ等を用いることを前提にしている。しかし、翻訳エ
ディタ等で、非単語末パターンを登録・変更するように
することも、非単語末パターンファイルが、テキストフ
ァイルとして構成されているので、容易に実現可能であ
る。In the embodiment of the present invention, the non-word-end pattern file is composed of a text file, and changes to the file are assumed to be made using a general editor or the like. However, since the non-word ending pattern file is configured as a text file, it is easily possible to register and change the non-word ending pattern using a translation editor or the like.

また、非単語末パターンファイルを、言語解析システム
上で利用しやすい形に構成し、翻訳エディタ（あるいは
、非単語末パターンの構成に合わせた非単語末パターン
エディタ等）で直接変更するように構成することも可能
である。In addition, the non-word-final pattern file is configured in a format that is easy to use on the language analysis system, and configured to be directly modified with a translation editor (or a non-word-final pattern editor that matches the configuration of the non-word-final pattern). It is also possible to do so.

また、本発明の実施例は、英語の解析装置を例として取
り上げたが、非単語末パターンに登録できる字類等を適
当に設定することによって、フランス語をはじめとして
、字順に空白等によって分かち書きされる種類の言語の
解析に容易に応用できる。また、日本語の解析において
も、括弧等の処理において応用することは非常に容易で
ある。In addition, although the embodiment of the present invention takes an English language analysis device as an example, by appropriately setting classes etc. that can be registered in non-word ending patterns, character order can be separated by spaces etc. in French and other languages. It can be easily applied to the analysis of various types of languages. Furthermore, it is very easy to apply this method to the processing of parentheses, etc. in Japanese language analysis.

紘−一層以上の説明から明らかなように、本発明によると、文章
によって異なる非単語末のパターンに動的に対応できる
ために、自然言語解析に際して、言語解析の単位である
単語に、文章の種類等に応じて、正しく分割することが
容易に可能になる。Hiro - As is clear from the above explanation, according to the present invention, since it is possible to dynamically deal with non-word ending patterns that vary depending on the sentence, when performing natural language analysis, the words, which are the unit of linguistic analysis, are Correct division can be easily performed depending on the type, etc.

したがって、文章の解析の効率が非常に向上する。Therefore, the efficiency of text analysis is greatly improved.

また、システムの要求する単語パターンに変更するため
、前編集等を用いて、文の編集を行なう必要がなくなる
ので、自然言語解析や機械翻訳の前処理を軽減すること
ができる。また、同種の文章には、−度作成した文末パ
ターンが有効であるので、同種の文の大量の機械翻訳や
、継続的な翻訳等に際して、より一層の効果がある。Furthermore, since it is no longer necessary to edit sentences using pre-editing or the like in order to change the word pattern to the one required by the system, it is possible to reduce the amount of pre-processing required for natural language analysis and machine translation. Furthermore, since the sentence ending pattern created twice is effective for sentences of the same type, it is even more effective in machine translation of a large number of sentences of the same type, continuous translation, etc.

[Brief explanation of drawings]

第１図は１本発明による形態素解析装置を含む英日機械
翻訳装置の一実施例を説明するための構成図、第２図は
、単語分割処理のフローチャートを示す図、第３図は、
非単語末パターンのチエツクのフローチャートを示す図
、第４図は、非単語末パターンの例を示す図である。１・・・入力装置、２・・・単語分割機能を有する形態
素解析部、３・・・非単語末パターンファイル、４・・
・非単語末パターンバッファ、５・・・言語解析部、６
・・・変換・生成部、７・・・辞書文法、８・・・出力
部、９・・・形態素解析部、１ｏ・・・言語処理部、１
１・・・表示部。FIG. 1 is a block diagram for explaining an embodiment of an English-Japanese machine translation device including a morphological analysis device according to the present invention, FIG. 2 is a flowchart of word segmentation processing, and FIG.
FIG. 4, which is a flowchart of checking a non-word ending pattern, is a diagram showing an example of a non-word ending pattern. 1... Input device, 2... Morphological analysis unit having word division function, 3... Non-word ending pattern file, 4...
・Non-word ending pattern buffer, 5...Language analysis unit, 6
...Conversion/generation unit, 7...Dictionary grammar, 8...Output unit, 9...Morphological analysis unit, 1o...Language processing unit, 1
1...Display section.

Claims

[Claims]

1. At least, in a morphological analysis device that has the function of dividing an input sentence in the language into word units, positions that do not break between words can be registered as a pattern, and the pattern can be used to divide into word units. A morphological analysis device characterized by performing segmentation.