JPS6366672A

JPS6366672A - Unknown word processing system for morpheme analysis of mixture of kanji and kana

Info

Publication number: JPS6366672A
Application number: JP61211047A
Authority: JP
Inventors: Katsuhiko Fujita; 克彦藤田; Satoshi Okugawa; 奥川　聡
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1986-09-08
Filing date: 1986-09-08
Publication date: 1988-03-25

Abstract

PURPOSE:To accurately detect the unknown words by using a paragraph end propriety table showing whether words can be put at the end of paragraphs or not. CONSTITUTION:A word extraction processing part 2 uses a word dictionary 2a, a part-of-speech sorting table 2b, a conjugation inflection table 2c, a connection weight matrix table 2d and a paragraph end propriety table 2e to produce a candidate word list. A word selection processing part 3 adds the candidate words set right before the word list to this list as long as said list is not empty. If the word list is empty, a back track is produced. Then the unknown word processing is carried out in case no answer is obtained even after production of said back track. When two or three characters counted from the head are equal to KANJI (Chinese characters) and a postpositional word is detected right after the KANJI part, this KANJI part is entirely defined as unknown words. If no postpositional word is detected, a dictionary is retrieved at and after the back of the KANJI part and a part preceding a success area is defined as an unknown word.

Description

【発明の詳細な説明】り術次夏本発明は、漢字かな混じり文の形態素解析における未知
語処理方式に関し、日英機械翻訳、０ＣＲ（光学的文字
読取装置）、音声ワープロ等に適用可能なものである。[Detailed Description of the Invention] The present invention relates to an unknown word processing method in morphological analysis of sentences containing kanji and kana, and is applicable to Japanese-English machine translation, OCR (optical character reader), voice word processing, etc. It is something.

従遂ＪＵ伊］」本語文の形態素解析を行ない、分析結果として日本
語構文解析等に必要な情報を辞書システムにより詮索し
、構文解析システムへ出力する日本語形態素解析システ
ムに関しては、工業技術院電子技術総合研究所より既に
提案されているが（日英科学技術文献の速報システムに
関する研究、日本語形態素解析システム説明書（Ｓｙｒ
ｎｂｏｌｊｃｓ３６００版）Ｖｅｒｓｉｏｎ、１　　昭
和５９年１月）、このシステムは、辞書Ｉ駆動型による
文字列間の接続関係を文の左端からチェックし、文を単
位として、可能な解をすべて出力するＭｕｌｉｐｌｅ−
Ｐａｔｈ方式をとっており、未知語部の先頭文字から文
字種の変化が起きた直前文字までを全て未知語としてい
る。Regarding the Japanese morphological analysis system, which performs morphological analysis of the main language sentence, uses a dictionary system to obtain the information necessary for Japanese syntactic analysis as the analysis result, and outputs it to the syntactic analysis system, please contact the Agency of Industrial Science and Technology. Although it has already been proposed by the Electronics Technology Research Institute (research on a breaking system for Japanese and British scientific and technical literature, Japanese morphological analysis system manual (Syr
nboljcs3600 version) Version, 1 January 1987), this system is a dictionary I-driven system that checks the connection relationships between character strings from the left end of the sentence, and outputs all possible solutions for each sentence as a unit.
A path method is used, and all characters from the first character of the unknown word part to the character immediately before the character type change are treated as unknown words.

而して、漢字かな混じり文の形態素解析において、形態
素Ｍ析用辞Ｖ（単語辞書）にすべての単語を登録するこ
とば不可能であり、未知語（未登録語）の発生は必然的
に起り、その際に如何に的確に未知語を見つけるかが重
要であるが、上記システムは、未知語部の先頭文字から
文字種の変化が起きた直前文字までをすべて未知語とし
ており、字種だけで未知語を判定しているため、的確に
未知語を見つけだすことができず、例えば、「現在」と
いう語が未登録語とした場合、入力文を「現在話してい
る。」とすると、上記未知語処理システムでは、「現在
話」までが未知語となってしまい、未知語「現在」を的
確に児つけだすことができない。Therefore, in morphological analysis of sentences containing kanji and kana, it is impossible to register all words in the morpheme M analysis dictionary V (word dictionary), and the occurrence of unknown words (unregistered words) is inevitable. In this case, it is important to accurately find unknown words, but the above system treats everything from the first character of the unknown word part to the character immediately before the character type change as unknown words, and it is difficult to find unknown words just by character type. Since the unknown words are being judged, it is not possible to accurately find the unknown words. For example, if the word "currently" is an unregistered word, and the input sentence is "currently speaking." In the word processing system, even ``present talk'' becomes an unknown word, and the unknown word ``present'' cannot be accurately identified.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
上記例で言えば、「現在」だけを未知語として見つけだ
すようにし、未知語を的確に見つけだすことを目的とし
てなされたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In the above example, only "present" is found as an unknown word, and the purpose is to accurately find unknown words.

逓−−」又本発明は、上記目的を達成するために、未知語を自立語
のみとする条件を満たす文節未可否表を用い、未知語処
理を条件によって変えることを特徴としたものである。In order to achieve the above-mentioned object, the present invention is characterized in that it uses a phrase-unavailable table that satisfies the condition that unknown words are only independent words, and that unknown word processing is changed depending on the conditions. .

以下、本発明の実施例に基づいて説明する。Hereinafter, the present invention will be explained based on examples.

第１図は１本発明による未知開部決定方式が適用される
形態素解析処理のブロック線図で、図中、１は渭析対象
文字列作成部、２は単語抽出部、３は単語選択部、４は
確定処理部で、解析対象文字列作成処理部１は入力文か
ら単語選択処理部３で選択した単語の区切りまでを取り
除いた残りの文字列を″解析対象文字列″とする。単語
抽出処理部２は形態素解析用辞書（単語辞書）２ａ９品
詞分類表２ｂ、活用語尾表２ｃ、接続重み行列表２ｄ。FIG. 1 is a block diagram of morphological analysis processing to which the unknown opening determination method according to the present invention is applied. In the figure, 1 is a character string generation unit to be analyzed, 2 is a word extraction unit, and 3 is a word selection unit. , 4 is a confirmation processing unit, and the analysis target character string creation processing unit 1 removes the character string up to the word break selected by the word selection processing unit 3 from the input sentence, and sets the remaining character string as the “analysis target character string”. The word extraction processing unit 2 includes a morphological analysis dictionary (word dictionary) 2a, a part-of-speech classification table 2b, a conjugation ending table 2c, and a connection weight matrix table 2d.

文節末可否衣２０を用いて、゛′候補単語りスト″を作
成する。単語選択処理部３は、゛′候補単語リスト”が
空でなかったら、そのリストの直前単語を″単語りスト
″に積む。もし、空であった場合はバックトラックを作
成し、それでも解がみつからなかったときは、未知語処
理を行なう。次に。A ``candidate word list'' is created using the ``candidate word list'' 20. If the ``candidate word list'' is not empty, the word selection processing unit 3 selects the immediately preceding word of the list as a ``word list.'' Pile it on. If it is empty, a backtrack is created, and if a solution is still not found, unknown word processing is performed. next.

″候補単語りスト″の各単語に対して評価式により評価
値を求め、その中で最大の評価値をもつ単語を第一候補
とする。確定処理部４は、゛′単語りスト″から″確定
単語りスト″′を作成する。An evaluation value is determined for each word in the "candidate word list" using an evaluation formula, and the word with the highest evaluation value is selected as the first candidate. The confirmation processing unit 4 creates a "confirmed word list" from the "word list".

第２図は、本発明による未知語処理の例を説明するため
のフローチャートで、以下に、その具体的方法について
説明する。FIG. 2 is a flowchart for explaining an example of unknown word processing according to the present invention, and a specific method thereof will be explained below.

ｉ）単語抽出処理３により未知語を含む先頭ポインター
Ｐが得られるが、この際、ポインターＰの前の単語は必
ず文節末となる単語である（文節未可否表の利用）。i) Word extraction processing 3 yields a head pointer P that includes an unknown word, but at this time, the word before the pointer P is always the word that ends in a clause (use of clause availability table).

ｉｉ）入力文字列のＰ以降の文字列に対して以下の処理
を行なう。ii) Perform the following processing on the character string after P in the input character string.

（イ）先頭から２〜３文字目までが漢字であったときは
、その漢字部の直後に助詞が検出されたら、その漢字部
すべてを未知語とする。(b) When the second to third characters from the beginning are kanji, if a particle is detected immediately after the kanji part, the entire kanji part is treated as an unknown word.

助詞でなかった場合は、漢字部の後方から辞書検索を行
ない、成功したところ以前を未知語とする。If it is not a particle, a dictionary search is performed from the end of the kanji part, and if successful, the previous part is treated as an unknown word.

（例１）Ｐ以降の文字列を“東京は大きい″とする（辞書にパ東
京″がないとする。）と、″は″が助詞として検出され
、未知語が゛″東京″となり、このようにして、未知語
を的確に検出できる。(Example 1) If the character string after P is "Tokyo is big" (assuming that there is no "PaTokyo" in the dictionary), "is" is detected as a particle, the unknown word becomes "Tokyo", and this In this way, unknown words can be detected accurately.

（例２）Ｐ以降が″昨日食べた″とする（辞書に１′昨日″がな
いとする。）と、漢字部の直後の助詞が検出できず、漢
字部の後方から辞書検索を行なう。その結果食べた″が
検出でき、未知語は″昨日″となり、的確に未知語を検
出できる。(Example 2) If the words after P are "I ate yesterday" (assuming there is no 1'yesterday in the dictionary), the particle immediately after the kanji part cannot be detected, and the dictionary search is performed from the end of the kanji part. As a result, ``I ate'' can be detected, and the unknown word becomes ``yesterday,'' making it possible to accurately detect the unknown word.

（ロ）先頭から４文字以上の漢字連続であった時は、ま
ず、先頭から２文字スキップして辞書検索を行ない、成
功したらスキップした文字を未知語とする。その後１文
字ずつスキップして辞書検索の成功するまで行なう。(b) If there are four or more consecutive kanji characters from the beginning, first skip two characters from the beginning and perform a dictionary search, and if successful, use the skipped characters as an unknown word. Thereafter, the dictionary search is performed by skipping one character at a time until the dictionary search is successful.

（例３）Ｐ以降が“′人工言語の研究″とする（辞書に“人工″
がないとする）と、先頭から２文字をスキップし、″言
″から辞書検索を行なう。その結果、゛言語″が検出さ
れ、未知語は″人工″となり、未知語を的確に検出でき
る。(Example 3) The text after P is “research on artificial language” (“artificial” in the dictionary).
), the first two characters are skipped and a dictionary search is performed starting with "word". As a result, the "language" is detected, the unknown word becomes "artificial", and the unknown word can be detected accurately.

（ハ）先頭文字の文字種が、カタカナまたは数字の時は
、その異字種が出現する所までを未知語とする。(c) When the character type of the first character is katakana or a number, the part up to where the different character type appears is considered an unknown word.

（例４）Ｐ以降が″プログラム書法″とする（辞書に“プログラ
ム″が登録されていないとする。）と、カタカナ列゛′
プログラム″が未知語となり、的確に検出できる。(Example 4) If the characters after P are "program writing" (assuming "program" is not registered in the dictionary), then the katakana sequence ゛'
``Program'' becomes an unknown word and can be detected accurately.

効　　　果以上の説明から明らかなように１本発明によると、前記
１例１）〜（例４）に示すように、辞書に登録さ九でい
ない単語を的確に見つけ出すことができる。Effects As is clear from the above explanation, according to the present invention, as shown in Examples 1 to 4 above, words that are not registered in the dictionary and are not numbered 9 can be accurately found.

[Brief explanation of the drawing]

第１図は１本発明が適用される形態素解析処理のブロッ
ク線゛用、第２図は１本発明に゛よる未知部決定方式の
一例を説明するためのフローチャートである。１・・・解析対象文字列作成部、２・・・単語抽出部、
３、・・Ｉｌ１語選択部、４・・・確定処理部筒　／ｆ
ｆｉ第　２　図手続補正書輸発）昭和６１年１２月１７日特許庁長官　　黒　１）明　雄　殿３、補正をする者事件との関係　　特許出願人オオタ　り　ナカマゴメ住所　　東京都大田区中馬込１丁目３番６号代表者　　
浜　　１）　　広５、補正命令の日付自発 ρ・・、ノ′ ７、補正の内容（１）、明細書の特許請求の範囲を別紙の通り補正する
。（２）、明細書の第２頁第１０行〜１１行目に記載の「
辞書システムにより詮索し、」を［辞書システムにより
検索し、」に補正する。（３）、同第４頁第４行目に記載の「文節未可否表」を
「文節末可否表」に補正する。（４）、同第５頁第１行目に記載のｒバックトラックを
作成し、」をｒバックトラックを行ない、」に補正する
。（５）、同第５頁第２行目に記載の「次に、」を「この
未知語処理については後で説明する。″候補単語りスト
”が空でなかった場合には、次に、」に補正する。（６）、同第５頁第・１３行目に記載の「（文節未可否
表の利用）」を「（文節末可否表の利用）」に補正する
。特許請求の範囲（１）、単語が文節末にきうるかどうかを示す文節末可
否表を用い、未知語処理を条件によって変えることを特
徴とする漢字かな混じりの形態素解析における未知語処
理方式。（２）、前記条件は未知語部の先頭２〜３文字が漢字で
あることを特徴とする特許請求の範囲第（１）項に記載
の漢字かな混じりの形態素解析における未知語処理方式
。（３）、前記条件は未知語部の先頭４文字以上が漢字で
あることを特徴とする特許請求の範囲第（１）項に記載
の漢字かな混じりの形態素解析における未知語処理方式
。（４）、前記条件は未知語部の先頭の文字種がカタカナ
或いは数字であることを特徴とする特許請求の範囲第（
１）項に記載の漢字かな混じりの形態素解析における未
知語処理方式。FIG. 1 is a block diagram of a morphological analysis process to which the present invention is applied, and FIG. 2 is a flowchart for explaining an example of an unknown part determination method according to the present invention. 1... Analysis target character string creation section, 2... Word extraction section,
3... Il1 word selection section, 4... Confirmation processing section /f
Fi No. 2 (Import of amendment to figure procedure) December 17, 1985 Commissioner of the Japan Patent Office Kuro 1) Relationship with the case of the person making the amendment Patent applicant Ota Ri Nakamagome Address 1 Nakamagome, Ota-ku, Tokyo Chome 3-6 representative
Hama 1) Hiro 5, date of amendment order ρ..., No' 7, content of amendment (1), and the scope of claims of the specification are amended as shown in the attached sheet. (2), “
``Search using the dictionary system,'' is corrected to ``Search using the dictionary system,''. (3) The "Bunsetsu Unavailability Table" written in the 4th line of page 4 is amended to the "Bunsetsu End Availability Table." (4) Create r-backtrack as described in the first line of page 5, and correct "by performing r-backtracking" to ". (5), "Next" written in the second line of page 5 is replaced with "This unknown word processing will be explained later. If the "candidate word list" is not empty, then ,” is corrected. (6), "(Use of clause-end availability table)" written on page 5, line 13 is corrected to "(Use of clause-end availability table)." Claim (1): An unknown word processing method in morphological analysis of kanji and kana mixtures, characterized in that unknown word processing is changed depending on conditions using a clause-finality table that indicates whether a word can appear at the end of a clause. (2) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first two to three characters of the unknown word part are kanji. (3) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first four or more characters of the unknown word part are kanji. (4) The condition is that the first character type of the unknown word portion is katakana or a number.
Unknown word processing method in morphological analysis of kanji and kana mixtures described in section 1).

Claims

[Claims]

(1) An unknown word processing method in morphological analysis of kanji and kana mixtures, which is characterized by changing the unknown word processing depending on the conditions, using a clause impossibility table that satisfies the condition that unknown words are only independent words.

(2) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first two to three characters of the unknown word part are kanji.

(3) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first four or more characters of the unknown word part are kanji.

(4) The condition is that the first character type of the unknown word portion is katakana or a number.
Unknown word processing method in morphological analysis of kanji and kana mixtures described in section 1).