JPS6366672A - Unknown word processing system for morpheme analysis of mixture of kanji and kana - Google Patents

Unknown word processing system for morpheme analysis of mixture of kanji and kana

Info

Publication number
JPS6366672A
JPS6366672A JP61211047A JP21104786A JPS6366672A JP S6366672 A JPS6366672 A JP S6366672A JP 61211047 A JP61211047 A JP 61211047A JP 21104786 A JP21104786 A JP 21104786A JP S6366672 A JPS6366672 A JP S6366672A
Authority
JP
Japan
Prior art keywords
word
kanji
unknown
unknown word
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP61211047A
Other languages
Japanese (ja)
Inventor
Katsuhiko Fujita
克彦 藤田
Satoshi Okugawa
奥川 聡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP61211047A priority Critical patent/JPS6366672A/en
Publication of JPS6366672A publication Critical patent/JPS6366672A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

PURPOSE:To accurately detect the unknown words by using a paragraph end propriety table showing whether words can be put at the end of paragraphs or not. CONSTITUTION:A word extraction processing part 2 uses a word dictionary 2a, a part-of-speech sorting table 2b, a conjugation inflection table 2c, a connection weight matrix table 2d and a paragraph end propriety table 2e to produce a candidate word list. A word selection processing part 3 adds the candidate words set right before the word list to this list as long as said list is not empty. If the word list is empty, a back track is produced. Then the unknown word processing is carried out in case no answer is obtained even after production of said back track. When two or three characters counted from the head are equal to KANJI (Chinese characters) and a postpositional word is detected right after the KANJI part, this KANJI part is entirely defined as unknown words. If no postpositional word is detected, a dictionary is retrieved at and after the back of the KANJI part and a part preceding a success area is defined as an unknown word.

Description

【発明の詳細な説明】 り術次夏 本発明は、漢字かな混じり文の形態素解析における未知
語処理方式に関し、日英機械翻訳、0CR(光学的文字
読取装置)、音声ワープロ等に適用可能なものである。
[Detailed Description of the Invention] The present invention relates to an unknown word processing method in morphological analysis of sentences containing kanji and kana, and is applicable to Japanese-English machine translation, OCR (optical character reader), voice word processing, etc. It is something.

従遂JU伊 ]」本語文の形態素解析を行ない、分析結果として日本
語構文解析等に必要な情報を辞書システムにより詮索し
、構文解析システムへ出力する日本語形態素解析システ
ムに関しては、工業技術院電子技術総合研究所より既に
提案されているが(日英科学技術文献の速報システムに
関する研究、日本語形態素解析システム説明書(Syr
nboljcs3600版)Version、1  昭
和59年1月)、このシステムは、辞書I駆動型による
文字列間の接続関係を文の左端からチェックし、文を単
位として、可能な解をすべて出力するMuliple−
Path方式をとっており、未知語部の先頭文字から文
字種の変化が起きた直前文字までを全て未知語としてい
る。
Regarding the Japanese morphological analysis system, which performs morphological analysis of the main language sentence, uses a dictionary system to obtain the information necessary for Japanese syntactic analysis as the analysis result, and outputs it to the syntactic analysis system, please contact the Agency of Industrial Science and Technology. Although it has already been proposed by the Electronics Technology Research Institute (research on a breaking system for Japanese and British scientific and technical literature, Japanese morphological analysis system manual (Syr
nboljcs3600 version) Version, 1 January 1987), this system is a dictionary I-driven system that checks the connection relationships between character strings from the left end of the sentence, and outputs all possible solutions for each sentence as a unit.
A path method is used, and all characters from the first character of the unknown word part to the character immediately before the character type change are treated as unknown words.

而して、漢字かな混じり文の形態素解析において、形態
素M析用辞V(単語辞書)にすべての単語を登録するこ
とば不可能であり、未知語(未登録語)の発生は必然的
に起り、その際に如何に的確に未知語を見つけるかが重
要であるが、上記システムは、未知語部の先頭文字から
文字種の変化が起きた直前文字までをすべて未知語とし
ており、字種だけで未知語を判定しているため、的確に
未知語を見つけだすことができず、例えば、「現在」と
いう語が未登録語とした場合、入力文を「現在話してい
る。」とすると、上記未知語処理システムでは、「現在
話」までが未知語となってしまい、未知語「現在」を的
確に児つけだすことができない。
Therefore, in morphological analysis of sentences containing kanji and kana, it is impossible to register all words in the morpheme M analysis dictionary V (word dictionary), and the occurrence of unknown words (unregistered words) is inevitable. In this case, it is important to accurately find unknown words, but the above system treats everything from the first character of the unknown word part to the character immediately before the character type change as unknown words, and it is difficult to find unknown words just by character type. Since the unknown words are being judged, it is not possible to accurately find the unknown words. For example, if the word "currently" is an unregistered word, and the input sentence is "currently speaking." In the word processing system, even ``present talk'' becomes an unknown word, and the unknown word ``present'' cannot be accurately identified.

目     的 本発明は、上述のごとき実情に鑑みてなされたもので、
上記例で言えば、「現在」だけを未知語として見つけだ
すようにし、未知語を的確に見つけだすことを目的とし
てなされたものである。
Purpose The present invention was made in view of the above-mentioned circumstances.
In the above example, only "present" is found as an unknown word, and the purpose is to accurately find unknown words.

逓−−」又 本発明は、上記目的を達成するために、未知語を自立語
のみとする条件を満たす文節未可否表を用い、未知語処
理を条件によって変えることを特徴としたものである。
In order to achieve the above-mentioned object, the present invention is characterized in that it uses a phrase-unavailable table that satisfies the condition that unknown words are only independent words, and that unknown word processing is changed depending on the conditions. .

以下、本発明の実施例に基づいて説明する。Hereinafter, the present invention will be explained based on examples.

第1図は1本発明による未知開部決定方式が適用される
形態素解析処理のブロック線図で、図中、1は渭析対象
文字列作成部、2は単語抽出部、3は単語選択部、4は
確定処理部で、解析対象文字列作成処理部1は入力文か
ら単語選択処理部3で選択した単語の区切りまでを取り
除いた残りの文字列を″解析対象文字列″とする。単語
抽出処理部2は形態素解析用辞書(単語辞書)2a9品
詞分類表2b、活用語尾表2c、接続重み行列表2d。
FIG. 1 is a block diagram of morphological analysis processing to which the unknown opening determination method according to the present invention is applied. In the figure, 1 is a character string generation unit to be analyzed, 2 is a word extraction unit, and 3 is a word selection unit. , 4 is a confirmation processing unit, and the analysis target character string creation processing unit 1 removes the character string up to the word break selected by the word selection processing unit 3 from the input sentence, and sets the remaining character string as the “analysis target character string”. The word extraction processing unit 2 includes a morphological analysis dictionary (word dictionary) 2a, a part-of-speech classification table 2b, a conjugation ending table 2c, and a connection weight matrix table 2d.

文節末可否衣20を用いて、゛′候補単語りスト″を作
成する。単語選択処理部3は、゛′候補単語リスト”が
空でなかったら、そのリストの直前単語を″単語りスト
″に積む。もし、空であった場合はバックトラックを作
成し、それでも解がみつからなかったときは、未知語処
理を行なう。次に。
A ``candidate word list'' is created using the ``candidate word list'' 20. If the ``candidate word list'' is not empty, the word selection processing unit 3 selects the immediately preceding word of the list as a ``word list.'' Pile it on. If it is empty, a backtrack is created, and if a solution is still not found, unknown word processing is performed. next.

″候補単語りスト″の各単語に対して評価式により評価
値を求め、その中で最大の評価値をもつ単語を第一候補
とする。確定処理部4は、゛′単語りスト″から″確定
単語りスト″′を作成する。
An evaluation value is determined for each word in the "candidate word list" using an evaluation formula, and the word with the highest evaluation value is selected as the first candidate. The confirmation processing unit 4 creates a "confirmed word list" from the "word list".

第2図は、本発明による未知語処理の例を説明するため
のフローチャートで、以下に、その具体的方法について
説明する。
FIG. 2 is a flowchart for explaining an example of unknown word processing according to the present invention, and a specific method thereof will be explained below.

i)単語抽出処理3により未知語を含む先頭ポインター
Pが得られるが、この際、ポインターPの前の単語は必
ず文節末となる単語である(文節未可否表の利用)。
i) Word extraction processing 3 yields a head pointer P that includes an unknown word, but at this time, the word before the pointer P is always the word that ends in a clause (use of clause availability table).

ii)入力文字列のP以降の文字列に対して以下の処理
を行なう。
ii) Perform the following processing on the character string after P in the input character string.

(イ)先頭から2〜3文字目までが漢字であったときは
、その漢字部の直後に助詞が検出されたら、その漢字部
すべてを未知語とする。
(b) When the second to third characters from the beginning are kanji, if a particle is detected immediately after the kanji part, the entire kanji part is treated as an unknown word.

助詞でなかった場合は、漢字部の後方から辞書検索を行
ない、成功したところ以前を未知語とする。
If it is not a particle, a dictionary search is performed from the end of the kanji part, and if successful, the previous part is treated as an unknown word.

(例1) P以降の文字列を“東京は大きい″とする(辞書にパ東
京″がないとする。)と、″は″が助詞として検出され
、未知語が゛″東京″となり、このようにして、未知語
を的確に検出できる。
(Example 1) If the character string after P is "Tokyo is big" (assuming that there is no "PaTokyo" in the dictionary), "is" is detected as a particle, the unknown word becomes "Tokyo", and this In this way, unknown words can be detected accurately.

(例2) P以降が″昨日食べた″とする(辞書に1′昨日″がな
いとする。)と、漢字部の直後の助詞が検出できず、漢
字部の後方から辞書検索を行なう。その結果食べた″が
検出でき、未知語は″昨日″となり、的確に未知語を検
出できる。
(Example 2) If the words after P are "I ate yesterday" (assuming there is no 1'yesterday in the dictionary), the particle immediately after the kanji part cannot be detected, and the dictionary search is performed from the end of the kanji part. As a result, ``I ate'' can be detected, and the unknown word becomes ``yesterday,'' making it possible to accurately detect the unknown word.

(ロ)先頭から4文字以上の漢字連続であった時は、ま
ず、先頭から2文字スキップして辞書検索を行ない、成
功したらスキップした文字を未知語とする。その後1文
字ずつスキップして辞書検索の成功するまで行なう。
(b) If there are four or more consecutive kanji characters from the beginning, first skip two characters from the beginning and perform a dictionary search, and if successful, use the skipped characters as an unknown word. Thereafter, the dictionary search is performed by skipping one character at a time until the dictionary search is successful.

(例3) P以降が“′人工言語の研究″とする(辞書に“人工″
がないとする)と、先頭から2文字をスキップし、″言
″から辞書検索を行なう。その結果、゛言語″が検出さ
れ、未知語は″人工″となり、未知語を的確に検出でき
る。
(Example 3) The text after P is “research on artificial language” (“artificial” in the dictionary).
), the first two characters are skipped and a dictionary search is performed starting with "word". As a result, the "language" is detected, the unknown word becomes "artificial", and the unknown word can be detected accurately.

(ハ)先頭文字の文字種が、カタカナまたは数字の時は
、その異字種が出現する所までを未知語とする。
(c) When the character type of the first character is katakana or a number, the part up to where the different character type appears is considered an unknown word.

(例4) P以降が″プログラム書法″とする(辞書に“プログラ
ム″が登録されていないとする。)と、カタカナ列゛′
プログラム″が未知語となり、的確に検出できる。
(Example 4) If the characters after P are "program writing" (assuming "program" is not registered in the dictionary), then the katakana sequence ゛'
``Program'' becomes an unknown word and can be detected accurately.

効   果 以上の説明から明らかなように1本発明によると、前記
1例1)〜(例4)に示すように、辞書に登録さ九でい
ない単語を的確に見つけ出すことができる。
Effects As is clear from the above explanation, according to the present invention, as shown in Examples 1 to 4 above, words that are not registered in the dictionary and are not numbered 9 can be accurately found.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は1本発明が適用される形態素解析処理のブロッ
ク線゛用、第2図は1本発明に゛よる未知部決定方式の
一例を説明するためのフローチャートである。 1・・・解析対象文字列作成部、2・・・単語抽出部、
3、・・Il1語選択部、4・・・確定処理部筒 /f
fi 第 2 図 手続補正書輸発) 昭和61年12月17日 特許庁長官  黒 1)明 雄 殿 3、補正をする者 事件との関係  特許出願人 オオタ り ナカマゴメ 住所  東京都大田区中馬込1丁目3番6号代表者  
浜  1)  広 5、補正命令の日付 自発 ρ・・、ノ′ 7、補正の内容 (1)、明細書の特許請求の範囲を別紙の通り補正する
。 (2)、明細書の第2頁第10行〜11行目に記載の「
辞書システムにより詮索し、」を[辞書システムにより
検索し、」に補正する。 (3)、同第4頁第4行目に記載の「文節未可否表」を
「文節末可否表」に補正する。 (4)、同第5頁第1行目に記載のrバックトラックを
作成し、」をrバックトラックを行ない、」に補正する
。 (5)、同第5頁第2行目に記載の「次に、」を「この
未知語処理については後で説明する。″候補単語りスト
”が空でなかった場合には、次に、」に補正する。 (6)、同第5頁第・13行目に記載の「(文節未可否
表の利用)」を「(文節末可否表の利用)」に補正する
。 特許請求の範囲 (1)、単語が文節末にきうるかどうかを示す文節末可
否表を用い、未知語処理を条件によって変えることを特
徴とする漢字かな混じりの形態素解析における未知語処
理方式。 (2)、前記条件は未知語部の先頭2〜3文字が漢字で
あることを特徴とする特許請求の範囲第(1)項に記載
の漢字かな混じりの形態素解析における未知語処理方式
。 (3)、前記条件は未知語部の先頭4文字以上が漢字で
あることを特徴とする特許請求の範囲第(1)項に記載
の漢字かな混じりの形態素解析における未知語処理方式
。 (4)、前記条件は未知語部の先頭の文字種がカタカナ
或いは数字であることを特徴とする特許請求の範囲第(
1)項に記載の漢字かな混じりの形態素解析における未
知語処理方式。
FIG. 1 is a block diagram of a morphological analysis process to which the present invention is applied, and FIG. 2 is a flowchart for explaining an example of an unknown part determination method according to the present invention. 1... Analysis target character string creation section, 2... Word extraction section,
3... Il1 word selection section, 4... Confirmation processing section /f
Fi No. 2 (Import of amendment to figure procedure) December 17, 1985 Commissioner of the Japan Patent Office Kuro 1) Relationship with the case of the person making the amendment Patent applicant Ota Ri Nakamagome Address 1 Nakamagome, Ota-ku, Tokyo Chome 3-6 representative
Hama 1) Hiro 5, date of amendment order ρ..., No' 7, content of amendment (1), and the scope of claims of the specification are amended as shown in the attached sheet. (2), “
``Search using the dictionary system,'' is corrected to ``Search using the dictionary system,''. (3) The "Bunsetsu Unavailability Table" written in the 4th line of page 4 is amended to the "Bunsetsu End Availability Table." (4) Create r-backtrack as described in the first line of page 5, and correct "by performing r-backtracking" to ". (5), "Next" written in the second line of page 5 is replaced with "This unknown word processing will be explained later. If the "candidate word list" is not empty, then ,” is corrected. (6), "(Use of clause-end availability table)" written on page 5, line 13 is corrected to "(Use of clause-end availability table)." Claim (1): An unknown word processing method in morphological analysis of kanji and kana mixtures, characterized in that unknown word processing is changed depending on conditions using a clause-finality table that indicates whether a word can appear at the end of a clause. (2) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first two to three characters of the unknown word part are kanji. (3) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first four or more characters of the unknown word part are kanji. (4) The condition is that the first character type of the unknown word portion is katakana or a number.
Unknown word processing method in morphological analysis of kanji and kana mixtures described in section 1).

Claims (4)

【特許請求の範囲】[Claims] (1)、未知語を自立語のみとする条件を満たす文節未
可否表を用い、未知語処理を条件によって変えることを
特徴とする漢字かな混じりの形態素解析における未知語
処理方式。
(1) An unknown word processing method in morphological analysis of kanji and kana mixtures, which is characterized by changing the unknown word processing depending on the conditions, using a clause impossibility table that satisfies the condition that unknown words are only independent words.
(2)、前記条件は未知語部の先頭2〜3文字が漢字で
あることを特徴とする特許請求の範囲第(1)項に記載
の漢字かな混じりの形態素解析における未知語処理方式
(2) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first two to three characters of the unknown word part are kanji.
(3)、前記条件は未知語部の先頭4文字以上が漢字で
あることを特徴とする特許請求の範囲第(1)項に記載
の漢字かな混じりの形態素解析における未知語処理方式
(3) The unknown word processing method in morphological analysis containing kanji and kana as set forth in claim (1), wherein the condition is that the first four or more characters of the unknown word part are kanji.
(4)、前記条件は未知語部の先頭の文字種がカタカナ
或いは数字であることを特徴とする特許請求の範囲第(
1)項に記載の漢字かな混じりの形態素解析における未
知語処理方式。
(4) The condition is that the first character type of the unknown word portion is katakana or a number.
Unknown word processing method in morphological analysis of kanji and kana mixtures described in section 1).
JP61211047A 1986-09-08 1986-09-08 Unknown word processing system for morpheme analysis of mixture of kanji and kana Pending JPS6366672A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP61211047A JPS6366672A (en) 1986-09-08 1986-09-08 Unknown word processing system for morpheme analysis of mixture of kanji and kana

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP61211047A JPS6366672A (en) 1986-09-08 1986-09-08 Unknown word processing system for morpheme analysis of mixture of kanji and kana

Publications (1)

Publication Number Publication Date
JPS6366672A true JPS6366672A (en) 1988-03-25

Family

ID=16599501

Family Applications (1)

Application Number Title Priority Date Filing Date
JP61211047A Pending JPS6366672A (en) 1986-09-08 1986-09-08 Unknown word processing system for morpheme analysis of mixture of kanji and kana

Country Status (1)

Country Link
JP (1) JPS6366672A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02155073A (en) * 1988-12-07 1990-06-14 Matsushita Electric Ind Co Ltd Unknown word qualifying device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02155073A (en) * 1988-12-07 1990-06-14 Matsushita Electric Ind Co Ltd Unknown word qualifying device

Similar Documents

Publication Publication Date Title
EP0971294A2 (en) Method and apparatus for automated search and retrieval processing
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
JPH0724056B2 (en) Computer-based morphological text analysis method
Avner et al. Identifying translationese at the word and sub-word level
Van Halteren et al. Linguistic Exploitation of Syntactic Databases: The Use of the Nijmegen LDB Program
von Glasersfeld et al. The multistore parser for hierarchical syntactic structures
JPS6366672A (en) Unknown word processing system for morpheme analysis of mixture of kanji and kana
JP2828692B2 (en) Information retrieval device
JPS6118072A (en) Automatic register system of dictionary data
Jaruskulchai An automatic indexing for Thai text retrieval
JPH0578058B2 (en)
Petrovčič et al. The New Chinese Corpus of Literary Texts Litchi
JPH06259423A (en) Summary automatically generating system
JP2958044B2 (en) Kana-Kanji conversion method and device
JPS60193074A (en) Analyzer of japanese language
JPH0561902A (en) Mechanical translation system
Craven Automatic Recognition of Sentence Dependency Structures.
JPH0785040A (en) Inscription nonuniformity detecting method and kana/ kanji converting method
JPH06149790A (en) Document processor
JPS59103136A (en) Kana (japanese syllabary)/kanji (chinese character) processor
JPS6368972A (en) Unregistered word processing system
JPS6395573A (en) Method for processing unknown word in analysis of japanese sentence morpheme
JPH1115846A (en) Information retrieval device and recording medium
Lee et al. A Greek-Chinese Interlinear of the New Testament Gospels and its Applications.
Segert et al. A Computer Program for Analysis of Words According to Their Meaning (Conceptual analysis of Latin equivalents for the comparative dictionary of Semitic languages)