JPH01156866A - Processing system for japanese sentence - Google Patents
Processing system for japanese sentenceInfo
- Publication number
- JPH01156866A JPH01156866A JP62315698A JP31569887A JPH01156866A JP H01156866 A JPH01156866 A JP H01156866A JP 62315698 A JP62315698 A JP 62315698A JP 31569887 A JP31569887 A JP 31569887A JP H01156866 A JPH01156866 A JP H01156866A
- Authority
- JP
- Japan
- Prior art keywords
- analysis
- character string
- japanese
- unable
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 claims abstract description 30
- 230000000877 morphologic effect Effects 0.000 claims description 17
- 238000000034 method Methods 0.000 claims description 7
- 238000003672 processing method Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 9
- 238000001514 detection method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 235000021419 vinegar Nutrition 0.000 description 1
- 239000000052 vinegar Substances 0.000 description 1
Landscapes
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
【発明の詳細な説明】
〔産業上の利用分野〕
本発明は計算機による日本文の解析処理方式に係り、詳
しくは辞書にない単語や誤字等を含む日本文に対しても
解析精度を向上させる日本文処理方式に関するものであ
る。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a method for analyzing Japanese sentences using a computer, and more specifically, it improves the accuracy of analysis even for Japanese sentences that include words not found in dictionaries or misspellings. It concerns the Japanese sentence processing method.
計算機による日本文の解析処理方式の従来の概略構成図
を第5図に示す。第5図において、10はデータ入力部
、20は形態素解析部、50は出力編集部である。FIG. 5 shows a schematic diagram of a conventional computer-based method for analyzing Japanese sentences. In FIG. 5, 10 is a data input section, 20 is a morphological analysis section, and 50 is an output editing section.
データ入力部10は漢字OCR,ベンタッチ、タブレッ
ト、キーボード等からなり、解析対象の日本語文字列を
読込む。この読込まれた日本語文字列について、形態素
解析部20では、予め用意された文法辞書や単語辞書を
用いて形態素解析を行い、単語の認定1文節切り等を実
施する。出力編集部50では、この形態素解析部2oで
の処理結果に従って日本語文字列を編集出力する。こぎ
で、辞書にない単語や誤字を含む日本文の場合、形態素
解析部20においては、解析単位の文節全体が解析不能
文字列として検出され、出力編集部50では、該解析不
能文字列をそのま\文節単位で出力することへなる。The data input unit 10 consists of a Kanji OCR, Bentouch, tablet, keyboard, etc., and reads Japanese character strings to be analyzed. The morphological analysis unit 20 performs morphological analysis on the read Japanese character string using a grammar dictionary and a word dictionary prepared in advance, and performs recognition of words such as segmentation into one segment. The output editing section 50 edits and outputs the Japanese character string according to the processing result of the morphological analysis section 2o. In the case of Japanese sentences that include words that are not found in the dictionary or spelling errors, the morphological analysis unit 20 detects the entire clause as an analysis unit as an unanalyzable character string, and the output editing unit 50 converts the unanalyzable character string into its own. Well, it will be output in units of clauses.
従来技術においては、計算機により日本文を解析するに
当って、日本文に入力ミスなどの誤りが存在したり、辞
書にない単語が存在した場合、その影響が限定できず、
その部分を含む文節や文全体がそのま\解析不能文字列
となっていた。このため1例えば、日本文を解析して単
語単位に認定し、読みやアクセント・ポーズを付与して
合成音声で読上げる音声出力システムでは、実際に1ケ
所の解析不能文字であっても、それを特定できないため
、文節や文全体の読みがおかしくなり、正続率を低下さ
せていた。In conventional technology, when a Japanese sentence is analyzed by a computer, if there are errors such as input errors in the Japanese sentence, or there are words that are not in the dictionary, the impact cannot be limited.
The phrase or entire sentence that contained that part was an unparsable character string. For this reason, 1. For example, in a voice output system that analyzes Japanese sentences, certifies them in word units, adds pronunciations, accents, and pauses, and reads them out using synthesized speech, even if there is actually one unanalyzable character, Because the characters could not be identified, the reading of clauses and entire sentences became incorrect, reducing the rate of correct continuation.
又、解析がうまく行かない所に誤りが存在する可能性が
あるということを利用した日本文誤り自動検出システム
においても、長い単位の文字列しか指摘できず、新聞記
事の校閲に利用した場合などは短時間に大量データを処
理する必要があるため、長い文字列からさらに誤りを見
つけ出す作業のために、省力化効果を低下させてしまう
という問題があった。Furthermore, automatic Japanese sentence error detection systems that take advantage of the fact that errors may exist where the analysis fails can only point out long character strings, such as when used to proofread newspaper articles. Since it is necessary to process a large amount of data in a short period of time, there is a problem in that the labor-saving effect is reduced due to the task of finding errors in long character strings.
本発明の目的は、辞書にない単語や誤りを含む文におい
てもできるだけその文字列を限定し、特定化して文全体
の解析に与える影響を極小化できる日本文処理方式を提
供することにある。An object of the present invention is to provide a Japanese sentence processing method that can limit and specify character strings as much as possible even in sentences that include words that are not found in the dictionary or errors, thereby minimizing the impact on the analysis of the entire sentence.
本発明は、計算機により日本文を処理する方式において
、処理対象の日本文を単語、文節等に認定分割する形態
素解析部の他に、日本文中に辞書にない単語や誤りがあ
るため解析不能になる文字列に対してより深く解析する
解析不能文字列処理部を設ける。In the method of processing Japanese sentences using a computer, the present invention has a morphological analysis section that recognizes and divides the Japanese sentence to be processed into words, phrases, etc., and also uses a morphological analysis section that recognizes and divides the Japanese sentence to be processed into words, phrases, etc. An unanalyzable character string processing unit is provided that analyzes character strings more deeply.
解析不能文字列の処理部では、形態素解析部で検出され
た解析不能文字列に対して、とにかく解析できる所まで
初めからできるだけ解析して行く。The unanalyzable character string processing unit analyzes the unanalyzable character string detected by the morphological analysis unit as much as possible from the beginning until it can be analyzed.
そして、どうしても解析できない文字列の先頭がカタカ
ナあるいはアルファベットの場合は、その連続する同一
字種の文字列全体を解析不能文字列として検出する。ま
た、それ以外の字種(漢字、ひながな)の場合は、その
先頭文字の次の文字を先頭として再度解析を行い、それ
以降もし、うまく解析できるならずスキップして解析対
象としなかった1文字が解析不能対象であるとする。も
し、゛それでもまだ以降の文字列を解析できない場合は
、さらに先頭文字をスキップして次の文字から解析を始
め、以降うまく解析できるか、文節末や文末となるまで
この処理を行い、スキップした文字列を解析不能として
検出する。If the beginning of a character string that cannot be analyzed is a katakana or alphabetic character, the entire consecutive character string of the same character type is detected as an unanalyzable character string. In addition, in the case of other character types (Kanji, Hinagana), the next character after the first character is analyzed again, and if it cannot be successfully analyzed after that, it is skipped and not included in the analysis. Assume that one character is an unanalyzable target. If you are still unable to parse the following character string, skip the first character and start parsing from the next character. Continue this process until you reach the end of a clause or sentence, then skip the first character and start parsing from the next character. Detects strings as unparsable.
このように、形態素解析部で検出した解析不能の長い文
字列に対して、さらに解析を加えるため誤りの箇所を特
定化でき、また解析不能部分を短い単位の文字列にまで
限定できる。In this way, it is possible to further analyze the long unanalyzable character strings detected by the morphological analysis unit to identify the error location, and to limit the unanalyzable portions to short character strings.
以下、本発明の一実施例について図面により説明する。 An embodiment of the present invention will be described below with reference to the drawings.
第1図は本発明の一実施例の構成図で、音声出力システ
ムに適用した場合であり、第2図はその具体的処理例を
第1図の各部分に対応づけて示した図である。FIG. 1 is a block diagram of an embodiment of the present invention, which is applied to an audio output system, and FIG. 2 is a diagram showing a specific example of its processing in association with each part of FIG. 1. .
第1図において、10は日本文のデータ入力部、20は
形態素解析部、30は読み付与部、40は解析不能文字
列処理部、50は出力編集部、60は合成音声発生部で
ある。解析不能文字列処理部40は、さらに解析不能文
字の限定部41、解析不能文字品詞付与部42、アテ読
み付与部43、アクセント・ポーズ付与部44に分かれ
る。In FIG. 1, 10 is a Japanese sentence data input section, 20 is a morphological analysis section, 30 is a reading assignment section, 40 is an unanalyzable character string processing section, 50 is an output editing section, and 60 is a synthesized speech generation section. The unanalyzable character string processing unit 40 is further divided into an unanalyzable character limiting unit 41, an unanalyzable character part-of-speech assigning unit 42, a reading assignment unit 43, and an accent/pause assigning unit 44.
今、データ入力部10より「キャルビックのように軽い
」という日本文(日本語文字列)が読込まれたとする(
■)。こ\で、「キャルビック」が辞書にない単語であ
るとする。Suppose that the Japanese sentence (Japanese character string) ``Light like Calvik'' is read from the data input unit 10 (
■). Now let's assume that ``Kyalbik'' is a word that is not in the dictionary.
形態素解析部20では、予め用意された文法辞書、単語
辞書を用いて[キャルビックのように軽い」を用語分割
、文節分割するが、「キャルビック」が辞書にないとい
うことで、「キャルビックのように」の文節は解析不能
文字列とし、「軽い」を形容詞の単語に認定する(■)
。この形態素解析部20で認定できた単語「軽い」につ
いては、読み付与部30で読みと適当なアクセントやポ
ーズの情報が付与される(■)。The morphological analysis unit 20 uses a grammar dictionary and a word dictionary prepared in advance to divide ``light like kyalbik'' into terms and phrases, but since ``kalbik'' is not in the dictionary, ” is an unparsable character string, and “light” is recognized as an adjective word (■)
. For the word "light" that has been recognized by the morphological analysis unit 20, the reading assignment unit 30 assigns information on the pronunciation and appropriate accents and poses (■).
一方、解析不能文字列「キャルビックのように」は解析
不能文字列処理部40に送られる。まず、解析不能文字
の限定部41で解析不能文字の限定化を行う。この場合
、カタカナ文字で始まる「キャルビック」を1つの単位
の解析不能文字とすると、「のように」は格助詞「の」
と助動詞「ように」に解析できる(■)。次に、アクセ
ントやポースをできるだけ自然に解析不能文字列に付与
するため、解析不能文字品詞付与部42で格助詞「の」
と接続可能な適当な品詞(この場合一般名詞)を解析不
能文字列「キャルビック」に与え(■)、あたかも辞書
にあった一般名詞のようにした後、アテ読み付与部43
、アクセント・ポース付与部44において適当なアテ読
み(この場合はカタカナ文字なのでカタカナ読み)とア
クセント・ポーズを付与する(■、■)。On the other hand, the unanalyzable character string "Kyalbik ni ni" is sent to the unanalyzable character string processing section 40. First, the unanalyzable character limiting unit 41 limits the unanalyzable characters. In this case, if the katakana character ``karubik'' is an unparsable character, then ``ni'' is the case particle ``no''.
and can be analyzed into the auxiliary verb ``yo ni'' (■). Next, in order to add accents and poses to the unanalyzable character string as naturally as possible, the unanalyzable character part-of-speech adding unit 42 uses the case particle "no".
After giving an appropriate part of speech (in this case, a common noun) that can be connected to the unparsable character string "Kalbik" (■) and making it look like a common noun found in the dictionary, the Ate-yomi adding unit 43
, the accent/pose giving unit 44 gives an appropriate reading (in this case, the katakana reading since it is a katakana character) and an accent/pose (■, ■).
出力編集部50では、読み付与部30とアクセント・ボ
ース付与部44の出力を合わせて編集する(■)。この
出力編集部50での編集結果が合成音声発生部60へ送
られて音声となる。The output editing section 50 edits the outputs of the pronunciation adding section 30 and the accent/bose adding section 44 together (■). The editing result in the output editing section 50 is sent to the synthesized speech generating section 60 to become speech.
このように、解析不能文字の限定化を行う事により、解
析できない文字列を含む文を入力されても、その部分の
品詞が適当に推定できるため、なるべく自然に近い形で
読みを出力することができる。In this way, by limiting the characters that cannot be parsed, even if a sentence containing a string that cannot be parsed is input, the part of speech of that part can be appropriately estimated, so the reading can be output in a form as close to natural as possible. Can be done.
第3図は本発明の他の実施例の構成図で、誤り検出シス
テムに本発明を適用した場合であり、第4図はその具体
的処理例を第3図の各部分に対応づけて示した図である
。FIG. 3 is a block diagram of another embodiment of the present invention, in which the present invention is applied to an error detection system, and FIG. 4 shows a specific processing example thereof in correspondence with each part of FIG. 3. This is a diagram.
第3図において、10はデータ入力部、20は形態素解
析部、40は解析不能文字列処理部、70は誤り検出部
である。解析不能文字列処理部40には解析不能文字の
限定化部41がある。In FIG. 3, 10 is a data input section, 20 is a morphological analysis section, 40 is an unanalyzable character string processing section, and 70 is an error detection section. The unanalyzable character string processing section 40 includes an unanalyzable character limiting section 41.
今、データ入力部1oより日本文「すべて分からないこ
たばかりだ。」が読込まれ、その「こたばかりだ」の「
た」が誤字であるとする(■)。Now, the Japanese sentence ``I don't understand everything.'' is read from the data input section 1o, and the Japanese sentence ``I don't understand everything.'' is read.
Suppose that "ta" is a typo (■).
この日本文について、形態素解析部10で解析され、副
詞「すべて」と解析不能文字列の文節「分からないこた
ばかりだ」に認定される(■)。このうち、解析不能文
字列[分からないこたばかりだ」は、解析不能文字列処
理部40に送られる。This Japanese sentence is analyzed by the morphological analysis unit 10, and is recognized as the adverb "all" and the phrase "I don't understand" which is a string of characters that cannot be analyzed (■). Among these, the unanalyzable character string [I just don't understand it] is sent to the unanalyzable character string processing section 40.
該処理部40の解析不能文字限定化部41では、さらに
解析が失敗する文字までなんとか解析し、失敗先頭文字
「た」 (力変活用動詞「こ」に接続できる「た」はな
い)をスキップして、再度解析し、「ばかりだJの解析
に成功する(■)5解析不能文字の限定化部41での解
析結果を誤り検出部70に送ることにより、「た」1文
字を誤りとして検出できる(■)、、このように、解析
不能文字の限定化を行う事により、誤り文字を特定化で
きる。The unanalyzable character limiting unit 41 of the processing unit 40 manages to analyze even the characters whose analysis fails, and skips the first character of failure, “ta” (there is no “ta” that can be connected to the force-transforming verb “ko”). 5. By sending the analysis results of the unanalyzable character limiting section 41 to the error detection section 70, the single character "ta" is treated as an error. Detectable (■) By limiting unanalyzable characters in this way, erroneous characters can be identified.
以上説明したように1日本文の解析において従来辞書に
ない単語や誤字などを含む文が存在した場合は5文節な
ど処理の最小単位全体を解析不能文字列とせざるを得な
かったが、本発明では、さらにその中でも解析不能文字
を限定して特定化できるようになるため、日本文の解析
効率が向上する。さらに、抽出した解析不能文字列の正
解の品詞を単一品詞として推定しても問題ない程度に限
定化でき、解析不能文字列を適当な品詞に推定して処理
できるため、辞書にない単語や誤字などを含む文を許容
して日本文解析ができるようになるという利点がある。As explained above, in the conventional analysis of a single Japanese sentence, if there were words that were not found in the dictionary or sentences that contained misspellings, the entire minimum unit of processing, such as five clauses, had to be treated as an unanalyzable character string, but the present invention Furthermore, since it becomes possible to limit and specify characters that cannot be analyzed, the efficiency of analyzing Japanese sentences is improved. Furthermore, it is possible to limit the correct part of speech of extracted unparsable character strings to the extent that there is no problem even if the correct part of speech is estimated as a single part of speech, and it is possible to process unparsable character strings by inferring them to an appropriate part of speech. It has the advantage of allowing Japanese sentences to be analyzed while allowing sentences that contain typographical errors.
現実の業務で使用される一般の文章は、辞書にない単語
や誤字などを含んだ文を持っている場合が多く、本発明
による効果の意義は大きい。例えば日本文音声出力シス
テムに応用すると、解析不能文字列があっても正解に近
いアテ読みができるようになる。又、日本文誤り検出シ
ステムでは誤りを含む範囲を限定して検出できるように
なる。General sentences used in actual work often contain words that are not found in dictionaries or sentences that include misspellings, so the effects of the present invention are of great significance. For example, when applied to a Japanese speech output system, even if there is a character string that cannot be parsed, it will be possible to read the text close to the correct answer. Furthermore, the Japanese sentence error detection system can detect errors by limiting the range that includes them.
第1図は本発明の一実施例の構成図、第2図は第1図の
具体的処理例を示す図、第3図は本発明の他の実施例の
構成図、第4図は第3図の具体的処理例を示す図、第5
図は従来方式の構成図である。
10・・・データ入力部、 20・・・形態素解析部。
40・・・解析不能文字列処理部。
第1L11 第22
゛メフ?す。FIG. 1 is a block diagram of one embodiment of the present invention, FIG. 2 is a diagram showing a specific processing example of FIG. 1, FIG. 3 is a block diagram of another embodiment of the present invention, and FIG. Figure 5 shows a specific processing example in Figure 3.
The figure is a block diagram of a conventional system. 10...Data input section, 20...Morphological analysis section. 40...Unanalyzable character string processing section. 1st L11 22nd ゛meh? vinegar.
Claims (1)
理対象の日本文を文法辞書、単語辞書等を用いて単語、
文節等に認定分割する形態素解析部と、日本文中に辞書
にない単語や誤りがあるため前記形態素解析部で解析不
能になる文字列に対してより深く解析して解析不能文字
列の限定化、特定化を行う解析不能文字列処理部を有す
ることを特徴とする日本文処理方式。(1) In the method of processing Japanese sentences using a computer, the Japanese sentences to be processed are processed using a grammar dictionary, word dictionary, etc.
A morphological analysis unit that recognizes and divides into phrases, etc., and a deeper analysis of character strings that cannot be analyzed by the morphological analysis unit due to words that are not found in the dictionary or errors in Japanese sentences, to limit the number of character strings that cannot be analyzed. A Japanese sentence processing method characterized by having an unanalyzable character string processing unit that performs specification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP62315698A JPH0682364B2 (en) | 1987-12-14 | 1987-12-14 | Japanese sentence processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP62315698A JPH0682364B2 (en) | 1987-12-14 | 1987-12-14 | Japanese sentence processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
JPH01156866A true JPH01156866A (en) | 1989-06-20 |
JPH0682364B2 JPH0682364B2 (en) | 1994-10-19 |
Family
ID=18068475
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP62315698A Expired - Lifetime JPH0682364B2 (en) | 1987-12-14 | 1987-12-14 | Japanese sentence processing method |
Country Status (1)
Country | Link |
---|---|
JP (1) | JPH0682364B2 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS61134877A (en) * | 1984-12-05 | 1986-06-21 | Ricoh Co Ltd | Unregistered word attribute estimating device |
JPS6290760A (en) * | 1985-10-16 | 1987-04-25 | Fujitsu Ltd | Sentence analysis system |
JPS62209659A (en) * | 1986-03-10 | 1987-09-14 | Sharp Corp | Correcting device for japanese sentence |
-
1987
- 1987-12-14 JP JP62315698A patent/JPH0682364B2/en not_active Expired - Lifetime
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS61134877A (en) * | 1984-12-05 | 1986-06-21 | Ricoh Co Ltd | Unregistered word attribute estimating device |
JPS6290760A (en) * | 1985-10-16 | 1987-04-25 | Fujitsu Ltd | Sentence analysis system |
JPS62209659A (en) * | 1986-03-10 | 1987-09-14 | Sharp Corp | Correcting device for japanese sentence |
Also Published As
Publication number | Publication date |
---|---|
JPH0682364B2 (en) | 1994-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3971373B2 (en) | Hybrid automatic translation system that mixes rule-based method and translation pattern method | |
US6490563B2 (en) | Proofreading with text to speech feedback | |
JPH03224055A (en) | Method and device for input of translation text | |
JP5231698B2 (en) | How to predict how to read Japanese ideograms | |
Black et al. | Analysis of unknown words through morphological decomposition | |
Sahala et al. | A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages | |
GB2378877A (en) | Prosodic boundary markup mechanism | |
JPS61285570A (en) | Voice input device | |
Harrington et al. | Word boundary identification from phoneme sequence constraints in automatic continuous speech recognition | |
JPH01156866A (en) | Processing system for japanese sentence | |
Voutilainen | Engcg tagger, version 2 | |
JPS6229796B2 (en) | ||
JP2002123281A (en) | Speech synthesizer | |
JPH0634175B2 (en) | Text-to-speech device | |
JPH09152883A (en) | Accent phrase division position detecting method and text /voice converter | |
Ferri et al. | A complete linguistic analysis for an Italian text-to-speech system | |
KR970066941A (en) | Multilingual translation system using token separator | |
Sečujski et al. | A software tool for semi-automatic part-of-speech tagging and sentence accentuation in Serbian language | |
JP2770536B2 (en) | Sentence analyzer | |
JP3069532B2 (en) | Kana-kanji conversion method and device, and computer-readable recording medium storing a program for causing a computer to execute the kana-kanji conversion method | |
JPS63153596A (en) | Voice sentence input device | |
JPH09281993A (en) | Phonetic symbol forming device | |
JPH06289890A (en) | Natural language processor | |
Levow | An experimental discourse-neutral prosodic phrasing system for Mandarin Chinese | |
JPS6265160A (en) | Analysis system for sentence data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FPAY | Renewal fee payment (event date is renewal date of database) |
Free format text: PAYMENT UNTIL: 20071019 Year of fee payment: 13 |
|
FPAY | Renewal fee payment (event date is renewal date of database) |
Free format text: PAYMENT UNTIL: 20081019 Year of fee payment: 14 |
|
EXPY | Cancellation because of completion of term | ||
FPAY | Renewal fee payment (event date is renewal date of database) |
Free format text: PAYMENT UNTIL: 20081019 Year of fee payment: 14 |