JPH01156866A

JPH01156866A - Processing system for japanese sentence

Info

Publication number: JPH01156866A
Application number: JP62315698A
Authority: JP
Inventors: Tsuneo Yasuda; 安田　恒雄; Katsumi Shimazaki; 島崎　勝美; Shinichiro Takagi; 伸一郎高木; Satoru Ikehara; 池原　悟
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1987-12-14
Filing date: 1987-12-14
Publication date: 1989-06-20
Anticipated expiration: 2009-10-19
Also published as: JPH0682364B2

Abstract

PURPOSE:To minimize influence given to the analysis of a Japanese sentence as a whole by providing an analysis-unable character string processing part to give further analysis to a character string to which the analysis is impossible due to a fact that the Japanese sentence includes such words that are not included in a dictionary or have errors. CONSTITUTION:An analysis-unable character string processing part 40 analyzes as much as possible an analysis-unable character string detected by a morpheme analyzing part 20. Thus the analysis-unable characters are limited and specified and therefore the analyzing efficiency of Japanese sentences is improved. Furthermore, the correct parts of speech of the extracted analysis-unable character string can be limited up to a degree where no problem is produced even though said parts of speech are estimated as a single part of speech. Thus the analysis- unable character string can be estimated as proper parts of speech and processed. As a result, such a sentence including the words contained in a dictionary or having errors, etc., is also allowed for analysis of Japanese sentences.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は計算機による日本文の解析処理方式に係り、詳
しくは辞書にない単語や誤字等を含む日本文に対しても
解析精度を向上させる日本文処理方式に関するものであ
る。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a method for analyzing Japanese sentences using a computer, and more specifically, it improves the accuracy of analysis even for Japanese sentences that include words not found in dictionaries or misspellings. It concerns the Japanese sentence processing method.

[Conventional technology]

計算機による日本文の解析処理方式の従来の概略構成図
を第５図に示す。第５図において、１０はデータ入力部
、２０は形態素解析部、５０は出力編集部である。FIG. 5 shows a schematic diagram of a conventional computer-based method for analyzing Japanese sentences. In FIG. 5, 10 is a data input section, 20 is a morphological analysis section, and 50 is an output editing section.

データ入力部１０は漢字ＯＣＲ，ベンタッチ、タブレッ
ト、キーボード等からなり、解析対象の日本語文字列を
読込む。この読込まれた日本語文字列について、形態素
解析部２０では、予め用意された文法辞書や単語辞書を
用いて形態素解析を行い、単語の認定１文節切り等を実
施する。出力編集部５０では、この形態素解析部２ｏで
の処理結果に従って日本語文字列を編集出力する。こぎ
で、辞書にない単語や誤字を含む日本文の場合、形態素
解析部２０においては、解析単位の文節全体が解析不能
文字列として検出され、出力編集部５０では、該解析不
能文字列をそのま＼文節単位で出力することへなる。The data input unit 10 consists of a Kanji OCR, Bentouch, tablet, keyboard, etc., and reads Japanese character strings to be analyzed. The morphological analysis unit 20 performs morphological analysis on the read Japanese character string using a grammar dictionary and a word dictionary prepared in advance, and performs recognition of words such as segmentation into one segment. The output editing section 50 edits and outputs the Japanese character string according to the processing result of the morphological analysis section 2o. In the case of Japanese sentences that include words that are not found in the dictionary or spelling errors, the morphological analysis unit 20 detects the entire clause as an analysis unit as an unanalyzable character string, and the output editing unit 50 converts the unanalyzable character string into its own. Well, it will be output in units of clauses.

[Problem that the invention seeks to solve]

従来技術においては、計算機により日本文を解析するに
当って、日本文に入力ミスなどの誤りが存在したり、辞
書にない単語が存在した場合、その影響が限定できず、
その部分を含む文節や文全体がそのま＼解析不能文字列
となっていた。このため１例えば、日本文を解析して単
語単位に認定し、読みやアクセント・ポーズを付与して
合成音声で読上げる音声出力システムでは、実際に１ケ
所の解析不能文字であっても、それを特定できないため
、文節や文全体の読みがおかしくなり、正続率を低下さ
せていた。In conventional technology, when a Japanese sentence is analyzed by a computer, if there are errors such as input errors in the Japanese sentence, or there are words that are not in the dictionary, the impact cannot be limited.
The phrase or entire sentence that contained that part was an unparsable character string. For this reason, 1. For example, in a voice output system that analyzes Japanese sentences, certifies them in word units, adds pronunciations, accents, and pauses, and reads them out using synthesized speech, even if there is actually one unanalyzable character, Because the characters could not be identified, the reading of clauses and entire sentences became incorrect, reducing the rate of correct continuation.

又、解析がうまく行かない所に誤りが存在する可能性が
あるということを利用した日本文誤り自動検出システム
においても、長い単位の文字列しか指摘できず、新聞記
事の校閲に利用した場合などは短時間に大量データを処
理する必要があるため、長い文字列からさらに誤りを見
つけ出す作業のために、省力化効果を低下させてしまう
という問題があった。Furthermore, automatic Japanese sentence error detection systems that take advantage of the fact that errors may exist where the analysis fails can only point out long character strings, such as when used to proofread newspaper articles. Since it is necessary to process a large amount of data in a short period of time, there is a problem in that the labor-saving effect is reduced due to the task of finding errors in long character strings.

本発明の目的は、辞書にない単語や誤りを含む文におい
てもできるだけその文字列を限定し、特定化して文全体
の解析に与える影響を極小化できる日本文処理方式を提
供することにある。An object of the present invention is to provide a Japanese sentence processing method that can limit and specify character strings as much as possible even in sentences that include words that are not found in the dictionary or errors, thereby minimizing the impact on the analysis of the entire sentence.

[Means for solving problems]

本発明は、計算機により日本文を処理する方式において
、処理対象の日本文を単語、文節等に認定分割する形態
素解析部の他に、日本文中に辞書にない単語や誤りがあ
るため解析不能になる文字列に対してより深く解析する
解析不能文字列処理部を設ける。In the method of processing Japanese sentences using a computer, the present invention has a morphological analysis section that recognizes and divides the Japanese sentence to be processed into words, phrases, etc., and also uses a morphological analysis section that recognizes and divides the Japanese sentence to be processed into words, phrases, etc. An unanalyzable character string processing unit is provided that analyzes character strings more deeply.

[For production]

解析不能文字列の処理部では、形態素解析部で検出され
た解析不能文字列に対して、とにかく解析できる所まで
初めからできるだけ解析して行く。The unanalyzable character string processing unit analyzes the unanalyzable character string detected by the morphological analysis unit as much as possible from the beginning until it can be analyzed.

そして、どうしても解析できない文字列の先頭がカタカ
ナあるいはアルファベットの場合は、その連続する同一
字種の文字列全体を解析不能文字列として検出する。ま
た、それ以外の字種（漢字、ひながな）の場合は、その
先頭文字の次の文字を先頭として再度解析を行い、それ
以降もし、うまく解析できるならずスキップして解析対
象としなかった１文字が解析不能対象であるとする。も
し、゛それでもまだ以降の文字列を解析できない場合は
、さらに先頭文字をスキップして次の文字から解析を始
め、以降うまく解析できるか、文節末や文末となるまで
この処理を行い、スキップした文字列を解析不能として
検出する。If the beginning of a character string that cannot be analyzed is a katakana or alphabetic character, the entire consecutive character string of the same character type is detected as an unanalyzable character string. In addition, in the case of other character types (Kanji, Hinagana), the next character after the first character is analyzed again, and if it cannot be successfully analyzed after that, it is skipped and not included in the analysis. Assume that one character is an unanalyzable target. If you are still unable to parse the following character string, skip the first character and start parsing from the next character. Continue this process until you reach the end of a clause or sentence, then skip the first character and start parsing from the next character. Detects strings as unparsable.

このように、形態素解析部で検出した解析不能の長い文
字列に対して、さらに解析を加えるため誤りの箇所を特
定化でき、また解析不能部分を短い単位の文字列にまで
限定できる。In this way, it is possible to further analyze the long unanalyzable character strings detected by the morphological analysis unit to identify the error location, and to limit the unanalyzable portions to short character strings.

〔Example〕

以下、本発明の一実施例について図面により説明する。 An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例の構成図で、音声出力システ
ムに適用した場合であり、第２図はその具体的処理例を
第１図の各部分に対応づけて示した図である。FIG. 1 is a block diagram of an embodiment of the present invention, which is applied to an audio output system, and FIG. 2 is a diagram showing a specific example of its processing in association with each part of FIG. 1. .

第１図において、１０は日本文のデータ入力部、２０は
形態素解析部、３０は読み付与部、４０は解析不能文字
列処理部、５０は出力編集部、６０は合成音声発生部で
ある。解析不能文字列処理部４０は、さらに解析不能文
字の限定部４１、解析不能文字品詞付与部４２、アテ読
み付与部４３、アクセント・ポーズ付与部４４に分かれ
る。In FIG. 1, 10 is a Japanese sentence data input section, 20 is a morphological analysis section, 30 is a reading assignment section, 40 is an unanalyzable character string processing section, 50 is an output editing section, and 60 is a synthesized speech generation section. The unanalyzable character string processing unit 40 is further divided into an unanalyzable character limiting unit 41, an unanalyzable character part-of-speech assigning unit 42, a reading assignment unit 43, and an accent/pause assigning unit 44.

今、データ入力部１０より「キャルビックのように軽い
」という日本文（日本語文字列）が読込まれたとする（
■）。こ＼で、「キャルビック」が辞書にない単語であ
るとする。Suppose that the Japanese sentence (Japanese character string) ``Light like Calvik'' is read from the data input unit 10 (
■). Now let's assume that ``Kyalbik'' is a word that is not in the dictionary.

形態素解析部２０では、予め用意された文法辞書、単語
辞書を用いて［キャルビックのように軽い」を用語分割
、文節分割するが、「キャルビック」が辞書にないとい
うことで、「キャルビックのように」の文節は解析不能
文字列とし、「軽い」を形容詞の単語に認定する（■）
。この形態素解析部２０で認定できた単語「軽い」につ
いては、読み付与部３０で読みと適当なアクセントやポ
ーズの情報が付与される（■）。The morphological analysis unit 20 uses a grammar dictionary and a word dictionary prepared in advance to divide ``light like kyalbik'' into terms and phrases, but since ``kalbik'' is not in the dictionary, ” is an unparsable character string, and “light” is recognized as an adjective word (■)
. For the word "light" that has been recognized by the morphological analysis unit 20, the reading assignment unit 30 assigns information on the pronunciation and appropriate accents and poses (■).

一方、解析不能文字列「キャルビックのように」は解析
不能文字列処理部４０に送られる。まず、解析不能文字
の限定部４１で解析不能文字の限定化を行う。この場合
、カタカナ文字で始まる「キャルビック」を１つの単位
の解析不能文字とすると、「のように」は格助詞「の」
と助動詞「ように」に解析できる（■）。次に、アクセ
ントやポースをできるだけ自然に解析不能文字列に付与
するため、解析不能文字品詞付与部４２で格助詞「の」
と接続可能な適当な品詞（この場合一般名詞）を解析不
能文字列「キャルビック」に与え（■）、あたかも辞書
にあった一般名詞のようにした後、アテ読み付与部４３
、アクセント・ポース付与部４４において適当なアテ読
み（この場合はカタカナ文字なのでカタカナ読み）とア
クセント・ポーズを付与する（■、■）。On the other hand, the unanalyzable character string "Kyalbik ni ni" is sent to the unanalyzable character string processing section 40. First, the unanalyzable character limiting unit 41 limits the unanalyzable characters. In this case, if the katakana character ``karubik'' is an unparsable character, then ``ni'' is the case particle ``no''.
and can be analyzed into the auxiliary verb ``yo ni'' (■). Next, in order to add accents and poses to the unanalyzable character string as naturally as possible, the unanalyzable character part-of-speech adding unit 42 uses the case particle "no".
After giving an appropriate part of speech (in this case, a common noun) that can be connected to the unparsable character string "Kalbik" (■) and making it look like a common noun found in the dictionary, the Ate-yomi adding unit 43
, the accent/pose giving unit 44 gives an appropriate reading (in this case, the katakana reading since it is a katakana character) and an accent/pose (■, ■).

出力編集部５０では、読み付与部３０とアクセント・ボ
ース付与部４４の出力を合わせて編集する（■）。この
出力編集部５０での編集結果が合成音声発生部６０へ送
られて音声となる。The output editing section 50 edits the outputs of the pronunciation adding section 30 and the accent/bose adding section 44 together (■). The editing result in the output editing section 50 is sent to the synthesized speech generating section 60 to become speech.

このように、解析不能文字の限定化を行う事により、解
析できない文字列を含む文を入力されても、その部分の
品詞が適当に推定できるため、なるべく自然に近い形で
読みを出力することができる。In this way, by limiting the characters that cannot be parsed, even if a sentence containing a string that cannot be parsed is input, the part of speech of that part can be appropriately estimated, so the reading can be output in a form as close to natural as possible. Can be done.

第３図は本発明の他の実施例の構成図で、誤り検出シス
テムに本発明を適用した場合であり、第４図はその具体
的処理例を第３図の各部分に対応づけて示した図である
。FIG. 3 is a block diagram of another embodiment of the present invention, in which the present invention is applied to an error detection system, and FIG. 4 shows a specific processing example thereof in correspondence with each part of FIG. 3. This is a diagram.

第３図において、１０はデータ入力部、２０は形態素解
析部、４０は解析不能文字列処理部、７０は誤り検出部
である。解析不能文字列処理部４０には解析不能文字の
限定化部４１がある。In FIG. 3, 10 is a data input section, 20 is a morphological analysis section, 40 is an unanalyzable character string processing section, and 70 is an error detection section. The unanalyzable character string processing section 40 includes an unanalyzable character limiting section 41.

今、データ入力部１ｏより日本文「すべて分からないこ
たばかりだ。」が読込まれ、その「こたばかりだ」の「
た」が誤字であるとする（■）。Now, the Japanese sentence ``I don't understand everything.'' is read from the data input section 1o, and the Japanese sentence ``I don't understand everything.'' is read.
Suppose that "ta" is a typo (■).

この日本文について、形態素解析部１０で解析され、副
詞「すべて」と解析不能文字列の文節「分からないこた
ばかりだ」に認定される（■）。このうち、解析不能文
字列［分からないこたばかりだ」は、解析不能文字列処
理部４０に送られる。This Japanese sentence is analyzed by the morphological analysis unit 10, and is recognized as the adverb "all" and the phrase "I don't understand" which is a string of characters that cannot be analyzed (■). Among these, the unanalyzable character string [I just don't understand it] is sent to the unanalyzable character string processing section 40.

該処理部４０の解析不能文字限定化部４１では、さらに
解析が失敗する文字までなんとか解析し、失敗先頭文字
「た」　（力変活用動詞「こ」に接続できる「た」はな
い）をスキップして、再度解析し、「ばかりだＪの解析
に成功する（■）５解析不能文字の限定化部４１での解
析結果を誤り検出部７０に送ることにより、「た」１文
字を誤りとして検出できる（■）、、このように、解析
不能文字の限定化を行う事により、誤り文字を特定化で
きる。The unanalyzable character limiting unit 41 of the processing unit 40 manages to analyze even the characters whose analysis fails, and skips the first character of failure, “ta” (there is no “ta” that can be connected to the force-transforming verb “ko”). 5. By sending the analysis results of the unanalyzable character limiting section 41 to the error detection section 70, the single character "ta" is treated as an error. Detectable (■) By limiting unanalyzable characters in this way, erroneous characters can be identified.

〔Effect of the invention〕

以上説明したように１日本文の解析において従来辞書に
ない単語や誤字などを含む文が存在した場合は５文節な
ど処理の最小単位全体を解析不能文字列とせざるを得な
かったが、本発明では、さらにその中でも解析不能文字
を限定して特定化できるようになるため、日本文の解析
効率が向上する。さらに、抽出した解析不能文字列の正
解の品詞を単一品詞として推定しても問題ない程度に限
定化でき、解析不能文字列を適当な品詞に推定して処理
できるため、辞書にない単語や誤字などを含む文を許容
して日本文解析ができるようになるという利点がある。As explained above, in the conventional analysis of a single Japanese sentence, if there were words that were not found in the dictionary or sentences that contained misspellings, the entire minimum unit of processing, such as five clauses, had to be treated as an unanalyzable character string, but the present invention Furthermore, since it becomes possible to limit and specify characters that cannot be analyzed, the efficiency of analyzing Japanese sentences is improved. Furthermore, it is possible to limit the correct part of speech of extracted unparsable character strings to the extent that there is no problem even if the correct part of speech is estimated as a single part of speech, and it is possible to process unparsable character strings by inferring them to an appropriate part of speech. It has the advantage of allowing Japanese sentences to be analyzed while allowing sentences that contain typographical errors.

現実の業務で使用される一般の文章は、辞書にない単語
や誤字などを含んだ文を持っている場合が多く、本発明
による効果の意義は大きい。例えば日本文音声出力シス
テムに応用すると、解析不能文字列があっても正解に近
いアテ読みができるようになる。又、日本文誤り検出シ
ステムでは誤りを含む範囲を限定して検出できるように
なる。General sentences used in actual work often contain words that are not found in dictionaries or sentences that include misspellings, so the effects of the present invention are of great significance. For example, when applied to a Japanese speech output system, even if there is a character string that cannot be parsed, it will be possible to read the text close to the correct answer. Furthermore, the Japanese sentence error detection system can detect errors by limiting the range that includes them.

[Brief explanation of the drawing]

第１図は本発明の一実施例の構成図、第２図は第１図の
具体的処理例を示す図、第３図は本発明の他の実施例の
構成図、第４図は第３図の具体的処理例を示す図、第５
図は従来方式の構成図である。１０・・・データ入力部、　２０・・・形態素解析部。４０・・・解析不能文字列処理部。第１Ｌ１１　　　　　　　　第２２゛メフ？す。FIG. 1 is a block diagram of one embodiment of the present invention, FIG. 2 is a diagram showing a specific processing example of FIG. 1, FIG. 3 is a block diagram of another embodiment of the present invention, and FIG. Figure 5 shows a specific processing example in Figure 3.
The figure is a block diagram of a conventional system. 10...Data input section, 20...Morphological analysis section. 40...Unanalyzable character string processing section. 1st L11 22nd ゛meh? vinegar.

Claims

[Claims]

(1) In the method of processing Japanese sentences using a computer, the Japanese sentences to be processed are processed using a grammar dictionary, word dictionary, etc.
A morphological analysis unit that recognizes and divides into phrases, etc., and a deeper analysis of character strings that cannot be analyzed by the morphological analysis unit due to words that are not found in the dictionary or errors in Japanese sentences, to limit the number of character strings that cannot be analyzed. A Japanese sentence processing method characterized by having an unanalyzable character string processing unit that performs specification.