JPH0724054B2

JPH0724054B2 - Data processing device

Info

Publication number: JPH0724054B2
Application number: JP59019969A
Authority: JP
Inventors: 良成平岡; 恭紀片山; 裕吉浦; 邦夫中西
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1984-02-08
Filing date: 1984-02-08
Publication date: 1995-03-15
Anticipated expiration: 2010-03-15
Also published as: JPS60164864A

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は、仮名漢字混りの日本語文を処理するテキスト
処理システムに係り、特にこのようなテキスト処理シス
テムにおいて、綴りの誤りを検出あるいは訂正する装置
に係る。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text processing system for processing a Japanese sentence mixed with kana and kanji, and particularly to detecting or correcting a spelling error in such a text processing system. Related to the device.

[Background of the Invention]

OAの進展に伴つて計算機によるテキスト処理が注目を集
めている。特に欧米では、アルフアベツトの字数が少な
いこともあつて、テキストの計算機処理が相当普及して
きており、単にテキストを入力・記憶・出力するばかり
でなく、入力されたテキストの綴りの誤りを検出・訂正
するシステムが製品化されつつある。例えば、UNIXのユ
ーテイリテイプログラムであるTYPOは英単語の綴り誤り
を検出する機能を有しており、DEC10のSPELLというプロ
グラムは英単語の綴り誤りの一部を訂正する機能があ
る。例えばJ.L.Peterson“Computer Programs for Dete
cting and Correcting Spelling Errors",CACM,Vo.123,
No.92には、このような英文綴り検証システムが詳しく
紹介されている。With the development of OA, text processing by computer has been attracting attention. Especially in Europe and the United States, the computer processing of texts has become quite popular due to the small number of alphanumeric characters.In addition to simply inputting, storing and outputting texts, spelling errors in input texts can be detected and corrected. The system that does this is being commercialized. For example, UNIX utility program TYPO has a function to detect spelling errors in English words, and DEC10's SPELL program has a function to correct some spelling errors in English words. For example, JLPeterson “Computer Programs for Dete
cting and Correcting Spelling Errors ", CACM, Vo.123,
No.92 introduces such an English spelling verification system in detail.

しかしながら、日本語は（１）漢字，平仮名，片仮名と３つの字種が混在して
いる。However, in Japanese, (1) Kanji, Hiragana, Katakana and three character types are mixed.

（２）漢字は字数が極めて多い。(2) The number of kanji is extremely large.

（３）単語の切れ目に空白が挿入されないため単語の
切り出しが難しい。(3) It is difficult to cut out a word because no space is inserted at the break of the word.

等の特徴があるため、日本では、テキストの入力方式及
び綴りの検出・訂正システムは開発が困難である。In Japan, it is difficult to develop a text input method and a spelling detection / correction system.

さらに、日本語は綴り誤りのパターン自体が欧米語と異
なるため、欧米で開発された処理方式は適用することが
できず、日本語独自の方式が必要である。Furthermore, because the spelling error pattern itself in Japanese is different from that in Western languages, the processing methods developed in the West cannot be applied, and a unique Japanese method is required.

日本語の綴り誤りのパターンは入力方式にも大きく依存
している。現時点では日本語入力方式として、（１）タツチ・タイプ方式（２）ペン・タツチ方式（３）仮名漢字変換方式（４）手書文字認識方式が知られている（特開昭58-106663、特開昭55-14368
1）。Japanese spelling patterns also depend heavily on the input method. At present, as Japanese input method, (1) touch type method (2) pen touch method (3) kana-kanji conversion method (4) handwritten character recognition method is known (JP-A-58-106663, JP-A-55-14368
1).

一方綴り誤りのパターンとしては、（ａ）ユーザの記憶誤り（ｂ）タイプミス（ｃ）装置の認識誤りが考えられる。このうち（ａ）は、日本語では同音異語
をあて字として使つた綴り誤り、送り仮名の誤り等が多
く、（１），（２），（４）の入力方式では多く見受け
られるが、（３）の方式のうち文節入力可能な方式では
ユーザが表音文字である仮名で文節を入力するとシステ
ムがこれを仮名漢字混りの文字列に変換するため、この
種の誤りは他の方式に比較し少ない。（ｂ）は、主とし
て（１）と（２）に特有の誤りである。（ｃ）は（４）
の方式のみで問題となるエラーである。On the other hand, examples of the spelling error pattern are (a) user's memory error, (b) typo, and (c) device recognition error. Among them, (a) has many spelling errors in which homophones are used as target characters in Japanese, errors in sending kana, etc., which are often seen in the input methods (1), (2), and (4). In the method of 3) in which the phrase can be input, when the user inputs a phrase with a kana which is a phonetic character, the system converts it into a character string mixed with kana and kanji characters, so this kind of error can be caused by other methods. Less compared. (B) is an error peculiar to (1) and (2). (C) is (4)
This is a problematic error only in the method of.

以上の考察で明らかなように、現在日本語ワードプロセ
ツサで多用されている（３）の仮名漢字変換方式は比較
的誤りの発生しにくい入力方式と考えられるが、初心者
にとつて使いやすいこと、文字認識技術の応用から、
（４）の手書文字認識方式の日本語入力方式に適用され
る、（ａ）及び（ｃ）タイプの誤りを検出・訂正するの
に好都合な装置を提供するものである。As is clear from the above consideration, the kana-kanji conversion method (3), which is widely used in Japanese word processors at present, is considered to be a relatively error-free input method, but easy for beginners to use. , From the application of character recognition technology,
The present invention provides a convenient device for detecting and correcting errors of types (a) and (c), which is applied to the Japanese input method of the handwritten character recognition method of (4).

[Object of the Invention]

本発明は、上記の考察に基づくものであつて、その第１
の目的は、漢字仮名混り日本語文を処理するテキスト処
理システムにおいて、日本語文に含まれる綴り誤りを文
節単位に検出する手段を提供することにある。その第２
の目的は、上記検出された綴り誤りのうち、同音異字を
あて字として使つた誤りを訂正する手段を提供すること
にある。The present invention is based on the above consideration and its first
It is an object of the present invention to provide a means for detecting spelling errors included in a Japanese sentence in units of clauses in a text processing system for processing Japanese sentences mixed with kanji and kana. The second
It is an object of the present invention to provide a means for correcting an error in the detected spelling error in which a homophone is used as a target character.

[Outline of Invention]

本発明は、綴り誤りを含んだ漢字仮名混りの日本語文
を、漢字仮名混り表記の単語辞書を参照して単語分割す
る手段と、単語分割が失敗した文節をユーザに提示する
手段と、単語分割が失敗した文節を仮名漢字混り表記か
ら表音文字（例えばローマ字）表記に変換する変換テー
ブルと、得られた表音文字表記の文節を、表音文字表記
の単語辞書を参照して単語に分割する手段と、分割が成
功した時、その表音文字表記の文節を仮名漢字混り表記
に変換し、訂正された文節としてユーザに提示する手段
を有することを特徴とするものである。The present invention, a Japanese sentence mixed with Kanji and Kana containing a spelling error, a means for dividing the word by referring to a word dictionary of mixed Kanji and Kana notation, and a means for presenting a clause in which the word division has failed to the user, With reference to the phonetic alphabet word dictionary, the conversion table that converts the phrase in which the word segmentation failed into the kana-kanji mixed notation to the phonetic alphabet (for example, Roman alphabet) notation and the obtained phonetic alphabet phrase It is characterized by having a means for dividing into words and a means for, when the division is successful, converting a phrase in the phonetic character notation into a kana-kanji mixed notation and presenting it as a corrected phrase to the user. .

Example of Invention

以下、本発明の一実施例を図面に基づいて説明する。第
１図は、本実施例として取り上げた文書処理ワークステ
ーシヨンの外観を示す。ワークステーシヨン１は、デイ
スプレイ2,キーボード3,手書文字入力用タブレツト４、
及びフロツピー記憶装置６を有する。ユーザは、日本語
文を、キーボード３からキー入力するか、又はタブレツ
ト４にスタイラスペンで手書入力する。入力された日本
語文はデイスプレイ２に表示されるか、フロツピー記憶
装置６に格納される。又、一旦、記憶装置６格納された
日本語文をデイスプレイ２に表示させることができる。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 shows the appearance of the document processing workstation taken as the present embodiment. Workstation 1 is a display 2, keyboard 3, tablet 4 for handwriting input,
And a floppy storage device 6. The user inputs a Japanese sentence from the keyboard 3 by hand or by handwriting on the tablet 4 with a stylus pen. The input Japanese sentence is displayed on the display 2 or stored in the floppy storage device 6. Further, the Japanese sentence stored in the storage device 6 can be displayed once on the display 2.

第２図は第１図のワークステーシヨンの詳細を示すブロ
ツク図である。ワークステーシヨンは、デイスプレイ
２、表示用のバツフアメモリ７、キーボード３、タブレ
ツト４、フロツピー記憶装置６、プロセツサ８、手書文
字認識装置９、及びプリンタ10を含む。FIG. 2 is a block diagram showing details of the work station shown in FIG. The workstation includes a display 2, a display memory 7, a keyboard 3, a tablet 4, a floppy storage device 6, a processor 8, a handwritten character recognition device 9, and a printer 10.

プロセツサ８は、入力されたテキストの処理等各種のデ
ータ処理を行う装置であつて、マイクロプロセツサ又は
専用論理回路で実現される場合もあるし、ホストコンピ
ユータと端末装置で実現される場合もある。本発明の綴
り誤り訂正処理はこのプロセツサで実行される。The processor 8 is a device that performs various data processing such as processing of input text, and may be realized by a microprocessor or a dedicated logic circuit, or may be realized by a host computer and a terminal device. . The spelling error correction process of the present invention is executed by this processor.

手書文字認識装置９は、タブレツトから入力された漢字
や仮名文字を認識しこれをコード化して、プロセツサ８
に送り込む装置であつて、95〜99％程度の認識率が得ら
れている。The handwritten character recognition device 9 recognizes the kanji or kana characters input from the tablet, encodes them, and processes them.
The recognition rate of about 95-99% has been obtained for the device.

プリンタ10は、プロセツサ８で処理された日本語文を印
刷する装置である。The printer 10 is a device that prints the Japanese sentence processed by the processor 8.

フロツピー記憶装置６には、入力された日本語文が格納
される他、日本語の解析に用いられる辞書やテーブル類
も格納されている。なおフロツピー記憶装置の代りにデ
イスク記憶装置等他の外部記憶装置を用いてもよい。The floppy storage device 6 stores the input Japanese sentences as well as dictionaries and tables used for Japanese analysis. Other external storage devices such as a disk storage device may be used instead of the floppy storage device.

第３図は、本発明で綴り誤りの検出・訂正を行つた場合
のデイスプレイスクリーン表示の一例を示す図である。
キーボード２、タブレツト４、又はフロツピー記憶装置
６から入力された日本語文はスクリーンの上半分に表示
されている。ここで綴り検出用のコマンドを、キーボー
ド３から入力すると、システムは、日本語文の先頭から
順番に一文節ずつ解析し、誤つた文節が見付かると、ア
ンダーラインが施される。この例では「専問家の」とい
う誤つた文節が検出されている。これに対してシステム
は、スクリーンの下半分に３種類のメニユーを表示す
る。これによつてユーザは、システムの検出した誤りが
システムの辞書にない特殊な用語であれば、１を選択
し、システムが訂正した綴りが正しければ２を選択し、
自分で新たに正しい綴りを入力したければ３を選択する
ことができる。ユーザの選択を終了するとシステムは次
の文節から再び文の解析を再開する。FIG. 3 is a diagram showing an example of a display screen display when the spelling error is detected and corrected in the present invention.
Japanese sentences input from the keyboard 2, the tablet 4, or the floppy storage device 6 are displayed in the upper half of the screen. When a spelling detection command is input from the keyboard 3, the system analyzes the Japanese sentence one phrase at a time starting from the beginning, and if an erroneous phrase is found, it is underlined. In this example, a false phrase "of a specialist" is detected. In contrast, the system displays three types of menus in the bottom half of the screen. This allows the user to select 1 if the error detected by the system is a special term that is not in the system's dictionary, and 2 if the spelling corrected by the system is correct,
If you want to enter a new correct spelling yourself, you can select 3. When the user's selection is complete, the system restarts parsing the sentence from the next clause.

第４図は、上記のような綴り誤り検出・訂正処理をプロ
セツサ８及びフロツピー記憶装置６でどのような手順で
実行するかを示したものである。検査すべき文字列はま
ず文字列バツフア20に格納された後、文節切り出し手段
21によつて文節に分割される。文節は、自立語（一般に
漢字又は片仮名で始まる。）に任意の個数の付属語（平
仮名）が続く構成になつているので、平仮名から漢字又
は片仮名に字種が変化する点を検出すれば、容易に文節
分けができる。次に切り出された文節ごとに文節内形態
素解析23が行なわれる。ここで形態素解析とは文節を単
語に切り分ける処理のことで、日本語ワードプロセツサ
の仮名漢字変換処理において扱われる概念である。ここ
では、テキストが仮名漢字混り文であるから、漢字仮名
表記による単語辞書30及び単語間接続行列34を用いて解
析を進める。もし、形態素解析が成功すれば、その文節
は正しいと判断し、次の文節を切出すため文節切出し手
段21に制御が移る。もし、形態素解析が失敗した場合
は、その文節に同音異字による当て字的な誤字が含まれ
ていると判断し、当該文節を、漢字仮名→ローマ字変換
手段24、及び漢字仮名→ローマ字対応表33を用いて、ロ
ーマ字表記に変換し、ローマ字文節バツフア25に格納す
る。次に、このローマ字文節を、ローマ字による形態素
解析手段26を用い、形態素解析を行なう。今度は表音文
字をベースとした形態素解析であるため、音は正しくて
も表記が誤つているような綴り誤りを含む文節は正しく
解析される。この場合の形態素解析は、ローマ字表記に
よる単語辞書32及び単語間接続行列34を用いる。ここで
もし形態素解析が失敗した場合は、もとの文字列にある
字以外の誤りが含まれていたことになり、システムは、
ユーザに当該文節に誤りがあつたことだけを通知する。
もし形態素解析が成功した場合には、正しく解析された
文節を、ローマ字→漢字仮名変換手段27によつて、漢字
仮名混り文に変換し訂正文字列バツフア28に格納した後
ユーザに提示する。変換は、ローマ字表記による単語辞
書32中の単語から、漢字仮名表記による単語辞書30中の
単語へのポインタ31を参照して実行される。なお、同一
のローマ字表記の文節に対して、複数の漢字仮名混り文
節が対応する場合には、複数の候補をユーザに提示す
る。FIG. 4 shows how the processor 8 and the floppy storage device 6 execute the spelling error detection / correction process as described above. The character string to be inspected is first stored in the character string buffer 20 and then the phrase segmentation means.
It is divided into clauses by 21. The bunsetsu consists of independent words (generally starting with Kanji or Katakana) followed by an arbitrary number of adjuncts (Hiragana), so if you detect a point where the type changes from Hiragana to Kanji or Katakana, Can be easily segmented. Next, an intra-segment morphological analysis 23 is performed for each segment extracted. Here, the morphological analysis is a process of dividing a bunsetsu into words, which is a concept handled in the kana-kanji conversion process of a Japanese word processor. Here, since the text is a mixed kana / kanji sentence, the analysis is advanced using the word dictionary 30 and the inter-word connection matrix 34 in kanji / kana notation. If the morphological analysis is successful, the bunsetsu is judged to be correct, and control is passed to the bunsetsu cutout means 21 to cut out the next bunsetsu. If the morpheme analysis fails, it is determined that the syllable contains a phonetic typographical error due to a homonym, and the bunsetsu is converted into the kanji-kana → romaji conversion means 24 and the kanji-kana → romaji correspondence table 33. It is used to convert to Roman alphabet and stored in Roman phrase buffer 25. Next, the Roman phrase is subjected to morphological analysis using the Roman morphological analysis means 26. This time, since it is a morphological analysis based on phonetic characters, a phrase containing spelling errors such that the sound is correct even if the sound is correct is correctly analyzed. The morpheme analysis in this case uses the word dictionary 32 and the inter-word connection matrix 34 in Roman alphabet. If the morpheme analysis fails here, it means that the original string contained errors other than the letters, and the system
Notify the user only that the phrase is incorrect.
If the morphological analysis is successful, the correctly analyzed phrase is converted into a mixed kanji / kana sentence by the Roman character → kanji kana conversion means 27, stored in the corrected character string buffer 28, and then presented to the user. The conversion is performed by referring to a pointer 31 from a word in the word dictionary 32 written in Roman alphabet to a word in the word dictionary 30 written in Kanji kana. When a plurality of Kanji / Kana mixed phrases correspond to the same romanized phrase, a plurality of candidates are presented to the user.

第５図は、「専問家の」という誤つた文節が切り出され
た時の処理の流れを示した図である。文節バツフア20に
格納された「専問家の」という文節を形態素解析する
と、「専問家」が辞書にないため、解析は失敗する。そ
こで、漢字仮名→ローマ字変換表33を用いてローマ字表
記「SENMONKANO」に変換すると、今度は「SENMONKA」と
いう名詞が辞書中にあるため形態素解析が成功し、名詞
「SENMONKA」と、名詞に続く格助詞「NO」に分割され
る。最後にこれらの単語は、２つの辞書の間のポインタ
31をたどつて、「専門家の」という訂正文字列に変換さ
れユーザーに提示される。FIG. 5 is a diagram showing the flow of processing when the erroneous phrase “of a specialist” is cut out. When the morphological analysis of the phrase "of an expert" stored in the phrase buffer 20 is performed, the analysis fails because "the expert" is not in the dictionary. So, if you convert it into the Roman alphabet "SENMONKANO" using the Kanji-to-Romaji conversion table 33, the morpheme analysis succeeds because the noun "SENMONKA" is in the dictionary this time, and the noun "SENMONKA" and the case following the noun are succeeded. It is divided into particles "NO". Finally these words are pointers between the two dictionaries
Following 31, it is converted into a correction character string of "expert" and presented to the user.

第６図は、「恥ぢない」という誤つた文節が切り出され
た時の処理の流れを示した図である。この場合も「恥ぢ
る」という単語が辞書にないため漢字仮名混り表記によ
る形態素解析は失敗する。しかし、ローマ字表記に変換
すると形態素解析は成功し、「HAJIRU」という動詞の未
然形と、「NAI」という助動詞に分割される。最後にこ
れを、漢字仮名混り表記の文節「恥じる」に変換し、訂
正された文節としてユーザに提示する。FIG. 6 is a diagram showing the flow of processing when the erroneous phrase “I am not ashamed” is cut out. Also in this case, since the word "shame" is not in the dictionary, the morphological analysis using the mixed kanji kana notation fails. However, when converted to Roman alphabet, the morphological analysis succeeds and it is divided into the preformed form of the verb "HAJIRU" and the auxiliary verb "NAI". Finally, this is converted into a phrase “shameful” in which kanji and kana are mixed and presented to the user as a corrected phrase.

表音文字としてローマ字の例を示したが、カナ文字であ
つても差しつかえない。Although the example of Roman letters is shown as the phonetic alphabet, it is acceptable to use Kana letters.

〔The invention's effect〕

上述のように本発明によれば、第１に仮名漢字混り日本
語文の綴り誤りを検出できる効果がある。誤りとしては
タイプミス，手書文字認識の認識誤り、ユーザの記憶誤
り等を含む。第２に、上記の誤りのうちで、同音異字を
誤つてあて字的に用いた綴り誤りを訂正する効果を有す
る。As described above, according to the present invention, firstly, there is an effect that a spelling error in a Japanese sentence mixed with kana and kanji can be detected. Errors include typographical errors, recognition errors in handwritten character recognition, user memory errors, and the like. Secondly, among the above errors, it has the effect of correcting spelling errors that are made by mistakenly using homophones and different letters.

[Brief description of drawings]

図面は本発明の一実施例で、第１図はシステムの外観
図、第２図はブロツク図、第３図はデイスプレイ表示の
一例を示す図、第４図は、第２図のプロセツサと、フロ
ツピー記憶装置内のデータフローを示す機能図、第５図
は、「専問家の」という誤つた文字列が入力された時の
処理の説明図、第６図は、「恥ぢる」という誤つた文字
列が入つた時の処理の説明図である。１……ワークステーシヨン、２……デイスプレイスクリ
ーン、３……キーボード、４……タブレツト、５……ス
タイラスペン、６……フロツピー記憶装置、７……表示
用バツフアメモリ、８……プロセツサ、９……手書文字
認識装置、10……プリンタ、20……文字列バツフア、21
……文節切出し手段、22……漢字仮名混り文節バツフ
ア、23……漢字仮名混り表記の形態素解析手段、24……
漢字仮名→ローマ字変換手段、25……ローマ字バツフ
ア、26……ローマ字表記の形態素解析手段、27……ロー
マ字→漢字仮名変換手段、28……訂正文字列バツフア、
29……文書メモリ、30……漢字仮名表記による単語辞
書、31……ポインタ、32……ローマ字表記による単語辞
書、33……漢字仮名→ローマ字対応表、34……単語間接
続行列。1 is an external view of a system, FIG. 2 is a block diagram, FIG. 3 is a diagram showing an example of a display, and FIG. 4 is a processor of FIG. Fig. 5 is a functional diagram showing the data flow in the floppy storage device. Fig. 5 is an explanatory diagram of the processing when a wrong character string "of a specialist" is input, and Fig. 6 is an error of "shame". It is explanatory drawing of the process at the time of the input character string. 1 ... Work station, 2 ... Display screen, 3 ... Keyboard, 4 ... Tablet, 5 ... Stylus pen, 6 ... Floppy storage device, 7 ... Display memory, 8 ... Processor, 9 ... Handwritten character recognition device, 10 …… printer, 20 …… character string buffer, 21
...... Phrase segmentation means, 22 ...... Kanji and Kana mixed phrase clause buffer, 23 ...... Korean and Kana mixed morphological analysis means, 24 ......
Kanji to Kana to Romaji conversion means, 25 …… Romaji buffer, 26 …… Romaji morphological analysis means, 27 …… Romaji to Kanji kana conversion means, 28 …… Corrected character string buffer,
29 …… Document memory, 30 …… Kanji kana notation word dictionary, 31 …… Pointer, 32 …… Romaji notation word dictionary, 33 …… Kanji kana → Romaji correspondence table, 34 …… Word connection matrix.

Claims

[Claims]

1. A handwriting input means for inputting at least a kanji kana mixed sentence using handwritten characters, a storage means for storing a word dictionary in kanji kana notation and a word dictionary in phonetic alphabet notation, and the input kanji kana mixed A first conversion process for converting a part or all of a sentence into phonetic alphabet notation, a second conversion process for converting phonetic alphabet into a kanji kana mixed sentence, and the handwritten input kanji kana mixed sentence A processor for executing a first dividing process for dividing into words while referring to the word dictionary in Kanji and Kana and a second dividing process for dividing into words while referring to the word dictionary in phonetic character notation In the data processing device having the following, when the first division process is executed and the first division process fails, the first conversion unit causes one of the Kanji / Kana mixed sentences related to the failure to occur. A part or all is converted into a phonetic alphabet notation, the converted phonetic alphabet notation is subjected to the second division processing, and the second conversion means outputs the result of the second division processing to a kanji kana. A data processing device characterized by converting to a mixed sentence.

2. The data processing device according to claim 1, further comprising a printer or a display as a process for outputting the processing results of the first division processing and the second division processing, A data processing device, wherein when outputting the processing result of the division processing of 2, the output of the content instructing the selection of homonyms is displayed on the printer or the display.