JPH0432958A

JPH0432958A - Japanese sentence error word detecting device

Info

Publication number: JPH0432958A
Application number: JP2133319A
Authority: JP
Inventors: Shiyou Imagou; 詔今郷
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1990-05-22
Filing date: 1990-05-22
Publication date: 1992-02-04

Abstract

PURPOSE:To point out only a position which is possibly wrong and to reduce the load on a user by detecting partial character strings, which are used only once among partial character strings counted by a partial character string counting means, as a wrong word. CONSTITUTION:A document is divided by a document dividing means 1 at change points between two kinds of characters, i.e. a change point from KATAKANA(square form of Japanese syllabary) to other characters and a change point from other characters to KATAKANA. Then the partial character string counting means 2 counts how many times the same character string is used in the document as to only partial character strings consisting of only KATAKANA. Then it is judged whether or not the counted character strings in KATAKANA are used only once and the partial character strings in KATAKANA which are used only once in the document are recognized as wrong words. When the partial character strings in KATAKANA which are counted are used only once in the document, an error detecting means 3 displays the detected wrong words to the user. Consequently, only positions which are possibly wrong are pointed out and the load on the user is reduced.

Description

【発明の詳細な説明】投嘉分互本発明は、日本文誤り語検出装置に関し、文字コードで
電子的に表現された日本語文章を対象として、その中に
含まれる誤った表記の語を検出する日本文誤り語検出装
置に関する。特に日本語ワードプロセッサなどのキーボ
ードを用いて入力される文章を対象とし、タイプミスや
かな漢字変換ミスのために表記を誤っている可能性のあ
る語を検出する日本文誤り語検出装置に関する。例えば
、日本語ワードプロセッサ等での校正支援機能として応
用できる。[Detailed Description of the Invention] The present invention relates to a Japanese sentence error word detection device that detects incorrectly written words contained in Japanese sentences electronically expressed using character codes. This invention relates to a Japanese sentence error word detection device. In particular, the present invention relates to a Japanese sentence error word detection device that detects words that may be written incorrectly due to typos or kanji conversion errors, particularly for sentences entered using a keyboard such as a Japanese word processor. For example, it can be applied as a proofreading support function in Japanese word processors, etc.

従米肢生本発明に係る従来技術、すなわち日本文を対象とした誤
り検出方式の一例を記載した公知文献としては「日本語
文章作成支援システムＣＯＭＥＴＪ（福島外３名、電子
情報通信学会技術報告０５８６−２１．１９８６年）が
ある。A publicly known document describing an example of the prior art related to the present invention, that is, an error detection method for Japanese sentences, is "Japanese text creation support system COMETJ (3 people outside Fukushima, IEICE technical report 0586) -21.1986).

以下、上記公知文献に記載されている誤り検出方式につ
いて説明する。この誤り検出方式によれば、まず形態素
解析に失敗した箇所を誤りと認定する方式がある。形態
素解析とは既に広く知られている技術で、文章を単語単
位に分割するとともに、その品詞を認定する処理である
。致方語〜数十万語の単語の表記とその品詞を記録した
単語辞書と、それぞれの品詞が互いに文法的に接続可能
かどうかという情報を記録した接続表を使用する。The error detection method described in the above-mentioned known document will be explained below. According to this error detection method, there is a method in which a portion where morphological analysis fails is recognized as an error. Morphological analysis is a widely known technique that involves dividing a sentence into words and identifying the part of speech. We use a word dictionary that records the notation of hundreds of thousands of words from Machhogo and their parts of speech, and a connection table that records information on whether each part of speech can be grammatically connected to each other.

この接続表を使って、与えられた文を互いに文法的に接
続可能な単語の列に分解する。このとき単語の品詞も認
定することができる。文章中で形態素解析に失敗する箇
所は、単語辞書に登録されていない語が使われているか
何らかの誤りがあるかのどちらかである。単語辞書に十
分な数の単語が登録されていれば、形態素解析に失敗す
る箇所を誤りと認定してもよい。Using this connection table, we break down a given sentence into a string of words that can be grammatically connected to each other. At this time, the part of speech of the word can also be recognized. Where morphological analysis fails in a sentence, either a word that is not registered in the word dictionary is used, or there is some kind of error. If a sufficient number of words are registered in the word dictionary, a portion where morphological analysis fails may be recognized as an error.

次に、ＫＷＩＣ表示により人間が誤りを見つける方式が
ある。ＫＷＩＣとは、ＫｅｙＷｏｒｄ　Ｉｎ　Ｃｏｎｔ
ｅｘｔの略で、文章中で使われている単語（文字列）を
その前後の文字列とともに表示したものである。文字列
は文字コード順などで表示され、誤りを含むような見な
れない文字列の発見が容易になっている。Next, there is a method in which humans can detect errors using KWIC display. What is KWIC?KeyWord In Cont
This is an abbreviation for ext, which displays the words (character strings) used in a sentence along with the character strings before and after them. Character strings are displayed in order of character code, etc., making it easier to find unreadable character strings that contain errors.

しかしながら、形態素解析に失敗した箇所を誤りと認定
する前記の方式によると以下のような欠点がある。すな
わち、まず誤り検出能力が不十分であるという欠点であ
る。もし文章に誤りが含まれていたとしても、その誤り
を検出できないことが多い。特に漢字複合語の一部が別
の誤った漢字になっている箇所は検出できないことが多
い。例えば、′講演前に”とすべき箇所を誤って“公園
前″と入力したとしても、形態素解析では′公園′（名
詞）十′前′　（接尾辞）というように分割できるので
、誤りとして検出することはできない。However, the above-mentioned method of recognizing a portion where morphological analysis has failed has the following drawbacks. That is, the first drawback is that the error detection ability is insufficient. Even if a sentence contains an error, it is often impossible to detect the error. In particular, it is often impossible to detect places where part of a kanji compound word is an incorrect kanji. For example, even if you mistakenly enter ``Kenmae'' instead of ``Before the lecture,'' the morphological analysis can divide it into ``Kenmae'' (noun) and ``Kenmae'' (suffix), so it will be treated as an error. It cannot be detected.

次に、誤りを過剰に検出するという欠点である。The second drawback is that errors are detected excessively.

一般に単語辞書には致方語以上の単語が登録されている
が、実際に使われる単語で登録されていない語は多い。In general, word dictionaries contain more words than Machhogo, but there are many words that are actually used that are not registered.

特にカタカナ表記の語は専門用語や新しい概念を表現す
るために使われることが多いので、単語辞書に登録され
ていないものが多い。In particular, words written in katakana are often used to express technical terms or new concepts, so many words are not registered in word dictionaries.

したがって、この方式ではカタカナ語の多くを間違って
誤りであると検出してしまう。Therefore, this method incorrectly detects many katakana words as errors.

また、ＫすＩＣ表示により人間が誤りを見つける前記の
方式によると、１ｎＩｃ表示から誤りを見つけるには、
膨大な表示のすべてを人間が調べなければならず、利用
者の負担が過大なものになる。したがって、このような
機能が用意されていてもほとんど利用されないことが予
想される。また、原理的にはすべての誤りを見つけられ
る可能性はあるが、実際には人間が調べる以上必ず見落
としが発生する。調べなければならない量が多ければ多
いほど見落としも増加するという欠点がある。Furthermore, according to the above-mentioned method for humans to find errors using the KsuIC display, in order to find errors from the 1nIc display,
A human must examine all of the enormous amount of display, which places an excessive burden on the user. Therefore, even if such a function is available, it is expected that it will hardly be used. In addition, although in principle it is possible to find all errors, in reality, as long as humans investigate, there will always be oversights. The disadvantage is that the greater the amount of research that needs to be done, the more likely things will be overlooked.

■−−敗本発明は、上述のごとき実情に鑑みてなされたもので、
誤りがある可能性のある箇所だけを検出して人間の負担
を軽減すること、また、間違った誤り検出を少なくする
こと、更に構文的には正しいが実際は誤っている箇所も
検出できるようにした日本文誤り語検出装置を提供する
ことを目的としてなされたものである。■--Defeat The present invention was made in view of the above-mentioned circumstances,
We reduced the burden on humans by detecting only the parts where there is a possibility of an error, we also reduced the number of incorrect error detections, and we also made it possible to detect parts that were syntactically correct but were actually incorrect. This was done for the purpose of providing a Japanese sentence error word detection device.

璽−一一腹本発明は、上記目的を達成するために、（１）文字コー
ドで電子的に表現された日本文を対象として誤り語を検
出する日本文誤り語検出装置において、文章を字種の変
化点を境界として部分文字列に分割する文章分割手段と
、カタカナの部分文字列の出現回数を計数する部分文字
列計数手段と。In order to achieve the above-mentioned objects, the present invention provides (1) a Japanese sentence error word detection device that detects error words in Japanese sentences electronically expressed using character codes; A sentence dividing means for dividing into sub-character strings using a species change point as a boundary, and a sub-character string counting means for counting the number of times a katakana sub-character string appears.

該部分文字列計数手段により計数される部分文字列の出
現回数が１回の部分文字列を誤り語として検出する誤り
検出手段とから成ること、或いは。and an error detection means for detecting a partial character string in which the number of occurrences of the partial character string counted by the partial character string counting means is one, as an error word.

（２）文字コードで電子的に表現された日本文を対象と
して誤り語を検出する日本文誤り語検出装置において、
文章を形態素解析し、単語単位に分割する文章分割手段
と、名詞に相当する単語列の出現回数を計数する部分文
字列計数手段と、該部公文字列計数手段により計数され
る単語列の出現回数が１回の部分文字列を誤り語として
検出する誤り検出手段とから成ること、更には、（３）
前記部分文字列計数手段により計数される単語列の出現
回数が１回の部分文字列で、単語辞書に登録されていな
いものを誤り語として検出する誤り検出手段を有するこ
とを特徴としたものである。以下、本発明の実施例に基
づいて説明する。(2) In a Japanese sentence error word detection device that detects error words in Japanese sentences electronically expressed using character codes,
A sentence dividing means that morphologically analyzes a sentence and divides it into word units; a partial character string counting means that counts the number of occurrences of word strings corresponding to nouns; and occurrences of word strings counted by the department character string counting means. and an error detection means for detecting a partial character string that occurs once as an error word; and (3)
The method is characterized by having an error detection means for detecting a partial character string in which the number of occurrences of the word string counted by the partial character string counting means is one and is not registered in the word dictionary as an error word. be. Hereinafter, the present invention will be explained based on examples.

第１図は、本発明による日本文誤り語検出装置の一実施
例を説明するための構成図で、図中、１は文章分割手段
、２は部分文字列計数手段、３は誤り検出手段である。FIG. 1 is a block diagram for explaining an embodiment of the Japanese sentence error word detection device according to the present invention. In the figure, 1 is a sentence segmentation means, 2 is a substring counting means, and 3 is an error detection means. be.

文章分割手段１は、与えられた日本語の文章を単語に相
当するような部分文字列に分割する。文章は文字コード
で表現されており、電子的に処理可能な形態になってい
るものとする。部分文字列計数手段２は、文章分割手段
によって分割されたそれぞれの文字列を部分文字列とよ
ぶが、ここでは、ある条件に当てはまる部分文字列だけ
を対象とし、同じ部分文字列が文章中に何回出現するか
を数える。誤り検出手段３は、基本的には、部分文字列
計数手段によって計数された部分文字列のうち、１回し
かその文章で使われなかったものを誤り語として検出す
る。これは、普通の語は文章中で複数回使われることが
多く、同じ誤りを含む語は文章中で１回しか現れないと
いう見通しに基づいている。The sentence dividing means 1 divides a given Japanese sentence into partial character strings corresponding to words. It is assumed that the text is expressed in character codes and in a form that can be processed electronically. The partial character string counting means 2 refers to each character string divided by the sentence dividing means as a partial character string, but here, only partial character strings that meet a certain condition are targeted, and the same partial character string is not included in the sentence. Count how many times it appears. The error detection means 3 basically detects, as an error word, a partial character string that is used only once in the sentence among the partial character strings counted by the partial character string counting means. This is based on the observation that common words are often used multiple times in a sentence, whereas words containing the same error occur only once in a sentence.

第２図は、本発明による日本文誤り語検出装置の動作を
説明するためのフローチャートである。FIG. 2 is a flowchart for explaining the operation of the Japanese sentence error word detection device according to the present invention.

以下、各ステップに従って順に説明する。Below, each step will be explained in order.

射」壮：与えられた日本語の文章を字種の変化点で分割
する。すなわち、文章を、カタカナから他の文字へ、他
の文字からカタカナへという２種類の字種の変化点で分
割する。分割後の部分文字列には、カタカナのみからな
る文字列と、カタカナ以外の文字のみからなる文字列と
の２種類があることになる。``Shi'' So: Divide a given Japanese sentence at the points where the character type changes. In other words, the text is divided at two types of change points: from katakana to other characters, and from other characters to katakana. There are two types of partial character strings after division: character strings consisting only of katakana characters and character strings consisting only of characters other than katakana characters.

射」４：次にカタカナの部分文字列を計数する。4: Next, count the katakana substrings.

すなわち、カタカナのみからなる部分文字列だけを対象
として、同じ文字列が文章中で何回使われているかを計
数する。That is, the number of times the same character string is used in a sentence is counted only for partial character strings consisting only of katakana characters.

射μ棧：次に計数されたカタカナの部分文字列が文章中
で１回しか用いられていないかどうか判断する。文章中
で１回しか用いられていないカタカナの部分文字列を誤
り語であると認定する。Shooting: Next, it is determined whether the counted katakana substring is used only once in the sentence. A partial character string of katakana that is used only once in a sentence is recognized as an incorrect word.

射μＡ二次に計数されたカタカナの部分文字列が文★中
で１回しか用いられていない場合には、検出した誤り語
をユーザに提示し、必要なら訂正する。If the katakana subcharacter string counted quadratically by μA is used only once in the sentence, the detected error word is presented to the user and corrected if necessary.

カタカナで表記される語は、専門用語や新概念を表すこ
とが多く、文章中で重要な語であることが多い。カタカ
ナ語が文章中で１回しか出現しないことは珍しく、何度
も使われるのが普通である。Words written in katakana often represent technical terms or new concepts, and are often important words in a text. It is rare for katakana words to appear only once in a sentence, and it is normal for them to be used multiple times.

しかし、′コンプータ′のようにタイプミスなどが原因
で発生する誤り語が１つの文章中の複数回現れることは
ほとんどない。したがって、上記の方法でカタカナの誤
り語の大部分を検出することができる。以上の説明は請
求項１の説明に相当する。However, erroneous words such as 'computer', which occur due to typographical errors, rarely appear more than once in a single sentence. Therefore, most of the erroneous words in Katakana can be detected using the above method. The above description corresponds to the description of claim 1.

第３図は、本発明による日本文誤り語検出装置の動作を
説明するための他のフローチャートである。以下、各ス
テップに従って順に説明する。FIG. 3 is another flowchart for explaining the operation of the Japanese sentence error word detection device according to the present invention. Below, each step will be explained in order.

Ｂμ壮：形態素解析により文章を単語単位に分割すると
ともに品詞を認定する。この場合の形態素解析は、まず
、句読点と字種の情報だけを用いて文節境界の認定を行
い１次に文節内での単語の品詞と活用形の決定を行うも
のである。この形態素解析については、例えば、「国語
辞書の記憶と日本語文の自動分割」（長丸外３名、情報
処理ＶＯ１，１９Ｎ（Ｌ６．１９７８年）に述べられて
いるように知られているものである。Bμ So: Divide sentences into word units and identify parts of speech using morphological analysis. In this case, the morphological analysis first identifies clause boundaries using only punctuation marks and character type information, and then determines the part of speech and conjugation of words within the clause. This morphological analysis is known, for example, as described in "Memory of Japanese Dictionaries and Automatic Segmentation of Japanese Sentences" (by 3 authors outside of Nagamaru, Information Processing VO1, 19N (L6, 1978)). It is.

旦」４：次に部分文字列の作成を行う。すなわち、連続
する接頭辞・名詞・接尾辞をまとめて１つの部分文字列
とする。つまり、先頭が接頭辞または名詞で、その後ろ
に接頭辞・名詞・接尾辞のどれかが続き、名詞または接
尾辞で終了するような最長の連続した単語列を１つの部
分文字列とする。ただし、接頭辞の直後に接尾辞が続く
ことはないものとする。また、接頭辞・名詞・接尾辞以
外の品詞の単語は部分文字列とはしない。4: Next, create a partial string. That is, consecutive prefixes, nouns, and suffixes are combined into one partial character string. In other words, one substring is the longest continuous word string that begins with a prefix or noun, followed by a prefix, noun, or suffix, and ends with a noun or suffix. However, a suffix shall not immediately follow a prefix. Also, words with parts of speech other than prefixes, nouns, and suffixes are not treated as partial character strings.

例えば、″ついに世界新記録で、″という文は、６つい
に／世界／新／記録／で／、”と形態素解析され、″世
界新記録”という部分文字列が取り出される。For example, the sentence ``Finally a new world record,'' is morphologically analyzed as ``6 finally /world/new/record/,'' and the substring ``world record'' is extracted.

射μ見二次に部分文字列の計数を行う。The substrings are counted in the second step.

７：次に計数された部分文字列が文章中で１回しか用い
られていないかどうか判断する。7: Next, it is determined whether the counted partial character string is used only once in the sentence.

すなわち、文章中で１回しか用いられていない部分文字
列を誤り語の候補として抽出する。That is, a partial character string that is used only once in a sentence is extracted as an error word candidate.

１月扱二計数された部分文字列が文章中で１回しか用い
られていない場合は、次に単語辞書に登録されていない
かどうか判断する。If the partial character string counted in January is used only once in the sentence, it is then determined whether it is registered in the word dictionary.

１月則：単語辞書に登録されていない場合には、誤りで
あるとユーザに提示する。すなわち、誤り語候補が形態
素解析で使用する単語辞書に登録されていない場合に限
って誤りであると認定する。January rule: If the word is not registered in the word dictionary, it is displayed to the user as an error. That is, an error word candidate is recognized as an error only when it is not registered in the word dictionary used in morphological analysis.

単語辞書には、大抵の複合語でない名詞が登録されてい
る。そのため、ここに登録されている語はたとえ１回し
か文章中で出現していなくても誤りである可能性は低い
。登録されていない語は、複合語であるか誤り語である
かのどちらかである。Most nouns that are not compound words are registered in the word dictionary. Therefore, even if the word registered here appears only once in a sentence, it is unlikely to be an error. An unregistered word is either a compound word or an error word.

複合語はカタカナ語と同様に、専門用語や新概念を表す
ことが多く、文章中で重要な語であることが多い。複合
語が文章中で１回しか出現しないことは珍しく、何度も
使われるのが普通である。Similar to katakana, compound words often represent technical terms or new concepts, and are often important words in a sentence. It is rare for a compound word to appear only once in a sentence; it is common for it to appear multiple times.

しかし、′世界新記録で′を誤って′世界新記録手′と
入力してしまった場合のように、タイプミスやかな漢字
変換などが原因で発生する誤り語が１つの文章中に複数
回現れることはほとんどない。したがって、上記の方法
で複合語の誤りの大部分を検出することができる。以上
の説明は請求項２，３の説明に相当する。However, erroneous words that occur due to typos or kanji conversion appear multiple times in a single sentence, such as when you accidentally enter ``world record hand'' instead of ``world record''. Very rarely. Therefore, most errors in compound words can be detected using the above method. The above explanation corresponds to the explanation of claims 2 and 3.

羞−一果以上の説明から明らかなように、本発明によると、以下
のような効果がある。As is clear from the above description, the present invention has the following effects.

（１）請求項１に対応する効果；誤っている可能性のあ
る箇所だけを指摘するのでユーザの負担が少なく、また
、正しい表記を登録しておく必要がないのでカタカナ語
に対する余計な（誤った）誤り検出が少ない。さらに、
文法情報を使わないので構文的には正しいが実際は誤っ
ている箇所も検出できる。(1) Effects corresponding to claim 1: Since only possible mistakes are pointed out, there is less burden on the user, and there is no need to register correct spellings, so there are no unnecessary (mistakes) for katakana ) Fewer error detections. moreover,
Since it does not use grammatical information, it can detect parts that are syntactically correct but are actually incorrect.

（２）請求項２に対応する効果；誤っている可能性のあ
る箇所だけを指摘するのでユーザの負担が少なく、また
、誤り検出に直接文法情報を使わないので構文的には正
しいが実際は誤っている箇所も検出できる。さらに、字
種情報を使わないので字種に関係なく誤りを検出できる
。(2) Effects corresponding to claim 2: Since only possible errors are pointed out, there is less burden on the user, and since grammatical information is not directly used for error detection, errors may be syntactically correct but actually incorrect. It is also possible to detect locations where Furthermore, since character type information is not used, errors can be detected regardless of character type.

（３）　ｔｉｔｌ求項３に対応する効果；正しい表記か
どうかを検査するので余計な（間違った）誤り検出が少
ない。(3) Effect corresponding to title requirement 3: Since it is checked whether the notation is correct, there are fewer unnecessary (erroneous) error detections.

[Brief explanation of the drawing]

第１図は、本発明による日本文誤り語検出装置の一実施
例を説明するための構成図、第２図は、本発明による日
本文誤り語検出装置の動作を説明するためのフローチャ
ート、第３図は、日本文誤り語検出装置の動作を説明す
るための他のフローチャートである。１・・・文章分割手段、２・・・部分文字列計数手段、
３・・・誤り検出手段。FIG. 1 is a block diagram for explaining an embodiment of the Japanese sentence error word detection device according to the present invention, and FIG. 2 is a flowchart for explaining the operation of the Japanese sentence error word detection device according to the present invention. FIG. 3 is another flowchart for explaining the operation of the Japanese sentence error word detection device. 1... Sentence division means, 2... Partial character string counting means,
3...Error detection means.

Claims

[Scope of Claims] 1. In a Japanese sentence error word detection device that detects error words in Japanese sentences electronically expressed using character codes, a sentence is divided into partial character strings using character type change points as boundaries. a partial character string counting means for counting the number of occurrences of a partial character string in katakana; and an error detection means for detecting an error word in a Japanese sentence. 2. In a Japanese sentence error word detection device that detects incorrect words in Japanese sentences electronically expressed using character codes, there is a sentence dividing means that performs morphological analysis of the sentence and divides it into word units, and words that correspond to nouns. A partial character string counting means for counting the number of occurrences of a string, and an error detection means for detecting a partial character string whose number of occurrences of a word string counted by the partial character string counting means is 1 as an error word. Characteristic Japanese sentence error word detection device. 3. It is characterized by having an error detection means for detecting, as an error word, a partial character string in which the number of occurrences of the word string counted by the partial character string counting means is one and is not registered in the word dictionary. The Japanese sentence error word detection device according to claim 2.