JP2000259625A

JP2000259625A - Document calibration device

Info

Publication number: JP2000259625A
Application number: JP11063657A
Authority: JP
Inventors: Jun Ibuki; 潤伊吹; Fumito Nishino; 文人西野
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1999-03-10
Filing date: 1999-03-10
Publication date: 2000-09-22
Anticipated expiration: 2019-03-10
Also published as: JP3919968B2

Abstract

PROBLEM TO BE SOLVED: To make it possible to automatically estimate an error part by utilizing a text whose error generation inclination has been determined. SOLUTION: An error candidate detection part 1 includes error candidate word group knowledge 1a and detects error candidate words from an inputted text. A co-occurrence information extraction part 2 extracts various co- occurrence information corresponding to the error candidate words from the inputted text. A control part 3 controls the detection part 1 and the extraction part 2 so as to extract co-occurrence information over the whole text area. The extracted co-occurrence data are integrated by a statistic information totalization part 4 and the appearance patterns of a target word in the whole document are totalized. An error judgment part 5 judges whether error candidates extracted on the basis of the co-occurrence information are correct or not. It is also available to correct the knowledge of an error candidate word group by preparing an error statistic integration part 6 for integrating statistic information concerned with errors or a similarity evaluation part 8 for evaluating the similarity of two words in the error word group.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書処理装置にお
いて、ユーザが入力、もしくは電子的な媒体として獲得
した文書データに対して誤った部分を自動的に指摘する
ことにより、ユーザが文書を校正する作業を軽減し、文
書校正の効率を大幅に向上させることができる文書校正
技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document processing apparatus, in which a user proofreads a document by automatically pointing out an erroneous portion in document data input by a user or obtained as an electronic medium. The present invention relates to a document proofreading technique that can reduce the number of operations to be performed and greatly improve the efficiency of document proofreading.

【０００２】[0002]

【従来の技術】一般に同音異義語誤りのように単語自体
を見ても正誤の判断がつかないような誤り（文脈依存誤
り）の場合は単語の存在する文脈を見て正誤の判断をす
る必要があり、一般的には単語間の共起に関する知識を
利用して誤りを検出する方法が用いられている。最も基
本的なシステムとしては、予め登録された共起データの
誤りパターンに合致した場合に、誤りとして検出するよ
うなシステムが知られている。2. Description of the Related Art In general, in the case of an error (context-dependent error) such as a homonym error in which it is not possible to judge whether the word itself is correct (context-dependent error), it is necessary to judge whether the word exists in the context of the word. Generally, a method of detecting an error using knowledge about co-occurrence between words is used. As the most basic system, a system that detects an error when an error pattern of co-occurrence data registered in advance matches an error pattern is known.

【０００３】例えば、同音異議語である「保障」と「保
証」という単語について考えると、「安全保障」という
用語は通常使われる正しい表現であるが、「安全保証」
という用語はあまり使用されない。そこで、予め「保証
←→安全」という共起データの誤りパターンを登録して
おき、この誤りパターンを検出することにより文書中の
誤りを検出する方法である。また、同様に、同音異義語
の正しい共起データを予め蓄積しておき、テキスト中か
ら抽出した共起データとデータベース中の共起データと
の比較によって誤りかどうかを判断するシステムも存在
する。For example, considering the words “guarantee” and “guarantee”, which are homonyms, the term “security” is a correct expression usually used, but “security”
The term is rarely used. Therefore, a method of registering an error pattern of co-occurrence data of “guarantee ← → safe” in advance and detecting an error in the document by detecting this error pattern is used. Similarly, there is also a system in which correct co-occurrence data of homonymous words are stored in advance, and the co-occurrence data extracted from the text is compared with the co-occurrence data in the database to determine whether or not there is an error.

【０００４】[0004]

【発明が解決しようとする課題】上記のように文脈依存
誤りをチェックするためには誤りと正解との対象デー
タ、あるいは正しいデータの蓄積が必要であり、実際に
起こり得る誤りの膨大なバリエーションをカバーするた
めには、大量の共起データが必要となる。人手ではこれ
らのデータを整備することは難しく、コーパス（同一種
類のテキストを集めたデータベース）から自動抽出する
試みもあるが、一般のコーパスはある程度の誤りを含ん
でおり、抽出したデータをそのまま利用することはでき
ない。それを避けるためには手間のかかる人手でのチェ
ックが欠かせないものとなっていた。In order to check for context-dependent errors as described above, it is necessary to accumulate target data of errors and correct answers or correct data. To cover, a large amount of co-occurrence data is required. It is difficult for humans to maintain these data, and there are attempts to automatically extract them from a corpus (a database that collects the same type of text). However, general corpora contain some errors, and the extracted data is used as it is. I can't. In order to avoid this, laborious manual checks have become indispensable.

【０００５】本発明は上記した事情に鑑みなされたもの
であって、本発明の目的は、予め共起データを蓄積して
おかなくても、誤りの発生傾向の決まったテキストを利
用して、自動的に誤り部分の推定を行うことができ、ま
た、従来の誤り検出システムで利用できる共起データを
自動的に整備できるようにすることである。The present invention has been made in view of the above circumstances, and an object of the present invention is to utilize a text in which an error tendency is determined without using co-occurrence data beforehand. An object of the present invention is to make it possible to automatically estimate an error portion and to automatically prepare co-occurrence data usable in a conventional error detection system.

【０００６】[0006]

【課題を解決するための手段】図１は本発明の原理構成
図である。同図において、１は相互に混同しやすい単語
群である誤り候補単語群の知識１ａをもち、入力された
テキストからそれらの単語を検出する誤り候補検出部、
２は誤り候補単語に対してテキスト中で各種の共起情報
を抽出する共起情報抽出部、３は制御部であり、制御部
３は、誤り候補検出部１、共起情報抽出部２に対してテ
キストのどの部分に対して適用するかを制御し、対象と
する誤り候補に対してテキスト全域にわたる共起情報を
抽出させる。４は上記共起情報抽出部２において抽出さ
れた共起情報を集積・集計する統計情報集計部であり、
抽出された共起データを集積して統計データを計算し、
対象単語について文書全体での出現のパターンの集計を
行なう。FIG. 1 is a block diagram showing the principle of the present invention. In the figure, reference numeral 1 denotes an error candidate detecting unit that has knowledge 1a of an error candidate word group, which is a group of words that are easily confused with each other, and detects those words from an input text;
Reference numeral 2 denotes a co-occurrence information extraction unit that extracts various types of co-occurrence information in a text for an error candidate word, and 3 denotes a control unit. The control unit 3 controls the error candidate detection unit 1 and the co-occurrence information extraction unit 2 On the other hand, it controls which part of the text is to be applied, and causes the target error candidate to extract co-occurrence information over the entire text. Reference numeral 4 denotes a statistical information totalizing unit that accumulates and totals the co-occurrence information extracted by the co-occurrence information extracting unit 2,
Statistical data is calculated by accumulating the extracted co-occurrence data,
The appearance patterns of the target word in the entire document are totaled.

【０００７】５は共起情報を元にして抽出された誤り候
補の正誤を判断する誤り判定部であり、語群中の特定の
単語ペアの混同誤りの発生確率等の統計情報に対する期
待値の情報を持ち、統計情報集計部において集積・集計
された実際に観測された共起データから計算される同じ
種類の統計情報と上記期待値と比較して類似性を判断
し、類似性が高い場合に、これらの共起データを誤りと
判定する。６は出力された誤りに関する頻度等の統計情
報を集積する誤り統計集積部、７はこれらの情報を参照
して誤り候補単語群の知識を修正する誤り候補修正部で
ある。また、８は抽出された共起統計情報を利用して、
誤り語群中の全てのペアについてペア中の２語の類似度
を評価する類似度評価部、９は共起統計情報を利用して
誤り語群中の全ての組み合わせの意味的類似度を評価
し、それを元に誤り候補単語群の知識を修正する誤り候
補修正部である。Reference numeral 5 denotes an error determination unit which determines the correctness of an error candidate extracted based on the co-occurrence information. The error determination unit 5 has an expected value for statistical information such as a probability of occurrence of a confusion error of a specific word pair in a word group. When the similarity is judged by comparing information with the expected value and comparing the same type of statistical information calculated from the actually observed co-occurrence data collected and tabulated by the statistical information tabulation unit Then, these co-occurrence data are determined as errors. Reference numeral 6 denotes an error statistics accumulating unit that accumulates statistical information such as the frequency of the output error, and 7 denotes an error candidate correcting unit that corrects the knowledge of the error candidate word group by referring to the information. 8 uses the extracted co-occurrence statistical information,
A similarity evaluation unit that evaluates the similarity of two words in the pair for all pairs in the error word group, and 9 evaluates the semantic similarity of all combinations in the error word group using the co-occurrence statistical information The error candidate correcting unit corrects the knowledge of the error candidate word group based on the error candidate word group.

【０００８】本発明においては上記のように、共起デー
タを抽出し、抽出された共起データを集積して、対象単
語について文書全体での出現のパターンの集計を行な
い、これを基に、誤り判定部において誤り判定を行って
いるので、予め共起データを蓄積しておかなくても、自
動的にテキスト中の誤り部分の推定を行うことができ、
また、共起データを自動的に整備することが可能とな
る。また、誤り統計集積部６、誤り候補修正部７を設け
ることにより、実際に起こった誤りの評価を高くする方
向での誤り候補単語群の知識１ａの修正を行なうことが
でき、誤り候補の抽出処理の精度を高めることができ
る。また、類似度評価部８、誤り候補修正部９を設ける
ことにより、誤り候補単語群の知識１ａの誤りやすさの
情報を修正することができ、同様に誤り候補の抽出処理
の精度を高めることができる。In the present invention, as described above, co-occurrence data is extracted, the extracted co-occurrence data is accumulated, and the appearance pattern of the target word in the entire document is counted. Since the error determination unit performs the error determination, it is possible to automatically estimate the error portion in the text without having to store the co-occurrence data in advance.
In addition, it is possible to automatically prepare co-occurrence data. Further, by providing the error statistics accumulating unit 6 and the error candidate correcting unit 7, it is possible to correct the knowledge 1a of the error candidate word group in a direction to increase the evaluation of the error that actually occurred, and to extract the error candidate. Processing accuracy can be improved. Further, by providing the similarity evaluation section 8 and the error candidate correction section 9, information on the likelihood of error of the knowledge 1a of the error candidate word group can be corrected, and similarly, the accuracy of the error candidate extraction processing can be improved. Can be.

【０００９】[0009]

【発明の実施の形態】図２は本発明の文書校正処理を行
うためのシステムの構成例を示す図である。同図におい
て、１０１はＣＲＴ、液晶ディスプレイ等の表示装置、
キーボード、マウス等の、文字、記号、命令等を入力す
るための入力装置から構成される入力出力装置、１０２
はＣＰＵ、１０３はＲＯＭ、ＲＡＭ等から構成されるメ
モリ、１０４はプログラム、データ等を記憶する外部記
憶装置、１０５はフロッピィディスクやＣＤ−ＲＯＭな
ど可搬型記憶媒体にアクセスしてデータの読み出し／書
き込みを行う媒体読取装置、１０６は電話回線を使用し
てデータ通信をするためのモデム、ＬＡＮなどのネット
ワークを使用してデータ通信をするためのネットワーク
カードなどを含む通信インタフェースである。外部記憶
装置１０４には本発明の文書校正処理を行うプログラ
ム、誤り候補単語群知識等が格納されており、また、本
発明の文書校正処理において得られる共起データ、統計
情報等が格納される。また、本発明が対象とする各種テ
キストは、ＣＤ−ＲＯＭ等から上記媒体読取装置１０５
を介して読み取られ、また、上記通信インタフェース１
０６を介してネットワーク上から収集される。FIG. 2 is a diagram showing a configuration example of a system for performing a document proofreading process according to the present invention. In FIG. 1, reference numeral 101 denotes a display device such as a CRT or a liquid crystal display;
An input / output device 102 including an input device for inputting characters, symbols, instructions, etc., such as a keyboard and a mouse;
Is a CPU, 103 is a memory composed of a ROM, a RAM, etc., 104 is an external storage device for storing programs, data, etc., 105 is a portable storage medium, such as a floppy disk or CD-ROM, for reading / writing data. Is a communication interface including a modem for performing data communication using a telephone line, a network card for performing data communication using a network such as a LAN, and the like. The external storage device 104 stores a program for performing the document proofreading process of the present invention, error candidate word group knowledge, and the like, and stores co-occurrence data, statistical information, and the like obtained in the document proofreading process of the present invention. . Further, various texts to which the present invention is applied can be read from a CD-ROM or the like by using
Via the communication interface 1
06 from the network.

【００１０】以下、図３〜図１２により本発明の第１〜
第３の実施例について説明する。図３は、本発明の第１
の実施例のシステムの機能構成を示す図である。同図に
おいて、１１は誤り候補検出部、１２は前記した「保
障」と「保証」等の同音異議語からなる誤り候補単語を
格納した誤り単語群知識であり、誤り候補検出部１１
は、予め誤り候補単語群知識１２に格納された誤り候補
単語に基づき、入力された部分テキストから誤り候補を
検出する。１３は共起データ抽出部であり、部分テキス
トから上記誤り候補と共起関係にある単語（例えば、前
記した「保証」に対する「安全」等）を検出することに
より、共起データを抽出する。なお、上記共起データ抽
出部１３は既存の枠組で一般的に用いられるものと同じ
である。Hereinafter, first to third embodiments of the present invention will be described with reference to FIGS.
A third embodiment will be described. FIG. 3 shows the first embodiment of the present invention.
It is a figure showing the functional composition of the system of an example. In the figure, reference numeral 11 denotes an error candidate detection unit, and 12 denotes an error word group knowledge storing error candidate words composed of homophonic words such as “guarantee” and “guarantee”.
Detects an error candidate from the input partial text based on the error candidate words stored in the error candidate word group knowledge 12 in advance. Reference numeral 13 denotes a co-occurrence data extraction unit, which extracts co-occurrence data by detecting words (for example, “safe” for “guarantee” described above) that are co-occurring with the error candidate from the partial text. The co-occurrence data extraction unit 13 is the same as that generally used in the existing framework.

【００１１】１４は制御部であり、制御部１４はテキス
ト全体に対して誤り候補の検出部１１、共起データの抽
出部１３を適用するための制御を行なう。すなわち、テ
キスト全体から部分テキスト（例えば、文、段落、等）
を切り出し、これに対して誤り候補の検出部１１、共起
データの抽出部１３を適用して、誤り候補の検出、共起
データの抽出を行わせ、次いで、次の部分テキストに対
して誤り候補の検出、共起データの抽出を行わせ、以下
同様に、誤り候補の検出、共起データの抽出をテキスト
全体に対して繰り返す。これによってテキスト全体から
誤り候補に対する共起データが抽出されることとなる。
１５は統計情報集計部であり、共起データ抽出部１３に
より抽出された共起データを蓄積し、共起データの種別
毎に統計的な処理を行なう。１６は誤り判定部であり、
共起データに対する統計情報を入力として受け、共起デ
ータ毎に正誤の判断を行ないそれを最終的に単語の正誤
として出力する。Reference numeral 14 denotes a control unit. The control unit 14 performs control for applying the error candidate detection unit 11 and the co-occurrence data extraction unit 13 to the entire text. That is, partial text (eg, sentences, paragraphs, etc.) from the entire text
Is applied, and the error candidate detection unit 11 and the co-occurrence data extraction unit 13 are applied to detect the error candidate and extract the co-occurrence data. Detection of candidates and extraction of co-occurrence data are performed, and similarly, detection of error candidates and extraction of co-occurrence data are repeated for the entire text. As a result, co-occurrence data for the error candidate is extracted from the entire text.
Reference numeral 15 denotes a statistical information totalizing unit that accumulates the co-occurrence data extracted by the co-occurrence data extraction unit 13 and performs statistical processing for each type of co-occurrence data. 16 is an error determination unit,
Statistical information on the co-occurrence data is received as input, correct / incorrect judgment is made for each co-occurrence data, and the result is finally output as word correct / incorrect.

【００１２】図４は上記制御部１４による制御処理例を
示すフローチャートであり、この例では、対象とするテ
キストから文を順に切り出し、文毎に共起データを抽出
する処理をテキストが尽きるまで続ける場合を示してい
る。同図のステップＳ１において、テキストから文を切
り出し、ステップＳ２において誤り候補検出部１１によ
り誤り候補を検出する。誤り候補が検出されると、ステ
ップＳ３において、文を共起データ抽出部１３に入力
し、誤り候補に対する共起データを抽出する。ついで、
ステップＳ４において、抽出された共起データを統計情
報集計部１５に入力し、共起データの生起回数を集計す
る。ステップＳ５においてテキスト全体についての処理
が終わったかを判定し、テキスト全体の処理が終わった
場合には終了する。また、テキスト全体についての処理
が終わっていない場合には、ステップＳ１に戻り、上記
処理を繰り返す。FIG. 4 is a flowchart showing an example of control processing by the control unit 14. In this example, a process of sequentially cutting out sentences from a target text and extracting co-occurrence data for each sentence is continued until the text runs out. Shows the case. In step S1 of the figure, a sentence is cut out from the text, and in step S2, an error candidate is detected by the error candidate detection unit 11. When an error candidate is detected, in step S3, a sentence is input to the co-occurrence data extraction unit 13, and co-occurrence data for the error candidate is extracted. Then
In step S4, the extracted co-occurrence data is input to the statistical information totaling unit 15, and the number of occurrences of the co-occurrence data is totaled. In step S5, it is determined whether the processing for the entire text has been completed. If the processing for the entire text has been completed, the processing ends. If the processing has not been completed for the entire text, the process returns to step S1, and the above processing is repeated.

【００１３】図５は、上記のようにして抽出、集計され
た共起統計例を示す図である。この実現例では、誤り候
補として「運航」と「運行」、「指示」と「支持」とい
う２つの群を仮定し、それらの語群についての共起デー
タの生起回数を共起データの種類毎に集計している。同
図に示すように、「運行」、「運航」という誤り候補対
する共起単語として「を再開」という単語が抽出され、
また、「指示」と「支持」という誤り候補に対する共起
単語として「を表明」という単語が抽出される。そし
て、それぞれの共起データの頻度の集計結果は「６
０」，「１７」，「５１９」，「１」となった。FIG. 5 is a diagram showing an example of co-occurrence statistics extracted and tabulated as described above. In this example, two groups of "operation" and "operation" and "instruction" and "support" are assumed as error candidates, and the number of occurrences of co-occurrence data for those word groups is determined for each type of co-occurrence data. Is counted. As shown in the figure, the word "resume" is extracted as a co-occurrence word for the error candidates "operation" and "operation",
Also, the word “express” is extracted as a co-occurrence word for the error candidates “instruction” and “support”. Then, the total result of the frequency of each co-occurrence data is “6
0 "," 17 "," 519 ", and" 1 ".

【００１４】図６は、図３に示した誤り判定部１６にお
ける処理を示すフローチャートであり、誤り判定部１６
においては、次のようにして誤り判定を行う。同図のス
テップＳ１において、統計情報集計部１５において集計
した集計結果を読み込む。ステップＳ２において、誤り
候補単語群知識１２を使って、誤り易い単語を群にまと
める。図５の例の場合は、「運行」と「運航」、「指
示」と「支持」がそれぞれ群にまとめられる。ステップ
Ｓ３において上記群の内の一つを選び、ステップＳ４に
おいて、群の中から単語を一つ選ぶ。例えば「運行」と
「運航」という群が選択され、その内の「運行」という
単語が選択される。ステップＳ５において、選択された
単語について、発生頻度、誤り語群全体に対する比率
（相対比率）を計算する。なお、統計処理としては、そ
の外、誤り語群内でのｔ検定の値を計算する等も考えら
れる。FIG. 6 is a flowchart showing the processing in the error judgment unit 16 shown in FIG.
In, the error determination is performed as follows. In step S1 of the figure, the tallying result obtained by the tallying unit 15 is read. In step S2, using the error candidate word group knowledge 12, words that are likely to be erroneous are grouped. In the case of the example of FIG. 5, "operation" and "operation", and "instruction" and "support" are respectively grouped. In step S3, one of the groups is selected, and in step S4, one word is selected from the group. For example, a group of "operation" and "operation" is selected, and the word "operation" is selected. In step S5, the occurrence frequency and the ratio (relative ratio) of the selected word to the entire error word group are calculated. In addition, as the statistical processing, it is also conceivable to calculate the value of a t-test within the error word group.

【００１５】ステップＳ６において、上記頻度、相対比
率を予め定められた期待値と比較し、誤り判定を行う。
例えは、図３の例では、「指示」←→「表明」の頻度が
少なく、また相対比率も小さいので、これを誤りと判定
する。ステップＳ７において、群中の単語を選択しつく
したかを調べ、選択しつくしていない場合には、ステッ
プＳ４に戻り上記処理を繰り返す。また、群中の単語を
選択しつくした場合には、ステップＳ８にいき、全ての
群を選択しつくしたかを調べ、選択しつくした場合には
処理を終了する。また、選択しつくしていない場合に
は、ステップＳ３に戻り、次の群を選択して上記処理を
繰り返す。本発明の第１の実施例においては、上記のよ
うに、共起データを抽出し誤り判定を行っているので、
予め共起データを蓄積しておかなくても、自動的にテキ
スト中の誤り部分の推定を行うことができる。また、共
起データを自動的に整備することが可能となる。In step S6, the above frequency and relative ratio are compared with predetermined expected values to determine an error.
For example, in the example of FIG. 3, since the frequency of “instruction” ← → “assertion” is low and the relative ratio is low, this is determined as an error. In step S7, it is checked whether all the words in the group have been selected. If not, the process returns to step S4 and repeats the above processing. If all the words in the group have been selected, the process goes to step S8 to check whether all the groups have been selected. If all the words have been selected, the process ends. If not, the process returns to step S3 to select the next group and repeat the above processing. In the first embodiment of the present invention, as described above, co-occurrence data is extracted and error determination is performed.
Even if co-occurrence data is not stored in advance, it is possible to automatically estimate an erroneous portion in a text. In addition, it is possible to automatically prepare co-occurrence data.

【００１６】図７は本発明の第２の実施例のシステムの
機能構成を示す図である。１１〜１６の構成は、前記図
３に示したものと同じであり、本実施例においては、誤
り統計集計部２１、誤り候補修正部２２が追加されてい
る。誤り統計集計部２１は、テキストの校正処理が終る
毎に誤り判定部１６において検出された誤りを集め対象
単語毎に誤り頻度が集計する。これらの情報は誤り候補
修正部２２に送られ、誤り候補修正部２２は、実際に起
こった誤りの評価を高くする方向での修正を行ない、誤
り候補の抽出処理の精度を高める。例えば、誤りの誤り
語群中で他の単語へ誤ることが実際にはなかった単語を
誤り候補単語知識１２から削除することにより、誤り候
補の抽出処理の精度を高める。FIG. 7 is a diagram showing a functional configuration of a system according to a second embodiment of the present invention. The configuration of 11 to 16 is the same as that shown in FIG. 3, and in this embodiment, an error statistics totalizing unit 21 and an error candidate correcting unit 22 are added. The error statistics totalizing unit 21 collects the errors detected by the error determining unit 16 every time the text proofreading process is completed, and totalizes the error frequency for each target word. These pieces of information are sent to the error candidate correction unit 22, and the error candidate correction unit 22 corrects the error that has actually occurred in a direction that increases the evaluation, thereby improving the accuracy of the error candidate extraction process. For example, by deleting from the error candidate word knowledge 12 words that have not actually been mistaken for other words in the group of erroneous words, the accuracy of the error candidate extraction process is increased.

【００１７】以下に誤り候補修正部２２での処理アルゴ
リズム例を示す。ここではある程度データが集まってい
る誤り語群中で他の単語へ誤ることが実際にはなかった
単語を削除することとしている。＜誤り候補修正部での処理アルゴリズム例＞特定の誤り
群を選択し以下の処理を繰り返す。１．誤り群全体での誤り件数が閾値以上存在しなければ
終了２．誤り群中の特定の単語を選択して以下の処理を繰り
返す。ｉ．同じ群内の他の単語との混同誤りが全くなければ単
語を誤り語群から削除するAn example of the processing algorithm in the error candidate correction unit 22 will be described below. Here, words that are not actually mistaken for other words in the error word group in which data is gathered to some extent are deleted. <Example of processing algorithm in error candidate correcting section> A specific error group is selected and the following processing is repeated. 1. 1. If the number of errors in the entire error group does not exceed the threshold, the process ends. A specific word in the error group is selected and the following processing is repeated. i. If there are no confusion errors with other words in the same group, delete the word from the erroneous word group

【００１８】図８は上記アルゴリズムをフローチャート
で示したものである。同図のステップＳ１において、誤
り統計集計部２１の誤り統計データから特定の誤り群を
選択する。ステップＳ２において、誤り群全体での共起
データ件数が閾値以上であるかを調べる。閾値以下の場
合にはステップＳ１に戻る。閾値以上の場合には、ステ
ップＳ３に行き、誤り群中の特定の単語を選択し、ステ
ップＳ４において、同じ群内他の単語との混同誤りが全
くないかを調べる。混同誤りがある場合には、ステップ
Ｓ３に戻り上記処理を繰り返す。また、混同誤りが全く
ない場合には、ステップＳ５において、選択した単語
を、誤り候補単語知識１２の誤り候補から削除する。ス
テップＳ６において、誤り群内の単語を選択しつくした
かを調べ、選択しつくしていない場合にはステップＳ３
に戻り上記処理を繰り返す。また、選択しつくした場合
には、ステップＳ７に行き、誤り候補群を選択しつくし
たかを調べ、選択しつくしていない場合にはステップＳ
１に戻り上記処理を繰り返し、また選択しつくした場合
には、処理を終了する。FIG. 8 is a flowchart showing the above algorithm. In step S1 of the figure, a specific error group is selected from the error statistics data of the error statistics totalizing unit 21. In step S2, it is checked whether the number of co-occurrence data items in the entire error group is equal to or larger than a threshold value. If it is equal to or smaller than the threshold, the process returns to step S1. If it is equal to or greater than the threshold value, the process goes to step S3, where a specific word in the error group is selected, and in step S4, it is checked whether there is any confusion error with another word in the same group. If there is a confusion error, the process returns to step S3 and the above processing is repeated. If there is no confusion error, the selected word is deleted from the error candidates of the error candidate word knowledge 12 in step S5. In step S6, it is checked whether or not all the words in the error group have been selected. If not, the process proceeds to step S3.
And the above processing is repeated. If the selection has been completed, the process goes to step S7 to check whether the error candidate group has been selected. If not, the process proceeds to step S7.
Returning to step 1, the above processing is repeated, and when the selection is completed, the processing is terminated.

【００１９】図９（ａ）に誤り統計集計部２１におい
て、集計された誤り候補情報の例を示す。この例では誤
り候補情報として「話す」「放す」「離す」を仮定して
いる。これらの語に対してテキスト中の誤り検出を行な
った後での誤り統計集計部の集計結果は、同図に示すよ
うに「５」，「０」，「０」となった。なお、誤り統計
集計部２１では、単純に件数の集計を行なっているだけ
である。図９（ｂ）に図９に示した誤り統計データ例に
対する、誤り候補修正部２２における処理例を示す。こ
の例では、誤り候補単語である「話す」については混同
誤りが全くなかったので、図９（ｂ）に示すように、誤
り語群から「話す」が削除されることになる。その結
果、誤り候補単語知識１２の誤り語群情報は、図９
（ｃ）から図９（ｄ）のように修正される。FIG. 9A shows an example of error candidate information tabulated by the error statistics tabulation section 21. In this example, "speak", "release", and "release" are assumed as error candidate information. After the error detection in the text is performed on these words, the counting results of the error statistics counting unit are “5”, “0”, and “0” as shown in FIG. Note that the error statistics tallying unit 21 simply counts the number of cases. FIG. 9B shows a processing example in the error candidate correction unit 22 for the error statistical data example shown in FIG. In this example, since there was no confusion error for the error candidate word "speak", "speak" is deleted from the error word group as shown in FIG. 9B. As a result, the error word group information of the error candidate word knowledge 12 is as shown in FIG.
It is modified from (c) as shown in FIG. 9 (d).

【００２０】本発明の第２の実施例においては、前記第
１の実施例と同様、予め共起データを蓄積しておかなく
てもテキスト中の誤り部分の推定を行うことができると
ともに、上記のように誤り候補単語知識１２から混同誤
りが全くない単語を自動的に削除しているので、誤り候
補の抽出処理の精度を高めることができる。また、誤り
やすい単語についての共起データを整備することが可能
となる。In the second embodiment of the present invention, as in the first embodiment, it is possible to estimate an erroneous portion in a text without storing co-occurrence data in advance, Since words having no confusion errors are automatically deleted from the error candidate word knowledge 12 as described above, the accuracy of error candidate extraction processing can be improved. In addition, it becomes possible to prepare co-occurrence data for easily erroneous words.

【００２１】図１０は本発明の第３の実施例のシステム
の機能構成を示す図である。１１〜１６の構成は、前記
図３に示したものと同じであり、本実施例においては、
類似度評価部３１、誤り候補修正部３２が追加されてい
る。類似度評価部３１では統計情報集計部１５から共起
統計情報と共に、各単語毎の生成回数の集計値を受けと
る。誤り候補修正部３２では任意の混同しやすい単語群
中の任意の２つの単語について共起統計情報の重なりの
度合いを評価し、それによって誤り候補単語知識１２の
誤りやすさの情報を修正する。FIG. 10 is a diagram showing a functional configuration of a system according to a third embodiment of the present invention. The configuration of 11 to 16 is the same as that shown in FIG. 3 above, and in this embodiment,
A similarity evaluation unit 31 and an error candidate correction unit 32 are added. The similarity evaluation unit 31 receives the total number of generations for each word together with the co-occurrence statistical information from the statistical information totaling unit 15. The error candidate correction unit 32 evaluates the degree of overlap of the co-occurrence statistical information for any two words in an arbitrary group of words that are easily confused, and thereby corrects the error probabilities of the error candidate word knowledge 12.

【００２２】以下に類似度評価部３１、誤り候補修正部
３２での処理アルゴリズム例を示す。ここではある程度
データが集まっている誤り語群中で他の単語へ誤ること
が実際にはなかった単語を削除することとしている。＜類似度評価部、誤り候補修正部での処理アルゴリズム
例＞特定の誤り群を選択し以下の処理を繰り返す。１．誤り群全体での共起データ件数が閾値以上存在しな
ければ終了ｉ．誤り群中の特定の単語を選択して以下の処理を繰り
返す。 i)．同じ群内の全ての他の単語と以下の方法で類似度の
評価を行なう・共起データを比較して２つの単語中に共通して存在す
る共起データの種別の自分のもつ全共起データの種類に
占める割合を計算 ii).全ての他の単語に対しての類似度が一定の閾値未満
である場合には対象単語を誤り語群から削除するAn example of the processing algorithm in the similarity evaluation section 31 and the error candidate correction section 32 will be described below. Here, words that are not actually mistaken for other words in the error word group in which data is gathered to some extent are deleted. <Example of processing algorithm in similarity evaluation section and error candidate correction section> A specific error group is selected and the following processing is repeated. 1. If the number of co-occurrence data items in the entire error group does not exceed the threshold, the process ends. I. A specific word in the error group is selected and the following processing is repeated. i). Evaluate the similarity with all the other words in the same group by the following method.-Compare the co-occurrence data and all co-occurrences of the type of co-occurrence data that exists in common in two words Calculate the percentage of the data type ii) If the similarity to all other words is less than a certain threshold, delete the target word from the error word group

【００２３】図１１は上記アルゴリズムをフローチャー
トで示したものである。ステップＳ１において、統計情
報集計部１５から特定の誤り群を選択する。ステップＳ
２において、誤り群全体での共起データ件数が閾値以上
かを調べ、閾値以下の場合には、ステップＳ１に戻る。
また閾値以上の場合には、ステップＳ３に行き、誤り群
中の特定の単語を選択する。ステップＳ４において、同
じ群内の全ての他の単語と共起データを比較し、２つの
単語中に共通して存在する共起データの種別の自分が持
つ全共起データの種類に占める割合を計算し、類似度を
求める。FIG. 11 is a flowchart showing the above algorithm. In step S1, a specific error group is selected from the statistical information totalizing unit 15. Step S
In step 2, it is checked whether the number of co-occurrence data items in the entire error group is equal to or larger than a threshold value.
If the difference is equal to or larger than the threshold value, the process proceeds to step S3, and a specific word in the error group is selected. In step S4, the co-occurrence data is compared with all other words in the same group, and the ratio of the type of co-occurrence data that is commonly present in the two words to the type of all co-occurrence data that the user has is determined. Calculate and calculate similarity.

【００２４】ステップＳ５において、全ての他の単語に
対して類似度が閾値未満であるかを調べ、閾値未満の場
合には、ステップＳ６にいき、対象単語を誤り候補単語
知識１２の誤り候補から削除する。また、閾値以上の場
合にはステップＳ３に戻り、上記処理を繰り返す。つい
で、ステップＳ７において、誤り群中の単語を選択しつ
くしたかを調べ、選択しつくしていない場合には、ステ
ップＳ３に戻り上記処理を繰り返す。また、誤り群中の
単語を選択しつくした場合には、ステップＳ８に行き、
誤り候補群を選択しつくしたかを調べ、選択しつくして
いない場合にはステップＳ１にもどり、上記処理を繰り
返す。また、選択しつくした場合には、処理を終了す
る。In step S5, it is checked whether or not the similarity is less than the threshold value with respect to all other words. If the similarity is less than the threshold value, the process proceeds to step S6, and the target word is determined from the error candidates in the error candidate word knowledge 12. delete. If the value is equal to or larger than the threshold, the process returns to step S3, and the above processing is repeated. Then, in step S7, it is checked whether or not all the words in the error group have been selected. If not, the process returns to step S3 and repeats the above processing. If all the words in the error group have been selected, go to step S8,
It is checked whether the error candidate group has been completely selected. If not, the process returns to step S1 and repeats the above processing. If the selection has been completed, the process is terminated.

【００２５】図１２は本実施例における誤り候補情報の
修正例を示す図である。この例では、誤り候補情報とし
て同図（ａ）に示すように、「映す」、「写す」、「移
す」が規定されているとする。ここで、校正処理が行な
われ、統計情報集計部１５で同図（ｂ）に示すような共
起統計情報が抽出されたとする。類似度評価部３１で
は、前記したアルゴリズムに基づき、まず、共起データ
を比較して２つの単語中に共通して存在する共起データ
の種別の数を求める。この例の場合、「写す」に対して
共起関係にある単語は「光」、「姿」、「写真」であ
り、また、「映す」に対して共起関係にある単語は
「光」、「姿」、「映画」である。したがって、「写
す」と「映す」については「光」、「姿」が共通してい
るから、共通する共起データ数は図１２（ｃ）に示すよ
うに「２」である。一方、「写す」と「移す」、「映
す」と「移す」は共通する共起データ数が「０」であ
る。FIG. 12 is a diagram showing an example of correction of error candidate information in this embodiment. In this example, it is assumed that “project”, “transfer”, and “transfer” are defined as the error candidate information, as shown in FIG. Here, it is assumed that the calibration process is performed, and the co-occurrence statistical information as shown in FIG. Based on the algorithm described above, the similarity evaluation unit 31 first compares the co-occurrence data to determine the number of types of co-occurrence data that are commonly present in two words. In this example, words co-occurring with “sharing” are “light”, “shape” and “photograph”, and words co-occurring with “mirror” are “light” , “Appearance”, and “movie”. Accordingly, since “light” and “shape” are common to “project” and “project”, the number of common co-occurrence data is “2” as shown in FIG. On the other hand, “copies” and “move” and “mirrors” and “move” have a common co-occurrence data number of “0”.

【００２６】ついで、前記したように共通して存在する
共起データの種別の自分が持つ全共起データの種類に占
める割合を計算し、類似度を求める。この例の場合、
「写す」の全共起データの種類は上記したように
「光」、「姿」、「写真」の３種類であるから、図１２
（ｄ）に示すように、「写す」の「映す」に対する類似
度は、「２／３」となる。同様に「映す」の「写す」に
対する類似度も「２／３」となる。また、「写す」の
「移す」に対する類似度、「映す」の「移す」に対する
類似度は「０／３」と計算され、「移す」の「写す」、
「映す」の類似度は「０／２」と計算される。Next, as described above, the ratio of the type of co-occurrence data that exists in common to the type of all the co-occurrence data that the user owns is calculated, and the similarity is obtained. In this case,
As described above, there are three types of co-occurrence data of “copy”, “light”, “shape”, and “photograph”.
As shown in (d), the similarity of “mirror” to “mirror” is “2/3”. Similarly, the similarity of “mirror” to “mirror” is “2/3”. Also, the similarity of “transfer” to “transfer” and the similarity of “mirror” to “transfer” are calculated as “0/3”, and “transfer” of “transfer”
The similarity of “project” is calculated as “0/2”.

【００２７】図１２（ｄ）から明らかなように、「移
す」については、他の「写す」「映す」との類似度が共
に０であり、他の単語へ誤ることが実際にはなかったの
で、削除対象となる。したがってこの例の場合、誤り候
補修正部３２は「移す」を誤り候補単語知識１２から削
除する。本実施例においても、前記第２の実施例と同
様、予め共起データを蓄積しておかなくてもテキスト中
の誤り部分の推定を行うことができるとともに、誤り候
補単語知識１２を自動的に修正することができ、誤り候
補の抽出処理の精度を高めることができる。また、誤り
やすい単語についての共起データを整備することが可能
となる。As apparent from FIG. 12D, the similarity of "transfer" to other "shooting" and "shooting" is 0, and there is no actual mistake in other words. Therefore, it is subject to deletion. Therefore, in the case of this example, the error candidate correction unit 32 deletes “move” from the error candidate word knowledge 12. In this embodiment, as in the second embodiment, it is possible to estimate an erroneous portion in a text without storing co-occurrence data in advance, and to automatically obtain error candidate word knowledge 12. This can be corrected, and the accuracy of the error candidate extraction process can be increased. In addition, it becomes possible to prepare co-occurrence data for easily erroneous words.

【００２８】[0028]

【発明の効果】以上説明したように、本発明によれば共
起データを予め蓄積しておかなくとも誤りの発生傾向の
決まったテキストを大量に用意することによって自動的
に誤り部分の推定を行なうことが可能となる。又、テキ
ストから共起データを抽出し、誤りやすい共起データを
推定し集積できるので、従来の誤り検出システムで利用
する共起データの整備を行なうためにもこのシステムを
利用することができる。As described above, according to the present invention, it is possible to automatically estimate an erroneous portion by preparing a large number of texts having a tendency to generate errors without having to previously store co-occurrence data. It is possible to do. In addition, since co-occurrence data can be extracted from text and error-prone co-occurrence data can be estimated and accumulated, this system can also be used to prepare co-occurrence data used in a conventional error detection system.

[Brief description of the drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の文書構成処理を行うためのシステムの
構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of a system for performing a document configuration process according to the present invention.

【図３】本発明の第１の実施例のシステムの構成を示す
ブロック図である。FIG. 3 is a block diagram illustrating a configuration of a system according to a first exemplary embodiment of the present invention.

【図４】制御部による制御処理例を示すフローチャート
である。FIG. 4 is a flowchart illustrating an example of a control process performed by a control unit.

【図５】第１の実施例において抽出、集計された共起統
計例を示す図である。FIG. 5 is a diagram illustrating an example of co-occurrence statistics extracted and tabulated in the first embodiment.

【図６】誤り判定部における処理を示すフローチャート
である。FIG. 6 is a flowchart illustrating a process in an error determination unit.

【図７】本発明の第２の実施例のシステムの構成を示す
ブロック図である。FIG. 7 is a block diagram illustrating a configuration of a system according to a second exemplary embodiment of the present invention.

【図８】誤り候補修正部における処理を示すフローチャ
ートである。FIG. 8 is a flowchart showing processing in an error candidate correction unit.

【図９】本発明の第２の実施例における処理結果例を示
す図である。FIG. 9 is a diagram illustrating an example of a processing result according to the second embodiment of the present invention.

【図１０】本発明の第３の実施例のシステムの構成を示
すブロック図である。FIG. 10 is a block diagram showing a configuration of a system according to a third example of the present invention.

【図１１】類似度評価部、誤り候補修正部での処理を示
すフローチャートである。FIG. 11 is a flowchart showing processing in a similarity evaluation unit and an error candidate correction unit.

【図１２】第３の実施例における誤り候補情報の修正例
を示す図である。FIG. 12 is a diagram illustrating an example of correcting error candidate information according to the third embodiment.

[Explanation of symbols]

１誤り候補検出部１ａ誤り候補単語群の知識２共起情報抽出部３制御部４統計情報集計部５誤り判定部６誤り統計集積部７誤り候補修正部８類似度評価部９誤り候補修正部１１誤り候補検出部１２誤り単語群知識１３共起データ抽出部１４制御部１５統計情報集計部１６誤り判定部２１誤り統計集計部２２誤り候補修正部３１類似度評価部３２誤り候補修正部 DESCRIPTION OF SYMBOLS 1 Error candidate detection part 1a Knowledge of an error candidate word group 2 Co-occurrence information extraction part 3 Control part 4 Statistical information totalization part 5 Error judgment part 6 Error statistics accumulation part 7 Error candidate correction part 8 Similarity evaluation part 9 Error candidate correction part Reference Signs List 11 error candidate detection unit 12 error word group knowledge 13 co-occurrence data extraction unit 14 control unit 15 statistical information totaling unit 16 error determining unit 21 error statistical totaling unit 22 error candidate correcting unit 31 similarity evaluation unit 32 error candidate correcting unit

Claims

[Claims]

1. An error candidate detecting unit that extracts an error word from a text based on knowledge of an error candidate word group that is a group of words that are easily confused with each other, and extracts co-occurrence information in the text for the error candidate word. A co-occurrence information extraction unit, a statistical information aggregation unit that accumulates and aggregates co-occurrence information, and an error determination unit that determines whether the co-occurrence information is correct based on the statistical information of the co-occurrence information. Characteristic proofreading device.

2. An error candidate detecting section which has knowledge of an error candidate word group, which is a group of words that are easily confused with each other, and detects those words from an input text. A co-occurrence information extraction unit that extracts the co-occurrence information, a control unit that controls which part of the text is applied to the error candidate detection unit, and the co-occurrence information extraction unit. A statistical information summarizing unit for summarizing, and an error determining unit for judging whether the error candidate extracted based on the co-occurrence information is correct, and the control unit controls the error candidate detecting unit and the co-occurrence information extracting unit. The co-occurrence information over the entire text is extracted for the target error candidate, and the statistical information aggregation unit accumulates the extracted co-occurrence information and calculates statistical data. Aggregation of patterns A document proofreading apparatus, wherein the error determination unit determines correctness or incorrectness in units of co-occurrence information based on the statistical information.

3. An error judging unit has information of expected values for statistical information such as a probability of occurrence of a confusion error of a specific word pair in a word group, and is actually observed and collected by a statistical information collecting unit. Comparing the statistical information of the same type calculated from the co-occurrence data with the expected value and determining the similarity, and when the similarity is high, determining the co-occurrence data as an error. Item 2. The document proofreading device according to item 2.

4. An error statistics accumulating unit for accumulating statistical information such as the frequency of an output error, and after performing a proofreading process on a text, correcting the knowledge of an error candidate word group by referring to the information. 4. The document proofreading apparatus according to claim 2, further comprising an error candidate correcting unit.

5. A similarity evaluation unit that evaluates the similarity of two words in the error word group for all pairs in the error word group using the extracted co-occurrence statistical information, and performs a proofreading process on the text. Later, an error candidate correction unit that evaluates the semantic similarity of all combinations in the error word group using the co-occurrence statistical information and corrects the knowledge of the error candidate word group based on the evaluation is provided. The document proofreading device according to claim 2 or 3, wherein:

6. A recording medium storing a document proofreading program for executing a document proofreading process by a computer, wherein the document proofreading process program stores knowledge of an error candidate word group which is a group of words that are easily confused with each other. Based on the co-occurrence information, based on the statistical information of the co-occurrence information, the co-occurrence information is extracted based on the co-occurrence information. A recording medium storing a document proofreading processing program characterized in that correctness is determined in units of co-occurrence information.