JP2003196636A

JP2003196636A - Notation error detection processing method using machine learning method having teacher, its processing device and its processing program

Info

Publication number: JP2003196636A
Application number: JP2001393734A
Authority: JP
Inventors: Maki Murata; 真樹村田; Hitoshi Isahara; 均井佐原
Original assignee: Communications Research Laboratory
Current assignee: Communications Research Laboratory
Priority date: 2001-12-26
Filing date: 2001-12-26
Publication date: 2003-07-11
Anticipated expiration: 2021-12-26
Also published as: JP3692399B2

Abstract

<P>PROBLEM TO BE SOLVED: To perform highly accurate detection by using a machine learning method having a teacher, concerning notation error detection processing. <P>SOLUTION: Positive example data D (correct notation) and negative example data E (wrong notation) are stored in a teacher's data storage part 11 as teacher's data. A solution-identity pair extraction part 12 extracts pairs of the solution and the identity from the negative example data E and the positive example data D in the teacher's data storage part 11, and a machine learning part 13 estimates the kind of solution probably acquired in the case of a set of identities by the machine learning method and stores the result in a learning result data storage part 14. An identity extraction part 15 extracts the set of identities from inputted data 2, and an error detection part 16 estimates whether a notation is wrong or not from the set of the identities on reference to the learning result data storage part 14, and outputs an estimation result 3. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、表記誤り検出処理
に関し、特に教師あり機械学習法を用いた表記誤り検出
処理方法と、その処理を実現する処理装置と、およびそ
の処理をコンピュータに実行させるためのプログラムと
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to notation error detection processing, and more particularly to a notation error detection processing method using a supervised machine learning method, a processing device for realizing the processing, and a computer for executing the processing. And for the program.

【０００２】[0002]

【従来の技術】日本語の場合の単語の表記誤り検出は、
英語の場合に比べてはるかに難しいものである。英語の
場合は単語でわかち書きされているために、基本的に単
語辞書と単語末の変形の規則とを用意しておくことによ
り、ほぼ高精度に単語のスペルチェックを行なうことが
できる。これに対して、日本語の場合は単語でわかち書
きされていないために、単語の表記誤りに限定した処理
であっても、高精度に行なうことが困難である。2. Description of the Related Art In the case of Japanese, word detection error detection is
It's much more difficult than in English. In the case of English, words are written in words, so basically, by preparing a word dictionary and rules for transforming word ends, it is possible to check spelling of words with high accuracy. On the other hand, in the case of Japanese, since it is not written in words, it is difficult to perform the processing with high accuracy even if the processing is limited to word notation errors.

【０００３】また、表記の誤りとしては、単語表記の誤
りの他に、助詞の「て」「に」「を」「は」の運用誤り
などの文法的な誤りも存在する。In addition to typographical errors, typographical errors include grammatical errors such as operational errors of the particles "te", "ni", "wo", and "ha".

【０００４】日本語の表記誤りの検出の主な従来技術と
して以下のものがある。The following are the main conventional techniques for detecting Japanese typographical errors.

【０００５】単語辞書やひらがな連続を登録した辞書
や、連接の条件を記述した辞書にもとづいて表記誤りを
検出する従来手法などが、以下の参考文献１〜参考文献
３に記載されている。これらの従来手法では、単語辞書
やひらがな連続を登録した辞書にないものがあらわれる
と表記誤りであると判定したり、連接の条件を記述した
辞書において満足されない連接の出現が存在すると表記
誤りであると判定する。［参考文献１：納富一宏，日本語文書校正支援ツールｈ
ｓｐの開発，情報処理学会研究発表会（デジタル・ド
キュメント），(1997)，pp.9-16 ］［参考文献２：川原一真他，コーパスから抽出された
辞書を用いた表記誤り検出法，情報処理学会第５４回全
国大会，(1997)，pp.2-21-2-22］［参考文献３：白木伸征他，大量の平仮名列登録によ
る日本語スペルチェッカの作成、言語処理学会年次大
会，(1997)，pp.445-448］また、文字単位のｎｇｒａｍを利用した確率モデルにも
とづいて各文字列の生起確率を求め、生起確率の低い文
字列が出現する箇所を表記誤りと判定する従来手法など
が、以下の参考文献４〜参考文献６に記載されている。［参考文献４：荒木哲郎他，２重マルコフモデルによ
る日本語文の誤り検出並びに訂正法，情報処理学会自然
言語処理研究会 NL97-5，(1997)，pp.29-35］［参考文献５：松山高明他，ｎ−ｇｒａｍによるｏｃ
ｒ誤り検出の能力検討のための適合率と再現率の推定に
関する実験と考察，言語処理学会年次大会(1996), p
p.129-132］［参考文献６：竹内孔一他，統計的言語モデルを用い
たＯＣＲ誤り修正システムの構築，情報処理学会論文
誌，Vol.40, No.6, (1999)］上記の従来手法のうち、参考文献５のｎｇｒａｍ確率を
利用する手法は、主に光学式文字読み取り装置（Optica
l Character Reader：ＯＣＲ）の誤り訂正システムにお
ける表記誤り検出に用いられているものである。ＯＣＲ
誤り訂正システムの場合は、前提として表記誤りの出現
率が５〜１０％と高く、普通に人がものを書くときに誤
る確率より高い。したがって、表記誤りの検出の再現
率、適合率は高くなりやすく、比較的容易な問題の設定
となる。References 1 to 3 below describe conventional techniques for detecting typographical errors based on word dictionaries, dictionaries in which hiragana sequences are registered, and dictionaries in which concatenation conditions are described. In these conventional methods, if there is something that does not exist in the word dictionary or the dictionary in which the hiragana sequence is registered, it is determined that there is a typographical error, and if there is an unsatisfactory concatenation in the dictionary that describes the concatenation condition, it is a typographical error. To determine. [Reference 1: Kazuhiro Notomi, Japanese document proofreading support tool h
Development of sp, Information Processing Society of Japan, Research Presentation (Digital Document), (1997), pp.9-16] [Reference 2: Kazuma Kawahara et al., Error detection method using dictionary extracted from corpus, IPSJ 54th National Convention, (1997), pp.2-21-2-22] [Reference 3: Shinki Shiraki et al., Making Japanese spell checker by registering a large number of hiragana sequences, Annual Conference of the Language Processing Society of Japan Convention, (1997), pp.445-448] In addition, the occurrence probability of each character string is calculated based on a probability model using ngram for each character, and a portion where a character string with a low occurrence probability appears is determined to be a writing error. Conventional methods and the like are described in References 4 to 6 below. [Reference 4: Tetsuro Araki et al., Error detection and correction method for Japanese sentences by dual Markov model, IPSJ Natural Language Processing Research Group NL97-5, (1997), pp.29-35] [Reference 5: Takaaki Matsuyama et al., Oc by n-gram
Experiment and consideration on estimation of precision and recall for examination of r error detection ability, Annual Conference of the Language Processing Society of Japan (1996), p
p.129-132] [Reference 6: Koichi Takeuchi et al., Construction of OCR error correction system using statistical language model, IPSJ Journal, Vol.40, No.6, (1999)] Among the conventional methods, the method utilizing the ngram probability in Reference 5 is mainly an optical character reading device (Optica
l Character Reader (OCR) is used for notation error detection in an error correction system. OCR
In the case of an error correction system, as a premise, the appearance rate of notation errors is as high as 5 to 10%, which is higher than the probability that a person normally makes an error when writing. Therefore, the recall and precision of detection of typographical errors are likely to be high, which is a relatively easy problem setting.

【０００６】また、上記の従来手法の中で最も良さそう
に思われる竹内らの方法、すなわち参考文献６に記載さ
れている従来手法（以下、従来手法Ａという。）を、以
下で簡単に説明する。The method of Takeuchi et al., Which seems to be the best among the above-mentioned conventional methods, that is, the conventional method described in Reference 6 (hereinafter referred to as conventional method A) will be briefly described below. To do.

【０００７】従来手法Ａでは、まず、表記誤りを検出し
たいテキストを頭から一文字ずつずらしながら3 文字連
続を抽出し、抽出した部分のコーパス( 正しい日本語文
の集合) での出現確率がＴｐ以下の場合に、その各３文
字連続に−１を加えていき、与えられた値がＴｓ以上と
なった文字を誤りと判定する。例えば、Ｔｐ＝０、Ｔｓ
＝−２とする。ここで、Ｔｐ＝０としているために、出
現確率をわざわざ求める必要はなく、コーパスにその３
文字連続が出現するか否かを調べるということをするだ
けでよい。Ｔｐ＞０とした場合は、抽出した部分がコー
パスに出現するものがあっても誤りと判定するものとな
る。しかし、出現確率が低くともコーパスに出現してい
れば、それは誤りとしなくてよいだろうからＴｐ＞０は
適切ではなく、Ｔｐ＝０の設定は良いとする。In the conventional method A, first, the text for which the typographical error is to be detected is shifted by one character from the beginning, three consecutive characters are extracted, and the appearance probability of the extracted part in the corpus (set of correct Japanese sentences) is Tp or less. In this case, -1 is added to each of the three consecutive characters, and a character whose given value is Ts or more is determined to be an error. For example, Tp = 0, Ts
= -2. Here, since Tp = 0, it is not necessary to find the appearance probability, and the corpus
All you have to do is check if a sequence of characters appears. When Tp> 0, it is determined that there is an error even if the extracted part appears in the corpus. However, even if the appearance probability is low, if it appears in the corpus, it may not be considered as an error, so Tp> 0 is not appropriate, and Tp = 0 is good.

【０００８】従来手法Ａの補足説明として、「負の事零
の検出」という日本語表現に対して誤り検出を行なうこ
とを考える。このとき、日本語表現の頭から「負の事」
「の事零」といった連続する３文字を切り出し、これら
がコーパスにあるかどうかを調べ、切り出した３文字が
なければその３文字に−１を与える。この場合「の事
零」「事零の」がなかったため、図７に示すようなｔｒ
ｉｇｒａｍによる得点が与えられ、結果として「−２」
点となった「事」と「零」の部分が誤りと判定される。
この従来手法Ａは、コーパスに高頻度に出現する文字３
−ｇｒａｍをうまく組み合わせて誤りを検出する方法と
なっている。As a supplementary explanation of the conventional method A, consider that error detection is performed on the Japanese expression "detection of negative things and zeros". At this time, from the beginning of Japanese expression, "negative things"
Cut out three consecutive characters such as "nothing zero", check whether these are in the corpus, and give -1 to the three characters if there are not three cut out characters. In this case, there was no "nothing zero" and "nothing zero", so tr as shown in FIG.
The score is given by igram, and as a result, "-2"
The "thing" and "zero" parts that become points are determined to be incorrect.
This conventional method A uses characters 3 that appear frequently in the corpus.
It is a method of detecting an error by properly combining -gram.

【０００９】しかし、結局のところ、従来手法Ａの処理
は、コーパスにその表現が存在するか否かを判定するも
のである。すなわち、従来手法Ａは、辞書にないものが
あらわれると誤りとする上記の他の従来手法とよく似た
ものである。However, after all, the processing of the conventional method A is to determine whether or not the expression exists in the corpus. That is, the conventional method A is very similar to the other conventional methods described above in which it is considered an error if something that is not in the dictionary appears.

【００１０】機械学習法については、以下の参考文献７
に述べられているように、正の例のみからの学習は一般
的に困難であることが知られている。［参考文献７：横森貫他，型式言語の学習−正の例か
らの学習を中心に−，情報処理学会誌，Vol.32, No.3,
(1991), pp226-235 ］さらに、教師信号とする誤った表記データ（負の例）
は、正しい表記データ（正の例）に比べて一般的に取得
することが困難であると考えられている。Regarding the machine learning method, the following reference 7
It is known that learning from only positive examples is generally difficult as described in. [Reference 7: Yokomori Kan et al., Learning a formal language-focusing on learning from positive examples-, IPSJ, Vol.32, No.3,
(1991), pp226-235] Furthermore, incorrect notation data used as a teacher signal (negative example)
Are generally considered to be more difficult to obtain than correct notation data (positive example).

【００１１】[0011]

【発明が解決しようとする課題】従来は、正の例のみを
教師信号とする機械学習法を用いた処理方法では高い精
度の処理が期待できないこと、および、教師信号とする
負の例の取得が困難であることから、文章の表記誤り検
出処理において、正の例および負の例の両方を教師信号
とした機械学習法を利用した処理方法は実現されていな
かった。Conventionally, a highly accurate processing cannot be expected with a processing method using a machine learning method in which only positive examples are used as teacher signals, and the acquisition of negative examples used as teacher signals. Since it is difficult to do so, a processing method using a machine learning method in which both positive examples and negative examples are used as teacher signals has not been realized in the sentence notation error detection process.

【００１２】本発明の目的は、正の例および負の例を教
師信号とする機械学習法を用いて、精度の高い表記誤り
検出処理を実現することである。An object of the present invention is to realize a highly accurate notation error detection process by using a machine learning method in which positive examples and negative examples are used as teacher signals.

【００１３】また、本発明の別の目的は、教師信号とす
る負の例を効率よく自動生成し、機械学習法の教師信号
として用いて、精度の高い表記誤り検出処理を実現する
ことである。Another object of the present invention is to realize a highly accurate notation error detection process by efficiently and automatically generating a negative example as a teacher signal and using it as a teacher signal of a machine learning method. .

【００１４】[0014]

【課題を解決するための手段】上記の課題を解決するた
め、本発明は、表記の誤りを検出する処理方法であっ
て、正しい表記である正の例データと誤った表記である
負の例データとを含む教師データから素性と解との対を
抽出し、前記素性と解との対を教師信号として機械学習
を行い、学習結果を学習結果データ記憶部に保存する処
理過程と、入力されたデータから素性を抽出し、前記学
習結果をもとに前記データの表記の誤りを検出する処理
過程とを備える。In order to solve the above problems, the present invention is a processing method for detecting a notation error, in which positive example data that is correct notation and negative example that is incorrect notation. A process of extracting a pair of a feature and a solution from teacher data including data, performing machine learning using the pair of the feature and the solution as a teacher signal, and storing a learning result in a learning result data storage unit, And extracting a feature from the acquired data and detecting an error in the notation of the data based on the learning result.

【００１５】また、本発明は、表記の誤りを検出する処
理方法であって、入力された事例が予め用意した正しい
表記である正の例データに存在しない場合に、前記事例
の一般的な出現確率を算出する処理過程と、前記事例が
前記正の例データに出現する確率を前記一般的出現確率
をもとに算出し、当該確率が所定のしきい値を超える前
記事例を負の例データとする処理過程と、正の例データ
と前記負の例データとを含む教師データから素性と解と
の対を抽出し、前記素性と解との対を教師信号として機
械学習を行い、学習結果を学習結果データ記憶部に保存
する処理過程と、入力されたデータから素性を抽出し、
前記学習結果をもとに前記データの表記の誤りを検出す
る処理過程とを備える。Further, the present invention is a processing method for detecting a notation error, and when the input case does not exist in the positive example data which is the correct notation prepared in advance, the general appearance of the case A process of calculating a probability and a probability that the case appears in the positive example data are calculated based on the general appearance probability, and the case in which the probability exceeds a predetermined threshold is negative example data. And a processing step to extract a pair of a feature and a solution from teacher data including positive example data and the negative example data, machine learning is performed using the pair of the feature and the solution as a teacher signal, and a learning result The process of storing the learning result data in the memory and the features extracted from the input data,
And a processing step of detecting a notation error of the data based on the learning result.

【００１６】さらに、本発明は、表記の誤りを検出する
処理装置であって、正しい表記である正の例データと誤
った表記である負の例データとを含む教師データから素
性と解との対を抽出し、前記素性と解との対を教師信号
として機械学習を行い、学習結果を学習結果データ記憶
部に保存する機械学習処理手段と、入力されたデータか
ら素性を抽出し、前記学習結果をもとに前記データの表
記の誤りを検出する誤り検出処理手段とを備える。Further, the present invention is a processor for detecting a notation error, in which a feature and a solution are identified from teacher data including positive example data which is correct notation and negative example data which is wrong notation. Machine learning processing means for extracting a pair, performing machine learning using the pair of the feature and the solution as a teacher signal, and storing the learning result in the learning result data storage unit, and extracting the feature from the input data, and performing the learning. Error detection processing means for detecting an error in the notation of the data based on the result.

【００１７】また、本発明は、表記の誤りを検出する処
理装置であって、入力された事例が予め用意した正しい
表記である正の例データに存在しない場合に、前記事例
の一般的な出現確率を算出する出現確率算出処理手段
と、前記事例が前記正の例データに出現する確率を前記
一般的出現確率をもとに算出し、当該確率が所定のしき
い値を超える前記事例を負の例データとする負の例取得
処理手段と、正の例データと前記負の例データとを含む
教師データから素性と解との対を抽出し、前記素性と解
との対を教師信号として機械学習を行い、学習結果を学
習結果データ記憶部に保存する機械学習処理手段と、入
力されたデータから素性を抽出し、前記学習結果をもと
に前記データの表記の誤りを検出する誤り検出処理手段
とを備える。Further, the present invention is a processor for detecting a notation error, and when the input case does not exist in the positive example data which is the correct notation prepared in advance, the general appearance of the case. Appearance probability calculation processing means for calculating a probability, and a probability that the case appears in the positive example data are calculated based on the general appearance probability, and the case where the probability exceeds a predetermined threshold is negative. Of the negative example acquisition processing means as the example data of, and a pair of the feature and the solution is extracted from the teacher data including the positive example data and the negative example data, and the pair of the feature and the solution is used as the teacher signal. Machine learning processing means for performing machine learning and storing the learning result in a learning result data storage unit, and error detection for extracting features from input data and detecting an error in notation of the data based on the learning result Processing means.

【００１８】さらに、本発明は、表記の誤りを検出する
処理をコンピュータに実行させるためのプログラムであ
って、正しい表記である正の例データと誤った表記であ
る負の例データとを含む教師データから素性と解との対
を抽出し、前記素性と解との対を教師信号として機械学
習を行い、学習結果を学習結果データ記憶部に保存する
処理と、入力されたデータから素性を抽出し、前記学習
結果をもとに前記データの表記の誤りを検出する処理と
を、コンピュータに実行させるものである。Further, the present invention is a program for causing a computer to execute a process of detecting a notation error, and a teacher including positive example data that is correct notation and negative example data that is wrong notation. A process of extracting a feature-solution pair from the data, performing machine learning using the feature-solution pair as a teacher signal, and storing the learning result in the learning result data storage unit, and extracting the feature from the input data. Then, the computer is caused to execute a process of detecting an error in notation of the data based on the learning result.

【００１９】また、本発明は、表記の誤りを検出する処
理をコンピュータに実行させるためのプログラムであっ
て、入力された事例が予め用意した正しい表記である正
の例データに存在しない場合に、前記事例の一般的な出
現確率を算出する処理と、前記事例が前記正の例データ
に出現する確率を前記一般的出現確率をもとに算出し、
当該確率が所定のしきい値を超える前記事例を負の例デ
ータとする処理と、正の例データと前記負の例データと
を含む教師データから素性と解との対を抽出し、前記素
性と解との対を教師信号として機械学習を行い、学習結
果を学習結果データ記憶部に保存する処理と、入力され
たデータから素性を抽出し、前記学習結果データ記憶部
に保存された前記学習結果をもとに前記データの表記の
誤りを検出する処理とを、コンピュータに実行させるも
のである。Further, the present invention is a program for causing a computer to execute a process of detecting a notation error, and when an input case does not exist in positive example data which is a correct notation prepared in advance, A process of calculating a general appearance probability of the case, calculating the probability that the case appears in the positive example data based on the general appearance probability,
A process of setting the case in which the probability exceeds a predetermined threshold as negative example data, and extracting a pair of a feature and a solution from teacher data including positive example data and the negative example data, and extracting the feature. Machine learning is performed by using a pair of a and a solution as a teacher signal, a learning result is stored in a learning result data storage unit, and a feature is extracted from input data, and the learning is stored in the learning result data storage unit. The computer is caused to execute a process of detecting an error in the notation of the data based on the result.

【００２０】本発明の各手段または機能または要素をコ
ンピュータにより実現するプログラムは、コンピュータ
が読み取り可能な、可搬媒体メモリ、半導体メモリ、ハ
ードディスクなどの適当な記録媒体に格納することがで
き、これらの記録媒体に記録して提供され、または、通
信インタフェースを介して種々の通信網を利用した送受
信により提供される。The program for realizing each means or function or element of the present invention by a computer can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, a hard disk, which can be read by a computer. It is provided by being recorded in a recording medium or is provided by transmission and reception using various communication networks via a communication interface.

【００２１】[0021]

【発明の実施の形態】以下に、本発明の第１の実施の形
態として、日本語表記誤りを連接により検出する処理を
説明する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, as a first embodiment of the present invention, a process of detecting a Japanese writing error by concatenation will be described.

【００２２】図１に、本発明の第１の実施の形態におけ
る表記誤り検出装置１の構成例を示す。FIG. 1 shows a configuration example of a notation error detection device 1 according to the first embodiment of the present invention.

【００２３】表記誤り検出装置１は、教師データ記憶部
１１と、解−素性対抽出部１２と、機械学習部１３と、
学習結果データ記憶部１４と、素性抽出部１５と、誤り
検出部１６とを持つ。The notation error detection device 1 includes a teacher data storage unit 11, a solution-feature pair extraction unit 12, a machine learning unit 13, and
It has a learning result data storage unit 14, a feature extraction unit 15, and an error detection unit 16.

【００２４】教師データ記憶部１１は、機械学習法を実
施する際の教師信号となるデータ（教師データ）を記憶
する手段である。教師データ記憶部１１には、教師デー
タとして、正しい表記である事例（正の例）と誤った表
記である事例（負の例）とが記憶される。正の例は、例
えば正しい文の集合であるコーパス等を利用してもよ
い。負の例は、誤った表記であって一般的なデータはな
いため、予め人手により生成したものを用いる。また
は、後述するような負の例予測処理方法を用いて正の例
から生成するようにしてもよい。The teacher data storage unit 11 is means for storing data (teacher data) which is a teacher signal when the machine learning method is carried out. The teacher data storage unit 11 stores examples of correct notation (positive example) and examples of incorrect notation (negative example) as teacher data. As a positive example, for example, a corpus which is a set of correct sentences may be used. In the negative example, since it is an erroneous notation and there is no general data, the one generated by hand beforehand is used. Alternatively, a negative example prediction processing method as described later may be used to generate the positive example.

【００２５】解−素性対抽出部１２は、教師データ記憶
部１１に記憶されている教師データの各事例ごとに、事
例の解と素性の集合との組を抽出する手段である。The solution-feature pair extraction unit 12 is means for extracting a set of a solution of a case and a set of features for each case of the teacher data stored in the teacher data storage unit 11.

【００２６】機械学習部１３は、解−素性対抽出部１２
により抽出された解と素性の集合の組から、どのような
素性のときにどのような解になりやすいかを機械学習法
により学習する手段である。その学習結果は、学習結果
データ記憶部１４に保存される。The machine learning unit 13 includes a solution-feature pair extraction unit 12
It is a means for learning, by a machine learning method, what kind of features and what kind of solutions are likely to occur from the set of the solution and the set of features extracted by. The learning result is stored in the learning result data storage unit 14.

【００２７】素性抽出部１５は、表記誤り検出対象であ
るデータ２から素性の集合を抽出し、抽出した素性の集
合を誤り検出部１６へ渡す手段である。The feature extraction unit 15 is a means for extracting a set of features from the data 2 which is a notation error detection target and passing the extracted feature set to the error detection unit 16.

【００２８】誤り検出部１６は、学習結果データ記憶部
１４の学習結果データを参照して、素性抽出部１５から
渡された素性の集合の場合に、どのような解になりやす
いか、すなわち表記誤りであるかどうかを推定し、その
推定結果３を出力する手段である。The error detection unit 16 refers to the learning result data in the learning result data storage unit 14 to find out what kind of solution is likely to occur in the case of the feature set passed from the feature extraction unit 15, that is, notation. It is means for estimating whether or not there is an error and outputting the estimation result 3.

【００２９】図２に、教師データ記憶部１１のデータ構
成例を示す。教師データ記憶部１１には、問題と解との
組である教師データが記憶されている。例えば、文の各
文字のすき間（＜｜＞で表す。）を問題として、そのす
き間の連接の解（正解、誤り）が対応付けられた教師デ
ータが記憶される。図２の教師データのうち、「問題−解：説明した方法で＜｜＞を用いることができ
る−誤り」は、負の例データＥの例であり、「問題−解：説明した方法＜｜＞でを用いることができ
る−正」は、正の例データＤの例である。FIG. 2 shows an example of the data structure of the teacher data storage unit 11. The teacher data storage unit 11 stores teacher data that is a set of problems and solutions. For example, the teacher data in which the gap (represented by <|>) of each character of the sentence is used as a problem and the solution (correct answer, error) of the connection of the gap is associated is stored. Of the teacher data in FIG. 2, “problem-solution: <|> can be used in the explained method-error” is an example of negative example data E, and “problem-solution: the explained method <| > Can be used-positive ”is an example of positive example data D.

【００３０】図３に、表記誤り検出処理の処理フローチ
ャートを示す。表記誤り検出処理前に、正の例データＤ
および負の例データＥが教師データ記憶部１１に記憶さ
れているとする。FIG. 3 shows a processing flowchart of the notation error detection processing. Before the notation error detection process, the positive example data D
And the negative example data E is stored in the teacher data storage unit 11.

【００３１】まず、解−素性対抽出部１２は、教師デー
タ記憶部１１から、各事例ごとに、解と素性の集合との
組を抽出する（ステップＳ１）。素性とは、解析に用い
る情報の細かい１単位を意味する。素性として連接の判
定対象となる文字のすき間ごとに以下のものを抽出す
る。First, the solution-feature pair extraction unit 12 extracts a set of a solution and a feature set from the teacher data storage unit 11 for each case (step S1). The feature means a small unit of information used for analysis. As features, the following is extracted for each character gap that is the target of connection determination.

【００３２】・前接および後接の各１〜５ｇｒａｍの文
字列、・対象（すき間）を含めた１〜５ｇｒａｍの文字列（た
だし、対象であるすき間（＜｜＞）は１文字として扱
う。）・前接および後接の単語（単語の抽出は既存の形態素解
析処理を行う処理手段（図１には図示しない）などを利
用する。）・前接および後接の単語の品詞例えば、「問題−解」が、「説明した方法で＜｜＞を用
いることができる−誤り」である場合には、図４に示す
ような素性を抽出する。すなわち、以下の素性を抽出す
る。A character string of 1-5 gram for each of the front and back connections, and a character string of 1-5 gram including the target (gap) (however, the target gap (<|>) is treated as one character. ) Prefix and postfix words (word extraction uses existing processing means (not shown in FIG. 1) for performing morphological analysis), etc. ・ Part of speech of prefix and postfix words, for example, " If the “problem-solution” is “<|> can be used in the described method-error”, the features as shown in FIG. 4 are extracted. That is, the following features are extracted.

【００３３】素性：前接「した方法で」，前接「た方法
で」，前接「方法で」，前接「法で」，前接「で」，後
接「を用いるこ」，後接「を用いる」，後接「を用
い」，後接「を用」，後接「を」，「た方法で＜｜
＞」，「方法で＜｜＞を」，「法で＜｜＞を用」，「で
＜｜＞を用い」，「＜｜＞を用いる」，前接「で」，後
接「を」，前接「助詞」，後接「助詞」次に、機械学習部１３は、抽出した解と素性の集合との
組から、どのような素性のときにどのような解になりや
すいかを機械学習し、その学習結果を学習結果データ記
憶部１４に保存する（ステップＳ２）。Feature: Prefix "By the way", Prefix "By the method", Prefix "By the method", Prefix "By the law", Prefix "By", Postfix "By" Subsequent “use”, postfix “use”, postfix “use”, postfix “o”,
＞ ”,“ Method with <|> ”,“ Using method with <|> ”,“ With <|> ”,“ Using <|> ”, Prefix“ de ”, Postfix“ o ” , Prefix "particle", Postfix "particle" Next, the machine learning unit 13 uses the set of the extracted solution and the set of features to identify what kind of feature is likely to be and what kind of solution is likely to occur. After learning, the learning result is stored in the learning result data storage unit 14 (step S2).

【００３４】機械学習の手法としては、例えば、決定リ
スト法、最大エントロピー法、サポートベクトルマシン
法などを用いる。As a machine learning method, for example, a decision list method, a maximum entropy method, a support vector machine method, or the like is used.

【００３５】決定リスト法は、素性と分類先の組を規則
とし、それらをあらかじめ定めた優先順序でリストに蓄
えおき、検出する対象となる入力が与えられたときに、
リストで優先順位の高いところから入力のデータと規則
の素性を比較し、素性が一致した規則の分類先をその入
力の分類先とする方法である。The decision list method uses a set of features and classification destinations as rules, stores them in a list in a predetermined priority order, and when an input to be detected is given,
This is a method in which the input data and rule features are compared from the highest priority in the list, and the classification destination of the rules whose features match is used as the input classification destination.

【００３６】最大エントロピー法は、あらかじめ設定し
ておいた素性ｆ_j（１≦ｊ≦ｋ）の集合をＦとすると
き、所定の条件式を満足しながらエントロピーを意味す
る式を最大にするときの確率分布ｐ（ａ，ｂ）を求め、
その確率分布にしたがって求まる各分類の確率のうち、
もっとも大きい確率値を持つ分類を求める分類とする方
法である。In the maximum entropy method, when the set of features f _j (1 ≦ j ≦ k) set in advance is F, and the expression that means entropy is maximized while satisfying a predetermined conditional expression, The probability distribution p (a, b) of
Of the probabilities of each classification obtained according to the probability distribution,
In this method, the classification having the largest probability value is used.

【００３７】サポートベクトルマシン法は、空間を超平
面で分割することにより、２つの分類からなるデータを
分類する手法である。The support vector machine method is a method of classifying data consisting of two classes by dividing a space into hyperplanes.

【００３８】決定リスト法および最大エントロピー法に
ついては、以下の参考文献８に、サポートベクトルマシ
ン法については、以下の参考文献９および参考文献１０
に説明されている。［参考文献８：村田真樹、内山将夫、内元清貴、馬青、
井佐原均、種々の機械学習法を用いた多義解消実験、電
子情報通信学会言語理解とコミュニケーション研究会，
NCL2001-2, (2001) ] ［参考文献９：Nello Cristianini and John Shawe-Tay
lor, An Introduction to Support Vector Machines an
d other kernel-based learning methods,(Cambridge U
niversity Press,2000) ］［参考文献１０：Taku Kudoh, Tinysvm:Support Vector
machines,(http://cl.aist-nara.ac.jp/taku-ku//soft
ware/Tiny SVM/index.html,2000)］なお、機械学習部１３では、上記の手法に限定されず
に、教師あり機械学習法であればどのような手法でも使
用することができる。For the decision list method and the maximum entropy method, the following reference 8 is given. For the support vector machine method, the following references 9 and 10 are given.
Explained. [Reference 8: Maki Murata, Masao Uchiyama, Kiyotaka Uchimoto, Masei,
Hitoshi Isahara, Ambiguity Resolving Experiments Using Various Machine Learning Methods, IEICE Language Understanding and Communication Research Group,
NCL2001-2, (2001)] [Reference 9: Nello Cristianini and John Shawe-Tay
lor, An Introduction to Support Vector Machines an
d other kernel-based learning methods, (Cambridge U
niversity Press, 2000)] [Reference 10: Taku Kudoh, Tinysvm: Support Vector
machines, (http://cl.aist-nara.ac.jp/taku-ku//soft
ware / Tiny SVM / index.html, 2000)] Note that the machine learning unit 13 is not limited to the above method, and any method can be used as long as it is a machine learning method with a teacher.

【００３９】その後、素性抽出部１５は、解を求めたい
データ２を入力し（ステップＳ３）、解−素性対抽出部
１２での処理とほぼ同様に、データ２から素性の集合を
取り出し、それらを誤り検出部１６へ渡す（ステップＳ
４）。After that, the feature extraction unit 15 inputs the data 2 for which a solution is to be obtained (step S3), extracts a set of features from the data 2 in the same manner as the processing in the solution-feature pair extraction unit 12, and extracts them. To the error detector 16 (step S
4).

【００４０】誤り検出部１６は、渡された素性の集合の
場合にどのような解になりやすいかを学習結果データ記
憶部１４の学習結果データをもとに特定し、特定した解
すなわち表記誤りかどうかの推定結果３を出力する（ス
テップＳ５）。The error detection unit 16 specifies what kind of solution is likely to occur in the case of the passed feature set based on the learning result data of the learning result data storage unit 14, and specifies the specified solution, that is, the notation error. An estimation result 3 indicating whether or not it is output (step S5).

【００４１】例えば、解析したい問題がすき間＜｜＞の
連接である場合に、データ２が「説明した方法で＜｜＞
を用いることができる」であれば、「誤り」という推定
結果３を出力する。For example, when the problem to be analyzed is the connection of the gap <|>, the data 2 is "<│> by the method described.
Can be used ”, the estimation result 3 of“ error ”is output.

【００４２】次に、本発明の第２の実施の形態について
説明する。Next, a second embodiment of the present invention will be described.

【００４３】教師データ記憶部１１の正の例データＤに
ついては、コーパス等を利用できるため比較的容易に取
得できる。しかし、負の例データＥは、容易に取得でき
ないため人手により生成するが、かかる生成作業の負担
は大きい。The positive example data D in the teacher data storage unit 11 can be acquired relatively easily because a corpus or the like can be used. However, since the negative example data E cannot be easily acquired, it is generated manually, but the burden of such generation work is large.

【００４４】また、教師データは多量であるほうが処理
精度が向上するため、できる限り多量の教師データを用
意することが望ましい。Further, since the processing accuracy increases as the amount of teacher data increases, it is desirable to prepare as much teacher data as possible.

【００４５】そこで、多量な正の例データから負の例デ
ータを予測する方法を考える。Therefore, a method of predicting negative example data from a large amount of positive example data will be considered.

【００４６】正の例から負の例を予測する単純な方法と
して、既知の正の例のデータに現れなかったものをすべ
て負の例とするという手法が考えられる。しかし、実際
には未出現の正の例の存在が考えられるために、このよ
うな単純な方法を用いると、多くの未出現の正の例を負
の例であると判定してしまうことになるという問題があ
り、このような方法で生成した負の例を高精度の処理に
適用することができない。As a simple method of predicting a negative example from a positive example, a method of making all the negative examples that do not appear in the data of the known positive examples can be considered. However, in reality, it is possible that there are positive examples that have not yet appeared, so using such a simple method leads to the determination that many positive examples that have not yet appeared are negative examples. However, the negative example generated by such a method cannot be applied to high-precision processing.

【００４７】例えば大規模な既存のコーパス（日本語の
文の集合）をすべて正しいと仮定すると、その既存のコ
ーパスを正しい文（正の例）と考え、この正の例を用い
て、表記誤り（負の例）を予測する方法により、自動的
に負の例を生成することができる。For example, assuming that all large-scale existing corpus (set of Japanese sentences) is correct, the existing corpus is considered as a correct sentence (positive example), and a notation error is made using this positive example. The method of predicting (negative example) can automatically generate a negative example.

【００４８】これにより、教師データとする負の例が豊
富になり、生成作業の負担を軽減し、かつ、教師データ
付きの機械学習法を利用した高精度の表記誤り検出処理
を実現できることになる。As a result, a large number of negative examples of teacher data are provided, the burden of generation work is reduced, and high-precision notation error detection processing using a machine learning method with teacher data can be realized. .

【００４９】本形態における表記誤り検出装置１は、ま
ず、正の例か負の例か判定すべき未知の事例ｘの一般的
な出現確率ｐ（ｘ）を算出する。次に、この出現確率ｐ
（ｘ）で既知の正の例データＤに出現しないことが不自
然である場合に、すなわち、一般的な出現確率が高く当
然正の例データＤに出現するであろう状態にも関わらず
既知の正の例データＤに出現しない場合には、事例ｘの
負の例の度合いが高いと推測し、所定の値より高い負の
例の度合いの事例ｘを負の例データＥとする。そして、
かかる負の例データＥと正の例データＤとを教師信号と
した機械学習法により表記誤り検出処理を行う。The notation error detection device 1 in the present embodiment first calculates a general appearance probability p (x) of an unknown case x to be determined as a positive example or a negative example. Next, this appearance probability p
If it is unnatural that it does not appear in the known positive example data D in (x), that is, it is known in spite of the state that it has a high general appearance probability and naturally appears in the positive example data D. When it does not appear in the positive example data D of, the degree of the negative example of the case x is estimated to be high, and the case x having the degree of the negative example higher than a predetermined value is set as the negative example data E. And
The notation error detection process is performed by a machine learning method using the negative example data E and the positive example data D as teacher signals.

【００５０】図５に、本発明の第２の実施の形態におけ
る表記誤り検出装置１の構成例を示す。FIG. 5 shows a configuration example of the notation error detection device 1 according to the second embodiment of the present invention.

【００５１】表記誤り検出装置１は、教師データ記憶部
１１と、解−素性対抽出部１２と、機械学習部１３と、
素性抽出部１５と、誤り検出部１６と、存在判定部２１
と、出現確率推定部２２と、負の例度合い算出部２３
と、負の例取得部２４と、正の例データ記憶部２５とを
持つ。The notation error detection device 1 includes a teacher data storage unit 11, a solution-feature pair extraction unit 12, a machine learning unit 13, and
Feature extraction unit 15, error detection unit 16, and presence determination unit 21
, Appearance probability estimation unit 22, and negative example degree calculation unit 23
And a negative example acquisition unit 24 and a positive example data storage unit 25.

【００５２】教師データ記憶部１１と、解−素性対抽出
部１２と、機械学習部１３と、素性抽出部１５と、誤り
検出部１６とは、第１の実施の形態で説明した表記誤り
検出装置１の各手段と同一の手段であるので説明を省略
する（図１参照）。The teacher data storage unit 11, the solution-feature pair extraction unit 12, the machine learning unit 13, the feature extraction unit 15, and the error detection unit 16 detect the notation error described in the first embodiment. Since it is the same as each means of the apparatus 1, the description thereof will be omitted (see FIG. 1).

【００５３】存在判定部２１は、正または負の情報が付
与されていない日本語文の集合であるコーパス２０の事
例ｘが、正の例データ記憶部２５に記憶されている正の
例データＤに存在するかどうかを判定する手段である。The existence determining unit 21 determines that the case x of the corpus 20 which is a set of Japanese sentences to which positive or negative information is not added is the positive example data D stored in the positive example data storage unit 25. It is a means of determining whether or not it exists.

【００５４】出現確率推定部２２は、事例ｘが正の例デ
ータ記憶部２５に存在しない場合に、事例ｘの一般的な
出現確率（頻度）ｐ（ｘ）を算出する手段である。The appearance probability estimation unit 22 is means for calculating a general appearance probability (frequency) p (x) of the case x when the case x does not exist in the positive example data storage unit 25.

【００５５】負の例度合い算出部２３は、出現確率ｐ
（ｘ）をもとに事例ｘの負の例度合いＱ（ｘ）を算出す
る手段である。The negative example degree calculation unit 23 determines the appearance probability p.
This is means for calculating the negative example degree Q (x) of the case x based on (x).

【００５６】負の例取得部２４は、負の例度合い算出部
２３から受け取った事例ｘの負の例度合いＱ（ｘ）が所
定の値を超える場合に、その事例ｘを負の例データＥと
し、事例ｘを問題−解の構想の教師データ（負の例デー
タＥ）として教師データ記憶部１１に記憶する手段であ
る。When the negative example degree Q (x) of the case x received from the negative example degree calculating unit 23 exceeds a predetermined value, the negative example obtaining unit 24 sets the case x to the negative example data E. The case x is stored in the teacher data storage unit 11 as the teacher data (negative example data E) of the problem-solution concept.

【００５７】図６に、第２の実施の形態において学習デ
ータとなる負の例データの取得処理の処理フローチャー
トを示す。FIG. 6 shows a processing flow chart of the processing for acquiring the negative example data which is the learning data in the second embodiment.

【００５８】表記誤り検出装置１の存在判定部２１は、
コーパス２０から正の例か負の例かが未知である文を入
力し、文の頭から、文字のすき間を１つずつずらしなが
ら、各すき間を連接チェックの対象として、そのすき間
に前接する１〜５ｇｒａｍの文字列ａと、後接する１〜
５ｇｒａｍの文字列ｂを取り出し、この任意のペアであ
る事例ｘ＝（ａ、ｂ）を生成する（ステップＳ１１）。
ここでは、２５個の事例（ペア）が生成されることにな
る。The existence determination unit 21 of the notation error detection device 1 is
Enter a sentence whose positive or negative example is unknown from the corpus 20, shift the gaps of the characters one by one from the beginning of the sentence, and each gap is the target of the connection check, and the gap is preceded by 1 〜5gram character string a, followed by 1〜
A character string b of 5 gram is taken out, and a case x = (a, b) which is this arbitrary pair is generated (step S11).
Here, 25 cases (pairs) will be generated.

【００５９】そして、事例ｘの２５個の連接ａｂが正の
例データ記憶部２５にあるかどうかを調べ（ステップＳ
１２）、連接ａｂが正の例データ記憶部２５に存在しな
ければ、その事例ｘを出現確率推定部２２へ渡す（ステ
ップＳ１３）。Then, it is checked whether or not the 25 concatenated abs of case x are present in the positive example data storage unit 25 (step S
12) If the connection ab does not exist in the positive example data storage unit 25, the case x is passed to the appearance probability estimation unit 22 (step S13).

【００６０】出現確率推定部２２は、事例ｘの一般的な
出現確率ｐ（ｘ）を推定する（ステップＳ１４）。The appearance probability estimation unit 22 estimates a general appearance probability p (x) of the case x (step S14).

【００６１】例えば、正の例データ記憶部２５の正の例
データＤは二項関係（ａ，ｂ）からなり、二項のａとｂ
とがお互いに独立であると仮定すると、二項関係（ａ，
ｂ）の出現する確率はｐ（ｘ）は、ａ、ｂの正の例デー
タ記憶部２５での出現確率をｐ（ａ）、ｐ（ｂ）とする
とき、その積ｐ（ａ）×ｐ（ｂ）となる。すなわち、各
事例ｘを二項関係（ａ，ｂ）とし、その各項ａ、ｂを独
立と仮定することで、各事例ｘの一般的な出現確率ｐ
（ｘ）を、各項ａ、ｂの確率により計算する。For example, the positive example data D in the positive example data storage unit 25 has a binary relation (a, b), and the binary terms a and b
Assuming that and are independent of each other, the binary relation (a,
The probability p (x) of occurrence of b) is the product p (a) × p of the probability p of the occurrence of a and b in the positive example data storage unit 25 as p (a) and p (b). (B). That is, by assuming each case x to be a binary relation (a, b) and assuming that the terms a and b are independent, a general appearance probability p of each case x is obtained.
(X) is calculated from the probabilities of the terms a and b.

【００６２】そして、負の例度合い算出部２３は、事例
ｘの出現確率ｐ（ｘ）を使って、事例ｘが正の例データ
記憶部２５に出現する確率Ｑ（ｘ）を求める（ステップ
Ｓ１５）。Then, the negative example degree calculation unit 23 uses the appearance probability p (x) of the case x to obtain the probability Q (x) that the case x appears in the positive example data storage unit 25 (step S15). ).

【００６３】このとき、正の例データ記憶部２５の正の
例データＤがｎ個でありそれぞれが独立であることを仮
定すると、１回試行して事例ｘが出現しない確率は１−
ｐ（ｘ）であり、これがｎ回連続して起こるということ
から、事例ｘが正の例データ記憶部２５の正の例データ
Ｄに出現しない確率は（１−ｐ（ｘ））ⁿとなり、事例
ｘが同じく正の例データＤに出現する確率Ｑ（ｘ）＝１
−（１−ｐ（ｘ））ⁿとなる。At this time, assuming that the number of positive example data D in the positive example data storage unit 25 is n and each is independent, the probability that the case x does not appear after one trial is 1-
Since it is p (x) and this occurs n times in succession, the probability that the case x does not appear in the positive example data D of the positive example data storage unit 25 is (1-p (x)) ⁿ , Probability Q (x) = 1 that the case x appears in the positive case data D as well
-(1-p (x)) ⁿ .

【００６４】ところで、「確率Ｑ（ｘ）が小さい」とい
うのは、確率的に事例ｘが正の例データ記憶部２５の正
の例データＤに出現する確率が低いということであり、
正の例データ（コーパス）が小さいために確率的に出現
しないということが保証されたことを意味するため、
「事例ｘは正の例でありうる。」という意味になる。By the way, "the probability Q (x) is small" means that the probability that the case x appears in the positive example data D of the positive example data storage unit 25 is low.
Positive example data (corpus) means that it is guaranteed not to appear stochastically because it is small,
It means that the case x can be a positive example.

【００６５】逆に、「確率Ｑ（ｘ）が大きい」というの
は、確率的に事例ｘが正の例データＤに出現する確率が
高いということであり、確率的には同コーパスに当然出
現すべきということになり、それなのに実際は出現しな
かったということで矛盾が生じることになる。この矛盾
により、一般的な出現確率ｐ（ｘ）か種々の独立の仮定
が否定されることになる。On the contrary, "the probability Q (x) is large" means that the case x has a high probability of appearing in the positive example data D, and stochastically appears in the same corpus. It should be done, and there is a contradiction that it did not actually appear. This contradiction negates the general probability of occurrence p (x) or various independent assumptions.

【００６６】ここで、「事例ｘが正の例である場合は、
一般的な出現確率ｐ（ｘ）および種々の独立の仮定が正
しい。」と新たに仮定すると、この矛盾により「事例ｘ
は正の例でありえない。」が導出されることになる。す
なわち、「事例ｘが正の例データＤに出現する確率Ｑ
（ｘ）」は、「事例ｘが正の例でありえない確率Ｑ
（ｘ）」を意味することになる。そういう意味で、Ｑ
（ｘ）は負の例の度合いを意味するものとなる。よっ
て、このＱ（ｘ）を「負の例度合い」とし、事例ｘのＱ
（ｘ）が大きいほど事例ｘの負の例の度合いが大きいと
する。Here, "if case x is a positive example,
The general probability of occurrence p (x) and various independent assumptions are correct. ”, This inconsistency results in“ case x
Cannot be a positive example. Will be derived. That is, "the probability Q that the case x appears in the positive example data D
(X) is the probability Q that case x cannot be a positive example.
(X) ”is meant. In that sense, Q
(X) means the degree of a negative example. Therefore, this Q (x) is defined as the “negative example degree”, and the Q of case x is
It is assumed that the larger the value of (x), the greater the degree of the negative example of the case x.

【００６７】そして、負の例取得部２４は、最もＱ
（ｘ）の値が高いときのその値をＱ_max、またｘをｘ
_maxとし、Ｑ（ｘ_max）の値が大きいすき間ほど、妥当
でない連接の可能性が高いとして、Ｑ（ｘ_max）の値
が、所定の値より大きい場合には、そのすき間を負の例
データＥとして教師データ記憶部１１へ保存する（ステ
ップＳ１６）。なお、負の例データＥとその負の例の度
合いＱ（ｘ_max）とを教師データ記憶部１１に保存して
もよい。Then, the negative example acquisition unit 24 determines that the maximum Q
When the value of (x) is high, its value is Q _max , and x is x
and _max, the more the gap value is larger the Q (x _max), and that there is a high possibility of articulation not valid, Q (x _max) value of, is larger than a predetermined value, an example data that gap negative It is saved as E in the teacher data storage unit 11 (step S16). The negative example data E and the degree Q (x _max ) of the negative example may be stored in the teacher data storage unit 11.

【００６８】以上のステップＳ１１〜ステップＳ１５の
処理を、文の全てのすき間について行っていくことによ
り、正の例データ記憶部２５の正の例データＤの頻度情
報を用いて負の例データＥを取得することができ、正の
例データＤおよび負の例データＥを教師データとして教
師データ記憶部１１に用意することができる。By performing the above steps S11 to S15 for all the gaps in the sentence, the negative example data E is obtained by using the frequency information of the positive example data D in the positive example data storage unit 25. The positive example data D and the negative example data E can be prepared as teacher data in the teacher data storage unit 11.

【００６９】以降の処理は、第１の実施の形態で説明し
た誤り検出処理と同様であるので、説明を省略する。Since the subsequent processing is the same as the error detection processing described in the first embodiment, the description will be omitted.

【００７０】以上、本発明をその実施の形態により説明
したが、本発明はその主旨の範囲において種々の変形が
可能である。The present invention has been described above with reference to the embodiments, but the present invention can be variously modified within the scope of the gist thereof.

【００７１】例えば、表記誤り検出装置１の出現確率推
定部２２は、事例ｘの一般的な出現確率ｐ（ｘ）を、何
らかの方法で算出すればよく、本発明の実施の形態で説
明した方法に限られるものではない。For example, the appearance probability estimation unit 22 of the notation error detection device 1 may calculate the general appearance probability p (x) of the case x by any method, and the method described in the embodiment of the present invention. It is not limited to.

【００７２】また、教師データ記憶部１１の正の例デー
タＤは、正の例データ記憶部２５に記憶されている正の
例データＤを使用することもでき、また、別に用意した
正の例データを使用することもできる。Further, as the positive example data D in the teacher data storage unit 11, the positive example data D stored in the positive example data storage unit 25 can be used. Data can also be used.

【００７３】[0073]

【発明の効果】以上説明したように、本発明は、正の例
と負の例とを教師信号とする機械学習法を用いて表記誤
り検出処理を行う。負の例の情報も用いる本発明は、正
の例だけを用いた処理方法に比べて、格段に高い精度の
処理結果を得ることができる。As described above, the present invention performs notation error detection processing using the machine learning method in which positive examples and negative examples are used as teacher signals. The present invention, which also uses the information of the negative example, can obtain a processing result with significantly higher accuracy than the processing method using only the positive example.

【００７４】また、本発明は、正の例の頻度情報を用い
て、正の例から負の例を抽出する処理を行い、抽出した
負の例を機械学習法の教師信号とする。正の例から自動
的に抽出される負の例の情報を用いる本発明は、表記誤
り検出のように正の例が存在するが負の例の取得が困難
な問題において、負の例を生成する処理負担を軽減する
ことができる。Further, according to the present invention, the frequency information of the positive example is used to perform the process of extracting the negative example from the positive example, and the extracted negative example is used as the teacher signal of the machine learning method. The present invention, which uses the information of the negative example automatically extracted from the positive example, generates the negative example in the problem that the positive example exists but it is difficult to obtain the negative example such as the notation error detection. It is possible to reduce the processing load.

[Brief description of drawings]

【図１】第１の実施の形態における表記誤り検出装置の
構成例を示す図である。FIG. 1 is a diagram illustrating a configuration example of a writing error detection device according to a first exemplary embodiment.

【図２】教師データ記憶部の構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of a teacher data storage unit.

【図３】表記誤り検出処理の処理フローチャート図であ
る。FIG. 3 is a process flow chart of a notation error detection process.

【図４】素性の例を示す図である。FIG. 4 is a diagram showing an example of features.

【図５】第２の実施の形態における表記誤り検出装置の
構成例を示す図である。FIG. 5 is a diagram showing a configuration example of a writing error detection device according to a second exemplary embodiment.

【図６】負の例データ取得処理の処理フローチャート図
である。FIG. 6 is a process flowchart of a negative example data acquisition process.

【図７】従来手法を補足的に説明するための図である。FIG. 7 is a diagram for supplementarily explaining a conventional method.

[Explanation of symbols]

１表記誤り検出装置２データ３推定結果１１教師データ記憶部１２解−素性対抽出部１３機械学習部１４学習結果データ記憶部１５素性抽出部１６誤り検出部２０コーパス２１存在判定部２２出現確率推定部２３負の例度合い算出部２４負の例取得部２５正の例データ記憶部 1 Notation error detector 2 data 3 estimation results 11 Teacher data storage 12 Solution-feature pair extraction unit 13 Machine learning department 14 Learning result data storage 15 Feature extraction unit 16 Error detector 20 corpus 21 presence determination unit 22 Appearance probability estimation unit 23 Negative example degree calculation unit 24 Negative example acquisition part 25 Positive example data storage

Claims

[Claims]

1. A processing method for detecting a notation error, wherein a feature-solution pair is extracted from teacher data including positive example data that is correct notation and negative example data that is incorrect notation. , A process of performing machine learning by using the extracted feature-solution pair as a teacher signal and storing the learning result in the learning result data storage unit, and extracting the feature from the input data and storing it in the learning result data storage unit. A notation error detection processing method using a supervised machine learning method, comprising: a processing step of detecting a notation error based on the stored learning result.

2. A processing method for detecting a notation error, wherein a general appearance probability of the case is calculated when an inputted case does not exist in positive example data which is a correct notation prepared in advance. A processing step, and a processing step in which the probability that the case appears in the positive example data is calculated based on the general appearance probability, and the case in which the probability exceeds a predetermined threshold is negative example data. And extracting a pair of a feature and a solution from teacher data including positive example data and the negative example data, machine learning is performed by using the pair of the feature and the solution as a teacher signal, and the learning result is a learning result data. A supervised machine learning method characterized by comprising a processing step of storing in a storage unit and a processing step of extracting features from input data and detecting an error in notation of the data based on the learning result. Error detection process using Law.

3. A processor for detecting a notation error, wherein a feature-solution pair is extracted from teacher data containing positive example data that is correct notation and negative example data that is incorrect notation. , Machine learning processing means for performing machine learning using a pair of the feature and the solution as a teacher signal, and storing the learning result in a learning result data storage unit, and extracting the feature from the input data to obtain the learning result. And an error detection processing means for detecting an error in notation of the data, a notation error detection processing apparatus using a supervised machine learning method.

4. A processing device for detecting a notation error, wherein a general appearance probability of said case is calculated when an inputted case does not exist in positive example data which is a correct notation prepared in advance. Appearance probability calculation processing means, the probability that the case appears in the positive example data is calculated based on the general appearance probability, and the case in which the probability exceeds a predetermined threshold is negative example data. A negative example acquisition processing means for extracting a pair of feature and solution from the teacher data including the positive example data and the negative example data, and performs machine learning using the extracted feature and solution pair as a teacher signal. Machine learning processing means for storing the learning result in the learning result data storage unit, extracting features from the input data, and making a notation error based on the learning result stored in the learning result data storage unit. Error detection processing means for detecting Notation error detection processing apparatus using supervised machine learning methods, characterized in that it comprises.

5. A program for causing a computer to execute a process of detecting a notation error, wherein the teacher data includes positive example data that is correct notation and negative example data that is wrong notation, and A process of extracting a pair with a solution, performing machine learning by using the pair of the feature and the solution as a teacher signal, and storing the learning result in a learning result data storage unit, extracting the feature from the input data, and performing the learning A notation error detection processing program using a supervised machine learning method, which causes a computer to execute a process for detecting an error in notation of the data based on the result.

6. A program for causing a computer to execute a process of detecting a notation error, wherein when an input case does not exist in positive example data which is a correct notation prepared in advance, the general case of the case is described. And a process of calculating a typical appearance probability, the probability that the case appears in the positive example data is calculated based on the general appearance probability, and the case where the probability exceeds a predetermined threshold is negative. A process of processing as example data, a pair of a feature and a solution is extracted from teacher data including positive example data and the negative example data, and machine learning is performed by using the pair of the feature and the solution as a teacher signal. A method of causing a computer to execute a process of storing a result in a learning result data storage unit and a process of extracting a feature from input data and detecting an error in notation of the data based on the learning result To Teacher There notation error detection processing program using a machine learning method.