JP3692399B2

JP3692399B2 - Notation error detection processing apparatus using supervised machine learning method, its processing method, and its processing program

Info

Publication number: JP3692399B2
Application number: JP2001393734A
Authority: JP
Inventors: 真樹村田; 均井佐原
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2001-12-26
Filing date: 2001-12-26
Publication date: 2005-09-07
Anticipated expiration: 2021-12-26
Also published as: JP2003196636A

Description

【０００１】
【発明の属する技術分野】
本発明は、表記誤り検出処理に関し、特に教師あり機械学習法を用いた表記誤り検出処理方法と、その処理を実現する処理装置と、およびその処理をコンピュータに実行させるためのプログラムとに関する。
【０００２】
【従来の技術】
日本語の場合の単語の表記誤り検出は、英語の場合に比べてはるかに難しいものである。英語の場合は単語でわかち書きされているために、基本的に単語辞書と単語末の変形の規則とを用意しておくことにより、ほぼ高精度に単語のスペルチェックを行なうことができる。これに対して、日本語の場合は単語でわかち書きされていないために、単語の表記誤りに限定した処理であっても、高精度に行なうことが困難である。
【０００３】
また、表記の誤りとしては、単語表記の誤りの他に、助詞の「て」「に」「を」「は」の運用誤りなどの文法的な誤りも存在する。
【０００４】
日本語の表記誤りの検出の主な従来技術として以下のものがある。
【０００５】
単語辞書やひらがな連続を登録した辞書や、連接の条件を記述した辞書にもとづいて表記誤りを検出する従来手法などが、以下の参考文献１〜参考文献３に記載されている。これらの従来手法では、単語辞書やひらがな連続を登録した辞書にないものがあらわれると表記誤りであると判定したり、連接の条件を記述した辞書において満足されない連接の出現が存在すると表記誤りであると判定する。
［参考文献１：納富一宏，日本語文書校正支援ツールｈｓｐの開発，情報処理学会研究発表会（デジタル・ドキュメント），(1997)，pp.9-16 ］
［参考文献２：川原一真他，コーパスから抽出された辞書を用いた表記誤り検出法，情報処理学会第５４回全国大会，(1997)，pp.2-21-2-22］
［参考文献３：白木伸征他，大量の平仮名列登録による日本語スペルチェッカの作成、言語処理学会年次大会，(1997)，pp.445-448］
また、文字単位のｎｇｒａｍを利用した確率モデルにもとづいて各文字列の生起確率を求め、生起確率の低い文字列が出現する箇所を表記誤りと判定する従来手法などが、以下の参考文献４〜参考文献６に記載されている。
［参考文献４：荒木哲郎他，２重マルコフモデルによる日本語文の誤り検出並びに訂正法，情報処理学会自然言語処理研究会 NL97-5，(1997)，pp.29-35］
［参考文献５：松山高明他，ｎ−ｇｒａｍによるｏｃｒ誤り検出の能力検討のための適合率と再現率の推定に関する実験と考察，言語処理学会年次大会(1996), pp.129-132］
［参考文献６：竹内孔一他，統計的言語モデルを用いたＯＣＲ誤り修正システムの構築，情報処理学会論文誌，Vol.40, No.6, (1999)］
上記の従来手法のうち、参考文献５のｎｇｒａｍ確率を利用する手法は、主に光学式文字読み取り装置（Optical Character Reader：ＯＣＲ）の誤り訂正システムにおける表記誤り検出に用いられているものである。ＯＣＲ誤り訂正システムの場合は、前提として表記誤りの出現率が５〜１０％と高く、普通に人がものを書くときに誤る確率より高い。したがって、表記誤りの検出の再現率、適合率は高くなりやすく、比較的容易な問題の設定となる。
【０００６】
また、上記の従来手法の中で最も良さそうに思われる竹内らの方法、すなわち参考文献６に記載されている従来手法（以下、従来手法Ａという。）を、以下で簡単に説明する。
【０００７】
従来手法Ａでは、まず、表記誤りを検出したいテキストを頭から一文字ずつずらしながら3 文字連続を抽出し、抽出した部分のコーパス( 正しい日本語文の集合) での出現確率がＴｐ以下の場合に、その各３文字連続に−１を加えていき、与えられた値がＴｓ以上となった文字を誤りと判定する。例えば、Ｔｐ＝０、Ｔｓ＝−２とする。ここで、Ｔｐ＝０としているために、出現確率をわざわざ求める必要はなく、コーパスにその３文字連続が出現するか否かを調べるということをするだけでよい。Ｔｐ＞０とした場合は、抽出した部分がコーパスに出現するものがあっても誤りと判定するものとなる。しかし、出現確率が低くともコーパスに出現していれば、それは誤りとしなくてよいだろうからＴｐ＞０は適切ではなく、Ｔｐ＝０の設定は良いとする。
【０００８】
従来手法Ａの補足説明として、「負の事零の検出」という日本語表現に対して誤り検出を行なうことを考える。このとき、日本語表現の頭から「負の事」「の事零」といった連続する３文字を切り出し、これらがコーパスにあるかどうかを調べ、切り出した３文字がなければその３文字に−１を与える。この場合「の事零」「事零の」がなかったため、図７に示すようなｔｒｉｇｒａｍによる得点が与えられ、結果として「−２」点となった「事」と「零」の部分が誤りと判定される。この従来手法Ａは、コーパスに高頻度に出現する文字３−ｇｒａｍをうまく組み合わせて誤りを検出する方法となっている。
【０００９】
しかし、結局のところ、従来手法Ａの処理は、コーパスにその表現が存在するか否かを判定するものである。すなわち、従来手法Ａは、辞書にないものがあらわれると誤りとする上記の他の従来手法とよく似たものである。
【００１０】
機械学習法については、以下の参考文献７に述べられているように、正の例のみからの学習は一般的に困難であることが知られている。
［参考文献７：横森貫他，型式言語の学習−正の例からの学習を中心に−，情報処理学会誌，Vol.32, No.3, (1991), pp226-235 ］
さらに、教師信号とする誤った表記データ（負の例）は、正しい表記データ（正の例）に比べて一般的に取得することが困難であると考えられている。
【００１１】
【発明が解決しようとする課題】
従来は、正の例のみを教師信号とする機械学習法を用いた処理方法では高い精度の処理が期待できないこと、および、教師信号とする負の例の取得が困難であることから、文章の表記誤り検出処理において、正の例および負の例の両方を教師信号とした機械学習法を利用した処理方法は実現されていなかった。
【００１２】
本発明の目的は、正の例および負の例を教師信号とする機械学習法を用いて、精度の高い表記誤り検出処理を実現することである。
【００１３】
また、本発明の別の目的は、教師信号とする負の例を効率よく自動生成し、機械学習法の教師信号として用いて、精度の高い表記誤り検出処理を実現することである。
【００１４】
【課題を解決するための手段】
上記の目的を達成するため、本発明は、コンピュータが読み取り可能な記憶装置に格納された文データ中の表記の誤りを機械学習処理を用いて検出する教師あり機械学習法を用いた表記誤り検出処理装置であって、以下の記憶手段および処理手段を備えるものである。
【００１５】
本発明は、１）問題と解との組で構成される教師データとして、問題が正しい表記の文字列であって解が正しい表記を示す正である正の例データの事例と問題が誤った表記の文字列であって解が誤りの表記を示す負である負の例データの事例とが格納された教師データ記憶手段と、２）前記教師データ記憶手段から前記事例を取り出し、前記事例ごとに、前記事例の問題から連接関係に関する所定の情報を素性として抽出し、前記抽出した素性の集合と解との対を生成する解−素性対抽出手段と、３）所定の機械学習アルゴリズムにもとづいて、前記解と素性の集合との対について、どのような素性の集合との場合に解が正または負であるかということを機械学習処理し、学習結果として前記どのような素性の集合との場合に解が正または負であるかということを学習結果データ記憶手段に保存する機械学習手段と、４）前記記憶装置に格納された文データから検出対象の文字列を取り出し、前記解−素性対抽出手段が行う抽出処理と同様の抽出処理によって、前記検出対象の文字列から前記所定の情報を素性として抽出する素性抽出手段と、５）前記学習結果データ記憶手段に学習結果として保存された前記どのような素性の集合との場合に解が正または負であるかということにもとづいて、前記検出対象の文字列の素性の集合の場合の正または負の度合いを推定し、前記推定結果として負の例の度合いが大きい場合に、前記検出対象の文字列を表記の誤りとして検出する誤り検出手段とを備える。
【００１６】
または、本発明は、前記構成をとる場合に、さらに以下の処理手段を備えるものである。
【００１７】
本発明は、６）問題と解との組で構成される教師データとして、問題が正しい表記の文字列であって解が正である正の例データの事例を記憶する正の例データ記憶手段と、７）前記正および負のいずれの解も付与されていない文データを格納するコーパス記憶手段と、８）前記コーパス記憶手段から文データを取り出し、前記取り出した文データから取り出した事例が、前記正の例データ記憶手段に格納された正の例データ内に存在するか否かを判定する正の例存在判定手段と、９）前記文データの事例が前記正の例データ内に存在しない場合に、前記文データの事例が教師データ記憶手段に格納された前記教師データ内で出現する出現確率を所定の式を用いて算出する出現確率推定手段と、１０）前記文データの事例について、前記事例の解が負である傾向を示す負の例度合いを、前記出現確率をもとに算出する負の例度合い算出手段と、１１）前記文データの事例についての負の例度合いが所定の値を超える場合に、前記事例の文データに負の解を付与して負の例データを生成し、前記負の例データを前記教師データ記憶手段に格納する負の例取得手段とを備える。
【００１８】
また、本発明は、コンピュータが読み取り可能な記憶装置に格納された文データ中の表記の誤りを検出する検出処理装置において、機械学習処理を用いて前記文データ中の表記の誤りを検出する処理方法を、所定の処理手段を備えたコンピュータである検出処理装置が行うものである。
【００１９】
また、本発明は、コンピュータが読み取り可能な記憶装置に格納された文データ中の表記の誤りを機械学習処理を用いて検出する処理装置としてコンピュータを機能させるためのプログラムであって、前記教師あり機械学習法を用いた表記誤り検出処理装置の各処理手段としてコンピュータを機能させるためのものである。
【００２０】
本発明の各手段または機能または要素をコンピュータにより実現するプログラムは、コンピュータが読み取り可能な、可搬媒体メモリ、半導体メモリ、ハードディスクなどの適当な記録媒体に格納することができ、これらの記録媒体に記録して提供され、または、通信インタフェースを介して種々の通信網を利用した送受信により提供される。
【００２１】
【発明の実施の形態】
以下に、本発明の第１の実施の形態として、日本語表記誤りを連接により検出する処理を説明する。
【００２２】
図１に、本発明の第１の実施の形態における表記誤り検出装置１の構成例を示す。
【００２３】
表記誤り検出装置１は、教師データ記憶部１１と、解−素性対抽出部１２と、機械学習部１３と、学習結果データ記憶部１４と、素性抽出部１５と、誤り検出部１６とを持つ。
【００２４】
教師データ記憶部１１は、機械学習法を実施する際の教師信号となるデータ（教師データ）を記憶する手段である。教師データ記憶部１１には、教師データとして、正しい表記である事例（正の例）と誤った表記である事例（負の例）とが記憶される。正の例は、例えば正しい文の集合であるコーパス等を利用してもよい。負の例は、誤った表記であって一般的なデータはないため、予め人手により生成したものを用いる。または、後述するような負の例予測処理方法を用いて正の例から生成するようにしてもよい。
【００２５】
解−素性対抽出部１２は、教師データ記憶部１１に記憶されている教師データの各事例ごとに、事例の解と素性の集合との組を抽出する手段である。
【００２６】
機械学習部１３は、解−素性対抽出部１２により抽出された解と素性の集合の組から、どのような素性のときにどのような解になりやすいかを機械学習法により学習する手段である。その学習結果は、学習結果データ記憶部１４に保存される。
【００２７】
素性抽出部１５は、表記誤り検出対象であるデータ２から素性の集合を抽出し、抽出した素性の集合を誤り検出部１６へ渡す手段である。
【００２８】
誤り検出部１６は、学習結果データ記憶部１４の学習結果データを参照して、素性抽出部１５から渡された素性の集合の場合に、どのような解になりやすいか、すなわち表記誤りであるかどうかを推定し、その推定結果３を出力する手段である。
【００２９】
図２に、教師データ記憶部１１のデータ構成例を示す。教師データ記憶部１１には、問題と解との組である教師データが記憶されている。例えば、文の各文字のすき間（＜｜＞で表す。）を問題として、そのすき間の連接の解（正解、誤り）が対応付けられた教師データが記憶される。図２の教師データのうち、
「問題−解：説明した方法で＜｜＞を用いることができる−誤り」
は、負の例データＥの例であり、
「問題−解：説明した方法＜｜＞でを用いることができる−正」
は、正の例データＤの例である。
【００３０】
図３に、表記誤り検出処理の処理フローチャートを示す。表記誤り検出処理前に、正の例データＤおよび負の例データＥが教師データ記憶部１１に記憶されているとする。
【００３１】
まず、解−素性対抽出部１２は、教師データ記憶部１１から、各事例ごとに、解と素性の集合との組を抽出する（ステップＳ１）。素性とは、解析に用いる情報の細かい１単位を意味する。素性として連接の判定対象となる文字のすき間ごとに以下のものを抽出する。
【００３２】
・前接および後接の各１〜５ｇｒａｍの文字列、
・対象（すき間）を含めた１〜５ｇｒａｍの文字列（ただし、対象であるすき間（＜｜＞）は１文字として扱う。）
・前接および後接の単語（単語の抽出は既存の形態素解析処理を行う処理手段（図１には図示しない）などを利用する。）
・前接および後接の単語の品詞
例えば、「問題−解」が、
「説明した方法で＜｜＞を用いることができる−誤り」
である場合には、図４に示すような素性を抽出する。すなわち、以下の素性を抽出する。
【００３３】
素性：前接「した方法で」，前接「た方法で」，前接「方法で」，前接「法で」，前接「で」，後接「を用いるこ」，後接「を用いる」，後接「を用い」，後接「を用」，後接「を」，「た方法で＜｜＞」，「方法で＜｜＞を」，「法で＜｜＞を用」，「で＜｜＞を用い」，「＜｜＞を用いる」，前接「で」，後接「を」，前接「助詞」，後接「助詞」
次に、機械学習部１３は、抽出した解と素性の集合との組から、どのような素性のときにどのような解になりやすいかを機械学習し、その学習結果を学習結果データ記憶部１４に保存する（ステップＳ２）。
【００３４】
機械学習の手法としては、例えば、決定リスト法、最大エントロピー法、サポートベクトルマシン法などを用いる。
【００３５】
決定リスト法は、素性と分類先の組を規則とし、それらをあらかじめ定めた優先順序でリストに蓄えおき、検出する対象となる入力が与えられたときに、リストで優先順位の高いところから入力のデータと規則の素性を比較し、素性が一致した規則の分類先をその入力の分類先とする方法である。
【００３６】
最大エントロピー法は、あらかじめ設定しておいた素性ｆ_j（１≦ｊ≦ｋ）の集合をＦとするとき、所定の条件式を満足しながらエントロピーを意味する式を最大にするときの確率分布ｐ（ａ，ｂ）を求め、その確率分布にしたがって求まる各分類の確率のうち、もっとも大きい確率値を持つ分類を求める分類とする方法である。
【００３７】
サポートベクトルマシン法は、空間を超平面で分割することにより、２つの分類からなるデータを分類する手法である。
【００３８】
決定リスト法および最大エントロピー法については、以下の参考文献８に、サポートベクトルマシン法については、以下の参考文献９および参考文献１０に説明されている。
［参考文献８：村田真樹、内山将夫、内元清貴、馬青、井佐原均、種々の機械学習法を用いた多義解消実験、電子情報通信学会言語理解とコミュニケーション研究会，NCL2001-2, (2001) ]
［参考文献９：Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods,(Cambridge University Press,2000) ］
［参考文献１０：Taku Kudoh, Tinysvm:Support Vector machines,(http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM/index.html,2000)］
なお、機械学習部１３では、上記の手法に限定されずに、教師あり機械学習法であればどのような手法でも使用することができる。
【００３９】
その後、素性抽出部１５は、解を求めたいデータ２を入力し（ステップＳ３）、解−素性対抽出部１２での処理とほぼ同様に、データ２から素性の集合を取り出し、それらを誤り検出部１６へ渡す（ステップＳ４）。
【００４０】
誤り検出部１６は、渡された素性の集合の場合にどのような解になりやすいかを学習結果データ記憶部１４の学習結果データをもとに特定し、特定した解すなわち表記誤りかどうかの推定結果３を出力する（ステップＳ５）。
【００４１】
例えば、解析したい問題がすき間＜｜＞の連接である場合に、データ２が「説明した方法で＜｜＞を用いることができる」であれば、「誤り」という推定結果３を出力する。
【００４２】
次に、本発明の第２の実施の形態について説明する。
【００４３】
教師データ記憶部１１の正の例データＤについては、コーパス等を利用できるため比較的容易に取得できる。しかし、負の例データＥは、容易に取得できないため人手により生成するが、かかる生成作業の負担は大きい。
【００４４】
また、教師データは多量であるほうが処理精度が向上するため、できる限り多量の教師データを用意することが望ましい。
【００４５】
そこで、多量な正の例データから負の例データを予測する方法を考える。
【００４６】
正の例から負の例を予測する単純な方法として、既知の正の例のデータに現れなかったものをすべて負の例とするという手法が考えられる。しかし、実際には未出現の正の例の存在が考えられるために、このような単純な方法を用いると、多くの未出現の正の例を負の例であると判定してしまうことになるという問題があり、このような方法で生成した負の例を高精度の処理に適用することができない。
【００４７】
例えば大規模な既存のコーパス（日本語の文の集合）をすべて正しいと仮定すると、その既存のコーパスを正しい文（正の例）と考え、この正の例を用いて、表記誤り（負の例）を予測する方法により、自動的に負の例を生成することができる。
【００４８】
これにより、教師データとする負の例が豊富になり、生成作業の負担を軽減し、かつ、教師データ付きの機械学習法を利用した高精度の表記誤り検出処理を実現できることになる。
【００４９】
本形態における表記誤り検出装置１は、まず、正の例か負の例か判定すべき未知の事例ｘの一般的な出現確率ｐ（ｘ）を算出する。次に、この出現確率ｐ（ｘ）で既知の正の例データＤに出現しないことが不自然である場合に、すなわち、一般的な出現確率が高く当然正の例データＤに出現するであろう状態にも関わらず既知の正の例データＤに出現しない場合には、事例ｘの負の例の度合いが高いと推測し、所定の値より高い負の例の度合いの事例ｘを負の例データＥとする。そして、かかる負の例データＥと正の例データＤとを教師信号とした機械学習法により表記誤り検出処理を行う。
【００５０】
図５に、本発明の第２の実施の形態における表記誤り検出装置１の構成例を示す。
【００５１】
表記誤り検出装置１は、教師データ記憶部１１と、解−素性対抽出部１２と、機械学習部１３と、素性抽出部１５と、誤り検出部１６と、存在判定部２１と、出現確率推定部２２と、負の例度合い算出部２３と、負の例取得部２４と、正の例データ記憶部２５とを持つ。
【００５２】
教師データ記憶部１１と、解−素性対抽出部１２と、機械学習部１３と、素性抽出部１５と、誤り検出部１６とは、第１の実施の形態で説明した表記誤り検出装置１の各手段と同一の手段であるので説明を省略する（図１参照）。
【００５３】
存在判定部２１は、正または負の情報が付与されていない日本語文の集合であるコーパス２０の事例ｘが、正の例データ記憶部２５に記憶されている正の例データＤに存在するかどうかを判定する手段である。
【００５４】
出現確率推定部２２は、事例ｘが正の例データ記憶部２５に存在しない場合に、事例ｘの一般的な出現確率（頻度）ｐ（ｘ）を算出する手段である。
【００５５】
負の例度合い算出部２３は、出現確率ｐ（ｘ）をもとに事例ｘの負の例度合いＱ（ｘ）を算出する手段である。
【００５６】
負の例取得部２４は、負の例度合い算出部２３から受け取った事例ｘの負の例度合いＱ（ｘ）が所定の値を超える場合に、その事例ｘを負の例データＥとし、事例ｘを問題−解の構想の教師データ（負の例データＥ）として教師データ記憶部１１に記憶する手段である。
【００５７】
図６に、第２の実施の形態において学習データとなる負の例データの取得処理の処理フローチャートを示す。
【００５８】
表記誤り検出装置１の存在判定部２１は、コーパス２０から正の例か負の例かが未知である文を入力し、文の頭から、文字のすき間を１つずつずらしながら、各すき間を連接チェックの対象として、そのすき間に前接する１〜５ｇｒａｍの文字列ａと、後接する１〜５ｇｒａｍの文字列ｂを取り出し、この任意のペアである事例ｘ＝（ａ、ｂ）を生成する（ステップＳ１１）。ここでは、２５個の事例（ペア）が生成されることになる。
【００５９】
そして、事例ｘの２５個の連接ａｂが正の例データ記憶部２５にあるかどうかを調べ（ステップＳ１２）、連接ａｂが正の例データ記憶部２５に存在しなければ、その事例ｘを出現確率推定部２２へ渡す（ステップＳ１３）。
【００６０】
出現確率推定部２２は、事例ｘの一般的な出現確率ｐ（ｘ）を推定する（ステップＳ１４）。
【００６１】
例えば、正の例データ記憶部２５の正の例データＤは二項関係（ａ，ｂ）からなり、二項のａとｂとがお互いに独立であると仮定すると、二項関係（ａ，ｂ）の出現する確率はｐ（ｘ）は、ａ、ｂの正の例データ記憶部２５での出現確率をｐ（ａ）、ｐ（ｂ）とするとき、その積ｐ（ａ）×ｐ（ｂ）となる。すなわち、各事例ｘを二項関係（ａ，ｂ）とし、その各項ａ、ｂを独立と仮定することで、各事例ｘの一般的な出現確率ｐ（ｘ）を、各項ａ、ｂの確率により計算する。
【００６２】
そして、負の例度合い算出部２３は、事例ｘの出現確率ｐ（ｘ）を使って、事例ｘが正の例データ記憶部２５に出現する確率Ｑ（ｘ）を求める（ステップＳ１５）。
【００６３】
このとき、正の例データ記憶部２５の正の例データＤがｎ個でありそれぞれが独立であることを仮定すると、１回試行して事例ｘが出現しない確率は１−ｐ（ｘ）であり、これがｎ回連続して起こるということから、事例ｘが正の例データ記憶部２５の正の例データＤに出現しない確率は（１−ｐ（ｘ））ⁿとなり、事例ｘが同じく正の例データＤに出現する確率Ｑ（ｘ）＝１−（１−ｐ（ｘ））ⁿとなる。
【００６４】
ところで、「確率Ｑ（ｘ）が小さい」というのは、確率的に事例ｘが正の例データ記憶部２５の正の例データＤに出現する確率が低いということであり、正の例データ（コーパス）が小さいために確率的に出現しないということが保証されたことを意味するため、「事例ｘは正の例でありうる。」という意味になる。
【００６５】
逆に、「確率Ｑ（ｘ）が大きい」というのは、確率的に事例ｘが正の例データＤに出現する確率が高いということであり、確率的には同コーパスに当然出現すべきということになり、それなのに実際は出現しなかったということで矛盾が生じることになる。この矛盾により、一般的な出現確率ｐ（ｘ）か種々の独立の仮定が否定されることになる。
【００６６】
ここで、「事例ｘが正の例である場合は、一般的な出現確率ｐ（ｘ）および種々の独立の仮定が正しい。」と新たに仮定すると、この矛盾により「事例ｘは正の例でありえない。」が導出されることになる。すなわち、「事例ｘが正の例データＤに出現する確率Ｑ（ｘ）」は、「事例ｘが正の例でありえない確率Ｑ（ｘ）」を意味することになる。そういう意味で、Ｑ（ｘ）は負の例の度合いを意味するものとなる。よって、このＱ（ｘ）を「負の例度合い」とし、事例ｘのＱ（ｘ）が大きいほど事例ｘの負の例の度合いが大きいとする。
【００６７】
そして、負の例取得部２４は、最もＱ（ｘ）の値が高いときのその値をＱ_max、またｘをｘ_maxとし、Ｑ（ｘ_max）の値が大きいすき間ほど、妥当でない連接の可能性が高いとして、Ｑ（ｘ_max）の値が、所定の値より大きい場合には、そのすき間を負の例データＥとして教師データ記憶部１１へ保存する（ステップＳ１６）。なお、負の例データＥとその負の例の度合いＱ（ｘ_max）とを教師データ記憶部１１に保存してもよい。
【００６８】
以上のステップＳ１１〜ステップＳ１５の処理を、文の全てのすき間について行っていくことにより、正の例データ記憶部２５の正の例データＤの頻度情報を用いて負の例データＥを取得することができ、正の例データＤおよび負の例データＥを教師データとして教師データ記憶部１１に用意することができる。
【００６９】
以降の処理は、第１の実施の形態で説明した誤り検出処理と同様であるので、説明を省略する。
【００７０】
以上、本発明をその実施の形態により説明したが、本発明はその主旨の範囲において種々の変形が可能である。
【００７１】
例えば、表記誤り検出装置１の出現確率推定部２２は、事例ｘの一般的な出現確率ｐ（ｘ）を、何らかの方法で算出すればよく、本発明の実施の形態で説明した方法に限られるものではない。
【００７２】
また、教師データ記憶部１１の正の例データＤは、正の例データ記憶部２５に記憶されている正の例データＤを使用することもでき、また、別に用意した正の例データを使用することもできる。
【００７３】
【発明の効果】
以上説明したように、本発明は、正の例と負の例とを教師信号とする機械学習法を用いて表記誤り検出処理を行う。負の例の情報も用いる本発明は、正の例だけを用いた処理方法に比べて、格段に高い精度の処理結果を得ることができる。
【００７４】
また、本発明は、正の例の頻度情報を用いて、正の例から負の例を抽出する処理を行い、抽出した負の例を機械学習法の教師信号とする。正の例から自動的に抽出される負の例の情報を用いる本発明は、表記誤り検出のように正の例が存在するが負の例の取得が困難な問題において、負の例を生成する処理負担を軽減することができる。
【図面の簡単な説明】
【図１】第１の実施の形態における表記誤り検出装置の構成例を示す図である。
【図２】教師データ記憶部の構成例を示す図である。
【図３】表記誤り検出処理の処理フローチャート図である。
【図４】素性の例を示す図である。
【図５】第２の実施の形態における表記誤り検出装置の構成例を示す図である。
【図６】負の例データ取得処理の処理フローチャート図である。
【図７】従来手法を補足的に説明するための図である。
【符号の説明】
１表記誤り検出装置
２データ
３推定結果
１１教師データ記憶部
１２解−素性対抽出部
１３機械学習部
１４学習結果データ記憶部
１５素性抽出部
１６誤り検出部
２０コーパス
２１存在判定部
２２出現確率推定部
２３負の例度合い算出部
２４負の例取得部
２５正の例データ記憶部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to notation error detection processing, and in particular, to a notation error detection processing method using a supervised machine learning method, a processing device that realizes the processing, and a program for causing a computer to execute the processing.
[0002]
[Prior art]
It is much more difficult to detect typographical errors in Japanese than it is in English. Since English is written in words, it is possible to check the spelling of a word with almost high accuracy by preparing a word dictionary and rules for changing the end of a word. On the other hand, in the case of Japanese, since words are not written in words, it is difficult to carry out the processing with high accuracy even if the processing is limited to word notation errors.
[0003]
In addition to the word notation errors, there are grammatical errors such as operation errors of the particles “te”, “ni”, “ha” and “ha”.
[0004]
The following are the main conventional techniques for detecting Japanese typographical errors.
[0005]
The following references 1 to 3 describe a conventional method for detecting a typographical error based on a word dictionary, a dictionary in which hiragana continuations are registered, or a dictionary in which concatenation conditions are described. In these conventional methods, if a word dictionary or a dictionary in which hiragana continuation is registered does not appear, it is determined that there is a notation error, or if there is an unsatisfied appearance of a concatenation in the dictionary describing the concatenation condition, it is a notation error. Is determined.
[Reference 1: Kazuhiro Natomi, Development of Japanese document proofreading support tool hsp, Information Processing Society of Japan (Digital Document), (1997), pp.9-16]
[Reference 2: Kazuma Kawahara et al., Typographical error detection method using a dictionary extracted from corpus, Information Processing Society of Japan 54th National Convention, (1997), pp.2-21-2-22]
[Reference 3: Nobuki Shiraki et al., Creation of a Japanese spell checker by registering a large number of hiragana strings, Annual Conference of the Association for Natural Language Processing, (1997), pp.445-448]
Further, a conventional method for obtaining the occurrence probability of each character string based on a probability model using ngram of character units and determining a portion where a character string having a low occurrence probability appears as a notation error is described in Reference Documents 4 to 4 below. Reference 6 describes.
[Reference 4: Tetsuro Araki et al., Japanese sentence error detection and correction method using double Markov model, NL97-5, (1997), pp.29-35, Information Processing Society of Japan.
[Reference 5: Takaaki Matsuyama et al., Experiments and Discussions on Estimating Precision and Recall Rate for Examining Ocr Error Detection Capability by n-gram, Annual Meeting of the Language Processing Society of Japan (1996), pp.129-132]
[Reference 6: Koichi Takeuchi et al., Construction of OCR error correction system using statistical language model, IPSJ Transactions, Vol.40, No.6, (1999)]
Among the above conventional methods, the method using the ngram probability of Reference 5 is mainly used for notation error detection in an error correction system of an optical character reader (OCR). In the case of an OCR error correction system, as a premise, the appearance rate of notation errors is as high as 5 to 10%, which is higher than the probability that a person usually makes a mistake when writing something. Therefore, the reproducibility and precision of detection of notation errors tend to be high, and the problem can be set relatively easily.
[0006]
The method of Takeuchi et al., Which seems to be the best among the above conventional methods, that is, the conventional method described in Reference Document 6 (hereinafter referred to as Conventional Method A) will be briefly described below.
[0007]
In the conventional method A, first, three consecutive characters are extracted while shifting the text to be detected from one head at a time, and when the appearance probability of the extracted part in the corpus (correct Japanese sentence set) is Tp or less, −1 is added to each of the three consecutive characters, and a character whose given value is equal to or greater than Ts is determined as an error. For example, Tp = 0 and Ts = −2. Here, since Tp = 0, it is not necessary to bother to determine the appearance probability, and it is only necessary to check whether or not three consecutive characters appear in the corpus. When Tp> 0, even if the extracted portion appears in the corpus, it is determined as an error. However, even if the appearance probability is low, if it appears in the corpus, it may not be an error, so Tp> 0 is not appropriate, and Tp = 0 is set appropriately.
[0008]
As a supplementary explanation of the conventional method A, consider performing error detection for the Japanese expression “detection of negative event zero”. At this time, three consecutive characters such as “negative thing” and “no thing zero” are cut out from the head of the Japanese expression, and it is checked whether or not these are in the corpus. give. In this case, since there was no “no thing” or “no thing”, a score by the trigram as shown in FIG. 7 was given, and as a result, the “thing” and “zero” portions that became the point “−2” were incorrect. It is determined. This conventional method A is a method of detecting an error by successfully combining the characters 3-gram that frequently appear in the corpus.
[0009]
However, after all, the processing of the conventional method A is to determine whether or not the expression exists in the corpus. In other words, the conventional method A is very similar to the above-described other conventional method, in which an error occurs when something not in the dictionary appears.
[0010]
As for the machine learning method, as described in Reference Document 7 below, it is known that learning from only positive examples is generally difficult.
[Reference 7: Nuki Yokomori et al., Model language learning: Focusing on learning from positive examples, Journal of Information Processing Society of Japan, Vol.32, No.3, (1991), pp226-235]
Furthermore, it is considered that it is generally difficult to acquire incorrect notation data (negative example) as a teacher signal compared to correct notation data (positive example).
[0011]
[Problems to be solved by the invention]
Conventionally, a processing method using a machine learning method using only positive examples as teacher signals cannot expect high accuracy processing, and it is difficult to obtain negative examples as teacher signals. In the notation error detection processing, a processing method using a machine learning method in which both a positive example and a negative example are teacher signals has not been realized.
[0012]
An object of the present invention is to realize a high-precision notation error detection process using a machine learning method using a positive example and a negative example as a teacher signal.
[0013]
Another object of the present invention is to efficiently and automatically generate a negative example as a teacher signal and use it as a teacher signal for a machine learning method to realize a high-precision notation error detection process.
[0014]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides notation error detection using a supervised machine learning method that detects notation errors in sentence data stored in a computer-readable storage device using machine learning processing. The processing apparatus includes the following storage means and processing means.
[0015]
In the present invention, 1) As teacher data composed of a set of a problem and a solution, the case of a positive example data in which the problem is a character string with a correct notation and the solution indicates a correct notation, and the problem is incorrect A teacher data storage means storing a negative example data case in which the answer is a negative character string indicating a notation of an error; 2) taking out the case from the teacher data storage means; And a solution-feature pair extraction means for extracting predetermined information related to the connection relation from the problem of the case as a feature and generating a pair of the extracted feature set and solution, and 3) based on a predetermined machine learning algorithm. Then, with respect to the pair of the solution and the feature set, a machine learning process is performed to determine what feature set the solution is positive or negative, and as a learning result, the feature set and If the solution is positive or negative Machine learning means for storing whether or not there is a learning result data storage means, and 4) an extraction process performed by extracting a character string to be detected from sentence data stored in the storage device and performing the solution-feature pair extraction means. A feature extraction unit that extracts the predetermined information as a feature from the character string to be detected by the same extraction process, and 5) any feature set stored as a learning result in the learning result data storage unit In this case, based on whether the solution is positive or negative, the degree of positive or negative in the case of the set of features of the character string to be detected is estimated, and the degree of the negative example is large as the estimation result A detection unit that detects the character string to be detected as a notation error.
[0016]
Or when this invention takes the said structure, the following processing means is further provided.
[0017]
6) Positive example data storage means for storing a case of positive example data in which a problem is a correct character string and a solution is positive as teacher data composed of a set of a problem and a solution. And 7) corpus storage means for storing sentence data to which neither positive nor negative solution is given, and 8) a case in which sentence data is extracted from the corpus storage means and extracted from the extracted sentence data. A positive example existence determination unit for determining whether or not the positive example data stored in the positive example data storage unit exists in the positive example data storage unit; and 9) the sentence data example does not exist in the positive example data. In this case, an appearance probability estimating means for calculating an appearance probability that the sentence data case appears in the teacher data stored in the teacher data storage means using a predetermined formula; and 10) for the sentence data case, Of the case A negative example degree calculation means for calculating a negative example degree indicating a tendency of being negative based on the appearance probability, and 11) a case where the negative example degree of the sentence data example exceeds a predetermined value And negative example acquisition means for generating negative example data by giving a negative solution to the sentence data of the case and storing the negative example data in the teacher data storage means.
[0018]
According to another aspect of the present invention, there is provided a detection processing device for detecting a notation error in sentence data stored in a computer-readable storage device, and detecting a notation error in the sentence data using a machine learning process. The method is performed by a detection processing device which is a computer provided with predetermined processing means.
[0019]
In addition, the present invention is a program for causing a computer to function as a processing device that detects an error in notation in sentence data stored in a computer-readable storage device using machine learning processing. This is to make a computer function as each processing means of a notation error detection processing apparatus using a machine learning method.
[0020]
A program for realizing each means, function, or element of the present invention by a computer can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, or a hard disk that can be read by the computer. It is provided by recording, or provided by transmission / reception using various communication networks via a communication interface.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
In the following, as a first embodiment of the present invention, a process for detecting Japanese notation errors by concatenation will be described.
[0022]
FIG. 1 shows a configuration example of a notation error detection device 1 in the first exemplary embodiment of the present invention.
[0023]
The notation error detection apparatus 1 includes a teacher data storage unit 11, a solution-feature pair extraction unit 12, a machine learning unit 13, a learning result data storage unit 14, a feature extraction unit 15, and an error detection unit 16. .
[0024]
The teacher data storage unit 11 is means for storing data (teacher data) that is a teacher signal when the machine learning method is performed. The teacher data storage unit 11 stores, as teacher data, a case with a correct notation (positive example) and a case with a wrong notation (negative example). As a positive example, for example, a corpus that is a set of correct sentences may be used. The negative example is an erroneous notation and there is no general data, and therefore, a manually generated one is used. Or you may make it produce | generate from a positive example using the negative example prediction processing method which is mentioned later.
[0025]
The solution-feature pair extraction unit 12 is a unit that extracts a combination of a case solution and a feature set for each case of the teacher data stored in the teacher data storage unit 11.
[0026]
The machine learning unit 13 is a means for learning from the set of solutions and features extracted by the solution-feature pair extraction unit 12 what kind of solution is likely to be obtained by a machine learning method. is there. The learning result is stored in the learning result data storage unit 14.
[0027]
The feature extraction unit 15 is a unit that extracts a set of features from the data 2 that is a detection error detection target, and passes the extracted set of features to the error detection unit 16.
[0028]
The error detection unit 16 refers to the learning result data stored in the learning result data storage unit 14, and in the case of a set of features passed from the feature extraction unit 15, what kind of solution is likely to occur, that is, a notation error. It is means for estimating whether or not and outputting the estimation result 3.
[0029]
FIG. 2 shows a data configuration example of the teacher data storage unit 11. The teacher data storage unit 11 stores teacher data that is a set of a problem and a solution. For example, teacher data in which gaps between each character of a sentence (represented by <|>) are associated and a solution (correct answer, error) of the gap is stored is stored. Of the teacher data in FIG.
"Problem-Solution: <|> can be used in the manner described-error"
Is an example of negative example data E,
"Problem-Solution: Can be used with the described method <|>-Positive"
Is an example of positive example data D.
[0030]
FIG. 3 shows a process flowchart of the notation error detection process. It is assumed that positive example data D and negative example data E are stored in the teacher data storage unit 11 before the notation error detection process.
[0031]
First, the solution-feature pair extraction unit 12 extracts a set of a solution and a feature set for each case from the teacher data storage unit 11 (step S1). A feature means one unit of information used for analysis. As features, the following are extracted for each gap of characters to be determined for connection.
[0032]
-Character string of 1-5 gram each of the front and back
A character string of 1 to 5 gram including the target (gap) (however, the target gap (<|>) is treated as one character.)
Pre- and post-consecutive words (word extraction uses existing processing means for performing morphological analysis processing (not shown in FIG. 1), etc.)
・ Part of speech of the front and back words
For example, "Problem-Solution"
"<|> Can be used in the manner described-error"
In such a case, the features as shown in FIG. 4 are extracted. That is, the following features are extracted.
[0033]
Features: Predecessor “with method”, predecessor “with method”, predecessor “with method”, predecessor “with method”, predecessor “with”, predecessor “using”, and predecessor “ "Use", Follow-up "Use", Follow-up "Use", Follow-up "Use", "Use method <|>", "Use method <|>", "Use method <<>>" , “Use <|>”, “Use <|>”, Predecessor “De”, Suffix “O”, Suffix “Particle”, Suffix “Particle”
Next, the machine learning unit 13 performs machine learning on what features are likely to be generated from a set of the extracted solution and the set of features, and the learning result is stored as a learning result data storage unit. 14 (step S2).
[0034]
As a machine learning method, for example, a decision list method, a maximum entropy method, a support vector machine method, or the like is used.
[0035]
In the decision list method, a set of features and classification destinations is used as a rule, and these are stored in a list in a predetermined priority order. When an input to be detected is given, input is performed from the highest priority in the list. And the feature of the rule are compared, and the classification destination of the rule having the identical feature is used as the classification destination of the input.
[0036]
The maximum entropy method uses a preset feature f _j When a set of (1 ≦ j ≦ k) is F, a probability distribution p (a, b) when maximizing an expression meaning entropy while satisfying a predetermined conditional expression is obtained, and according to the probability distribution This is a method for obtaining a classification having the largest probability value among the obtained classification probabilities.
[0037]
The support vector machine method is a method of classifying data composed of two classifications by dividing a space by a hyperplane.
[0038]
The decision list method and the maximum entropy method are described in Reference Document 8 below, and the support vector machine method is described in Reference Document 9 and Reference Document 10 below.
[Reference 8: Maki Murata, Masao Uchiyama, Kiyotaka Uchimoto, Ma Aoi, Hitoshi Isahara, Ambiguity Solving Experiments Using Various Machine Learning Methods, IEICE Language Understanding and Communication Study Group, NCL2001-2, ( 2001)]
[Reference 9: Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, (Cambridge University Press, 2000)]
[Reference 10: Taku Kudoh, Tinysvm: Support Vector machines, (http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM / index.html, 2000)]
The machine learning unit 13 is not limited to the above method, and any method can be used as long as it is a supervised machine learning method.
[0039]
Thereafter, the feature extraction unit 15 inputs the data 2 for which a solution is to be obtained (step S3), takes out a set of features from the data 2 and performs error detection in the same manner as the processing in the solution-feature pair extraction unit 12 It passes to the part 16 (step S4).
[0040]
The error detection unit 16 specifies what kind of solution is likely to occur in the case of the set of passed features based on the learning result data in the learning result data storage unit 14, and determines whether the specified solution, that is, a notation error. The estimation result 3 is output (step S5).
[0041]
For example, when the problem to be analyzed is a concatenation of gaps <|>, if the data 2 “can use <|> by the described method”, an estimation result 3 of “error” is output.
[0042]
Next, a second embodiment of the present invention will be described.
[0043]
The positive example data D in the teacher data storage unit 11 can be acquired relatively easily because a corpus or the like can be used. However, since the negative example data E cannot be easily acquired, it is generated manually, but the burden of such generation work is large.
[0044]
Moreover, since processing accuracy improves when the amount of teacher data is large, it is desirable to prepare as much teacher data as possible.
[0045]
Therefore, a method for predicting negative example data from a large amount of positive example data is considered.
[0046]
As a simple method for predicting a negative example from a positive example, a method in which all the data that did not appear in the data of a known positive example can be considered as a negative example. However, since there may actually be positive examples that have not yet appeared, using such a simple method will determine that many non-occurring positive examples are negative examples. The negative example generated by such a method cannot be applied to high-precision processing.
[0047]
For example, if all large existing corpora (a set of Japanese sentences) are assumed to be correct, the existing corpus is considered a correct sentence (positive example), and a typographical error (negative The negative example can be automatically generated by the method of predicting the example).
[0048]
As a result, negative examples of teacher data are abundant, the burden of generation work is reduced, and high-precision notation error detection processing using a machine learning method with teacher data can be realized.
[0049]
The notation error detection apparatus 1 in this embodiment first calculates a general appearance probability p (x) of an unknown case x to be determined as a positive example or a negative example. Next, when it is unnatural that this appearance probability p (x) does not appear in the known positive example data D, that is, it will naturally appear in the positive example data D having a high general appearance probability. If it does not appear in the known positive example data D in spite of the wax state, it is assumed that the degree of the negative example of the case x is high, and the case x having a negative example degree higher than a predetermined value is negative. Example data E. Then, a notation error detection process is performed by a machine learning method using such negative example data E and positive example data D as teacher signals.
[0050]
FIG. 5 shows a configuration example of the notation error detection device 1 in the second exemplary embodiment of the present invention.
[0051]
The notation error detection apparatus 1 includes a teacher data storage unit 11, a solution-feature pair extraction unit 12, a machine learning unit 13, a feature extraction unit 15, an error detection unit 16, an existence determination unit 21, and an appearance probability estimation. Unit 22, negative example degree calculation unit 23, negative example acquisition unit 24, and positive example data storage unit 25.
[0052]
The teacher data storage unit 11, the solution-feature pair extraction unit 12, the machine learning unit 13, the feature extraction unit 15, and the error detection unit 16 are the same as those in the notation error detection device 1 described in the first embodiment. Since it is the same means as each means, description is abbreviate | omitted (refer FIG. 1).
[0053]
The existence determination unit 21 determines whether the example x of the corpus 20 that is a set of Japanese sentences to which no positive or negative information is assigned exists in the positive example data D stored in the positive example data storage unit 25. It is means for determining whether or not.
[0054]
The appearance probability estimation unit 22 is a means for calculating a general appearance probability (frequency) p (x) of the case x when the case x does not exist in the positive example data storage unit 25.
[0055]
The negative example degree calculation unit 23 is means for calculating the negative example degree Q (x) of the case x based on the appearance probability p (x).
[0056]
When the negative example degree Q (x) of the case x received from the negative example degree calculation unit 23 exceeds a predetermined value, the negative example acquisition unit 24 sets the case x as negative example data E, and This is means for storing x in the teacher data storage unit 11 as teacher data (negative example data E) of a problem-solution concept.
[0057]
FIG. 6 shows a process flowchart of an acquisition process of negative example data that becomes learning data in the second embodiment.
[0058]
The existence determination unit 21 of the notation error detection apparatus 1 inputs a sentence in which a positive example or a negative example is unknown from the corpus 20, and shifts each gap from the head of the sentence by shifting the gap of characters one by one. As an object of the concatenation check, a character string a of 1 to 5 gram that is in front of the gap and a character string b of 1 to 5 gram that is adjacent to the gap are extracted, and an example x = (a, b) that is an arbitrary pair is generated ( Step S11). Here, 25 cases (pairs) are generated.
[0059]
Then, it is checked whether or not 25 concatenations ab of the case x exist in the positive example data storage unit 25 (step S12). If the concatenation ab does not exist in the positive example data storage unit 25, the case x appears. It passes to the probability estimation part 22 (step S13).
[0060]
The appearance probability estimation unit 22 estimates a general appearance probability p (x) of the case x (Step S14).
[0061]
For example, assuming that the positive example data D in the positive example data storage unit 25 has a binary relationship (a, b) and the binary terms a and b are independent of each other, the binary relationship (a, b) The probability of occurrence of b) is p (x), where p (a) and p (b) are the occurrence probabilities in the positive example data storage unit 25 of a and b, and the product p (a) × p (B). That is, assuming that each case x is a binary relation (a, b) and each of the terms a and b is assumed to be independent, the general appearance probability p (x) of each case x is changed to each of the terms a and b. Calculate with the probability of.
[0062]
Then, the negative example degree calculation unit 23 uses the appearance probability p (x) of the case x to determine the probability Q (x) that the case x appears in the positive example data storage unit 25 (step S15).
[0063]
At this time, assuming that the number of positive example data D in the positive example data storage unit 25 is n and they are independent, the probability that the case x does not appear after one trial is 1-p (x). Since this occurs continuously n times, the probability that the case x does not appear in the positive example data D of the positive example data storage unit 25 is (1-p (x)) ⁿ And the probability Q (x) = 1− (1−p (x)) that the case x appears in the positive example data D ⁿ It becomes.
[0064]
By the way, “the probability Q (x) is small” means that the probability that the case x appears in the positive example data D of the positive example data storage unit 25 is low in probability. This means that “case x can be a positive example” because it means that it is guaranteed that it does not appear probabilistically because the corpus is small.
[0065]
Conversely, “the probability Q (x) is large” means that the probability that the case x appears in the positive example data D is high in probability, and it should naturally appear in the corpus in terms of probability. In fact, a contradiction arises because it did not actually appear. This contradiction negates the general appearance probability p (x) or various independent assumptions.
[0066]
Here, if the case “x is a positive example, the general appearance probability p (x) and various independent assumptions are correct” is newly assumed. This contradiction causes “the case x is a positive example. Is not possible ". That is, “probability Q (x) that case x appears in positive example data D” means “probability Q (x) that case x cannot be a positive example”. In that sense, Q (x) means a negative example degree. Therefore, this Q (x) is assumed to be “a negative example degree”, and the greater the Q (x) of the case x, the greater the degree of the negative example of the case x.
[0067]
Then, the negative example acquisition unit 24 calculates the value when the value of Q (x) is the highest as Q _max And x to x _max And Q (x _max )), The larger the gap, the higher the possibility of invalid connection. _max ) Is larger than the predetermined value, the gap is stored in the teacher data storage unit 11 as negative example data E (step S16). It should be noted that the negative example data E and the negative example degree Q (x _max ) May be stored in the teacher data storage unit 11.
[0068]
By performing the processes of steps S11 to S15 for all the gaps in the sentence, the negative example data E is acquired using the frequency information of the positive example data D in the positive example data storage unit 25. The positive example data D and the negative example data E can be prepared in the teacher data storage unit 11 as teacher data.
[0069]
Since the subsequent processing is the same as the error detection processing described in the first embodiment, description thereof is omitted.
[0070]
As mentioned above, although this invention was demonstrated by the embodiment, various deformation | transformation are possible for this invention in the range of the main point.
[0071]
For example, the appearance probability estimation unit 22 of the notation error detection device 1 may calculate the general appearance probability p (x) of the case x by any method, and is limited to the method described in the embodiment of the present invention. It is not a thing.
[0072]
Moreover, the positive example data D stored in the positive example data storage unit 25 can be used as the positive example data D in the teacher data storage unit 11, or the positive example data prepared separately is used. You can also
[0073]
【The invention's effect】
As described above, the present invention performs notation error detection processing using a machine learning method in which a positive example and a negative example are used as teacher signals. The present invention, which also uses information on negative examples, can obtain processing results with much higher accuracy than a processing method using only positive examples.
[0074]
In addition, the present invention performs processing for extracting a negative example from a positive example using frequency information of a positive example, and sets the extracted negative example as a teacher signal for a machine learning method. The present invention, which uses negative example information automatically extracted from positive examples, generates negative examples in cases where there are positive examples, such as notation error detection, but it is difficult to obtain negative examples. Processing burden to be reduced.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration example of a notation error detection device according to a first embodiment.
FIG. 2 is a diagram illustrating a configuration example of a teacher data storage unit.
FIG. 3 is a process flowchart of a notation error detection process.
FIG. 4 is a diagram illustrating an example of features.
FIG. 5 is a diagram illustrating a configuration example of a notation error detection device according to a second embodiment.
FIG. 6 is a process flowchart of a negative example data acquisition process.
FIG. 7 is a diagram for supplementarily explaining a conventional method.
[Explanation of symbols]
1 Notation error detection device
2 data
3 estimation results
11 Teacher data storage
12 Solution-feature pair extraction unit
13 Machine Learning Department
14 Learning result data storage
15 Feature extraction unit
16 Error detector
20 Corpus
21 Existence determination unit
22 Appearance probability estimation unit
23 Negative example degree calculator
24 Negative example acquisition unit
25 Positive example data storage

Claims

A processing device for detecting notation errors in sentence data stored in a computer-readable storage device using machine learning processing,
As teacher data that consists of a set of a problem and a solution, the case of a positive example data in which the problem is a character string with a correct notation and the solution indicates a correct notation, and the problem is a character string with an incorrect notation. A teacher data storage means storing a case of negative example data in which the solution indicates a notation of an error, and
Extracting the case from the teacher data storage means, extracting predetermined information regarding the connection relation from the problem of the case as a feature for each case, and generating a pair of the extracted feature set and solution Pair extraction means;
Based on a predetermined machine learning algorithm, for a pair of the solution and a set of features, machine feature processing is performed to determine what type of feature set the solution is positive or negative. Machine learning means for storing in the learning result data storage means what kind of feature set the solution is positive or negative;
A character string to be detected is extracted from the sentence data stored in the storage device, and the predetermined information is set as a feature from the character string to be detected by an extraction process similar to the extraction process performed by the solution-feature pair extraction unit. A feature extraction means for extracting;
In the case of a set of features of the character string to be detected, based on whether the feature set stored as a learning result in the learning result data storage means is a positive or negative solution An error detection means for detecting the detection target character string as a notation error when the degree of a negative example is large as the estimation result Notation error detection processing device using machine learning method.

The notation error detection processing apparatus using the supervised machine learning method according to claim 1,
Positive example data storage means for storing examples of positive example data in which the problem is a character string with a correct notation and the answer is positive as teacher data composed of a set of a problem and a solution,
Corpus storage means for storing sentence data to which neither of the positive and negative solutions is given,
A positive example exists to extract sentence data from the corpus storage means and to determine whether or not a case extracted from the extracted sentence data exists in the positive example data stored in the positive example data storage means A determination means;
When an example of the sentence data does not exist in the positive example data, an appearance probability that the sentence data example appears in the teacher data stored in the teacher data storage unit is calculated using a predetermined formula. An appearance probability estimation means;
A negative example degree calculating means for calculating a negative example degree indicating a tendency that a solution of the case is negative with respect to the case of the sentence data; and
When the negative example degree of the sentence data example exceeds a predetermined value, negative example data is generated by giving a negative solution to the case sentence data, and the negative example data is converted into the teacher data. And a negative example acquisition unit stored in the storage unit. A notation error detection processing apparatus using a supervised machine learning method.

In a detection processing device for detecting a writing error in sentence data stored in a computer-readable storage device, a processing method for detecting a writing error in the sentence data using machine learning processing is obtained. A processing method performed by the detection processing device, which is a computer comprising means, solution-feature pair extraction means, machine learning means, feature extraction means, and error detection means,
When the teacher data acquisition means is a teacher data composed of a set of a problem and a solution, the case of the positive example data in which the problem is a character string with a correct notation and the solution indicates a correct notation and the problem is incorrect Teacher data storage means for accessing a teacher data storage means storing a negative example data case in which the answer is a negative character string indicating an error notation and extracting teacher data from the teacher data storage means Process,
The solution-feature pair extraction unit extracts the case from the teacher data storage unit, extracts predetermined information on the connection relation from the problem of the case as a feature for each case, and sets of features extracted from the case A solution-feature pair extraction process for generating a pair of a solution and a solution;
Based on a predetermined machine learning algorithm, the machine learning means determines what kind of feature set the solution is positive or negative for the pair of the solution and the feature set. And a machine learning processing step for storing in the learning result data storage means whether the solution is positive or negative in the case of the set of features as a learning result,
The feature extraction means extracts a character string to be detected from sentence data stored in the storage device, and extracts the character string from the detection target character string by an extraction process similar to the extraction process in the solution-feature pair extraction process. A feature extraction process for extracting predetermined information as a feature,
The error detection means is based on whether the feature set stored as a learning result in the learning result data storage means is a positive or negative solution, and the character string to be detected An error detection processing step of estimating a positive or negative degree in the case of a set of features and detecting the character string to be detected as a notation error when the degree of the negative example is large as the estimation result A notation error detection processing method using a supervised machine learning method characterized by the above.

The notation error detection processing method using the supervised machine learning method according to claim 3,
The detection processing device includes positive example data reference means, sentence data extraction means, positive example existence determination means, appearance probability estimation means, negative example degree calculation means, and negative example acquisition means,
A positive example in which the positive example data reference means stores a case of positive example data in which the problem is a correct character string and the solution is positive as teacher data composed of a set of a problem and a solution A process of accessing the data storage means and referring to the positive example data;
The sentence data fetching means accesses a corpus storage means for storing sentence data to which neither positive nor negative solution is given, and a sentence data fetching process for fetching sentence data from the corpus storage means;
Positive example existence determination means for determining whether the example extracted from the extracted sentence data by the positive example existence determination means exists in the positive example data stored in the positive example data storage means Process,
The appearance probability estimation means determines an appearance probability that the sentence data case appears in the teacher data stored in the teacher data storage means when the sentence data case does not exist in the positive example data. Appearance probability estimation process calculated using the formula of
A negative example degree calculation process in which the negative example degree calculating means calculates a negative example degree indicating a tendency that the solution of the case is negative for the sentence data case based on the appearance probability; ,
The negative example acquisition means generates a negative example data by giving a negative solution to the sentence data of the case when the negative example degree for the case of the sentence data exceeds a predetermined value, A negative example acquisition processing step of storing negative example data in the teacher data storage means. A notation error detection processing method using a supervised machine learning method.

A program for causing a computer to function as a processing device that detects notation errors in sentence data stored in a computer-readable storage device using machine learning processing,
As teacher data that consists of a set of a problem and a solution, the case of a positive example data in which the problem is a character string with a correct notation and the solution indicates a correct notation, and the problem is a character string with an incorrect notation. A teacher data storage means storing a case of negative example data in which the solution indicates a notation of an error, and
Extracting the case from the teacher data storage means, extracting predetermined information regarding the connection relation from the problem of the case as a feature for each case, and generating a pair of the extracted feature set and solution Pair extraction means;
Based on a predetermined machine learning algorithm, for a pair of the solution and a set of features, machine feature processing is performed as to what kind of feature set the solution is positive or negative. Machine learning means for storing in the learning result data storage means what kind of feature set the solution is positive or negative;
A character string to be detected is extracted from the sentence data stored in the storage device, and the predetermined information is set as a feature from the character string to be detected by an extraction process similar to the extraction process performed by the solution-feature pair extraction unit. A feature extraction means for extracting;
In the case of a set of features of the character string to be detected, based on whether the feature set stored as a learning result in the learning result data storage means is a positive or negative solution As a processing apparatus comprising: an error detection unit that estimates a positive or negative degree of the above, and detects the detection target character string as a notation error when the degree of the negative example is large as the estimation result,
A notation error detection processing program using a supervised machine learning method for causing the computer to function.

In the written error detection processing program using the supervised machine learning method according to claim 5,
Positive example data storage means for storing examples of positive example data in which the problem is a character string with a correct notation and the answer is positive as teacher data composed of a set of a problem and a solution,
Corpus storage means for storing sentence data to which neither of the positive and negative solutions is given,
A positive example exists to extract sentence data from the corpus storage means and to determine whether or not a case extracted from the extracted sentence data exists in the positive example data stored in the positive example data storage means A determination means;
When an example of the sentence data does not exist in the positive example data, an appearance probability that the sentence data example appears in the teacher data stored in the teacher data storage unit is calculated using a predetermined formula. An appearance probability estimation means;
A negative example degree calculating means for calculating a negative example degree indicating a tendency that a solution of the case is negative with respect to the case of the sentence data; and
When the negative example degree of the sentence data example exceeds a predetermined value, negative example data is generated by giving a negative solution to the case sentence data, and the negative example data is converted into the teacher data. As a processing device comprising storing in a storage means,
A notation error detection processing program using a supervised machine learning method for causing the computer to function.