JP2004246546A

JP2004246546A - Image processing method, program used for execution of method, and image processing apparatus

Info

Publication number: JP2004246546A
Application number: JP2003034758A
Authority: JP
Inventors: Yoshihisa Oguro; 慶久大黒
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-02-13
Filing date: 2003-02-13
Publication date: 2004-09-02

Abstract

<P>PROBLEM TO BE SOLVED: To simplify a procedure to determine a category of language of a clipped character line, and to prevent increase of calculation load. <P>SOLUTION: The procedure of a chart 8 is applied for the character line which is clipped from a document image read and obtained from a manuscript, and the character line is determined whether it is an European and American system or an Asian system. A regression line : base line is calculated based on an end point of a character-circumscribed rectangle in the line, and height (skew compensation) from that to a starting point of the rectangle is found, then frequency is totalized. The less than half of an estimated line height is removed from the result of the totalization among the result of the frequency total, and they are not used for language determination. Targeted frequency total is sorted in order of frequency, and it is determined whether a condition, which is characteristics of the European and American system character line, that the height of a lower case of the first frequency < the height of an upper case of the second frequency and the third frequency ≪ the second frequency, that is, the frequency is concentrated on the height of the first frequency (lower case) and the second (upper case), is satisfied or not, then determined whether it is an European and American system or an Asian system character line. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ＯＣＲ（光学的文字読み取り装置）や文字認識等に利用される画像処理に関し、より特定すると、画像入力された文書に使用されている言語を判定（分類）するための処理手順を備えた画像処理方法、該方法の実行に用いるプログラム及び画像処理装置に関する。
【０００２】
【従来の技術】
文書原稿を対象にした画像処理として、従来から原稿をスキャナーで読み取った画像情報を基に、その文書に記された文字のＯＣＲや文字認識処理が行われている。
この文字認識処理では、通常、文書画像から一つの文字が存在すると判断される画像領域を切り出し、切り出した画像の形状特徴から文字を認識するという処理手順を踏む。また、こうした処理を行う際に、効率的な処理により高い認識精度を得るといったことから、切り出された所定の文字領域（例えば、文字行）の言語の種類を判定・分類し、言語の分類（例えば、日本語と英語）に適応した処理方法を用いるようにしている。
文字認識処理に付随して行われるこのような言語の判定方法について提案した従来例として、下記特許文献１乃至３を挙げることができる。
特許文献１，２には、日本語行とアルファベット（英語）行とを分類する方法として、いずれも、行方向と垂直の方向に画像を走査し、白／黒の反転回数或いは黒のラン（連続した黒画素）の本数を計数し、それを文字の複雑さを表す指標とし、より複雑な日本語を区別するための判定基準に利用するものである。
また、特許文献３には、文字領域を日本語と英語の領域とに判定する方法として、領域内の文字高さに着目した方法が示され、ここでは、各文字の外接矩形における矩形高さのばらつき、即ち、英文では、大文字と小文字の高さに集中して矩形高さのばらつきが生じるという特徴を判定基準に利用するものである。なお、英字に特徴的な文字外接矩形における矩形高さのばらつきに着目した判定基準を補助的に用いて英字の判定精度を向上させるという方法については、下記特許文献２にも示されている。
【０００３】
【特許文献１】
特開平５−１０８８７６号公報
【特許文献２】
特開平９−２３１３１７号公報
【特許文献３】
特開平６−１５００５５号公報
【０００４】
【発明が解決しようとする課題】
しかしながら、特許文献１，２に示されている方法は、行切り出しを行った後、行全体にわたって行方向一定間隔に、行方向と垂直に画像を走査し、白／黒の反転回数を計数する必要があり、処理の負荷が増大する。
また、特許文献３に示されている方法は、外接矩形における矩形高さに着目しており、この矩形高さを求めるために、外接矩形の切り出しで得られる座標を基にさらに演算を行う必要があり、この矩形高さを算出する追加の演算により、処理負荷が増大する。
本発明は、文書原稿から読み取った画像情報を基にその文書に記された文字のＯＣＲや文字認識処理を行う際に、処理対象の文字領域における言語の種類を判定・分類するために用いられた上記従来技術の問題点に鑑みてなされたものであり、その目的は、切り出された所定の文字領域（例えば、文字行）の言語の種類を判定・分類するための処理手順を簡略化し、計算負荷の増大を招くことのない処理手順とすることにある。
また、本発明は、上記技術課題
（目的）を解決するために、行切り出しにより得られる文字行内に存在する矩形の始点に着目し、その高さの頻度を集計するという方法を用いるようにしたが、この方法を採用した場合に誤差の要因となるスキュー（原稿の傾き）に対応する手順を用意して、判定・分類精度を保つことを可能にすることを更なる目的とする。
【０００５】
【課題を解決するための手段】
請求項１の発明は、処理対象となる文書画像から文字画像の外接矩形情報を生成するステップと、前記外接矩形情報に基づいて文字行を切り出すステップと、前記文字行内の文字外接矩形を対象にして外接矩形の始点（文字最前部・文字頂部の座標）の文字行内における高さに関する情報を集計する始点高さ情報集計ステップと、集計した高さ情報の分布状態を言語種類の属性を示す所定の基準分布状態と対比することにより文字行の属する言語を判定する言語判定ステップを実行することを特徴とする画像処理方法である。
請求項２の発明は、請求項１に記載された画像処理方法において、前記始点高さ情報集計ステップは、文字行内の文字外接矩形の内、始点の高さが雑音除去用として設定された所定値以上の矩形を集計の対象として外接矩形の始点の文字行内における高さに関する情報を集計するようにしたことを特徴とする方法である。
【０００６】
請求項３の発明は、請求項１又は２に記載された画像処理方法において、前記文字行内の文字外接矩形の終点（文字最後部・文字最下部の座標）を結ぶ直線を求めるスキュー直線算出ステップと、前記始点高さ情報集計ステップにおける文字行内における高さをスキュー直線からの高さとして算出するステップを実行するようにしたことを特徴とする方法である。
請求項４の発明は、請求項３に記載された画像処理方法において、前記スキュー直線算出ステップは、文字行内の文字外接矩形の内、終点の高さが雑音除去用として設定された所定値以下の矩形を対象として外接矩形の終点を結ぶ直線を求めるようにしたことを特徴とする方法である。
【０００７】
請求項５の発明は、請求項１乃至４の何れかに記載された画像処理方法において、前記言語判定ステップを実行することにより判定された各文字行の判定結果を処理対象とした文書単位で集計し、集計結果を所定の規則に従い一言語又は不明を判定するステップを実行するようにしたことを特徴とする方法。
【０００８】
請求項６の発明は、請求項１乃至５の何れかに記載された画像処理方法において、前記文字行を切り出すステップにより切り出された文字行内の文字外接矩形の数に基づいて当該行の言語を判定するためのステップを実行するか否かを決定するステップを実行するようにしたことを特徴とする方法である。
請求項７の発明は、請求項１乃至６の何れかに記載された画像処理方法において、前記文字行を切り出すステップにより切り出された文字行の縦横比に基づいて当該行の言語を判定するためのステップを実行するか否かを決定するステップを実行するようにしたことを特徴とする画像処理方法である。
請求項８の発明は、請求項１乃至７の何れかに記載された画像処理方法において、前記文字行を切り出すステップにより切り出された文字行の方向に基づいて当該行の言語を判定するためのステップを実行するか否かを決定するステップを実行するようにしたことを特徴とする方法である。
【０００９】
請求項９の発明は、請求項１乃至８のいずれかに記載された画像処理方法の各処理ステップをコンピュータに実行させるためのプログラムである。
請求項１０の発明は、請求項９に記載されたプログラムを搭載したコンピュータを備え、該コンピュータを前記プログラムに基づき対象画像を処理する手段として機能させるようにしたことを特徴とする画像処理装置である。
請求項１１の発明は、請求項９に記載されたプログラムを記録した記録媒体である。
【００１０】
【発明の実施の形態】
本発明を添付する図面とともに示す以下の実施形態に基づき説明する。
本発明は、文書原稿から読み取った画像情報を基にその文書に記された文字の言語の種類を判定・分類する画像処理方法に係わるものである。文書に記された文字の言語種類を判定するねらいは、その判定結果を、文字の認識処理を行う際に効率的な処理を可能とし、かつ高い認識精度を得るために用いることにある。このために、文字認識処理が踏む、文書画像から文字が存在すると判断される文字領域を切り出し、切り出した文字領域の画像の形状特徴から文字を認識するという処理手順の中間過程に用いることが可能な形態で、言語が持つ特徴に着目した言語種類の判定処理を行う。ここでは、文書画像から文字領域として切り出した文字外接矩形に表れる言語の特徴をとらえることにより、この判定処理を行うが、本発明では、この処理を文字行の切り出し処理で既に求めている文字行内の文字外接矩形を対象にして外接矩形の始点（文字最前部・文字頂部の座標）の文字行内における高さに関する情報を用いて、当該行の属する言語を判定するという方法による。この方法を用いることにより、切り出された文字行の言語の種類を判定するための処理手順を簡略化し、計算負荷の増大を招くことをなくし、処理の高速化を図る。
以下に本発明の実施形態を示すが、下記実施形態において言語種類の判定処理の対象とする言語と、その言語が記された文書画像について説明する。
図１は、処理対象となる文書画像の一例を示す。図１に示すような原稿から読み取った文書画像をもとにアジア系言語（日本語、中国語、韓国語）と欧米系言語（英語、仏語、独語、伊語、西語など）とに２分する場合を考える。なお、以下に例示する実施形態では、図１に示すような主にアジア系言語として日本語、欧米系言語として英語を例にとるが、特にことわらない限り、本発明は例示する実施形態に限定されるものではなく、いわゆる印欧（インドヨーロッパ）語族と、それ以外とを大別する方法を示すものであり、特定の言語対に限定されない。
【００１１】
「実施形態１」
以下に示す「実施形態１」は、文字外接矩形の始点（後述する図４（Ａ）、参照）の文字行内における高さを用いて、当該行の属する言語を判定する方法の基本形態を示す。
本発明では、文字外接矩形に表れる言語の特徴をとらえるので、言語判定の前段において、文字外接矩形を生成（抽出）する処理を行う。この処理は既存の処理方法である、原稿画像中の黒ランの外接矩形を求める方法を適用することにより実施可能である。なお「ラン」は、連続画素データが同一値をとる場合に、この連続画素のかたまりを指す概念で、符号化の単位として扱われることでも知られている。この処理によって対象画像（図１）から求められた黒画素の外接矩形を図２に示す。
また、求めた黒ランの外接矩形を近隣の矩形と統合していくことによって、行に成長させ、文字行を切り出す、既存の処理を適用する。この処理により、対象画像（図１）から求められた文字行を図３に示す。
【００１２】
上記の文字外接矩形生成処理により得られた外接矩形に表れる言語の特徴をとらえるが、本発明では、文字外接矩形の配置、つまり文字行に属する外接矩形（行内矩形）の配置に表れる特徴をとらえる。文字行内の文字外接矩形の配置は、外接矩形の座標を定義することにより表現できる。図４（Ａ）は、矩形座標の定義を図示するもので、同図における矩形の左上の隅を始点、右下の隅を終点と称し、それぞれの座標を（Ｘｓ，Ｙｓ）、（Ｘｅ，Ｙｅ）とする。矩形はこの２点の座標で定義される。ここでは、始点のＸｓ，Ｙｓ座標は、それぞれ矩形により外接される文字の最前部、頂部の座標を表し、終点のＸｅ，Ｙｅ座標は、それぞれ矩形により外接される文字の最後部、最下部の座標を表している。
図４（Ｂ），（Ｃ）は、それぞれ欧米系文字行とアジア系文字行の、行内矩形の配置例を示す。図４（Ｂ）に示すように、欧米系文字行は、大文字と小文字とが混在していることに加え、アポストロフィー、アクサンテギュ、ウムラウトなど、記号類の有無が存在するので、行内矩形の始点の高さは、図４（Ｂ）のａの位置とｂの位置との２個所に集中することになるので、この点を欧米系文字の配置特徴としてとらえ、この高さ情報を欧米系言語の判定に用いるようにする。
一方、アジア系文字行は図４（Ｃ）のように、漢字、ひらがな、カタカナ、ハングルなど、文字の構造が複雑であり、行内矩形の始点の高さは、欧米系文字行で見られるような、２カ所への明確な集中はない。
従って、両者を区別するには、注目行において、行内矩形の始点の高さについてその出現頻度を集計するという方法により、文字行における外接矩形の配置に表れる言語の特徴をより精度良く検出することができる。行内矩形は、２点の座標（Ｘｓ，Ｙｓ）、（Ｘｅ，Ｙｅ）の形で，行切り出し処理の過程で既に求まっているので、追加の特徴抽出処理を行う必要がないので処理負担を増加することがなく、都合がよい。
【００１３】
ここで、本実施形態の文字行が属する言語の判定処理手順について示す。
図５は，本実施形態の処理手順を示すフローチャートである。
図５を参照すると、先ず、スキャナー、デジタル・スチル・カメラなどの画像入力機器によって、処理対象の文書画像原稿の読み取り、２値化を含めた画像処理等の画像入力処理を行う（ｓｔｅｐ１）。この入力処理において、原稿文書画像の黒ランの生成処理を行う。
次いで、生成された文書画像の黒ランに基づいて、黒ランの外接矩形を求める（ｓｔｅｐ２）。ここで求められる黒ランの外接矩形には、文字以外の図表等によるものも含まれている。
そこで、求めた黒ランの外接矩形から文字と見なせる矩形を抽出する処理を行い、抽出した文字と見なせる矩形同士で近隣の矩形と統合する処理を行い、文字行の切り出しを行う（ｓｔｅｐ３）。
次いで，行内矩形の配置に表れる言語の特徴を抽出するために、切り出された文字行内に存在する対象外接矩形の始点の高さ（図４（Ａ）参照）情報を取得し、欧米系文字の配置特徴として表れる２箇所の集中位置（図４（Ｂ）参照）における出現頻度を集計して、予め定めておいた出現頻度よりも高くなった場合に欧米系文字行であると判定し，それ以外を非欧米系文字行或いはアジア系文字行であると判定する（ｓｔｅｐ４）。
【００１４】
「実施形態２」
上記実施形態１には、文字行が属する言語の判定方法の基本形態を示したが、実施形態１をそのまま適用すると誤判定するケースが考えられ、こうしたケースが起きないように、その補償を行う処理ステップを追加するものである。
例えば、目次ページなどでは、言語に関わらず、ドットで行の空白部を埋めるような場合がある。図６は、この例を示すもので、ドットが文字行の高さの中点に位置している。このような場合、外接矩形の始点（図４（Ａ）参照）の最頻出位置はドットのみで決定してしまい、実施形態１で言及した、言語別の特徴が表れないケースとなってしまう。
言語によって最頻出位置の特徴が異なるのは、図４（Ｂ）におけるａ−ｂの範囲であるから、それ以外の範囲に外接矩形の始点の高さがあるものを集計の対象外とすれば、図６のような場合でも正しく言語識別することができる。
図４（Ｂ）におけるａ−ｂの範囲は、行内矩形高さの約２／３以上の範囲であるから、矩形始点位置の頻度集計の結果から、最頻出値を求める際、行高さの２／３未満のものは対象外とすればよい。
なお行高さの２／３以上というしきい値は、あくまで例であり、言語の特徴が峻別できる図４（Ｂ）におけるａ−ｂの範囲のみを集計対象とするよう、調整可能であることは言うまでもない。
また、この目次ページなどに対する補償処理を行うステップは、実施形態１に示した文字行が属する言語の判定処理手順（図５）における行の言語判定ステップ（ｓｔｅｐ４）の中で行う。
【００１５】
「実施形態３」
本実施形態は、スキューが原因で正しい判定ができない場合が生じる可能性があるので、スキュー補正を行うことにより、誤判定を回避するようにするものである。ここでは、上記実施形態１，２に適用し得る形態を示す。
入力機器によって、処理対象の文書画像原稿を読み取る際に、原稿が傾いてしまうことがある。図７は、スキューが生じた場合の入力文書画像の例を示す。図７に示すように、スキューが生じると、文字行も傾いてしまう。極端な傾きの場合には、行切り出し処理が成立しないような条件を設定した手順を用いることにより、行切り出しができないが、少々の傾きであれば、行間の空白部を利用して、行を切り出すことができるようにしている。
しかしながら、実施形態１に示す方法では、行内矩形の始点高さの出現頻度に注目するので、少々の傾きでも、集計結果に大きく影響してしまう。つまり、図７に示す位置ｅ（行内矩形の始点高さを定めるための基準線となるベースライン）から行内矩形の始点までの距離は、行高さに対して万遍なく分布することになり、欧米系文字行の特徴である、頻度の明確な２箇所への集中が観測できない。
そこで、図７に示す破線ｄのようなベースラインを求め、そこから行内矩形の始点までの高さを求めることにする。破線ｄのようなベースラインを求めるには、行内矩形の終点（図４（Ａ）に示す矩形座標の定義、参照）を結ぶような直線を求めればよい。
具体的には、行内矩形における終点座標（Ｘｉ，Ｙｉ）の分布の回帰直線を求めればよい。Ｘに対するＹの回帰直線を求める方法は、「統計」に関する教科書（例えば、．ガットマン．Ｓ．Ｓ．ウィルクス著「工科系のための統計概論」培風社刊）に詳しいが、簡単には以下のようになる。
Ｘに対するＹの回帰直線の式は、
Ｙ＝ＡＸ＋Ｂ
の形で表され、ＡをＸに対するＹの回帰係数と言う。
Ａ＝｛ＮΣＸｉＹｉ−（ΣＸｉ）（ΣＹｉ）｝／｛ＮΣＸｉ^２−（ΣＸｉ）^２｝
によってＡを求め、次に、
Ｂ＝ΣＹｉ／Ｎ−ＡΣＸｉ／Ｎ
によってＢを求める。
【００１６】
また、上記した方法では、行内矩形の終点の座標に基づいてベースラインを求めるので、本来求めるべきベースラインを正しく反映しない終点座標データが用いられる場合がある。その例として考えられるものは、アポストロフィー、アクサンテギュ、ウムラウトなどの行の上部にある孤立した点（これらの終点座標は本来のベースラインよりかなりずれる）で、これらを含めて回帰直線を求めると本来のベースラインとの差異が大きくなり、正しく矩形の高さ位置を求めることができなくなる。
これを避けるために、回帰直線を求める行内矩形の終点は、行高さの約１／２以下のものに限定するという方法を用いる。これによって文字行のベースラインを正確に求めることができる。なお、行高さの１／２以上というしきい値は、あくまで例であり、行の下部のみを対象とするよう、調整可能であることは言うまでもない。
【００１７】
ここで、本実施形態のスキュー補正を可能にした言語の判定処理手順について説明する。
図８は，本実施形態の処理手順を示すフローチャートである。なお、図８に示す処理フローは、実施形態１，２において用いるとした文字行の言語の判定処理手順（図５）における行の言語判定ステップ（ｓｔｅｐ４）を置き換える形態で実施可能としたものを示す。
図８を参照すると、前段で文字行の切り出し（ｓｔｅｐ３）を行い得られる行における行内矩形の始点高さを集計し、行内矩形の高さの最大高さを求め、実際の行高さと比較し、その結果に従い行高さを推定する（ｓｔｅｐ４−１）。ここで、行内矩形の高さの最大高さ × Ａ（例１．２）倍＞実際の行高さである場合、行内矩形の高さの最大高さを行高さとみなす。他方、この不等式が成り立たない場合には、実際の行高さ（＝行切り出し結果）を行高さとする。この行高さの推定は、スキュー行や、行内矩形が小さなものばかりで構成されている場合への対策として行う。
次に、行内矩形の高さの基準線となるベースラインとして、行内矩形の終点Ｙｅの回帰直線を求める（ｓｔｅｐ４−２）。その際、終点Ｙｅ位置は行高さ（即ち、ｓｔｅｐ４−１で推定した行高さ）の半分以下のものに限定する。
前ステップで算出したベースラインからの行内矩形の始点Ｙｓの高さを求め、頻度を集計する（ｓｔｅｐ４−３）。ここでは、所定の高さ範囲に区分し、区分毎にその出現頻度をカウントするという集計方法をとる。
次いで、前ステップで行った頻度集計結果の内、行内矩形の始点Ｙｓの高さが行高さ（即ち、ｓｔｅｐ４−１で推定した行高さ）の半分以下のものについては、次のステップで行う言語判定の対象として用いないようにするので、得た集計結果から除去する（ｓｔｅｐ４−４）。
この後に、対象とする頻度集計データに基づいて、言語判定の処理を行う（ｓｔｅｐ４−５）。ここでは、頻度集計データを頻度順にソートし、
頻度１位の高さ（行内矩形の始点高さ）＜頻度２位の高さ
、つまり欧米系文字の特徴である、頻度１位の小文字高さ＜頻度２位の大文字高さ、の条件を満足し、かつ、
頻度３位 ≪ 頻度２位（頻度の差が大きい）
、つまり頻度１位（小文字）と２位（大文字）の高さに頻度が集中する、の条件を満足するかを判断し、これらを満足した場合に欧米系文字行であると判定し，それ以外を非欧米系文字行或いはアジア系文字行であると判定する。
【００１８】
「実施形態４」
上記各実施形態では、切り出した文字行単位に適用することが可能な形態を例示したが、本実施形態では、処理対象の文書画像原稿の単位で一つの言語を判定するようにした形態に係わる。
一般的には、文書における使用言語は原稿毎に一種である場合が多く、１枚の原稿の中で外来語や固有名詞などを除いて、複数の言語が混在して使われることは、稀である。上記各実施形態では文字行単位に適用し得る言語判定方法について言及したが、文字行単位に適用し得る言語判定処理の判定結果を１枚の原稿全体にわたって集計すれば、当該原稿が属する言語を決定することができる。
集計結果から原稿が属する言語を判断する基準としては、
・多数決で決める
・多数決で、差が設定値以上ならば多数に決し、僅差なら不明とする
・最も長い行を有するものに決める（短い行は考慮しない）
・最頻値の行高さを持つ行のみを対象として多数決で決める（本文行のみを対象としたいから）
などを適用することが可能である。
ここで、本実施形態の言語の判定処理手順について説明する。
図９は，本実施形態の処理手順を示すフローチャートである。なお、図９に示す処理フローは、実施形態１〜３において用いるとした文字行の言語の判定処理手順（図５、図８）における行の言語判定ステップ（ｓｔｅｐ４、ｓｔｅｐ４−５）の後段に追加する形態で実施可能としたものを示す。ただし、実施形態１〜３における文字行の言語の判定処理は、文字行単位に適用する形態であるから、図５、図８における行の言語判定ステップ（ｓｔｅｐ４、ｓｔｅｐ４−５）を原稿全体の文字行各々に用いて言語判定結果を取得することが本実施形態の処理では必要になる。
図９を参照すると、前段の言語判定ステップ（ｓｔｅｐ４、ｓｔｅｐ４−５）で各文字行の言語の判定処理を行い、その判定結果を受けて、行毎の判定結果を原稿全体で集計する（ｓｔｅｐ５）。即ち、各文字行の判定結果を当該文字行に関する他の情報（例えば、文字行の矩形情報や行内の文字外接矩形情報など）と関連付けて情報を蓄積する。
次いで、前ステップの集計結果をもとに、原稿が属する言語を判定する処理を行う（ｓｔｅｐ６）。この判定処理は、上記したような判断基準により原稿に一つの言語を定めるか或いは、不明とする判定結果を導く。
【００１９】
「実施形態５」
上記した各実施形態に示した方法では、適用する文字行の行内矩形の在りようによっては、誤判定するケースが考えられる。つまり、行内矩形の高さの頻度分布に着目して、言語を識別する方法を用いていることから、行内矩形が少ない場合には、確信をもって判断するだけの頻度集計結果が得られず、これに基づいた場合には誤判定してしまう危険性が高い。
本実施形態は、こうした判定結果の発生を抑制するための処理ステップを追加するものである。この処理ステップは、行内矩形の矩形数の下限値として予め定めた数値を設定しておき、判定処理対象となる行の行内矩形数がこの下限値を下回ってした場合には、判定の対象外とし、判定不能行として分類し、この判定結果を利用者に伝えることや管理情報として記憶するといった、情報管理を行うようにする。
さらに、各行の判定結果を受けて原稿単位の判定を行う場合には（実施形態４参照）、判定不能行の数、あるいは、全行数に占める割合に上限値として予め定めた数値を設定しておき、判定不能行数が、この設定値を上回った場合には、判定処理対象となる原稿を、判定不能原稿とする。これは、文字行をほとんど含まない原稿などが、判定不能原稿に分類されることをねらいとするものである。ここでも、この判定結果を利用者に伝えることや管理情報として記憶するといった、情報管理を行うようにする。
この誤判定を回避する処理を行う追加のステップは、実施形態１に示した文字行が属する言語の判定処理手順（図５、図８）における行の言語判定ステップ（ｓｔｅｐ４、ｓｔｅｐ４−５）及び実施形態４に示した原稿に対する言語の判定処理手順（図９）における原稿の言語判定ステップ（ｓｔｅｐ６）の中でそれぞれ行うことにより実施し得る。
【００２０】
「実施形態６」
上記実施形態５に示した方法を必要とするケース以外にも、適用する文字行の行内矩形の在りようによっては、誤判定する他のケースが考えられる。つまり、行内矩形の数が規定値を越えていたとしても、文字数が少なければ、この言語判定方法は有効に機能しない。
本実施形態は、こうした場合の判定結果の発生を抑制するための処理ステップを追加するものである。この処理ステップは、文字行矩形の縦横比にしきい値を設け、短い行は判定の対象外とする。
一般に、文字の外接矩形は極端な長方形である場合は少なく、正方形であることが多い。行内の文字数が多ければ、行の外接矩形は長方形になり、文字数が多くなるほど、縦横比が大きくなる。つまり行の縦横比を検査するだけで、当該行の文字数の多寡は、ある程度予測できる。
そこで、
文字方向：水平の場合行の横の長さ／縦の長さ＞しきい値
文字方向：垂直の場合行の縦の長さ／横の長さ＞しきい値
という基準を設け、これを満足しない行は、言語識別処理の対象外とし、判定不能行として分類し、この判定結果を利用者に伝えることや管理情報として記憶するといった、情報管理を行うようにする。
さらに、各行の判定結果を受けて原稿単位の判定を行う場合にも、実施形態５におけると同様に、判定不能原稿を判断し、その情報を管理する処理を行うようにする。
この誤判定を回避する処理を行う追加のステップは、実施形態１に示した文字行が属する言語の判定処理手順（図５、図８）における行の言語判定ステップ（ｓｔｅｐ４、ｓｔｅｐ４−５）及び実施形態４に示した原稿に対する言語の判定処理手順（図９）における原稿の言語判定ステップ（ｓｔｅｐ６）の中でそれぞれ行うことにより実施し得る。
このようなステップを追加することによって、正確に判断することが困難な行に対して、言語識別処理を試みることを避けることができ、処理の効率化を図ることが可能になる。
【００２１】
「実施形態７」
上記各実施形態に示した方法を適用する際に、原稿によっては、行内矩形の高さに基づく判定処理を省略可能な場合が考えられる。処理が省略可能なこのような原稿を簡便な方法で知ることができれば、処理の効率化を図ることができる。
欧米系とアジア系行とを区別する場合、欧米系文字行は横書きのみ、アジア系文字行は縦書き／横書きの両方可能という性質を持っているので、この性質を利用すれば、言語識別処理を簡易に行うことが可能になる。
注目文字行が横書きであれば、欧米系とアジア系の両方の可能性が考えられるので、詳細な言語識別処理を行う必要がある。
しかしながら、注目行が縦書きであれば、アジア系文字行としての可能性しかない。よって、詳細な言語処理の前に行方向に注目して、言語識別を簡易に行い、アジア系文字行という結果が得られる場合には、この結果に従って、行内矩形の高さに基づく判定処理を省略するような手順とすることにより、処理の効率化を図ることが可能になる。
この行内矩形の高さに基づく判定処理を省略する手順を行う追加のステップは、実施形態１に示した文字行が属する言語の判定処理手順（図５、図８）における行の言語判定ステップ（ｓｔｅｐ４、ｓｔｅｐ４−５）の中でそれぞれ行うことにより実施し得る。
【００２２】
「実施形態８」
本実施形態は、本発明に係わる画像処理装置の実施形態を示すものである。上記した「実施形態１〜７」に示した文字行が属する言語の判定方法及び原稿に対する言語の判定方法における処理ステップを実行する手段として、汎用の処理装置（コンピュータ）を利用して構成される装置を例示するものである。
図１０は、本実施形態の画像処理装置の構成を例示する。図１０に示すように、本例は、汎用の処理装置（コンピュータ）により実施する例を示すものであり、構成要素としてＣＰＵ１、メモリ２、ハードディスクドライブ３、スキャナー、キーボード、マウス等の入力装置４、ＣＤ−ＲＯＭドライブ５、ディスプレイ６、フレキシブルディスクドライブ７、通信装置８などを用意し、これらをバス接続して構成する。
また、記憶手段としてのメモリ２、ハードディスクドライブ３、ＣＤ−ＲＯＭドライブ５、フレキシブルディスクドライブ７が用いる記憶媒体（図示せず）の一部には、本発明に係わる文字認識処理や画像処理の機能を実現し、上記実施形態に示した言語の判定方法で述べた処理手順を実現させるためのプログラム（ソフトウェア）が記録されている。
処理対象の原稿文書画像は、スキャナー等の入力装置４により入力され、例えばハードディスク３などに格納されているものである。ＣＰＵ１は、記憶手段が有する記録媒体から上記した処理機能・処理方法を実現するプログラムを読み出し、プログラムに従う処理を対象文書画像に実行し、その処理結果等をディスプレイ６などに出力する。
なお、本発明に係わる画像処理装置を、図１１に示すように、通信装置８によりインターネットなどの通信回線２０を介して、外部の装置１１〜１３と接続して、機能の一部をネットワーク上に持つような形態で実施してもよい。
【００２３】
【発明の効果】
（１）請求項１の発明に対応する効果
外接矩形の始点（文字頂部・最前部の座標）の文字行内における高さに関する情報を集計し、集計した高さ情報の分布状態を言語種類の属性を示す所定の基準分布状態と対比することにより文字行の属する言語を判定するようにしたので、文字行の言語の種類を判定するための処理手順を簡略化し、計算負荷の増大を招かず、処理の高速化を図ることが可能となる。また、文字認識処理に適用することにより、特に新たに行の特徴を抽出する処理ではなく、行切り出し処理の過程で得られる特徴を利用することにより、高速に言語識別処理を実現でき、言語識別結果に応じて、言語に最適な文書処理を自動選択する基準を求めることができ、認識処理に有効な手法を提供することが可能になる。
（２）請求項２の発明に対応する効果
欧米系言語の特徴を表さない（雑音となる）行内矩形の高さデータを削除するようにしたので、言語の判定精度を上げることが可能になる。
（３）請求項３，４の発明に対応する効果
行内矩形の終点データという、新たなデータの抽出処理をせずに得られるデータに基づいて、文字列のベースラインを算出し、これによって外接矩形の始点高さのスキュー補正ができるので、正しい言語判定を行うことが可能になる。また、スキュー直線（ベースライン）の算出に雑音となるデータを除くようにしたので、精度を向上させることが可能になる。
【００２４】
（４）請求項５の発明に対応する効果
各文字行の判定結果を、処理対象とした文書（原稿）単位で集計し、原稿全体に一つの判定結果が得られるようにしたので、判定結果を利用した処理を行う場合に、処理の効率化を図ることが可能になる。
（５）請求項６，７の発明に対応する効果
正しい判定結果が得られる条件の存在をチェックする手順を用いるようにしたので、誤判定を回避することができ、判定の信頼性が向上し、また判定結果を利用した処理を行う場合に、処理効率を損なうことがない。
（６）請求項８の発明に対応する効果
文字行の方向に基づいてアジア系文字行を簡易に判定することができるので、この結果に従って、行内矩形の高さに基づく判定処理を省略するような手順とすることにより、処理の効率化を図ることが可能になる。
【００２５】
（７）請求項９〜１１の発明に対応する効果
請求項１ないし８のいずれかに記載された画像処理方法の各ステップを実行するためのプログラムを汎用の処理装置（コンピュータ）に搭載することにより、上記（１）〜（６）の効果を容易に具現化することが可能になる。またプログラムを記録媒体として提供することにより利便性を高めることができる。
【図面の簡単な説明】
【図１】処理対象となる文書画像の一例を示す。
【図２】文書画像の例（図１）における文字と見なせる黒ランの外接矩形を作成した結果を示す。
【図３】文書画像の例（図１）において統合処理の結果得られる文字行の矩形と文字外接矩形を示す。
【図４】（Ａ）は矩形座標の定義を示し、（Ｂ），（Ｃ）は、それぞれ欧米系文字行とアジア系文字行の、行内矩形の配置例を示す。
【図５】文字行が属する言語の判定処理の基本的な手順を示すフローチャートである。
【図６】補償が必要な例としてのドット（文字行の高さの中点に位置する）を示す。
【図７】スキューが生じた場合の入力文書画像の例を示す。
【図８】スキュー補正を可能にした言語の判定処理手順を示すフローチャートである。
【図９】文書（原稿）全体にわたって判定結果を集計するステップを行う言語の判定処理手順を示すフローチャートである。
【図１０】本発明の実施形態に係わる画像処理装置の構成を示す。
【図１１】本発明の実施形態に係わる画像処理装置の他の構成を示す。
【符号の説明】
１…ＣＰＵ、２…メモリ、
３…ハードディスクドライブ、４…入力装置、
５…ＣＤ−ＲＯＭドライブ、６…ディスプレイ（表示装置）、
７…ＦＤドライブ、８…通信装置。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to image processing used for OCR (optical character reading device), character recognition, and the like. More specifically, the present invention relates to a processing procedure for determining (classifying) a language used in an image-input document. The present invention relates to an image processing method provided, a program used to execute the method, and an image processing apparatus.
[0002]
[Prior art]
Conventionally, as image processing for a document document, OCR and character recognition processing of characters written on the document have been performed based on image information obtained by reading the document with a scanner.
In the character recognition processing, an image area in which it is determined that one character is present is usually cut out from a document image, and a character is recognized based on the shape characteristics of the cut out image. In addition, when performing such processing, in order to obtain high recognition accuracy through efficient processing, the type of language of the extracted predetermined character region (for example, character line) is determined and classified, and the language classification ( For example, a processing method adapted to Japanese and English) is used.
Patent Literatures 1 to 3 below can be cited as conventional examples that have proposed such a language determination method performed in association with the character recognition processing.
Patent Literatures 1 and 2 disclose a method of classifying a Japanese line and an alphabet (English) line by scanning an image in a direction perpendicular to the line direction and performing the white / black inversion count or the black run ( The number of consecutive black pixels) is counted, and this is used as an index indicating the complexity of the character, and is used as a criterion for distinguishing more complicated Japanese.
Patent Literature 3 discloses a method that focuses on the character height in a region as a method for determining a character region as a Japanese or English region. Here, a rectangular height of a circumscribed rectangle of each character is disclosed. , That is, in English, the feature that the height of the rectangle is concentrated on the height of uppercase and lowercase letters is used as a criterion. Japanese Patent Application Laid-Open No. H11-163,019 discloses a method of improving the determination accuracy of an alphabet by supplementarily using a determination criterion focusing on a variation in the height of a rectangle in a character circumscribed rectangle characteristic of an alphabet.
[0003]
[Patent Document 1]
JP-A-5-108876
[Patent Document 2]
JP-A-9-231317
[Patent Document 3]
JP-A-6-150055
[0004]
[Problems to be solved by the invention]
However, the methods disclosed in Patent Literatures 1 and 2 perform line segmentation, scan an image across the entire line at regular intervals in the line direction, and perpendicularly to the line direction, and count the number of white / black inversions. And the processing load increases.
Further, the method disclosed in Patent Document 3 focuses on the height of a rectangle in a circumscribed rectangle. In order to obtain the height of the rectangle, it is necessary to further perform an operation based on coordinates obtained by cutting out the circumscribed rectangle. The additional calculation for calculating the height of the rectangle increases the processing load.
The present invention is used for determining and classifying a language type in a character area to be processed when performing OCR or character recognition processing of characters written in a document based on image information read from the document original. In view of the above-mentioned problems of the prior art, it is an object of the present invention to simplify a processing procedure for determining and classifying a language type of a cut-out predetermined character area (for example, a character line), A processing procedure that does not cause an increase in calculation load.
Further, the present invention provides the above-mentioned technical problem.
In order to solve (Objective), a method is used in which attention is paid to the starting point of a rectangle existing in a character line obtained by line segmentation, and the frequency of the height is counted, but when this method is adopted, It is a further object to prepare a procedure corresponding to a skew (orientation of a document) that causes an error and to maintain the accuracy of determination and classification.
[0005]
[Means for Solving the Problems]
The invention according to claim 1 includes a step of generating circumscribed rectangle information of a character image from a document image to be processed, a step of cutting out a character line based on the circumscribed rectangle information, and a method of targeting a character circumscribed rectangle in the character line. Starting point height information summarizing step of summing information on the height of the starting point of the circumscribed rectangle (coordinates of the frontmost part of the character and the top of the character) in the character line, and a distribution state of the summed height information indicating a language type attribute A language determination step of determining a language to which a character line belongs by comparing the reference distribution state with a reference distribution state.
According to a second aspect of the present invention, in the image processing method according to the first aspect, the starting point height information summing step includes a step of setting a starting point height of a character circumscribed rectangle in a character line for noise removal. This method is characterized in that information on the height of a starting point of a circumscribed rectangle in a character line is totaled for rectangles having a value or more as targets for totaling.
[0006]
According to a third aspect of the present invention, in the image processing method according to the first or second aspect, a skew line calculating step of obtaining a straight line connecting end points (coordinates of the last character and the bottom of the character) of the character circumscribed rectangle in the character line. And a step of calculating the height in the character line as a height from the skew straight line in the starting point height information totalizing step.
According to a fourth aspect of the present invention, in the image processing method according to the third aspect, in the skew line calculating step, a height of an end point of a character circumscribed rectangle in a character line is equal to or less than a predetermined value set for noise removal. A straight line connecting the end points of the circumscribed rectangle is obtained for the rectangle of the above.
[0007]
According to a fifth aspect of the present invention, in the image processing method according to any one of the first to fourth aspects, the determination result of each character line determined by performing the language determining step is processed in units of documents. A method of performing a step of counting and counting the counting result according to a predetermined rule to determine one language or unknown.
[0008]
According to a sixth aspect of the present invention, in the image processing method according to any one of the first to fifth aspects, the language of the line is determined based on the number of character circumscribed rectangles in the character line extracted by the step of extracting the character line. A method for executing a step of determining whether or not to execute a step for determining is provided.
According to a seventh aspect of the present invention, in the image processing method according to any one of the first to sixth aspects, the language of the line is determined based on an aspect ratio of the character line extracted in the step of extracting the character line. And a step of determining whether or not to execute the step.
According to an eighth aspect of the present invention, in the image processing method according to any one of the first to seventh aspects, the language of the line is determined based on a direction of the character line extracted in the step of extracting the character line. The method is characterized in that a step of determining whether to execute the step is performed.
[0009]
A ninth aspect of the present invention is a program for causing a computer to execute each processing step of the image processing method according to any one of the first to eighth aspects.
According to a tenth aspect of the present invention, there is provided an image processing apparatus comprising a computer equipped with the program according to the ninth aspect, wherein the computer is made to function as means for processing a target image based on the program. is there.
An eleventh aspect of the present invention is a recording medium recording the program according to the ninth aspect.
[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention will be described based on the following embodiments shown in the accompanying drawings.
The present invention relates to an image processing method for determining and classifying a language type of a character written in a document based on image information read from the document. The purpose of determining the language type of a character written in a document is to use the result of the determination to enable efficient processing when performing character recognition processing and to obtain high recognition accuracy. Therefore, it can be used in the intermediate process of the character recognition process, in which a character area in which a character is determined to be present is cut out from a document image and the character is recognized from the shape characteristics of the image of the cut out character area. In a simple form, the language type is determined by focusing on the features of the language. Here, this determination process is performed by capturing the features of the language appearing in the character circumscribed rectangle cut out as a character region from the document image. In the present invention, this process is performed within the character line already obtained by the character line cutout process. For the character circumscribing rectangle, the language to which the line belongs is determined using information on the height of the start point of the circumscribing rectangle (coordinates of the foremost part of the character and the top of the character) in the character line. By using this method, the processing procedure for determining the language type of the cut-out character line is simplified, the calculation load is not increased, and the processing speed is increased.
Hereinafter, an embodiment of the present invention will be described. In the following embodiment, a language to be subjected to a language type determination process and a document image in which the language is described will be described.
FIG. 1 shows an example of a document image to be processed. Based on the document image read from the manuscript as shown in FIG. 1, two languages are used for Asian languages (Japanese, Chinese, Korean) and Western languages (English, French, German, Italian, Western, etc.). Consider the case. In the embodiment illustrated below, Japanese is mainly used as an Asian language and English is used as a Western language as shown in FIG. 1, but the present invention is not limited to the illustrated embodiment unless otherwise specified. The present invention is not limited to the above, but indicates a method of roughly classifying a so-called Indo-European (Indo-European) language family and other languages, and is not limited to a specific language pair.
[0011]
"Embodiment 1"
“Embodiment 1” described below shows a basic mode of a method of determining the language to which the line belongs by using the height in the character line of the starting point of the character circumscribed rectangle (see FIG. 4A described later). .
In the present invention, since the features of the language appearing in the character circumscribed rectangle are captured, a process of generating (extracting) a character circumscribed rectangle is performed in the preceding stage of the language determination. This processing can be performed by applying a method of obtaining a circumscribed rectangle of a black run in a document image, which is an existing processing method. Note that “run” is a concept indicating a group of continuous pixels when the continuous pixel data has the same value, and is also known to be treated as a unit of encoding. FIG. 2 shows a circumscribed rectangle of black pixels obtained from the target image (FIG. 1) by this processing.
In addition, by integrating the obtained circumscribed rectangle of the black run with neighboring rectangles, an existing process of growing a line and cutting out a character line is applied. FIG. 3 shows a character line obtained from the target image (FIG. 1) by this processing.
[0012]
The characteristics of the language appearing in the circumscribed rectangle obtained by the above-described character circumscribed rectangle generation processing are captured. In the present invention, the characteristics appearing in the arrangement of the character circumscribed rectangle, that is, the arrangement of the circumscribed rectangle belonging to the character line (in-line rectangle) are captured. . The arrangement of the character circumscribed rectangle in the character line can be expressed by defining the coordinates of the circumscribed rectangle. FIG. 4A illustrates the definition of the rectangular coordinates. The upper left corner of the rectangle in FIG. 4A is referred to as a start point, and the lower right corner is referred to as an end point. The coordinates are (Xs, Ys), (Xe, Ye). A rectangle is defined by the coordinates of these two points. Here, the Xs and Ys coordinates of the starting point represent the coordinates of the forefront and top of the character circumscribed by the rectangle, respectively, and the Xe and Ye coordinates of the ending point represent the last and bottom coordinates of the character circumscribed by the rectangle, respectively. Represents coordinates.
FIGS. 4B and 4C show examples of arrangement of rectangles in a line between European and American character lines and Asian character lines, respectively. As shown in FIG. 4 (B), Western character lines have a mixture of uppercase and lowercase characters and the presence or absence of symbols such as apostrophe, axanteugue, and umlaut. Is concentrated at two positions, a and b in FIG. 4 (B). Therefore, this point is regarded as a layout feature of European and American characters, and this height information is used for Western languages. Will be used for the determination of.
On the other hand, as shown in FIG. 4C, Asian character lines have complicated character structures such as kanji, hiragana, katakana, and hangul, and the height of the starting point of the in-line rectangle is the same as that seen in Western character lines. There is no clear concentration in the two places.
Therefore, in order to distinguish between the two, it is necessary to detect the feature of the language appearing in the arrangement of the circumscribed rectangle in the character line with higher accuracy by a method of counting the appearance frequency of the height of the starting point of the in-line rectangle in the line of interest. Can be. Since the in-line rectangle has already been obtained in the process of line segmentation processing in the form of coordinates (Xs, Ys) and (Xe, Ye) of two points, there is no need to perform additional feature extraction processing, so the processing load increases. It is convenient.
[0013]
Here, a description will be given of the procedure for determining the language to which the character line belongs in the present embodiment.
FIG. 5 is a flowchart illustrating a processing procedure according to the present embodiment.
Referring to FIG. 5, first, an image input device such as a scanner or a digital still camera reads an image of a document image to be processed and performs image input processing such as image processing including binarization (step 1). . In this input processing, black run generation processing of the document image is performed.
Next, a circumscribed rectangle of the black run is obtained based on the black run of the generated document image (step 2). The circumscribed rectangle of the black run obtained here includes a figure and the like other than characters.
Therefore, a process of extracting a rectangle that can be regarded as a character from the obtained circumscribed rectangle of the black run is performed, a process of integrating the extracted rectangles with neighboring rectangles is performed, and a character line is cut out (step 3).
Next, in order to extract features of the language appearing in the arrangement of the in-line rectangles, information on the height (see FIG. 4 (A)) of the starting point of the target circumscribed rectangle existing in the cut-out character line is acquired, and the Western characters are extracted. The frequency of appearance at two concentrated positions (see FIG. 4B) appearing as arrangement features is totaled, and if the frequency of appearance is higher than a predetermined frequency of occurrence, it is determined that the line is a Western-style character line. Is determined to be a non-Western character line or Asian character line (step 4).
[0014]
"Embodiment 2"
In the first embodiment, the basic form of the method of determining the language to which the character line belongs is shown. However, a case may be considered in which the first embodiment is erroneously applied as it is, and compensation is performed so that such a case does not occur. A processing step is added.
For example, in a table of contents page, a blank portion of a line may be filled with dots regardless of the language. FIG. 6 shows this example, where the dot is located at the midpoint of the height of the character line. In such a case, the most frequent position of the starting point of the circumscribed rectangle (see FIG. 4A) is determined only by the dots, and in this case, the language-specific features described in the first embodiment do not appear.
The feature of the most frequently occurring position differs depending on the language in the range ab in FIG. 4B. If the height of the starting point of the circumscribed rectangle in the other range is excluded from the aggregation, 6, the language can be correctly identified.
The range of a-b in FIG. 4 (B) is a range that is about 2/3 or more of the height of the in-line rectangle. Those less than 2/3 may be excluded.
Note that the threshold value of 2/3 or more of the row height is merely an example, and can be adjusted so that only the range of a-b in FIG. Needless to say.
The step of performing the compensation process on the table of contents page and the like is performed in the language determination step (step 4) of the line in the language determination process procedure (FIG. 5) to which the character line belongs according to the first embodiment.
[0015]
"Embodiment 3"
In the present embodiment, there is a possibility that correct determination cannot be made due to skew. Therefore, skew correction is performed to avoid erroneous determination. Here, a form applicable to the first and second embodiments will be described.
When reading a document image document to be processed by an input device, the document may be inclined. FIG. 7 shows an example of an input document image when skew occurs. As shown in FIG. 7, when the skew occurs, the character line also tilts. In the case of an extreme inclination, line extraction cannot be performed by using a procedure in which a condition is set so that line extraction processing is not established. We can cut it out.
However, in the method according to the first embodiment, since the appearance frequency of the height of the starting point of the in-line rectangle is focused on, even a slight inclination greatly affects the aggregation result. That is, the distance from the position e shown in FIG. 7 (the base line serving as a reference line for determining the starting point height of the in-line rectangle) to the starting point of the in-line rectangle is uniformly distributed with respect to the line height. Concentration in two places with clear frequencies, which is a feature of Western character lines, cannot be observed.
Therefore, a base line such as a broken line d shown in FIG. 7 is obtained, and a height from the base line to the starting point of the in-line rectangle is obtained. In order to obtain a base line such as a broken line d, a straight line connecting the end points of the in-line rectangles (see the definition of the rectangular coordinates shown in FIG. 4A) may be obtained.
Specifically, a regression line of the distribution of the end point coordinates (Xi, Yi) in the in-line rectangle may be obtained. The method of finding the regression line of Y with respect to X is detailed in a textbook on "statistics" (for example, Gutman SS Wilkes, "Introduction to Statistics for Engineering", published by Baifusha). become that way.
The equation for the regression line of Y with respect to X is
Y = AX + B
Where A is the regression coefficient of Y to X.
A = {N} XiYi-({Xi) ({Yi)} / {N} Xi ² − (ΣXi) ² ｝
To find A, and then
B = {Yi / N-A} Xi / N
To find B.
[0016]
Further, in the above-described method, since the base line is obtained based on the coordinates of the end point of the in-line rectangle, end point coordinate data that does not correctly reflect the base line that should be obtained may be used. An example of this would be an isolated point at the top of a row such as apostrophe, axanteugue, umlaut, etc. (the coordinates of their end points are significantly different from their original baselines). Becomes large, and the height position of the rectangle cannot be obtained correctly.
In order to avoid this, a method is used in which the end point of the in-line rectangle for obtaining the regression line is limited to approximately のもの or less of the line height. This makes it possible to accurately determine the baseline of the character line. Note that the threshold value of 1/2 or more of the row height is merely an example, and it goes without saying that the threshold can be adjusted so as to target only the lower part of the row.
[0017]
Here, a description will be given of a language determination processing procedure that enables skew correction according to the present embodiment.
FIG. 8 is a flowchart illustrating a processing procedure according to the present embodiment. The processing flow shown in FIG. 8 can be implemented by replacing the language determination step (step 4) of the line in the language determination procedure of the character line used in the first and second embodiments (FIG. 5). Is shown.
Referring to FIG. 8, the starting point heights of the in-line rectangles in the lines obtained by performing the character line segmentation (step 3) in the previous stage are calculated, the maximum height of the in-line rectangles is obtained, and the actual height is compared with the actual line height. Then, the line height is estimated according to the result (step 4-1). Here, if the maximum height of the in-line rectangle × A (example 1.2) times the actual line height, the maximum height of the in-line rectangle is regarded as the line height. On the other hand, if this inequality does not hold, the actual line height (= line segmentation result) is taken as the line height. The estimation of the line height is performed as a countermeasure against a skew line or a case where the rectangle in the line is composed only of small ones.
Next, a regression line of the end point Ye of the in-line rectangle is obtained as a baseline serving as a reference line for the height of the in-line rectangle (step 4-2). At that time, the position of the end point Ye is limited to a half or less of the line height (that is, the line height estimated in step 4-1).
The height of the starting point Ys of the in-line rectangle from the baseline calculated in the previous step is obtained, and the frequencies are totaled (step 4-3). Here, a counting method is adopted in which the area is divided into a predetermined height range, and the appearance frequency is counted for each section.
Next, among the frequency aggregation results performed in the previous step, if the height of the starting point Ys of the in-line rectangle is less than half the line height (that is, the line height estimated in step 4-1), the next step Since it is not used as a target of the language judgment performed in step (4), it is removed from the obtained total result (step 4-4).
Thereafter, language determination processing is performed based on the target frequency aggregation data (step 4-5). Here, the frequency aggregation data is sorted in order of frequency,
Height of frequency 1 (height of starting point of in-line rectangle) <Height of frequency 2
In other words, the condition of the lowercase height of the first place that is a characteristic of Western characters, the condition of the uppercase height of the second place is satisfied, and
Frequency 3rd ２ Frequency 2nd (frequency difference is large)
That is, it is determined whether the condition that the frequency is concentrated at the height of the first place (lower case) and the second place (upper case) is satisfied, and if these conditions are satisfied, it is determined that the line is a Western character line. Are determined to be non-Western character lines or Asian character lines.
[0018]
"Embodiment 4"
In each of the above embodiments, a form that can be applied to a cut-out character line unit has been exemplified. However, this embodiment relates to a form in which one language is determined in units of a document image document to be processed. .
In general, the language used in a document is often one type for each manuscript, and it is rare for a single manuscript to use a mixture of languages except for foreign words and proper nouns. It is. In the above embodiments, the language determination method applicable to each character line has been described. However, if the determination results of the language determination processing applicable to each character line are totaled over one document, the language to which the document belongs can be determined. Can be determined.
As a criterion for judging the language to which the manuscript belongs,
・ Decide by majority vote
・ By majority vote, if the difference is larger than the set value, the majority is decided; if the difference is small, it is unknown
• Decide what has the longest line (short lines are not considered)
・ Decide by majority rule only for the rows with the mode row height (because we want to target only the text rows)
Etc. can be applied.
Here, the language determination processing procedure of the present embodiment will be described.
FIG. 9 is a flowchart illustrating a processing procedure according to the present embodiment. Note that the processing flow shown in FIG. 9 corresponds to the language determination step (step 4 and step 4-5) of the line in the language row determination processing procedure (FIGS. 5 and 8) used in the first to third embodiments. An embodiment that can be implemented in a form added at a later stage will be described. However, since the processing for determining the language of a character line in the first to third embodiments is a form applied to each character line, the language determination step (step 4 and step 4-5) of the line in FIGS. In the processing of the present embodiment, it is necessary to acquire the language determination result using each of the entire character lines.
Referring to FIG. 9, the language determination process of each character line is performed in the language determination step (step 4 and step 4-5) in the preceding stage, and the determination result is received, and the determination result for each line is totaled for the entire document. (Step 5). That is, the information is stored in association with the determination result of each character line and other information related to the character line (for example, rectangle information of the character line, circumscribed rectangle information of the character in the line, and the like).
Next, a process for determining the language to which the document belongs is performed based on the total result of the previous step (step 6). In this determination processing, one language is determined for the document based on the above-described determination criteria, or a determination result that the document is unknown is derived.
[0019]
"Embodiment 5"
In the method described in each of the above-described embodiments, a case may be considered in which an erroneous determination is made depending on the in-line rectangle of the character line to be applied. In other words, since the method of identifying languages is used by focusing on the frequency distribution of the heights of the in-line rectangles, if the number of in-line rectangles is small, a frequency aggregation result that can be judged with certainty cannot be obtained. There is a high risk that an erroneous determination will be made based on.
In the present embodiment, a processing step for suppressing the occurrence of such a determination result is added. In this processing step, a predetermined numerical value is set as the lower limit value of the number of rectangles of the in-line rectangle, and if the number of in-line rectangles of the row to be subjected to the determination processing falls below this lower limit, the determination is not performed. The information management is performed by classifying the row as an undeterminable row and transmitting the determination result to the user and storing it as management information.
Further, in the case of performing the determination on a per-document basis based on the determination result of each line (see Embodiment 4), a predetermined numerical value is set as the upper limit value for the number of undeterminable lines or the ratio of the total number of lines. If the number of undetermined lines exceeds this set value, the document to be subjected to the determination processing is determined to be a document that cannot be determined. This aims to classify a document or the like that hardly contains a character line as a document that cannot be determined. In this case, too, information management is performed such that the determination result is transmitted to the user and stored as management information.
An additional step of performing the process of avoiding the erroneous determination is a language determination step (step 4, step 4-5) of the line in the determination processing procedure (FIGS. 5 and 8) of the language to which the character line belongs according to the first embodiment. ) And the document language determination step (step 6) in the procedure for determining the language of the document described in the fourth embodiment (FIG. 9).
[0020]
"Embodiment 6"
In addition to the case where the method described in the fifth embodiment is required, there may be other cases where an erroneous determination is made depending on whether there is an in-line rectangle of a character line to be applied. That is, even if the number of in-line rectangles exceeds the specified value, this language determination method does not function effectively if the number of characters is small.
In the present embodiment, a processing step for suppressing the occurrence of the determination result in such a case is added. In this processing step, a threshold value is set for the aspect ratio of the character line rectangle, and short lines are excluded from the determination.
Generally, the circumscribed rectangle of a character is rarely an extreme rectangle, and is often a square. If the number of characters in a line is large, the circumscribed rectangle of the line becomes a rectangle, and the aspect ratio increases as the number of characters increases. That is, only by examining the aspect ratio of a line, the number of characters in the line can be predicted to some extent.
Therefore,
Text direction: horizontal When horizontal / vertical length of line> threshold
Character direction: Vertical Vertical / Horizontal length of line> Threshold
A line that does not satisfy this condition is excluded from language identification processing, classified as a non-determinable line, and information management is performed such that this determination result is transmitted to the user and stored as management information. I do.
Further, also in the case where the determination is made on a document basis in response to the determination result of each line, similarly to the fifth embodiment, a document that cannot be determined is determined, and processing for managing the information is performed.
An additional step of performing the process of avoiding the erroneous determination is a language determination step (step 4, step 4-5) of the line in the determination processing procedure (FIGS. 5 and 8) of the language to which the character line belongs according to the first embodiment. ) And the document language determination step (step 6) in the procedure for determining the language of the document described in the fourth embodiment (FIG. 9).
By adding such a step, it is possible to avoid trying the language identification processing for a line for which it is difficult to make an accurate determination, and it is possible to increase the processing efficiency.
[0021]
"Embodiment 7"
When applying the method described in each of the above embodiments, it may be possible to omit the determination process based on the height of the in-line rectangle depending on the document. If such a manuscript whose processing can be omitted can be known by a simple method, the efficiency of the processing can be improved.
When distinguishing between European and American lines and Asian lines, European and American character lines can only be written horizontally, and Asian character lines can be written both vertically and horizontally. Can be easily performed.
If the target character line is horizontal writing, there is a possibility of both Western and Asian descent, so it is necessary to perform detailed language identification processing.
However, if the line of interest is vertical, there is only the possibility of an Asian character line. Therefore, focusing on the line direction before detailed language processing, language identification is easily performed. If a result of an Asian character line is obtained, a determination process based on the height of the in-line rectangle is performed according to the result. By omitting the procedure, it is possible to increase the processing efficiency.
An additional step of performing the procedure of omitting the determination processing based on the height of the in-line rectangle is the language determination step of the line in the determination processing procedure (FIGS. 5 and 8) of the language to which the character line belongs according to the first embodiment (FIGS. 5 and 8). Step 4 and Step 4-5) can be carried out.
[0022]
"Embodiment 8"
This embodiment shows an embodiment of an image processing apparatus according to the present invention. The means for executing the processing steps in the method for determining the language to which the character line belongs and the method for determining the language for the document described in the first to seventh embodiments are configured using a general-purpose processing device (computer). 1 illustrates an apparatus.
FIG. 10 illustrates the configuration of the image processing apparatus according to the present embodiment. As shown in FIG. 10, the present example shows an example implemented by a general-purpose processing device (computer), and includes input devices 4 such as a CPU 1, a memory 2, a hard disk drive 3, a scanner, a keyboard, and a mouse. , A CD-ROM drive 5, a display 6, a flexible disk drive 7, a communication device 8, and the like, and these are connected by a bus.
Also, some of the storage media (not shown) used by the memory 2, the hard disk drive 3, the CD-ROM drive 5, and the flexible disk drive 7 as storage means have functions of character recognition processing and image processing according to the present invention. And a program (software) for realizing the processing procedure described in the language determination method shown in the above embodiment is recorded.
An original document image to be processed is input by an input device 4 such as a scanner and stored in, for example, the hard disk 3 or the like. The CPU 1 reads a program for realizing the above-described processing functions and processing methods from a recording medium included in the storage unit, executes a process according to the program on a target document image, and outputs a processing result or the like to the display 6 or the like.
As shown in FIG. 11, the image processing apparatus according to the present invention is connected to external apparatuses 11 to 13 via a communication line 20 such as the Internet by a communication apparatus 8 so that a part of the functions is connected to a network. It may be implemented in such a form as to have.
[0023]
【The invention's effect】
(1) Effects corresponding to the first aspect of the invention
By summing up the information on the height of the starting point of the circumscribed rectangle (the coordinates of the top and front of the character) in the character line, and comparing the distribution state of the totalized height information with a predetermined reference distribution state indicating the attribute of the language type Since the language to which the character line belongs is determined, the processing procedure for determining the language type of the character line can be simplified, and the processing speed can be increased without increasing the calculation load. In addition, by applying the present invention to character recognition processing, it is possible to realize language identification processing at high speed by utilizing characteristics obtained in the process of line segmentation processing instead of processing for extracting new line characteristics. Based on the result, a criterion for automatically selecting the most suitable document processing for the language can be obtained, and it is possible to provide an effective method for the recognition processing.
(2) Effects corresponding to the second aspect of the invention
Since the height data of the in-line rectangle that does not represent the features of the European and American languages (which is noise) is deleted, the language determination accuracy can be improved.
(3) Effects corresponding to the inventions of claims 3 and 4
The base line of the character string is calculated based on the end point data of the in-line rectangle, which is data that is obtained without performing new data extraction processing, and the skew of the starting point height of the circumscribed rectangle can be corrected. It is possible to make a determination. Further, since noise data is excluded from the calculation of the skew line (base line), the accuracy can be improved.
[0024]
(4) Effects corresponding to the invention of claim 5
Judgment results for each character line are tabulated for each document (document) to be processed, and one decision result is obtained for the entire document. Can be achieved.
(5) Effects corresponding to the inventions of claims 6 and 7
The procedure for checking for the existence of a condition for obtaining a correct judgment result is used, so that an erroneous judgment can be avoided, the reliability of the judgment is improved, and when the processing using the judgment result is performed, the processing is performed. There is no loss of efficiency.
(6) Effects corresponding to the invention of claim 8
Since Asian character lines can be easily determined based on the direction of the character lines, the efficiency of processing can be improved by omitting the determination processing based on the height of the in-line rectangle according to the result. It becomes possible to plan.
[0025]
(7) Effects corresponding to the inventions of claims 9 to 11
A program for executing each step of the image processing method according to any one of claims 1 to 8 is installed in a general-purpose processing device (computer), thereby facilitating the effects of (1) to (6). It can be embodied in By providing the program as a recording medium, convenience can be improved.
[Brief description of the drawings]
FIG. 1 shows an example of a document image to be processed.
FIG. 2 shows a result of creating a circumscribed rectangle of a black run that can be regarded as a character in the example of a document image (FIG. 1).
FIG. 3 shows a rectangle of a character line and a character circumscribed rectangle obtained as a result of the integration process in an example of a document image (FIG. 1).
FIG. 4 (A) shows the definition of rectangular coordinates, and FIGS. 4 (B) and 4 (C) show examples of arrangement of in-line rectangles of European and American character lines and Asian character lines, respectively.
FIG. 5 is a flowchart illustrating a basic procedure for determining a language to which a character line belongs;
FIG. 6 shows a dot (located at the midpoint of the height of a character line) as an example requiring compensation.
FIG. 7 shows an example of an input document image when skew occurs.
FIG. 8 is a flowchart illustrating a procedure for determining a language in which skew correction is enabled.
FIG. 9 is a flowchart illustrating a language determination processing procedure for performing a step of totalizing determination results over the entire document (document).
FIG. 10 shows a configuration of an image processing apparatus according to an embodiment of the present invention.
FIG. 11 shows another configuration of the image processing apparatus according to the embodiment of the present invention.
[Explanation of symbols]
1 ... CPU, 2 ... Memory,
3: Hard disk drive, 4: Input device,
5 CD-ROM drive 6 Display (display device)
7: FD drive, 8: Communication device.

Claims

Generating circumscribed rectangle information of the character image from the document image to be processed, extracting a character line based on the circumscribed rectangle information, and starting a circumscribed rectangle of the character circumscribed rectangle in the character line (character A starting point height information summarizing step of summing up information on heights in a character line at the forefront / character top), and comparing the distribution state of the counted height information with a predetermined reference distribution state indicating an attribute of a language type An image processing method for executing a language determination step of determining a language to which a character line belongs.

2. The image processing method according to claim 1, wherein the step of summing the starting point height information includes summing rectangles having a starting point height equal to or greater than a predetermined value set for noise removal among character circumscribed rectangles in a character line. An image processing method characterized in that information on a height of a starting point of a circumscribed rectangle in a character line as a target is totaled.

3. The image processing method according to claim 1, wherein a skew straight line calculating step of obtaining a straight line connecting end points (coordinates of a character rear end and a character bottom) of the character circumscribed rectangle in the character line, and the start point height information An image processing method, wherein a step of calculating a height in a character line as a height from a skew straight line in a totaling step is executed.

4. The image processing method according to claim 3, wherein the skew line calculating step includes circumscribing a rectangle whose height of an end point is less than or equal to a predetermined value set for noise removal among character circumscribed rectangles in a character line. An image processing method characterized by obtaining a straight line connecting the end points of the image data.

5. The image processing method according to claim 1, wherein the determination result of each character line determined by performing the language determining step is totaled for each document to be processed, and the totaling result is a predetermined value. An image processing method, wherein a step of determining one language or unknown is executed according to the rule of (1).

6. The image processing method according to claim 1, wherein a step of determining a language of the line based on the number of character circumscribed rectangles in the character line extracted by the step of extracting the character line is executed. Determining whether or not to perform the image processing.

7. The image processing method according to claim 1, wherein a step of determining a language of the line based on an aspect ratio of the character line cut out by the step of cutting out the character line is performed. Determining an image.

8. The image processing method according to claim 1, wherein a step of determining a language of the character line based on a direction of the character line cut out by the step of cutting out the character line is performed. The image processing method characterized by executing the step of determining

A program for causing a computer to execute each processing step of the image processing method according to claim 1.

An image processing apparatus comprising a computer on which the program according to claim 9 is installed, and causing the computer to function as means for processing a target image based on the program.

A recording medium on which the program according to claim 9 is recorded.