JP3835652B2

JP3835652B2 - Method for determining Japanese / English of document image and recording medium

Info

Publication number: JP3835652B2
Application number: JP12510398A
Authority: JP
Inventors: 亨水納; 高志齋藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-09-10
Filing date: 1998-05-07
Publication date: 2006-10-18
Anticipated expiration: 2018-05-07
Also published as: JPH11191135A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書画像中の各文字領域に対して日本語領域であるのか英語領域であるのかを判定する文書画像の日本語英語判定方法および記録媒体に関する。
【０００２】
【従来の技術】
文書画像に対して文字認識処理を施す場合に、適切な言語を選択する必要がある。すなわち、英文ＯＣＲで日本語を認識しようとしてもアルファベットや数字以外は認識不可能であるし、また逆に日本語ＯＣＲで英文を認識しようとすると、文字切り出しや言語処理のうえで英文ＯＣＲを使用した場合よりも認識率が低くなってしまう。
【０００３】
従って、文字認識処理を施す前に、言語識別を行う必要が生じる。従来から文書中の文字種を識別する種々の手法が提案されている。例えば、２値化された文字行の縦方向または横方向の黒白反転回数を計数し、その分布を基に文字種の識別を行う文書認識装置がある（特開平５−１０８８７６号公報を参照）。
【０００４】
また、読み取った単語を認識させ、その認識結果と辞書との適合率を基に認識文字の言語種類を判別する文書認識装置もある（特開平６−１５００６１号公報を参照）。
【０００５】
【発明が解決しようとする課題】
上記した前者の装置では、文字種を識別する特徴として黒白反転回数を用いているが、この特徴はフォントや文書内容（かな、漢字、数字などの比率）による変動が大きく、このために識別の精度が低くなるという問題がある。
【０００６】
これに対して、後者の装置では、一度、文字認識を行っているので、ＯＣＲの性能がよければかなりの確率で字種が判明することになり、精度よく日英判別を行うことが可能となる。しかし、ＯＣＲは処理に多くの時間を要するという問題がある。
【０００７】
本発明は上記した事情を考慮してなされたもので、
本発明の目的は、精度よくかつ高速に日本語と英語の識別を行うと共に、識別する範囲についても各文字領域毎に、またページ単位毎に両者を識別できる文書画像の日本語英語判別方法および記録媒体を提供することにある。
【０００８】
【課題を解決するための手段】
前記目的を達成するために、請求項１記載の発明では、文書画像中の各文字領域が日本語領域であるか英語領域であるかを判定する文書画像の日本語英語判定方法であって、前記各文字領域から行を切り出し、行内の矩形の最大高さに対する行内の各矩形の高さの割合が高い場合の矩形の頻度数（以下、第１の頻度数）と、行内の矩形の最大高さに対する行内の各矩形の高さの割合が低い場合の矩形の頻度数（以下、第２の頻度数）とを算出し、前記第１の頻度数／第２の頻度数が所定の第１の閾値を超えるとき前記各文字領域が日本語領域であると判定し、前記第１の頻度数／第２の頻度数が所定の第２の閾値未満のとき前記各文字領域が英語領域であると判定し、それ以外のときは不明領域と判定し、前記不明領域については、予め算出された日本語の特性値に近いとき日本語領域であると判定し、予め算出された英語の特性値に近いとき英語領域であると判定することを特徴としている。
【０００９】
請求項２記載の発明では、請求項１記載の文書画像の日本語英語判定方法をコンピュータに実現させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であることを特徴としている。
【００１０】
【発明の実施の形態】
以下、本発明の一実施例を図面を用いて具体的に説明する。
〈実施例１〉
図１は、本発明の実施例１の構成を示す。図において、１０１は、文書画像を入力する画像入力手段、１０２は、入力文書画像を縮小する画像縮小手段、１０３は、文書画像から連結成分を抽出する連結成分抽出手段、１０４は、抽出した連結成分を分類し、統合することによって文字領域を生成する領域生成手段、１０５は、文字領域単位またはページ単位で日本語と英語を判別する日英判別手段、１０６は、全体を制御する制御部、１０７は、入力された文書画像データや連結成分データ、領域データなど各種データを記憶するデータ記憶部、１０８は、データ通信路、１０９は、ネットワーク、回線などを介してホストなどに接続するデータ通信手段である。
【００１１】
図２は、本発明の実施例１の全体の処理フローチャートを示す。以下、図２を参照しながら、本発明の処理動作を説明する。
まず、画像入力手段１０１は、文書を読み取ることによって文書画像を得る（ステップ２０１）。この画像入力手段は、例えばスキャナ、ファックスなどであり、またデータ通信手段１０９を介してネットワーク経由で別の機器から画像を得るようにしてもよい。
【００１２】
次に、画像縮小手段１０２は、入力された文書画像を縮小する（ステップ２０２）。この処理は、例えば入力文書画像を１／８程度にＯＲ縮小する処理である。すなわち、８×８画素を１画素に縮小するもので、６４画素中に１つでも黒画素があれば縮小画素は黒画素とする処理である。
【００１３】
連結成分抽出手段１０３は、縮小画像から黒画素連結成分を抽出する（ステップ２０３）。領域生成手段１０４は、抽出した連結成分を分類し、統合して文字領域を生成する（ステップ２０４）。この領域生成方法として、例えば特開平６−２００９２号公報に記載された公知の方法を用いればよい。このとき、各文字領域を構成する連結成分の情報はデータ記憶部１０７に格納、保持する。
【００１４】
続いて、生成した文字領域について、日英判別手段１０５は日本語か英語かの判定を行う（ステップ２０５）。
【００１５】
ステップ２０２において画像をＯＲ縮小することにより、近傍の黒画素どうしが融合する。ここで英文においては単語間にはスペースが存在し、単語内の文字間は非常に狭いという特徴がある。一方、日本語においては、句読点の前後以外では文字間隔は大きくは変わらない。
【００１６】
図３は、英文、日本語文の画像例と、その外接矩形を示す。英文画像３０１を縮小し、連結成分を抽出した結果を外接矩形で表現したものが外接矩形３０２である（なお、縮小処理しているので外接矩形３０２は、本来画像３０１より小さくなるべきだが、ここでは同じサイズで表現している）。英文画像では、単語毎に融合して連結成分が構成される。
【００１７】
日本語画像３０３と３０５の例について、同様に縮小して連結成分を抽出し、その外接矩形で表現すると、それぞれ外接矩形３０４、３０６のようになる。
【００１８】
英文の場合は、単語を構成する文字の数がある程度一定であるので、縦横比が２倍から６、７倍程度となる外接矩形が多くなる特徴がある。一方、日本語の場合は、外接矩形３０４に示すように英文では現れにくい長い矩形が生じたり、逆に外接矩形３０６のように細かい矩形が多く生じる特徴がある。
【００１９】
そこで、上記した連結成分矩形を「短」、「中」、「長」の３種類に分類し、これを各文字領域について集計する。図４は、実施例１の日英判定の処理フローチャートを示す。図４の処理は各文字領域毎に行われる。矩形の分類は、行方向が横の場合には例えば、幅／高さが２以下で「短」、幅／高さが２から６で「中」、それ以上で「長」とする（ステップ４０１）。そして、文字領域中におけるこの分類結果を集計し（ステップ４０２）、文字領域毎に日本語か英語かを判定する（ステップ４０３）。ここで、「短」矩形の数をＳＣＮＴ、「中」矩形の数をＮＣＮＴ、「長」矩形の数をＬＣＮＴとすると、日英の判定は図８（ステップ４０３の詳細フローチャート）に示すように行われる。
【００２０】
まず、ＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）＞Ｔｈｌが成り立つかどうか調べる（ステップ８０１）。Ｔｈ１は予め定めたしきい値であり、例えば０．３程度とする。この条件式が成り立てば、長矩形が十分に多いということであり、当該文字領域は日本語領域であると判定する（ステップ８０４）。
【００２１】
次に、ステップ８０１でＮｏと判定されたとき、ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＜Ｔｈ２が成り立つかどうかを調べる（ステップ８０２）。Ｔｈ２も予め定めたしきい値であり、例えば３とする。この条件式が成り立てば、中矩形が少ないということであり、当該文字領域は日本語領域であると判定する（ステップ８０４）。いずれの条件も満たさない場合は、英語領域と判定される（ステップ８０３）。
【００２２】
〈実施例２〉
上記した実施例１では、文字領域単位で日英の判定を行っている。この場合、文字領域によっては文字数が非常に少ない場合がある。そのような場合は、矩形の数が十分に得られないので矩形数の比率で日英判定を行うことが難しくなる可能性がある。実施例２は、矩形の数が十分でない場合を考慮した実施例である。
【００２３】
図５は、実施例２の処理フローチャートを示す。日英判別手段１０５は、集計された領域内の矩形の数が十分であるか否か（つまり所定の閾値Ｔｈ以上あるか否か）を調べ（ステップ５０１）、十分でない場合には、前掲した特開平６−１５００６１号公報に記載されているＯＣＲを利用した日英判別を行う（ステップ５０３）。この場合は、文字の数が少ないのでＯＣＲ処理を施しても処理時間の増大は少なくてすむ。そして、矩形の数が十分である場合には実施例１で説明した矩形長による日英の識別を行う（ステップ５０２）。
【００２４】
〈実施例３〉
次に、ページ単位で日英識別を行う実施例３について説明する。図６、７は、実施例３に係るステップ２０５の詳細フローチャートを示す。図６に示す方法は、「短」、「中」、「長」矩形の数の集計を文字領域毎でなくページ全体について行い（ステップ６０１、６０２）、その結果を使用してページ単位に日英の判定を行う（ステップ６０３）。この日英の判定方法は、図８の処理フローチャートに従って行う。このときのしきい値Ｔｈ１，Ｔｈ２は文字領域単位の処理の場合と異なるしきい値としてもよい。
【００２５】
図７に示す方法は、各文字領域毎に日英の判別を行い（ステップ７０２）、その結果を基に当該ページの日英判定を行う（ステップ７０３）。具体的には、日本語領域と判定された領域の数をＪｎ、英語領域と判定された領域の数をＥｎとして、Ｊｎ＞Ｅｎなら日本語ページ、Ｅｎ＞Ｊｎなら英語ページと判定する。Ｊｎ＝Ｅｎの場合はリジェクトし、あるいは日英の何れかに判定してもよい。
【００２６】
〈実施例４〉
上記した実施例とは異なる特徴を利用した日英識別方法について説明する。図９は、実施例４の構成を示す。実施例１と異なる点は、行切り出し部９０２と、ブロック抽出部９０３と、ブロック内文字種判別部９０４を設けている点である。他の構成要素は実施例１のものと同様である、図１０は、実施例４の処理フローチャートを示す。
【００２７】
まず、行切り出し部９０２は、文書画像の文字領域から行の切り出しを行う（ステップ１００１、１００２）。領域生成処理として、特開平６−２００９２号公報記載の技術を使用した場合には、領域を抽出した段階で行情報が得られているので、これを用いればよく、また電子通信学会論文「周辺密度分布、線密度、外接矩形特徴を利用した文書画像の領域分割」（秋山他、１９８６年８月、Ｖｏｌ．Ｊ６９−ＤＮｏ．８）に記載されている射影を用いる方法を用いてもよい。
【００２８】
次に、ブロック抽出部９０３は、単語相当のブロックを抽出する（ステップ１００３）。このブロック抽出方法として、本出願人が先に特願平８−３４７８１号で提案した方法を用いればよい。すなわち、ブロック抽出部１１１は、行データ内部の外接矩形を検出し、その外接矩形をブロックデータにまとめる。このブロックデータにまとめる方法は、次の通りである。文字矩形の間隔（まだ一つの矩形が一文字とは確定されていない。従って、漢字の場合、偏とつくりに分離したものがそれぞれ一つの矩形となる場合も多い）のヒストグラムを求める。図１８は、抽出された文字矩形と、矩形間の距離を示す。図１９は、矩形間隔のヒストグラムを示す。
【００２９】
このヒストグラムにおいて、最も距離の短いピークは、漢字の偏とつくりの間隔や、プロポーショナル英字の同一単語内の文字間距離に現れる傾向がある。これらを統合しても異なる文字種がブロックに入ることは少ないので、それらを統合することでブロックデータを形成する。この処理を行うことによってプロポーショナルの単語や一文字が分離する（つまり偏とつくりからなる）漢字が一つに統合されることになる。
【００３０】
また、最も距離の長いピークは、単語間の距離、句読点と次の文字との距離に現れることが多い。これらは（特に単語間の距離は）文字種が変わる場合の境目に用いられることが多く、同一ブロックになることを避けたい。そこで、最も距離の長いピーク値以上の距離の文字矩形については、同一ブロックにしないように処理する。
【００３１】
さらに、対象矩形の両隣の矩形との距離（Ａ，Ｂ）を測定し、その差（Ａ−Ｂ）が所定の閾値以上のとき、長い方の距離の矩形同志は統合せず、短い方の距離の矩形を統合するように処理する。図２０は、矩形間の間隔の差が大きい位置で矩形の統合を行わない場合を説明する図である。図２０では、差が所定の閾値以上大きい位置で矩形の統合を行わないので、３つのブロックが形成される。このような処理を行うことによって、プロポーショナルの英文などで、単語間の距離が絶対的に近くても、文字間距離とは差があるはずであるので、一つの単語だけをまとめて統合できる。また、プロポーショナルフォントであっても日本語の漢字部分は比較的等間隔に配置されるので、日本語文をまとめる場合にも都合がよい。
【００３２】
上記したブロック抽出方法を用いることによって、英文の場合、日本語文書と違って単語と単語の間は半角相当のスペースで区切られるために、他の文字種と混合してブロックデータとなることが避けられる。
【００３３】
続いて、ブロック内文字種判別部９０４は、ブロック毎の日英判別を行う（ステップ１００４）。これも前掲した出願の方法を用いればよい。つまり、ブロック内文字種判別部９０４は、上記処理によってブロック化されたまとまりが、日本語であるか、英数字であるかという文字種の判定を行う。ブロック内は同一文字種として判断する。この文字種の判定は次のように行う。すなわち、ブロック内の矩形の幅に対して、該矩形の垂直方向の黒ランの数または白黒反転回数が所定の閾値以上のとき日本語文字と識別し、抽出されたブロック内の矩形の垂直方向座標値を基に英字を識別する。図２１（ａ）、（ｂ）は、日本語と英字の場合の垂直方向ランの数の具体例を示す。英数字ではノイズがない理想的な場合、最大で“ｇ”の文字で４つのランができる（図２１（ｂ））。従って、５つ以上のランがカウントされる場合は日本語とする。図２１（ａ）に示す文字「像」の場合、垂直方向のランの数は、文字の下の数字で示すように変化する。
【００３４】
日英判別手段９０５は、ブロック毎の判別結果を集計して当該領域の日英判別を行う（ステップ１００５）。ここで、日本語と判定されたブロックの数をＪＣＮＴ、英語と判定されたブロックの数をＥＣＮＴ、不定と判定されたブロックの数をＮＣＮＴとする。図１１は、ステップ１００５の詳細のフローチャートである。ＪＣＮＴ＊Ｔｈ３＞ＥＮＣＴのときは日本語と判定し（ステップ１１０１、１１０５）、そうではなく、ＥＣＮＴ＞ＪＣＮＴのときは英語と判定する（１１０２、１１０４）。それ以外の場合はリジェクトとする（ステップ１１０３）。しきし値Ｔｈ３は、例えば２とする。
【００３５】
〈実施例５〉
上記した実施例４では、文字領域単位で日英の判定を行っている。この場合、文字領域によっては文字数が非常に少ない場合がある。そのような場合は、矩形の数が十分に得られないのでブロックの判別結果数の比率で日英判定を行うことが難しくなる可能性がある。実施例５は、ブロックの数が十分でない場合の実施例である。
【００３６】
図１２は、実施例５の処理フローチャートを示す。日英判別手段１０５は、集計された文字領域内のブロックの数が十分であるか否か（つまり所定の閾値Ｔｈ以上あるか否か）を調べ（ステップ１２０１）、十分でない場合には、前掲した特開平６−１５００６１号公報に記載されているＯＣＲを利用した日英判別を行う（ステップ１２０３）。この場合は、文字の数が少ないのでＯＣＲ処理を施しても処理時間の増大は少なくてすむ。そして、ブロックの数が十分である場合には実施例４で説明したブロック毎の判別結果による日英の識別を行う（ステップ１２０２）。
【００３７】
〈実施例６〉
実施例６は、実施例４の文字領域毎の日英判別を、ページ単位の日英判別に変更したものである。実施例６の処理フローチャートは、図６、７を用いる。
【００３８】
図６の処理においては、ＪＣＮＴ、ＥＣＮＴ、ＮＣＮＴの集計を文字領域毎でなくページ全体について行い、その結果を使用して、前述した図１１の処理方法によって日英の判定を行う。このときＴｈ３は文字領域単位の場合とは異なってもよい。
【００３９】
図７の処理においては、まず、各文字領域毎に判別し、その結果から当該ページの日英判定を行う。具体的には、日本語領域と判定された領域の数をＪｎ、英語領域と判定された領域の数をＥｎとして、Ｊｎ＞Ｅｎなら日本語ページ、Ｅｎ＞Ｊｎなら英語ページと判定する。Ｊｎ＝Ｅｎの場合はリジェクトとしてもいいし、日英の何れかにしてもよい。
【００４０】
〈実施例７〉
実施例７では、文字領域毎またはページ単位で日英判別を行う際に、図１３に示すように矩形長を利用する日英判別処理（ステップ１３０１）と、ブロック毎の判別結果を利用する日英判別処理（ステップ１３０２）によって、それぞれ日英の判別を行う。そして、それぞれの判別結果から最終的に日英に判別を行う（ステップ１３０３）。
【００４１】
両者共に日本語または英語と判定された場合には、最終結果はそのまま日本語または英語と判定すればよい。何れかがリジェクトと判定された場合には、リジェクトでない方の判定結果を最終結果とする。
【００４２】
両者の判定結果が、一方が日本語で、他方が英語で、その結果が一致しない場合には、以下のいずれかの判定をする。
（１）リジェクトとする。
（２）両者の確信度を算出し、値の大きな方の結果を採用する。
矩形長を利用する判別方法の確信度としては、例えば
ＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）＞Ｔｈｌで、Ｔｈｌ＝０．３の場合にはＬＣＮＴ／（ＮＣＮＴ＋ＳＣＮＴ）＊２．５の値（ただし上限を１とする）
ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＜Ｔｈ２で、Ｔｈ２＝３の場合には（ＬＣＮＴ＋ＳＣＮＴ）／ＮＣＮＴ＊２．５の値（ただし上限を１とする）
ＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＞Ｔｈ２で、Ｔｈ２＝３の場合にはＮＣＮＴ／（ＬＣＮＴ＋ＳＣＮＴ）＊０．３３の値（ただし上限を１とする）
とする。
【００４３】
ブロック毎の判別結果を利用する判別方法の確信度としては、例えば
ＪＣＮＴ＊Ｔｈ３＞ＥＣＮＴで、Ｔｈ３＝２の場合には、ＪＣＴＮ／（ＥＣＮＴ＊３）の値（ただし上限を１とする）
ＥＣＮＴ＞ＪＣＮＴの場合には、ＥＣＮＴ／ＪＣＮＴ＊０．７の値（ただし上限を１とする）
とする。
【００４４】
〈実施例８〉
図１４は、実施例８の構成を示す。また、図１５は、実施例８の処理フローチャートを示す。この実施例では、入力された文書のページ全体について、日英判別部１４１２は、前述した実施例３、６の方法を用いて、そのページが日本語であるか英語であるかの日英識別処理を行い（ステップ１５０１、１５０２）、その判別結果に基づいて選択部１４０３は英文文書認識部１４０４または日本語文書認識部１４０５を選択し、選択された言語の文書認識処理を行い（ステップ１５０４、１５０５）、その認識結果をディスプレイなどの出力部に出力する（ステップ１５０６）。
【００４５】
なお、日本語と英語とではその属性が異なることから、領域分割処理やフォント識別処理なども切り替えた方がよい場合がある。そこで、本実施例の文書認識部は、文字認識処理だけではなく、上記した領域分割処理やフォント識別処理も含まれている。
【００４６】
〈実施例９〉
図１６は、実施例９の構成を示し、図１７は、実施例９の処理フローチャートを示す。実施例８と異なる点は、日英識別を文字領域毎に行う点である。そのために、領域分割部１６０２は、入力文書を文字領域に分割する（ステップ１７０１、１７０２）。ここで、領域分割部では、日英両方に適応できる領域分割方法を使用する。分割処理された後、日英判別部１６０３は文字領域毎に、例えば前述した実施例１の方法を用いて日英識別処理を行い（ステップ１７０４）、その判別結果に基づいて選択部１６０４は英文文書認識部１６０５または日本語文書認識部１６０６を選択し、選択された言語の文書認識処理を行い（ステップ１７０５、１７０６）、その認識結果をディスプレイなどの出力部１６０７に出力する（ステップ１７０７）。なお、実施例９の文書認識部では、文書認識処理の他にフォント識別処理も行う。
【００４７】
〈実施例１０〉
前述した各実施例は、黒画素連結成分や矩形長を特徴量として日本語と英語を判定している。しかし、黒画素連結成分を用いる判定方法は処理時間がかかり、また矩形長を利用する方法はリジェクトの発生が高くなることもある。なお、外接矩形の上辺、下辺の行内での相対位置の頻度分布のピーク位置を基に和文か英文かを識別する方法もあるが（特公平７−２１８１７号公報を参照）、傾きがある文書が入力された場合には、頻度分布が大きく変化し、識別精度が低下してしまうという問題点がある。
【００４８】
そこで、本実施例では、行高さに対する、行内の外接矩形の高さのヒストグラムを用いて日本語と英語を識別することにより、文書画像の領域毎に精度よくかつ高速に日本語と英語を識別するものである。そして、上記した日英識別方法でも判別不可能な領域に対しては、別の方法を用いて日英識別を行う。
【００４９】
図２２は、実施例１０の構成を示す。また、図２３は、実施例１０の全体の処理フローチャートである。まず、画像入力手段２２０１は、文書を読み取ることによって文書画像を得る（ステップ２３０１）。この画像入力手段は、例えばスキャナ、ファックスなどであり、またデータ通信手段２２０７を介してネットワーク経由で別の機器から画像を得るようにしてもよい。
【００５０】
次に、領域生成手段２２０２は、文字領域を生成する（ステップ２３０２）。この領域生成方法として、例えば特開平６−２００９２号公報に記載された方法を用いればよい。次に、行切り出し手段２２０３は、文字領域から文字認識のための行の切り出しを行なう。つまり、文字の外接矩形を求め、それらを統合して行を生成する（ステップ２３０３）。日英識別手段２２０４は、生成した文字領域について日英識別を行なう（ステップ２３０４）。
【００５１】
日英の識別は以下のようにして行う。図２７は、日英識別（ステップ２３０４）の詳細のフローチャートである。図２４は、切り出された行と行内の外接矩形の一例を示す。まず、行高さに対する、行内の外接矩形高さの割合の頻度分布を算出する（ステップ２７０１、２７０２）。行高さをｌｉｎｅｈｅｉｇｈｔ、矩形高さをｈｅｉｇｈｔとする。割合をｈｅｉｇｈｔｒａｔｅ＝ｈｅｉｇｈｔ＊１００／ｌｉｎｅｈｅｉｇｈｔとする。また、図２５のような傾きのある文書の場合は、より精度良く日英識別するために、行高さの代わりにその行の矩形の高さの最大値をｌｉｎｅｈｅｉｇｈｔとして用いてもよい。つまり、傾きのある入力文書については、行内矩形の最大高さに対する、行内各外接矩形高さの割合のヒストグラムを基に日英識別する。
【００５２】
上記した割合ｈｅｉｇｈｔｒａｔｅが例えば８０以上の場合の矩形数をｌｃｎｔとし、ｈｅｉｇｈｔｒａｔｅが例えば７０以上８０未満の場合の矩形数をｎｃｎｔとし、ｈｅｉｇｈｔｒａｔｅが例えば４０以上７０未満の場合の矩形数をｓｃｎｔとする。文字領域内のすべての矩形に対し、ｌｃｎｔ，ｎｃｎｔ，ｓｃｎｔを求める。
【００５３】
図２６は、日本語文書と英語文書について調べた矩形数の一例を示す。一般に、日本語はｌｃｎｔが大きく、英語はｓｃｎｔが大きいという傾向がある。そこで、所定の閾値ｔｈＪ，ｔｈＥを設定し、ｌｃｎｔ／ｓｃｎｔ＞ｔｈＪのとき日本語と判定し（ステップ２７０３）、ｌｃｎｔ／ｓｃｎｔ＜ｔｈＥのとき英語と判定する（ステップ２７０４）。それ以外のときは不明領域とする（ステップ２７０５）。
【００５４】
上記した不明領域に対して、統計的手法を用いて日英識別することができる。図２８は、不明領域に対する詳細な処理フローチャートである。例えば、あらかじめ日本語領域と英語領域の特徴値ｌｃｎｔ、ｎｃｎｔ、ｓｃｎｔを正規化し、その平均値と共分散行列の逆行列を日本語、英語についてそれぞれ求める。そして、平均値と共分散行列の逆行列を用いて、日本語、英語のそれぞれについてマハラノビス距離を求める（ステップ２８０１、２８０２）。
【００５５】
日本語のマハラノビス距離をＤｊ、英語のマハラノビス距離をＤｅとするとき、所定の閾値をＭｅ、Ｍｊとすると、Ｄｊ／Ｄｅ＞Ｍｅのとき英語と判定し（ステップ２８０３）、Ｄｊ／Ｄｅ＜Ｍｊのとき日本語と判定する（ステップ２８０４）。何れの条件にも満足しない場合は不明領域と判定する（ステップ２８０５）。なお、上記したマハラノビス距離の代わりに、平均値とのユークリッド距離やシティブロック距離を用いてもよい。
【００５６】
さらに不明と判定された領域に対して、英文認識の確信度を用いて日英識別を行う。図２９は、ステップ２８０５の詳細な処理フローチャートである。英文認識で確信度を算出する（ステップ２９０１）。次いで、算出された確信度について、例えば６０％以上の確信度をもつ単語の個数をＧｏｏｄ、６０％未満で確信度０でない単語の個数をＢａｄ、確信度が０の単語の個数をＺｅｌｏとする（ステップ２９０２）。
【００５７】
日英識別の判定値をＶａｌｕｅとするとき、Ｖａｌｕｅ＝Ｇｏｏｄ／（Ｇｏｏｄ＋Ｂａｄ＋Ｚｅｌｏ）
とし（ステップ２９０３）、Ｖａｌｕｅが所定の閾値ｔｈｅｏｃｒを超えれば（ステップ２９０４）、英語と判定し、それ以下ならば日本語と判定する。
【００５８】
なお、Ｚｅｌｏに重み付けしてもよい。Ｚｅｌｏを例えばＢａｄの３個分とすると、Ｖａｌｕｅは、
Ｂａｄ＝Ｂａｄ＋Ｚｅｌｏ×３であるから
Ｖａｌｕｅ＝Ｇｏｏｄ／（Ｇｏｏｄ＋Ｂａｄ）
となり、Ｖａｌｕｅが閾値ｔｈｅｏｃｒを超えれば英語、それ以下ならば日本語と判定することもできる。このように、日英識別判定のための文字数が少ない領域でも、英文認識による確信度で日英識別しているので、精度よく領域単位の日英識別が行われる。
【００５９】
〈実施例１１〉
本実施例は、入力文書画像を縮小した画像から外接矩形を生成し、生成された矩形同士で適当な統合を行い、統合後の矩形長の縦横比のヒストグラムを用いて日英識別をより精度良く行なう実施例である。
【００６０】
図３０は、実施例１１の構成を示す。また、図３１は、実施例１１の全体の処理フローチャートである。上記した実施例と同様にして画像入力手段３００１によって入力された文書画像は、画像縮小手段３００２によって縮小される（ステップ３１０１、３１０２）。この処理は、例えば文書画像を１／４程度にＯＲ圧縮（４×４画素を１画素に縮小し、１６画素中に１つでも黒画素があれば縮小画像は黒とする）する。
【００６１】
次に、領域生成手段３００３は、文字領域を生成する（ステップ３１０３）。この領域生成方法として、例えば特開平６−２００９２号公報に記載された方法を用いればよい。続いて、矩形統合手段３００４は、日英の特性が良く表れるように、矩形の統合を行なう（ステップ３１０４）。例えば、図３２に示すように、矩形１、２のｙ座標（縦方向）の上下座標が近くかつ、隣同士の矩形１、２のｘ座標が非常に近い場合（例えば、矩形間の水平距離が英語のスペースに相当する距離より小さい場合）、矩形を統合する。また、例えば、図３３に示すように、左側の矩形１が右側の矩形２をｙ座標で包含する位置関係にありかつ、隣同士の矩形１、２のｘ座標が非常に近い場合（例えば、矩形間の水平距離が英語のスペースに相当する距離より小さい場合）、矩形を統合する。
【００６２】
そして、矩形縦横比（矩形長縦／矩形長横）を用いて、長矩形、中矩形、小矩形、極小矩形の４つの特徴量に分ける（図３４）。一般に、日本語は長矩形の出現する割合が高く、また、英語は中矩形の出現する割合が高い。この特性の違いを利用して、日英識別手段３００５は、識別判定式を作成し、日英識別を行なう（ステップ３１０５）。図３５は、日英識別処理の詳細のフローチャートである。
【００６３】
例えば、領域内での長矩形の領域数ｌｃｎｔ
領域内での中矩形の領域数ｎｃｎｔ
領域内での小矩形の領域数ｓｃｎｔ
領域内での極小矩形の領域数ｓｓｃｎｔ（ノイズの場合が多い）を算出し（ステップ３５０１）、領域内での長矩形の割合ｒａｔｉｏ１＝ｌｃｎｔ／（ｎｃｎｔ＋ｓｃｎｔ）を算出し（ステップ３５０２）、領域内での中矩形の割合ｒａｔｉｏ２＝ｎｃｎｔ／（ｌｃｎｔ＋ｓｃｎｔ）を算出する（ステップ３５０３）。なお、上記割合を算出するとき、ｓｓｃｎｔはノイズとして無視した。
【００６４】
そして、ｒａｔｉｏｌをｘ座標、ｒａｔｉｏ２をｙ座標とし、誤識別を極力少なく、日英重なっている部分はリジェクトになるように、日本語領域、英語領域、リジェクト領域に分ける。例えば、ｒａｔｉｏ２／ｒａｔｉｏｌ＞ｔｈＥならば英語領域と判定（ステップ３５０４）し、ｒａｔｉｏ２／ｒａｔｉｏｌ＜ｔｈＪならば日本語領域と判定し（ステップ３５０５）、それ以外の領域は日英不明とする（ステップ３５０６）。ここで、ｔｈＥ、ｔｈＪは所定の閾値である。
【００６５】
日英不明と判定された領域に対して、実施例１０と同様に、統計的手法を用いて日英識別する。例えば、あらかじめ日本語領域と英語領域の特徴値ｌｃｎｔ、ｎｃｎｔ、ｓｃｎｔを正規化し、その平均値と共分散行列の逆行列を日本語、英語でそれぞれ求める。平均値と共分散行列の逆行列を用いて日本語、英語のそれぞれのマハラノビス距離を求める。日本語のマハラノビス距離をＤｊ、英語のマハラノビス距離をＤｅとするとき、所定の閾値をＭｅ、Ｍｊとすると、Ｄｊ／Ｄｅ＞Ｍｅのとき英語、Ｄｊ／Ｄｅ＜Ｍｊのとき日本語と判定する。何れの条件も満たさない場合は不明と判定する。なお、マハラノビス距離の代わりに、平均値とのユークリッド距離やシティブロック距離を用いてもよい。
【００６６】
〈実施例１２〉
本発明は上記した実施例に限定されず、ソフトウェアによっても実現することができる。本発明をソフトウェアによって実現する場合には、図３６に示すように、ＣＰＵ、メモリ、表示装置、ハードディスク、キーボード、ＣＤ−ＲＯＭドライブ、スキャナなどからなるコンピュータシステムを用意し、ＣＤ−ＲＯＭなどのコンピュータ読み取り可能な記録媒体には、本発明の日本語英語判定機能、文書認識機能を実現するプログラムなどが記録されている。また、スキャナなどの画像入力手段から入力された文書画像などは一時的にハードディスクなどに格納される。そして、該プログラムが起動されると、一時保存された文書画像データが読み込まれて、日本語英語判定処理、文書認識処理を実行し、その結果をディスプレイなどに出力する。
【００６７】
【発明の効果】
以上、説明したように、本発明によれば、複数の判定方法を併用しているので、高精度に日本語と英語とを判別することができる。また、文書画像中の文字領域毎に精度よく日本語と英語の判別を行うことができ、文書画像のページ単位に、精度よく日本語と英語の判別を行うことができる。さらに、日本語または英語と判定された文書画像に対して、適切な文書認識処理を実行しているので、高精度な認識結果を得ることができる。
【図面の簡単な説明】
【図１】本発明の実施例１の構成を示す。
【図２】本発明の実施例１の全体の処理フローチャートを示す。
【図３】英文、日本語文の画像例と、その外接矩形を示す。
【図４】実施例１の日英判定の処理フローチャートを示す。
【図５】実施例２の処理フローチャートを示す。
【図６】実施例３に係るステップ２０５の第１の詳細フローチャートを示す。
【図７】実施例３に係るステップ２０５の第２の詳細フローチャートを示す。
【図８】ステップ４０３の詳細フローチャートを示す。
【図９】実施例４の構成を示す。
【図１０】実施例４の処理フローチャートを示す。
【図１１】ステップ１００５の詳細のフローチャートである。
【図１２】実施例５の処理フローチャートを示す。
【図１３】実施例７の処理フローチャートを示す。
【図１４】実施例８の構成を示す。
【図１５】実施例８の処理フローチャートを示す。
【図１６】実施例９の構成を示す。
【図１７】実施例９の処理フローチャートを示す。
【図１８】抽出された文字矩形と、矩形間の距離を示す。
【図１９】矩形間隔のヒストグラムを示す。
【図２０】矩形間の間隔の差が大きい位置で矩形の統合を行わない場合を説明する図である。
【図２１】（ａ）、（ｂ）は、日本語と英字の場合の垂直方向ランの数の具体例を示す。
【図２２】実施例１０の構成を示す。
【図２３】実施例１０の全体の処理フローチャートである。
【図２４】切り出された行と行内の外接矩形の一例を示す。
【図２５】文書が傾いている場合の行と行内の外接矩形の一例を示す。
【図２６】日本語文書と英語文書について調べた矩形数の一例を示す。
【図２７】日英識別（ステップ２３０４）の詳細な処理フローチャートである。
【図２８】不明領域に対する詳細な処理フローチャートである。
【図２９】ステップ２８０５の詳細な処理フローチャートである。
【図３０】実施例１１の構成を示す。
【図３１】実施例１１の全体の処理フローチャートである。
【図３２】矩形を統合する例を示す。
【図３３】矩形を統合する他の例を示す。
【図３４】４種類に分類された矩形を示す。
【図３５】実施例１１の日英識別処理の詳細な処理フローチャートである。
【図３６】実施例１２の構成を示す。
【符号の説明】
１０１画像入力手段
１０２画像縮小手段
１０３連結成分抽出手段
１０４領域生成手段
１０５日英判別手段
１０６制御部
１０７データ記憶部
１０８データ通信路
１０９データ通信手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document image Japanese / English determination method and a recording medium for determining whether each character region in a document image is a Japanese region or an English region.
[0002]
[Prior art]
When character recognition processing is performed on a document image, it is necessary to select an appropriate language. In other words, when trying to recognize Japanese with English OCR, it is impossible to recognize anything other than alphabets and numbers. Conversely, when trying to recognize English with Japanese OCR, English OCR is used for character segmentation and language processing. The recognition rate will be lower than the case.
[0003]
Therefore, it is necessary to perform language identification before performing the character recognition process. Conventionally, various methods for identifying a character type in a document have been proposed. For example, there is a document recognition apparatus that counts the number of black and white inversions in the vertical or horizontal direction of a binarized character line and identifies the character type based on the distribution (see Japanese Patent Laid-Open No. 5-108876).
[0004]
There is also a document recognition apparatus that recognizes a read word and discriminates the language type of a recognized character based on the matching rate between the recognition result and the dictionary (see Japanese Patent Laid-Open No. 6-150061).
[0005]
[Problems to be solved by the invention]
In the former device, the number of black / white inversions is used as a feature for identifying the character type, but this feature varies greatly depending on the font and document content (ratio of kana, kanji, numbers, etc.). There is a problem that becomes low.
[0006]
On the other hand, since the latter device performs character recognition once, if the OCR performance is good, the character type can be determined with a high probability, and it is possible to accurately distinguish between Japanese and English. Become. However, OCR has a problem that it takes a long time for processing.
[0007]
The present invention has been made in consideration of the above circumstances,
An object of the present invention is to identify Japanese and English with high accuracy and at high speed, and to identify a Japanese / English language of a document image that can identify both for each character region and for each page as to a range to be identified. It is to provide a recording medium.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, according to the first aspect of the present invention, there is provided a document image Japanese / English determination method for determining whether each character region in a document image is a Japanese region or an English region, A line is cut out from each character area, and the frequency number of the rectangle when the ratio of the height of each rectangle in the line to the maximum height of the rectangle in the line is high (hereinafter, the first frequency number), and the maximum number of rectangles in the line The frequency number of the rectangle when the ratio of the height of each rectangle in the row to the height is low (hereinafter, the second frequency number) is calculated, and the first frequency number / second frequency number is a predetermined first number. Each character area is determined to be a Japanese area when a threshold value of 1 is exceeded, and each character area is an English area when the first frequency number / the second frequency number is less than a predetermined second threshold value It is determined that there is, otherwise it is determined as an unknown area, and the unknown area is calculated in advance Determines that the a Japanese region when close to the characteristic values of the Japanese, are characterized by determining that the English region when close to the previously calculated characteristic values in English.
[0009]
According to a second aspect of the present invention, there is provided a computer-readable recording medium on which a program for causing a computer to implement the method for determining Japanese / English of a document image according to the first aspect is recorded.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.
<Example 1>
FIG. 1 shows the configuration of Embodiment 1 of the present invention. In the figure, 101 is an image input means for inputting a document image, 102 is an image reduction means for reducing the input document image, 103 is a connected component extraction means for extracting a connected component from the document image, and 104 is an extracted connection. An area generating means for generating a character area by classifying and integrating components, 105 is a Japanese / English discriminating means for discriminating Japanese and English in character area units or page units, 106 is a control unit for controlling the whole, 107 is a data storage unit for storing various data such as input document image data, connected component data, and area data, 108 is a data communication path, 109 is data communication for connecting to a host or the like via a network, line, or the like. Means.
[0011]
FIG. 2 shows an overall process flowchart of the first embodiment of the present invention. The processing operation of the present invention will be described below with reference to FIG.
First, the image input unit 101 obtains a document image by reading a document (step 201). The image input means is, for example, a scanner or a fax machine, and an image may be obtained from another device via the network via the data communication means 109.
[0012]
Next, the image reduction means 102 reduces the input document image (step 202). This process is, for example, a process of reducing the OR of the input document image to about 1/8. That is, 8 × 8 pixels are reduced to one pixel. If there is at least one black pixel in 64 pixels, the reduced pixel is a black pixel.
[0013]
The connected component extracting means 103 extracts a black pixel connected component from the reduced image (step 203). The area generation unit 104 classifies the extracted connected components and integrates them to generate a character area (step 204). As this region generation method, for example, a known method described in JP-A-6-20092 may be used. At this time, information on the connected components constituting each character area is stored and held in the data storage unit 107.
[0014]
Subsequently, the Japanese-English discriminating means 105 judges whether the generated character area is Japanese or English (step 205).
[0015]
In step 202, the image is OR-reduced to fuse neighboring black pixels. Here, in English, there is a feature that there is a space between words and that there is a very narrow space between characters in the word. On the other hand, in Japanese, the character spacing does not change greatly except before and after punctuation.
[0016]
FIG. 3 shows image examples of English and Japanese sentences and their circumscribed rectangles. The circumscribed rectangle 302 represents the result of reducing the English image 301 and extracting the connected components in the circumscribed rectangle (note that the circumscribed rectangle 302 should be smaller than the image 301 because it is reduced, but here In the same size). In English images, connected components are formed by merging words.
[0017]
For the examples of the Japanese images 303 and 305, when the connected components are extracted in the same manner and expressed by their circumscribed rectangles, the circumscribed rectangles 304 and 306 are obtained, respectively.
[0018]
In the case of English sentences, since the number of characters constituting a word is constant to some extent, there is a feature that the number of circumscribed rectangles in which the aspect ratio is about 2 to 6, 7 times increases. On the other hand, the Japanese language has a feature that a long rectangle that does not appear in English as shown in the circumscribed rectangle 304 is generated, or conversely, a large number of small rectangles such as the circumscribed rectangle 306 are generated.
[0019]
Therefore, the above-described connected component rectangles are classified into three types of “short”, “medium”, and “long”, and these are totalized for each character region. FIG. 4 shows a processing flowchart of Japanese-English determination in the first embodiment. The process of FIG. 4 is performed for each character area. When the row direction is horizontal, for example, the width / height is 2 or less “short”, the width / height 2 to 6 is “medium”, and the rectangle is classified as “long” (step) (step). 401). Then, the classification results in the character area are totaled (step 402), and it is determined whether each character area is Japanese or English (step 403). Here, assuming that the number of “short” rectangles is SCNT, the number of “medium” rectangles is NCNT, and the number of “long” rectangles is LCNT, the determination of Japanese and English is as shown in FIG. 8 (detailed flowchart of step 403). Done.
[0020]
First, it is examined whether LCNT / (NCNT + SCNT)> Thl is satisfied (step 801). Th1 is a predetermined threshold value, for example, about 0.3. If this conditional expression is satisfied, it means that there are sufficiently long rectangles, and it is determined that the character area is a Japanese area (step 804).
[0021]
Next, when it is determined No in step 801, it is checked whether NCNT / (LCNT + SCNT) <Th2 is satisfied (step 802). Th2 is also a predetermined threshold value, for example, 3. If this conditional expression is satisfied, it means that there are few middle rectangles, and it is determined that the character area is a Japanese area (step 804). If neither condition is satisfied, it is determined as an English region (step 803).
[0022]
<Example 2>
In the first embodiment described above, Japanese / English is determined in units of character areas. In this case, the number of characters may be very small depending on the character area. In such a case, since the number of rectangles cannot be obtained sufficiently, it may be difficult to make a Japanese-English determination based on the ratio of the number of rectangles. The second embodiment is an embodiment that considers a case where the number of rectangles is not sufficient.
[0023]
FIG. 5 shows a process flowchart of the second embodiment. The Japanese-English discriminating means 105 checks whether or not the number of rectangles in the tabulated area is sufficient (that is, whether or not it is greater than or equal to a predetermined threshold Th) (step 501). Japanese-English discrimination using OCR described in JP-A-6-150061 is performed (step 503). In this case, since the number of characters is small, an increase in processing time can be reduced even if OCR processing is performed. If the number of rectangles is sufficient, the identification of Japanese and English by the rectangle length described in the first embodiment is performed (step 502).
[0024]
<Example 3>
Next, a description will be given of a third embodiment that performs Japanese-English discrimination on a page basis. 6 and 7 are detailed flowcharts of step 205 according to the third embodiment. In the method shown in FIG. 6, the number of “short”, “medium”, and “long” rectangles is aggregated for the entire page, not for each character area (steps 601 and 602), and the results are used for the day by page. The determination of English is performed (step 603). This Japanese-English determination method is performed according to the processing flowchart of FIG. The threshold values Th1 and Th2 at this time may be different from those in the case of processing in units of character areas.
[0025]
In the method shown in FIG. 7, Japanese / English is determined for each character area (step 702), and based on the result, Japanese / English is determined for the page (step 703). Specifically, Jn is the number of areas determined to be Japanese, and En is the number of areas determined to be English. If Jn> En, it is determined to be a Japanese page, and if En> Jn, it is determined to be an English page. If Jn = En, it may be rejected or judged to be either English or Japanese.
[0026]
<Example 4>
A description will be given of a Japanese-English identification method using features different from the above-described embodiments. FIG. 9 shows the configuration of the fourth embodiment. The difference from the first embodiment is that a line segmentation unit 902, a block extraction unit 903, and an in-block character type discrimination unit 904 are provided. The other components are the same as those of the first embodiment. FIG. 10 shows a process flowchart of the fourth embodiment.
[0027]
First, the line cutout unit 902 cuts out lines from the character area of the document image (steps 1001 and 1002). When the technique described in Japanese Patent Laid-Open No. 6-20092 is used as the area generation process, line information is obtained at the stage of extracting the area. A method using a projection described in “Division of document image using density distribution, linear density, circumscribed rectangle feature” (Akiyama et al., August 1986, Vol. J69-D No. 8) may be used. .
[0028]
Next, the block extraction unit 903 extracts a block corresponding to a word (step 1003). As this block extraction method, the method previously proposed by the present applicant in Japanese Patent Application No. 8-34781 may be used. That is, the block extraction unit 111 detects circumscribed rectangles in the row data and collects the circumscribed rectangles into block data. A method for collecting the block data is as follows. Histograms of character rectangle intervals (one rectangle is not yet determined as one character. Therefore, in the case of Kanji characters, there are many cases where each divided into a single and a rectangle is a single rectangle) are obtained. FIG. 18 shows the extracted character rectangle and the distance between the rectangles. FIG. 19 shows a histogram of rectangular intervals.
[0029]
In this histogram, the peak with the shortest distance tends to appear in the gap between kanji characters and the composition, and the distance between characters in the same word of proportional English. Even if these are integrated, different character types rarely enter the block, so that block data is formed by integrating them. By performing this process, proportional characters and single characters are separated (that is, kanji characters composed of bias and composition) into one.
[0030]
The peak with the longest distance often appears in the distance between words and the distance between the punctuation mark and the next character. These are often used at the boundary when the character type changes (especially the distance between words), and we want to avoid having the same block. Therefore, processing is performed so that character rectangles having a distance equal to or longer than the longest peak value are not made into the same block.
[0031]
Further, when the distance (A, B) between the target rectangle and the adjacent rectangle is measured and the difference (A−B) is equal to or greater than a predetermined threshold, the longer distance rectangles are not integrated and the shorter one is not integrated. Process to merge distance rectangles. FIG. 20 is a diagram for explaining a case where rectangles are not integrated at a position where the difference between the rectangles is large. In FIG. 20, since the integration of rectangles is not performed at a position where the difference is larger than a predetermined threshold value, three blocks are formed. By performing such processing, even if the distance between words is absolutely close, such as in proportional English, there should be a difference from the distance between characters, so only one word can be integrated together. Even in the case of proportional fonts, the Japanese kanji parts are arranged at relatively equal intervals, which is convenient for collecting Japanese sentences.
[0032]
By using the block extraction method described above, in the case of English, unlike Japanese documents, words are separated by a space equivalent to half-width characters, so avoid mixing with other character types to form block data. It is done.
[0033]
Subsequently, the in-block character type determination unit 904 performs Japanese-English determination for each block (step 1004). The method of the application mentioned above may also be used for this. That is, the in-block character type determination unit 904 determines the character type whether the block grouped by the above process is Japanese or alphanumeric. The same character type is determined in the block. The character type is determined as follows. That is, when the number of black runs in the vertical direction of the rectangle or the number of black and white inversions is greater than or equal to a predetermined threshold with respect to the width of the rectangle in the block, it is identified as a Japanese character, and the rectangle in the extracted block in the vertical direction Identify alphabetic characters based on coordinate values. FIGS. 21A and 21B show specific examples of the number of vertical runs in the case of Japanese and English characters. In an ideal case where there is no noise with alphanumeric characters, a maximum of four runs can be made with the letter “g” (FIG. 21B). Therefore, if more than 5 runs are counted, it will be in Japanese. In the case of the character “image” shown in FIG. 21A, the number of runs in the vertical direction changes as indicated by the number below the character.
[0034]
The Japanese / English discriminating means 905 tabulates the discrimination results for each block and discriminates between Japanese and English in the area (step 1005). Here, the number of blocks determined to be Japanese is JCNT, the number of blocks determined to be English is ECNT, and the number of blocks determined to be indefinite is NCNT. FIG. 11 is a detailed flowchart of step 1005. If JCNT * Th3> ENCT, it is determined that the language is Japanese (steps 1101 and 1105). Otherwise, if ECNT> JCNT, it is determined that the language is English (1102, 1104). Otherwise, it is rejected (step 1103). The threshold value Th3 is set to 2, for example.
[0035]
<Example 5>
In Example 4 described above, Japanese / English is determined in units of character areas. In this case, the number of characters may be very small depending on the character area. In such a case, since the number of rectangles cannot be obtained sufficiently, it may be difficult to perform Japanese-English determination by the ratio of the number of block discrimination results. The fifth embodiment is an embodiment where the number of blocks is not sufficient.
[0036]
FIG. 12 shows a process flowchart of the fifth embodiment. The Japanese / English discriminating means 105 checks whether or not the total number of blocks in the character area is sufficient (that is, whether or not it is greater than or equal to a predetermined threshold Th) (step 1201). Japanese-English discrimination using the OCR described in Japanese Patent Laid-Open No. 6-150061 is performed (step 1203). In this case, since the number of characters is small, an increase in processing time can be reduced even if OCR processing is performed. If the number of blocks is sufficient, Japanese and English are identified based on the discrimination result for each block described in the fourth embodiment (step 1202).
[0037]
<Example 6>
In the sixth embodiment, the Japanese / English discrimination for each character area in the fourth embodiment is changed to Japanese / English discrimination in page units. The processing flowchart of the sixth embodiment uses FIGS.
[0038]
In the processing shown in FIG. 6, JCNT, ECNT, and NCNT are aggregated not for each character area but for the entire page, and the result is used to determine Japanese / English using the processing method shown in FIG. At this time, Th3 may be different from that in the character area unit.
[0039]
In the processing shown in FIG. 7, first, each character area is determined, and Japanese-English determination of the page is performed from the result. Specifically, Jn is the number of areas determined to be Japanese, and En is the number of areas determined to be English. If Jn> En, it is determined to be a Japanese page, and if En> Jn, it is determined to be an English page. When Jn = En, it may be rejected or it may be either Japanese or English.
[0040]
<Example 7>
In the seventh embodiment, when performing Japanese-English discrimination for each character area or page, as shown in FIG. 13, a Japanese-English discrimination process using a rectangular length (step 1301) and a date using a discrimination result for each block are used. The English / Japanese discrimination is performed by the English discrimination process (step 1302). Then, finally, discrimination is made between Japanese and English from each discrimination result (step 1303).
[0041]
If both are determined to be Japanese or English, the final result may be determined to be Japanese or English as it is. If any of them is determined to be rejected, the determination result that is not rejected is the final result.
[0042]
If both of the determination results are in Japanese and the other is in English and the results do not match, one of the following determinations is made.
(1) Reject.
(2) The degree of certainty of both is calculated, and the result with the larger value is adopted.
As a certainty factor of the discrimination method using the rectangular length, for example, when LCNT / (NCNT + SCNT)> Thl, and Thl = 0.3, a value of LCNT / (NCNT + SCNT) * 2.5 (however, the upper limit is 1) )
When NCNT / (LCNT + SCNT) <Th2 and Th2 = 3, the value of (LCNT + SCNT) /NCNT*2.5 (however, the upper limit is 1)
When NCNT / (LCNT + SCNT)> Th2 and Th2 = 3, a value of NCNT / (LCNT + SCNT) * 0.33 (the upper limit is 1)
And
[0043]
For example, when JCNT * Th3> ECNT and Th3 = 2, the value of JCTN / (ECNT * 3) is set (however, the upper limit is 1).
In the case of ECNT> JCNT, the value of ECNT / JCNT * 0.7 (however, the upper limit is 1)
And
[0044]
<Example 8>
FIG. 14 shows a configuration of the eighth embodiment. FIG. 15 shows a process flowchart of the eighth embodiment. In this embodiment, with respect to the entire page of the input document, the Japanese / English discriminating unit 1412 uses the method of Embodiments 3 and 6 described above to identify whether the page is in Japanese or English. The selection unit 1403 selects the English document recognition unit 1404 or the Japanese document recognition unit 1405 based on the determination result, and performs the document recognition processing of the selected language (step 1504, 1502). 1505), and outputs the recognition result to an output unit such as a display (step 1506).
[0045]
Since Japanese and English have different attributes, it may be better to switch between region division processing and font identification processing. Therefore, the document recognition unit of the present embodiment includes not only the character recognition processing but also the above-described region division processing and font identification processing.
[0046]
<Example 9>
FIG. 16 shows the configuration of the ninth embodiment, and FIG. 17 shows a process flowchart of the ninth embodiment. The difference from the eighth embodiment is that Japanese-English identification is performed for each character area. For this purpose, the area dividing unit 1602 divides the input document into character areas (steps 1701 and 1702). Here, the area dividing unit uses an area dividing method that can be applied to both Japanese and English. After the division processing, the Japanese-English discriminating unit 1603 performs Japanese-English discriminating processing for each character area using, for example, the method of the first embodiment described above (step 1704). The document recognition unit 1605 or the Japanese document recognition unit 1606 is selected, document recognition processing of the selected language is performed (steps 1705 and 1706), and the recognition result is output to the output unit 1607 such as a display (step 1707). Note that the document recognition unit of the ninth embodiment performs font identification processing in addition to document recognition processing.
[0047]
<Example 10>
In each of the above-described embodiments, Japanese and English are determined using a black pixel connected component and a rectangular length as a feature amount. However, the determination method using the black pixel connected component takes a long processing time, and the method using the rectangular length may increase the generation of rejection. In addition, there is a method of discriminating whether it is Japanese or English based on the peak position of the frequency distribution of the relative position in the upper and lower rows of the circumscribed rectangle (see Japanese Patent Publication No. 7-21817), but there is a document with an inclination. Is input, there is a problem that the frequency distribution changes greatly and the identification accuracy decreases.
[0048]
Therefore, in the present embodiment, Japanese and English are accurately and quickly identified for each region of the document image by identifying Japanese and English using the histogram of the height of the circumscribed rectangle in the row with respect to the row height. To identify. For areas that cannot be identified by the above-described Japanese-English identification method, Japanese-English identification is performed using another method.
[0049]
FIG. 22 shows a configuration of the tenth embodiment. FIG. 23 is an overall process flowchart of the tenth embodiment. First, the image input unit 2201 obtains a document image by reading a document (step 2301). The image input means is, for example, a scanner or a fax machine, and an image may be obtained from another device via the network via the data communication means 2207.
[0050]
Next, the area generation unit 2202 generates a character area (step 2302). As this region generation method, for example, a method described in Japanese Patent Laid-Open No. 6-20092 may be used. Next, the line cutout means 2203 cuts out a line for character recognition from the character area. In other words, the circumscribed rectangles of the characters are obtained and integrated to generate a line (step 2303). The Japanese-English identification means 2204 performs Japanese-English identification on the generated character area (step 2304).
[0051]
Japanese and English are identified as follows. FIG. 27 is a flowchart showing details of Japanese-English identification (step 2304). FIG. 24 shows an example of a cut out line and a circumscribed rectangle in the line. First, the frequency distribution of the ratio of the circumscribed rectangle height in the row to the row height is calculated (steps 2701 and 2702). The line height is set to line height, and the rectangular height is set to height. Let the ratio be highrate = height * 100 / lineheight. In addition, in the case of a document with an inclination as shown in FIG. 25, the maximum value of the height of the rectangle of the line may be used as lineheight instead of the line height in order to identify Japanese and English more accurately. That is, for an input document with a slope, Japanese and English are identified based on a histogram of the ratio of each circumscribed rectangle height in the line to the maximum height of the in-line rectangle.
[0052]
For example, the number of rectangles when the ratio heightrate is 80 or more is lcnt, the number of rectangles when the heightrate is 70 or more and less than 80 is ncnt, and the number of rectangles when the heightrate is 40 or more and less than 70 is scnt. Lcnt, ncnt, scnt are obtained for all rectangles in the character area.
[0053]
FIG. 26 shows an example of the number of rectangles examined for Japanese and English documents. In general, Japanese has a large lcnt, and English tends to have a large scnt. Therefore, predetermined thresholds thJ and thE are set. When lcnt / scnt> thJ, it is determined that the language is Japanese (step 2703), and when lcnt / scnt <thE, it is determined that the language is English (step 2704). Otherwise, it is set as an unknown area (step 2705).
[0054]
The above unknown area can be identified using a statistical method. FIG. 28 is a detailed process flowchart for the unknown area. For example, the feature values lcnt, ncnt, and scnt of the Japanese region and English region are normalized in advance, and the average value and the inverse matrix of the covariance matrix are obtained for Japanese and English, respectively. Then, using the average value and the inverse matrix of the covariance matrix, the Mahalanobis distance is obtained for each of Japanese and English (steps 2801 and 2802).
[0055]
When the Mahalanobis distance in Japanese is Dj and the Mahalanobis distance in English is De, if the predetermined threshold is Me and Mj, it is judged as English when Dj / De> Me (step 2803), and Dj / De <Mj It is determined that the language is Japanese (step 2804). If neither condition is satisfied, it is determined as an unknown area (step 2805). In place of the Mahalanobis distance described above, an Euclidean distance from the average value or a city block distance may be used.
[0056]
Furthermore, Japanese-English identification is performed on the area determined to be unknown using the certainty of English recognition. FIG. 29 is a detailed process flowchart of step 2805. A certainty factor is calculated by English recognition (step 2901). Next, regarding the calculated certainty factor, for example, the number of words having a certainty factor of 60% or more is Good, the number of words having a certainty factor of less than 60% and a non-certainty factor is Bad, and the number of words having a certainty factor of zero is Zero. (Step 2902).
[0057]
When the judgment value for Japanese-English identification is Value, Value = Good / (Good + Bad + Zelo)
(Step 2903), if the value exceeds a predetermined threshold value th eocr (step 2904), it is determined as English, and if it is less than that, it is determined as Japanese.
[0058]
Note that Zelo may be weighted. For example, if Zelo is three pieces of Bad, Value is
Since Bad = Bad + Zelo × 3 Value = Good / (Good + Bad)
Thus, it can be determined that the value is English if the value exceeds the threshold value th eocr, and that the value is Japanese if the value is less than the threshold value. As described above, even in an area where the number of characters for Japanese-English discrimination determination is small, Japanese-English identification is performed with a certainty factor based on English recognition, so that Japanese-English identification is performed in units of areas with high accuracy.
[0059]
<Example 11>
In this embodiment, a circumscribed rectangle is generated from an image obtained by reducing an input document image, and the generated rectangles are appropriately integrated, and a Japanese-English identification is made more accurate by using a histogram of the rectangular length aspect ratio after the integration. This is a well-executed embodiment.
[0060]
FIG. 30 shows the configuration of the eleventh embodiment. FIG. 31 is an overall process flowchart of the eleventh embodiment. The document image input by the image input unit 3001 is reduced by the image reduction unit 3002 in the same manner as in the above-described embodiment (Steps 3101 and 3102). In this process, for example, the document image is OR-compressed to about ¼ (4 × 4 pixels are reduced to one pixel, and if there is at least one black pixel in 16 pixels, the reduced image is black).
[0061]
Next, the area generation means 3003 generates a character area (step 3103). As this region generation method, for example, a method described in Japanese Patent Laid-Open No. 6-20092 may be used. Subsequently, the rectangle integration means 3004 integrates the rectangles so that the characteristics of Japanese and English are well expressed (step 3104). For example, as shown in FIG. 32, when the vertical coordinates of the y-coordinates (vertical direction) of the rectangles 1 and 2 are close and the x-coordinates of the adjacent rectangles 1 and 2 are very close (for example, the horizontal distance between the rectangles). If the is less than the distance corresponding to the English space), merge the rectangles. Also, for example, as shown in FIG. 33, when the left rectangle 1 is in a positional relationship including the right rectangle 2 with the y coordinate and the x coordinates of the adjacent rectangles 1 and 2 are very close (for example, If the horizontal distance between the rectangles is less than the distance corresponding to the English space), merge the rectangles.
[0062]
Then, using the rectangular aspect ratio (rectangular long aspect / rectangular long aspect), it is divided into four feature amounts of a long rectangle, a medium rectangle, a small rectangle, and a minimal rectangle (FIG. 34). In general, Japanese has a high ratio of long rectangles, and English has a high ratio of medium rectangles. Using this difference in characteristics, the Japanese-English discriminating means 3005 creates an identification judgment formula and performs Japanese-English discrimination (step 3105). FIG. 35 is a detailed flowchart of the Japanese-English identification process.
[0063]
For example, the number of long rectangular areas lcnt in the area
Number of middle rectangular areas in the area ncnt
The number of small rectangular areas in the area scnt
The number of small rectangular areas sscnt (in many cases of noise) in the area is calculated (step 3501), and the ratio of long rectangles in the area ratio1 = lcnt / (ncnt + scnt) is calculated (step 3502). The ratio of the middle rectangle at ratio2 = ncnt / (lcnt + scnt) is calculated (step 3503). When calculating the above ratio, sscnt was ignored as noise.
[0064]
Then, the ratio is x-coordinate and ratio2 is the y-coordinate, and the classification is divided into a Japanese area, an English area, and a rejection area so that misidentification is reduced as much as possible, and a portion overlapping Japanese and English is rejected. For example, if ratio2 / ratiool> thE, it is determined as an English region (step 3504), if ratio2 / ratiool <thJ, it is determined as a Japanese region (step 3505), and other regions are determined as Japanese / English unknown (step 3506). ). Here, thE and thJ are predetermined threshold values.
[0065]
Similar to the tenth embodiment, a region determined to be unknown to English is identified using Japanese statistical methods. For example, the feature values lcnt, ncnt, and scnt of the Japanese region and English region are normalized in advance, and the average value and the inverse matrix of the covariance matrix are obtained in Japanese and English, respectively. Find the Mahalanobis distance in Japanese and English using the mean and the inverse of the covariance matrix. When the Mahalanobis distance in Japanese is Dj and the Mahalanobis distance in English is De, the predetermined threshold is Me and Mj. If Dj / De> Me, English is determined, and if Dj / De <Mj, Japanese is determined. If neither condition is satisfied, it is determined as unknown. In place of the Mahalanobis distance, an Euclidean distance from the average value or a city block distance may be used.
[0066]
<Example 12>
The present invention is not limited to the above-described embodiments, and can be realized by software. When the present invention is realized by software, as shown in FIG. 36, a computer system including a CPU, a memory, a display device, a hard disk, a keyboard, a CD-ROM drive, a scanner, etc. is prepared, and a computer such as a CD-ROM is prepared. On the readable recording medium, a program for realizing the Japanese / English determination function and the document recognition function of the present invention is recorded. A document image or the like input from image input means such as a scanner is temporarily stored in a hard disk or the like. When the program is started, the temporarily stored document image data is read, Japanese / English determination processing and document recognition processing are executed, and the results are output to a display or the like.
[0067]
【The invention's effect】
As described above, according to the present invention, since a plurality of determination methods are used in combination, Japanese and English can be determined with high accuracy. Further, Japanese and English can be accurately distinguished for each character area in the document image, and Japanese and English can be accurately distinguished for each page of the document image. Furthermore, since an appropriate document recognition process is performed on the document image determined to be Japanese or English, a highly accurate recognition result can be obtained.
[Brief description of the drawings]
FIG. 1 shows a configuration of Embodiment 1 of the present invention.
FIG. 2 shows an overall process flowchart of Embodiment 1 of the present invention.
FIG. 3 shows examples of English and Japanese images and their circumscribed rectangles.
FIG. 4 shows a processing flowchart of Japanese-English determination in Example 1;
FIG. 5 shows a processing flowchart of Embodiment 2.
FIG. 6 shows a first detailed flowchart of step 205 according to Embodiment 3;
FIG. 7 shows a second detailed flowchart of step 205 according to Embodiment 3;
FIG. 8 shows a detailed flowchart of step 403;
9 shows a configuration of Example 4. FIG.
FIG. 10 is a flowchart illustrating a process according to the fourth embodiment.
FIG. 11 is a detailed flowchart of step 1005;
12 shows a process flowchart of Embodiment 5. FIG.
FIG. 13 is a flowchart illustrating a process according to the seventh embodiment.
14 shows a configuration of Example 8. FIG.
FIG. 15 is a flowchart illustrating a process according to the eighth embodiment.
16 shows the configuration of Example 9. FIG.
FIG. 17 shows a process flowchart of Embodiment 9.
FIG. 18 shows extracted character rectangles and the distance between the rectangles.
FIG. 19 shows a histogram of rectangular intervals.
FIG. 20 is a diagram illustrating a case where rectangles are not merged at a position where the difference in spacing between rectangles is large.
FIGS. 21A and 21B show specific examples of the number of vertical runs in the case of Japanese and English characters.
22 shows the configuration of Example 10. FIG.
FIG. 23 is an overall process flowchart of Embodiment 10;
FIG. 24 shows an example of a cut line and a circumscribed rectangle in the line.
FIG. 25 shows an example of a line when a document is tilted and a circumscribed rectangle in the line.
FIG. 26 shows an example of the number of rectangles examined for a Japanese document and an English document.
FIG. 27 is a detailed processing flowchart of Japanese-English identification (step 2304).
FIG. 28 is a detailed process flowchart for an unknown area;
FIG. 29 is a detailed process flowchart of step 2805;
30 shows the structure of Example 11. FIG.
FIG. 31 is an overall process flowchart of Embodiment 11;
FIG. 32 shows an example of integrating rectangles.
FIG. 33 shows another example of integrating rectangles.
FIG. 34 shows rectangles classified into four types.
FIG. 35 is a detailed processing flowchart of Japanese-English identification processing according to the eleventh embodiment;
36 shows the configuration of Example 12. FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 101 Image input means 102 Image reduction means 103 Connected component extraction means 104 Area | region production | generation means 105 Japanese-English discrimination means 106 Control part 107 Data storage part 108 Data communication path 109 Data communication means

Claims

A document image Japanese / English determination method for determining whether each character region in a document image is a Japanese region or an English region, wherein a line is cut out from each character region, and the maximum height of a rectangle in the line When the ratio of the height of each rectangle in the line to the rectangle is high (hereinafter referred to as the first frequency number) and the ratio of the height of each rectangle in the line to the maximum height of the rectangle in the line is low A rectangular frequency number (hereinafter referred to as a second frequency number) is calculated, and each character area is a Japanese area when the first frequency number / the second frequency number exceeds a predetermined first threshold value. And when each of the first frequency number / second frequency number is less than a predetermined second threshold, each character area is determined to be an English area, and otherwise, it is determined to be an unknown area. The unknown area is determined to be a Japanese area when it is close to a pre-calculated Japanese characteristic value. And, Japanese English determination method of the document image, wherein the determining that the English region when close to the previously calculated characteristic values in English.

A computer-readable recording medium recording a program for causing a computer to implement the method for determining Japanese / English of a document image according to claim 1 .