JPH11338977A

JPH11338977A - Method and device for character processing and storage medium

Info

Publication number: JPH11338977A
Application number: JP10147619A
Authority: JP
Inventors: Tomotoshi Kanatsu; 知俊金津
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1998-05-28
Filing date: 1998-05-28
Publication date: 1999-12-10

Abstract

PROBLEM TO BE SOLVED: To realize a recognition capable of obtaining a high recognition rate without operator's recognition and instruction for language type by individually performing the character recognition processing suitable to each character area in accordance with the discriminated language type. SOLUTION: After preprocessings such as noise elimination and inclination correction, individual rectangular areas such as text blocks, figure blocks, and picture blocks are extracted from picture data of one page by area division processing, and area information indicating positions of individual areas and attribute information indicating classifications of areas are stored in a memory 102. Each of text blocks out of these blocks is subjected to language type discrimination processing, and the obtained result is stored in the memory 102 as the language attribute of the block correspondingly to area information and attribute information of the block. When the language attribute is English, a picture in the block is subjected to character recognition by a character recognition device for English; but when the language attribute is not English, it is subjected to character recognition by character recognition device for Japanese.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は異なる言語が混在す
る文書画像の文字認識処理に関する文字処理方法及び装
置、記憶媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character processing method and apparatus for character recognition of a document image in which different languages are mixed, and a storage medium.

【０００２】[0002]

【従来の技術】通常の文字認識装置は、基本的には単一
言語を対象としており、例えば日本語文書に対しては日
本語用文字認識装置を、英語文書に対しては英語用文字
認識装置を使用して文字認識を行っていた。2. Description of the Related Art An ordinary character recognition device basically targets a single language. For example, a Japanese character recognition device is used for a Japanese document, and an English character recognition device is used for an English document. Character recognition was performed using a device.

【０００３】ここで言う英語文書とは、英語で使われる
文字、すなわちアルファベットと数字、記号のみを含ん
だ文書のことであり、英語用文字認識装置はそれらの文
字のみを認識することが出来る。[0003] The English document referred to here is a document containing only characters used in English, that is, letters, numbers and symbols, and the English character recognition device can recognize only those characters.

【０００４】一方、日本語文書については、かな漢字な
ど日本語にしかない文字のみで構成されることは少な
く、アルファベットや数字等を含んでいるほうが一般的
である。すなわち、日本語用文字認識装置といっても、
アルファベット等の英語文字の認識もできるものである
ことが求められている。On the other hand, Japanese documents are rarely composed of only Japanese characters such as Kana-Kanji characters, and generally include alphabets and numerals. That is, even if it is a character recognition device for Japanese,
It is also required that it can recognize English characters such as alphabets.

【０００５】しかし、英語文書に対しては、あらかじめ
英語と判定して認識する英語用文字認識装置に比べ、よ
り広汎な文字を対象とする日本語用文字認識装置のほう
が認識性能については劣ることは否めない。文字認識装
置に日本語の文書と英語の文書をランダムに認識させな
ければならない場合、或いは日本語主体でありながら一
部分英語の文書ブロックが含まれる文書を認識する場合
など、これらをすべて日本語用文字認識装置のみで認識
させた場合、英語の部分の認識性能の低さが問題に成
る。However, the recognition performance of a Japanese character recognition device for a wider range of characters is inferior to that of an English character recognition device for recognizing an English document in advance by recognizing it as English. I can't deny it. If the character recognition device must randomly recognize Japanese and English documents, or if it recognizes a document that is mainly Japanese but contains some English document blocks, these are all used for Japanese. When the recognition is performed only by the character recognition device, the problem is the low recognition performance of the English part.

【０００６】このような欠点を無くすため、ひとつの文
字認識装置の中に、英語文字認識部と日本語文字認識部
の２つを持ち、オペレータが文書ごと、あるいは文書の
部分ごとに日本語が英語かを認識により判断し、オペレ
ータの手操作により領域及び言語種を特定した日本語の
文章部分には日本語文字認識部を、また、同じくオペレ
ータの手操作により領域及び言語種を特定した英語の文
章部分には英語の文字認識部を使って認識することで、
操作は煩雑になるものの、より高い認識率を得られるよ
うになっている文字認識装置もある。In order to eliminate such disadvantages, one character recognition device has two units, an English character recognition unit and a Japanese character recognition unit, and the operator can use Japanese for each document or for each part of the document. A Japanese character recognition part is used for the Japanese sentence part where the area and language type are specified by the operator's manual operation. By recognizing the sentence part using the English character recognition unit,
Although the operation is complicated, there is a character recognition device that can obtain a higher recognition rate.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、前述の
例では、認識の前に領域の特定と各領域に用いる認識部
の選択をオペレータが適切に指示しなければ高い認識性
能を得ることが出来ない。特に、異なる言語の文章領域
が混在する文書を認識させる場合にその手間は大きく、
全自動化にも大きな妨げとなることで、文字認識装置の
実用性を大きく損ねていた。However, in the above-mentioned example, high recognition performance cannot be obtained unless the operator properly specifies the area before the recognition and selects the recognition unit to be used for each area. . In particular, when recognizing a document in which text regions in different languages are mixed, the trouble is great.
This greatly hindered full automation, which greatly impaired the practicality of the character recognition device.

【０００８】[0008]

【課題を解決するための手段】上記課題を解決するため
に、本発明は画像から複数の文字領域を抽出し、前記抽
出された複数の文字領域各々について言語種を判別し、
前記判別された言語種に従って、各文字領域に適した文
字認識処理を各々別個に処理する文字処理方法及び装
置、記憶媒体を提供する。In order to solve the above problems, the present invention extracts a plurality of character regions from an image, determines a language type for each of the plurality of extracted character regions,
Provided are a character processing method and apparatus, and a storage medium that individually perform character recognition processing suitable for each character area according to the determined language type.

【０００９】上記課題を解決するために、本発明は好ま
しくは前記文字領域内の画像から文字を抽出し、各文字
の横幅に基づいて当該文字領域の言語種を判別する。In order to solve the above problem, the present invention preferably extracts characters from an image in the character area, and determines the language type of the character area based on the width of each character.

【００１０】上記課題を解決するために、本発明は好ま
しくは前記文字の横幅の行毎の平均値を求め、予め定め
られた閾値との比較に基づいて前記言語種の判別を行な
う。In order to solve the above-mentioned problem, the present invention preferably obtains an average value of the width of the character for each line, and determines the language type based on comparison with a predetermined threshold value.

【００１１】上記課題を解決するために、本発明は好ま
しくは前記文字の横幅の行毎の平均値と閾値との比較の
結果、平均値が閾値を上回る行数と全行数との割合に基
づいて前記言語種の判別を行なう。In order to solve the above problem, the present invention preferably compares the average value of the width of the character for each line with a threshold value, and finds that the ratio of the number of lines whose average value exceeds the threshold value to the total number of lines is determined. The language type is determined based on the language type.

【００１２】上記課題を解決するために、本発明は好ま
しくは行内の文字のピッチ及び間隔について、各々の分
布のピークから最頻値を求め、その差を前記行毎の平均
値とする。In order to solve the above problems, the present invention preferably calculates the mode from the peak of each distribution for the pitch and spacing of characters in a line, and uses the difference as the average value for each line.

【００１３】上記課題を解決するために、本発明は好ま
しくは前記文字領域内の画像から文字を抽出し、各文字
に含まれる横線の数に基づいて当該文字領域の言語種を
判別する。In order to solve the above-mentioned problem, the present invention preferably extracts characters from an image in the character area, and determines the language type of the character area based on the number of horizontal lines included in each character.

【００１４】上記課題を解決するために、本発明は好ま
しくは前記抽出した文字の画像を横幅２ドットずつに分
割して縦に走査し、特定の並びパターンを計数した結果
に基づいて前記横線の数を定める。In order to solve the above-mentioned problem, the present invention preferably divides the image of the extracted character into two dots in horizontal width, scans the character vertically, and counts a specific arrangement pattern. Determine the number.

【００１５】上記課題を解決するために、本発明は好ま
しくは前記文字領域内の画像から行を抽出し、各行の基
準線に基づいて当該文字領域の言語種を判別する。In order to solve the above problem, the present invention preferably extracts a line from an image in the character area, and determines a language type of the character area based on a reference line of each line.

【００１６】上記課題を解決するために、本発明は好ま
しくは前記行の基準線は、行に含まれる文字の上下端を
表わす線とする。In order to solve the above-mentioned problem, in the present invention, preferably, the reference line of the line is a line representing the upper and lower ends of characters included in the line.

【００１７】上記課題を解決するために、本発明は好ま
しくは前記文字領域内の画像から文字を抽出し、各文字
の画像を一定幅に分割し、各々の中で検出される黒画素
の上下端の分布から各行の基準線を定める。In order to solve the above problem, the present invention preferably extracts characters from an image in the character area, divides the image of each character into a fixed width, and sets upper and lower black pixels detected in each of the images. A reference line for each row is determined from the distribution at the end.

【００１８】上記課題を解決するために、本発明は好ま
しくは前記言語種の判別方法を複数備える。In order to solve the above problems, the present invention preferably includes a plurality of the language type discriminating methods.

【００１９】上記課題を解決するために、本発明は好ま
しくは前記抽出された文字領域の行方向を判別し、判別
された行方向が特定の方向である場合には、前記言語種
判別を行なわず、予め定めた言語種と定める。In order to solve the above-mentioned problem, the present invention preferably determines a line direction of the extracted character area, and if the determined line direction is a specific direction, performs the language type determination. Instead, the language type is determined in advance.

【００２０】上記課題を解決するために、本発明は好ま
しくは前記抽出された文字領域の位置を表わす情報と、
前記判別された言語種を表わす情報とを対応付けて記憶
する。In order to solve the above problem, the present invention preferably provides information indicating a position of the extracted character area,
The information indicating the determined language type is stored in association with the information.

【００２１】上記課題を解決するために、本発明は好ま
しくは前記画像情報及び前記文字認識の結果を同じ出力
装置により出力する。In order to solve the above problems, the present invention preferably outputs the image information and the result of the character recognition by the same output device.

【００２２】上記課題を解決するために、本発明は好ま
しくは前記判別される言語種の一方は、日本語とする。In order to solve the above problem, the present invention preferably sets one of the determined language types to Japanese.

【００２３】上記課題を解決するために、本発明は好ま
しくは前記判別される言語種の一方は、英語とする。In order to solve the above-mentioned problem, in the present invention, preferably, one of the determined language types is English.

【００２４】[0024]

【発明の実施の形態】図１は本発明に係る装置の構成図
である。FIG. 1 is a block diagram of an apparatus according to the present invention.

【００２５】図１において、１０１はＣＰＵ（中央処理
装置）であって、メモリ１０２に格納されている制御プ
ログラムに従って、本発明に係る処理の制御を行う。後
述するフローチャートの各ステップもＣＰＵ１０１によ
り実行される。メモリ１０２は、ＲＡＭ、ＲＯＭ、ハー
ドディスクなどからなる記憶装置で、ＣＰＵ１０１の制
御プログラム及び各種パラメータ、入力画像データや文
字認識辞書などが格納される。１０３は装置に着脱可能
な、光ディスク、磁気ディスク、光磁気ディスク、磁気
テープなどの外部記憶媒体であり、メモリ１０２に格納
されるプログラム及びデータはこの外部記憶媒体より読
み込まれる。また、処理結果はこの外部記憶媒体１０３
に出力される。１０４はネットワークや公衆回線を介し
て他端とデータをやりとりするための通信Ｉ／Ｆであ
り、前記メモリ１０２に格納されるプログラムやデー
タ、および処理結果はこの通信Ｉ／Ｆを介して入出力さ
れる場合もある。１０５はキーボード、ポインティング
デバイス等の入力手段であり、オペレータの指示を伝え
る。画像の入力や文字認識の開始、或は認識処理した結
果のテキストの記憶先の指示等は入力手段１０５より入
力される。１０６は原稿を光学的に読み取り、電気信号
として装置に入力するスキャナ装置である。１０７はＣ
ＲＴや液晶等の表示装置であり、処理結果のテキストを
表示するとともに、オペレータの操作のためのインター
フェースのひとつとなる。１０８はＬＢＰやインクジェ
ット式のプリンタであり、処理結果のテキストをフォン
トで紙に出力する。In FIG. 1, reference numeral 101 denotes a CPU (Central Processing Unit) which controls processing according to the present invention in accordance with a control program stored in a memory 102. Each step of the flowchart described later is also executed by the CPU 101. The memory 102 is a storage device including a RAM, a ROM, a hard disk, and stores a control program of the CPU 101, various parameters, input image data, a character recognition dictionary, and the like. Reference numeral 103 denotes an external storage medium such as an optical disk, a magnetic disk, a magneto-optical disk, or a magnetic tape, which is detachable from the apparatus. Programs and data stored in the memory 102 are read from the external storage medium. The processing result is stored in the external storage medium 103.
Is output to Reference numeral 104 denotes a communication I / F for exchanging data with the other end via a network or a public line, and programs and data stored in the memory 102 and processing results are input and output via the communication I / F. It may be done. Reference numeral 105 denotes an input unit such as a keyboard and a pointing device, and transmits an operator's instruction. Instructions for inputting an image, starting character recognition, or storing a text as a result of the recognition processing are input from the input unit 105. Reference numeral 106 denotes a scanner device that optically reads a document and inputs the document as an electric signal to the device. 107 is C
A display device such as an RT or a liquid crystal displays a text of a processing result and serves as one of interfaces for an operator's operation. Reference numeral 108 denotes an LBP or inkjet printer, which outputs the text of the processing result on paper in a font.

【００２６】図２は本発明の第一の実施例の文字認識装
置の処理の機能ブロック図を示す。２０１は、文書原稿
を画像データとして読み込む画像入力部であり、スキャ
ナ１０６、あるいは外部記憶媒体１０３、や通信Ｉ／Ｆ
１０４を通して、メモリ１０２に画像データを格納す
る。２０２はメモリ１０２に格納されているテキストや
図、表が混在した１ページの原稿の画像データから、そ
れぞれの領域を、テキストブロック、図ブロック、表ブ
ロック、画像ブロックなどに分離する領域分割部である
（図１０の（Ａ）が原稿の画像であり、領域分割部２０
２により領域分割した結果得られる領域データと、各領
域の種類を図示したのが（Ｂ）である。）。２０３は文
字認識を行うべきテキストブロックに対し、ブロック内
の文章が日本語であるか英語であるかを判別する言語判
別部である。２０４は英語のテキストを認識する英語文
字認識部であり、アルファベットと数字、記号などの、
英文に現れる文字のみを認識する。２０５は日本語文字
認識部であり、英語および漢字、かな等、日本文に現れ
るすべての文字を認識する。文字認識の結果得たテキス
トデータはフォントでディスプレイに表示し、その後、
ファイルやネットワーク先のコンピュータ、プリンタへ
と出力される。FIG. 2 is a functional block diagram of the processing of the character recognition device according to the first embodiment of the present invention. An image input unit 201 reads a document document as image data. The image input unit 201 includes a scanner 106, an external storage medium 103, and a communication I / F.
Through the step 104, the image data is stored in the memory 102. Reference numeral 202 denotes an area dividing unit that separates respective areas into text blocks, figure blocks, table blocks, image blocks, and the like from one page of document image data in which text, figures, and tables are stored in the memory 102. ((A) of FIG. 10 is an image of a document,
(B) shows the area data obtained as a result of area division by 2 and the types of each area. ). Reference numeral 203 denotes a language determination unit that determines whether the text in the text block to be subjected to character recognition is in Japanese or English. Reference numeral 204 denotes an English character recognizing unit that recognizes English texts, such as alphabets, numbers, and symbols.
Recognize only characters that appear in English sentences. Reference numeral 205 denotes a Japanese character recognition unit that recognizes all characters that appear in Japanese sentences, such as English, Kanji, and Kana. The text data obtained as a result of character recognition is displayed on the display in font, and then
Output to a file, network destination computer, or printer.

【００２７】本発明の実施例に係る全体的な処理を図３
のフローチャートに沿って説明する。FIG. 3 shows the overall processing according to the embodiment of the present invention.
Will be described along the flowchart of FIG.

【００２８】Ｓ３０１にて、スキャナ１０６や外部記憶
媒体１０３或は通信Ｉ／Ｆ１０４を介して画像を読み込
み、メモリ１０２に記憶する。Ｓ３０２にて、ノイズ除
去や傾き補正などの前処理の後、Ｓ３０３にて、領域分
割処理により、１ページの画像データからテキストブロ
ック、図ブロック、写真ブロック、表ブロック、画像ブ
ロックなどの各矩形領域を抽出し、各領域の位置を表わ
す領域情報及びその領域の種類（テキスト、図、写真、
表、画像等）を表す属性情報をメモリ１０２に格納す
る。Ｓ３０４にて、これらのブロックのうち、テキスト
ブロックの各々に対し、言語種を判定する処理がなさ
れ、得られた結果はそのブロックが有する言語属性とし
てそのブロックの領域情報や属性情報と対応づけてメモ
リ１０２に記憶する。Ｓ３０５では、各テキストブロッ
クの文字認識を行う。Ｓ３０６にて、文字認識の結果と
分割された各矩形を表わすブロックデータ（領域情報、
属性情報、言語属性情報を含む）などからなるレイアウ
トデータを出力する。In step S301, an image is read via the scanner 106, the external storage medium 103, or the communication I / F 104, and stored in the memory 102. In S302, after pre-processing such as noise removal and inclination correction, in S303, each rectangular area such as a text block, a figure block, a photograph block, a table block, and an image block is converted from one page of image data by the area division processing. Is extracted, and area information indicating the position of each area and the type of the area (text, figure, photograph,
Attribute information representing tables, images, etc.) is stored in the memory 102. In S304, a process of determining the language type is performed on each of the text blocks among these blocks, and the obtained result is associated with the region information and the attribute information of the block as the language attribute of the block. It is stored in the memory 102. In S305, character recognition of each text block is performed. In S306, the result of the character recognition and the block data (area information,
Layout data including attribute information and language attribute information).

【００２９】図３中のステップＳ３０５における、文字
認識処理の詳細について図４のフローチャートを用いて
説明する。図４のフローチャートの処理を施す各ブロッ
クを管理する為のブロックカウンタＢを１に初期化す
る。The details of the character recognition process in step S305 in FIG. 3 will be described with reference to the flowchart in FIG. A block counter B for managing each block subjected to the processing of the flowchart of FIG. 4 is initialized to 1.

【００３０】Ｓ４０１にてカウンタＢが特定するテキス
トブロックのブロックデータを読み込み、Ｓ４０２では
ブロックが持つ言語属性が参照される。言語属性が英語
の場合はＳ４０３に進んでそのブロック内の画像に対し
て英語用文字認識装置を用いた文字認識を行う。言語属
性が英語以外、すなわち日本語、および、言語不明の場
合は、日本語用文字認識による文字認識を行う。Ｓ４０
４にて、カウンタＢがテキストブロックの数と一致して
いると判定され、全テキストブロックの処理が行われて
いればこの文字認識処理を終了、テキストブロックが残
っていると判定された場合はＳ４０６でカウンタＢを１
インクリメントしてＳ４０１に戻り、次のテキストブロ
ックに対してＳ４０１からＳ４０４の処理を繰り返す。In step S401, the block data of the text block specified by the counter B is read. In step S402, the language attribute of the block is referred to. If the language attribute is English, the process advances to step S403 to perform character recognition on the image in the block using the English character recognition device. If the language attribute is other than English, that is, Japanese and the language is unknown, character recognition by Japanese character recognition is performed. S40
At 4, it is determined that the counter B matches the number of text blocks. If all text blocks have been processed, this character recognition process is terminated. If it is determined that text blocks remain, Counter B is set to 1 in S406
The value is incremented and returns to S401, and the processing of S401 to S404 is repeated for the next text block.

【００３１】図３中のステップＳ３０４における、言語
種判定処理の詳細について図５のフローチャートを用い
て説明する。この言語種判定処理は、Ｓ３０３でテキス
トであると判定された全てのブロックに対して行なう。
図５のフローチャートはブロック１つに対して行なう処
理を説明するものである。The details of the language type determination processing in step S304 in FIG. 3 will be described with reference to the flowchart in FIG. This language type determination processing is performed on all blocks determined to be text in S303.
The flowchart of FIG. 5 explains the processing performed for one block.

【００３２】Ｓ５０１にて処理対象のテキストブロック
のブロックデータを読み込み、Ｓ５０２にて、テキスト
が縦書きか横書きかを判定する。判定手法は、テキスト
ブロック内の黒画素の縦と横それぞれの射影の分散を比
較し、横のほうが分散が大きいなら縦書きとする。縦書
きと判定された場合は英語であることは有り得ないの
で、Ｓ５０３に進んでそのテキストブロックの言語属性
を日本語としてメモリ１０２に言語属性情報を格納し、
終了する。In step S501, block data of a text block to be processed is read. In step S502, it is determined whether the text is written vertically or horizontally. The determination method compares the variance of the vertical and horizontal projections of the black pixels in the text block, and if the variance is greater in the horizontal direction, the vertical writing is performed. If it is determined that the text is vertically written, it cannot be English, so the process proceeds to step S503, where the language attribute of the text block is set to Japanese and the language attribute information is stored in the memory 102.
finish.

【００３３】Ｓ５０２で横書きと判定された場合は、日
本語か英語かの判定を行う処理に移る。まず、Ｓ５０４
にてテキストブロック内の黒画素の横方向の射影をとっ
て行領域を切り出す。次にＳ５０５にて各行領域内の黒
画素の縦方向の射影をとって文字領域を切り出す。どち
らの処理も黒画素の射影の切れ目を用いて行う。結果を
それぞれ行、文字単位の画素を囲む外接矩形の座標値と
して次のように得る。これらのデータは行領域情報及び
文字領域情報として、ブロックデータと対応づけてメモ
リ１０２に格納する。If horizontal writing is determined in S502, the process proceeds to a process of determining whether the writing is Japanese or English. First, S504
The line region is cut out by taking the horizontal projection of the black pixel in the text block. Next, in step S505, a character region is cut out by taking the vertical projection of black pixels in each line region. Both processes are performed using the breaks of black pixel projection. The results are obtained as follows, as the coordinate values of a circumscribed rectangle that surrounds the pixels of each line and character. These data are stored in the memory 102 as line area information and character area information in association with the block data.

【００３４】[0034]

【外１】 [Outside 1]

【００３５】Ｓ５０６では、切り出された行ごごの分類
を行う。各行について、その行から抽出された文字数が
閾値以上、或は行高が定められた範囲内であるものをＳ
５０９〜Ｓ５１５の言語種判定に使用可能な行であると
識別する。この識別に用いる文字数の閾値及び行高の範
囲はパラメータデータとしてメモリ１０２に予め格納し
ておく。In S506, the cut out rows are classified. For each line, if the number of characters extracted from the line is equal to or greater than a threshold or the line height is within a predetermined range,
The line is identified as a line that can be used for language type determination in steps 509 to S515. The threshold of the number of characters and the range of the line height used for the identification are stored in the memory 102 in advance as parameter data.

【００３６】Ｓ５０７にて、判定に使用可能な行がひと
つも識別されなかった場合はＳ５０８に進んで、このテ
キストブロックの言語属性は不明としてメモリ１０２に
言語属性情報を格納し、終了する。判定に使用可能な行
が存在する場合はＳ５０９に進む。In S507, if no line usable for determination is identified, the flow advances to S508 to store the language attribute information in the memory 102 as the language attribute of the text block is unknown, and the process ends. If there is a row that can be used for the determination, the process proceeds to S509.

【００３７】Ｓ５０９では、文字矩形の集合から統計的
に傾きを求め、その値で矩形の座標値を修正する。具体
的には、すべての文字矩形ｃ_i,1に対する（ｘ₀ ^i,j−Ｘ₀
ⁱ，ｙ₀ ^i,j−Ｙ₀ ⁱ）の分布から最小二乗曲線を求めて、
その傾きｄを用いて、文字矩形の座標値を修正する。In step S509, the inclination is statistically obtained from the set of character rectangles, and the coordinate value of the rectangle is corrected with the calculated inclination. Specifically, (x ₀ ^{i, j} −X ₀₎ for all character rectangles c _{i, 1}
ⁱ , y ₀ ^{i, j} −Y ₀ ⁱ ) to obtain a least square curve from the distribution,
The coordinate value of the character rectangle is corrected using the inclination d.

【００３８】Ｓ５１０以降では、Ｓ５１０、Ｓ５１２、
Ｓ５１３の３つの判定法を段階的に用いて、テキストブ
ロック内から言語的特徴が抽出できた時点でそのブロッ
クの言語種を判定する。各判定法の詳細はそれぞれ後述
する。After S510, S510, S512,
When the linguistic features can be extracted from within the text block, the language type of the block is determined by using the three determination methods of S513 in stages. Details of each determination method will be described later.

【００３９】Ｓ５１０では、文字幅の特徴を用いて、テ
キストブロックから日本語らしさが抽出できるかどうか
を調べる。判定により日本語属性が得られた場合は、Ｓ
５１１に進んでテキストブロックの言語属性を日本語と
してメモリ１０２に格納し、終了する。得られない場合
はＳ５１２に進む。この文字幅による判定処理の詳細は
図６のフローチャートに示す。In step S510, it is checked whether or not Japanese character can be extracted from the text block using the character width feature. If the Japanese attribute is obtained by the determination,
Proceeding to 511, the language attribute of the text block is stored in the memory 102 as Japanese, and the process ends. If not, the process proceeds to S512. Details of the determination processing based on the character width are shown in the flowchart of FIG.

【００４０】Ｓ５１２では、文字の横線度を用いて、テ
キストブロックから日本語らしさが抽出できるかどうか
を調べる。判定により日本語属性が得られた場合は、Ｓ
５１１に進んでテキストブロックの言語属性を日本語と
してメモリ１０２に格納し終了する。得られない場合は
Ｓ５１３に進む。この横線度による判定処理の詳細は図
７のフローチャートに示す。In S512, it is checked whether or not Japanese character can be extracted from the text block by using the horizontal line degree of the character. If the Japanese attribute is obtained by the determination,
Proceeding to 511, the language attribute of the text block is stored in the memory 102 as Japanese, and the process ends. If not, the process proceeds to S513. Details of the determination processing based on the degree of horizontal line are shown in the flowchart of FIG.

【００４１】Ｓ５１３では、基準線による判定で、テキ
ストブロックから英語らしさ、もしくは日本語らしさを
抽出する。判定により英語属性が得られた場合は、Ｓ５
１４に進んでテキストブロックの言語属性を英語とし終
了する。日本語属性が得られた場合は、Ｓ５１１に進ん
でテキストブロックの言語属性を日本としてメモリ１０
２に格納し終了する。どちらの言語属性も得られなかっ
た場合はＳ５１５に進んで、テキストブロックの言語属
性は不明としてメモリ１０２に格納し終了する。この基
準線による判定処理の詳細は図８のフローチャートに示
す。In step S513, the English language or the Japanese language is extracted from the text block by the judgment based on the reference line. If the English attribute is obtained by the determination, S5
Proceeding to 14, the language attribute of the text block is set to English, and the processing ends. If the Japanese attribute is obtained, the process proceeds to S511, where the language attribute of the text block is set to Japan and stored in the memory 10
2 and end. If neither language attribute is obtained, the process proceeds to S515, where the language attribute of the text block is stored as unknown in the memory 102, and the process ends. Details of the determination process using the reference line are shown in the flowchart of FIG.

【００４２】図５中のステップＳ５１０における文字幅
の特徴を用いた言語種判定処理の詳細について説明す
る。これは、日本語文字と英語文字の横幅が統計的に分
類できることを用いて、日本語らしい特徴を抽出できた
テキストブロックに、日本語属性を与える処理である。
具体的な処理内容について図６のフローチャートを用い
て説明する。The details of the language type determination processing using the character width characteristics in step S510 in FIG. 5 will be described. This is a process of assigning Japanese attributes to a text block from which Japanese-like features have been extracted by using the fact that the width of Japanese characters and English characters can be statistically classified.
Specific processing contents will be described with reference to the flowchart in FIG.

【００４３】Ｓ６０１にて、各種カウンタの値をクリア
する。Ｓ６０２からの処理はテキストブロック内の各行
ごとに行われる。Ｓ６０２では文字行Ｌ_iに含まれる文
字矩形の幅｛（ｘ₁ ^i,j−ｘ₀ ^i,j＋１）｝の算術平均値ｗ
_iを求める。In step S601, the values of various counters are cleared. The processing from S602 is performed for each line in the text block. In S602 of character rectangles in a character row L _i width _{^{{(x 1 i, j -x}} 0 i, j +1)} arithmetic mean w of
_Ask for _i .

【００４４】Ｓ６０３では、行内の文字数がある閾値以
上か以下かを見て、その行が判定対象として適切か否か
を識別する。Ｎｍｉｎ以下であったら、Ｓ６０４にてカ
ウンタ言語が判定できなかった行をカウントするｕを＋
１する。Ｎｍｉｎ以上の文字があったらＳ６０５へ進
む。In step S603, it is determined whether or not the number of characters in the line is equal to or greater than a certain threshold, and whether or not the line is appropriate for determination is identified. If it is less than or equal to Nmin, u that counts the lines for which the counter language could not be determined in S604 is incremented by +
Do one. If there is a character of Nmin or more, the process proceeds to S605.

【００４５】Ｓ６０５では、それぞれ隣りあう矩形間の
文字ピッチ、｛（ｘ₀ ^i,j+1−ｘ₀ ^i,j＋１）｝の最頻値お
よび文字間隔｛（ｘ₀ ^i,j+1−ｘ₁ ^i,j＋１）｝の最頻値を
求め、その差をｐ_iとする。ｐ_iは行Ｌ_iに含まれる文字
の幅に相当する値である。In S605, the character pitch between the adjacent rectangles, the mode of {(x ₀ ^{i, j + 1} −x ₀ ^{i, j} +1)} and the character spacing {(x ₀ ^{i, j + 1} − x ₁ ^{i, j} +1)} is determined, and the difference is defined as p _i . is a value corresponding to the width of the characters in the p _i row L _i.

【００４６】なお、数値列｛ｓ₁，ｓ₂，．．．，ｓ_n｝
に対する最頻値は、下記の分布関数Ｄ（ｓ）の極大点を
とる座標から求める（図１１）。Note that the numerical sequence {s ₁ , s ₂ ,. . . , S _n ｝
Is obtained from the coordinates at which the maximum point of the following distribution function D (s) is obtained (FIG. 11).

【００４７】[0047]

【外２】 [Outside 2]

【００４８】Ｓ６０６にて、ｐ_iの行の高さで割った値
を閾値Ｐｔｈと比較する。ｐ_iがＰｔｈよりも小さけれ
ばＳ６０７に進み、その行は英語らしいとみなして、英
語らしいと判定された行をカウントするｅを＋１する。
大きければＳ６０８に進み、その行は日本語らしいと推
定して日本語らしいと判定された行をカウントするｊを
＋１する。[0048] At S606, to compare the value divided by the height of the row of p _i with the threshold value Pth. If p _i is smaller than Pth, the process proceeds to S607, in which the line is regarded as English, and e for counting the lines determined to be English is incremented by one.
If it is larger, the process proceeds to S608, in which the line is estimated to be Japanese, and j that counts lines determined to be Japanese is incremented by one.

【００４９】Ｓ６０９にて、文字幅の平均値ｗ_iをIn S609, the average value of the character width w _i is calculated.

【００５０】[0050]

【外３】閾値Ｗｔｈと比較する。ｗ_iが大きいときはＳ６１０に
進んでｃを＋１する。これは現在の行が接触文字による
文字矩形を多く含むことを意味し、ｃはその行をカウン
トするものである。[Outside 3] Compare with the threshold value Wth. If w _i is large, the process proceeds to S610, where c is incremented by one. This means that the current line contains many character rectangles by touching characters, and c is the number of the line.

【００５１】Ｓ６１１にて、行カウンタＬを行数と比較
し、一致せずにまだ全部の行を処理していないと判定さ
れるならばＳ６２０でＬを１インクリメントし、Ｓ６０
２に戻る。すべて処理したと判定されるならば、Ｓ６１
２に進む。Ｓ６１２にて、ｃの全行数に対する割合が閾
値Ｖｃよりも大きいかどうかを調べる。Ｖｃよりも大き
いときはＳ６１３に進み、テキストブロックに対して、
文字接触が多いという属性を与え、このブロックに対し
て言語の判定はおこなわずに終了する。割合がＶｃ以下
であれば、Ｓ６１４に進む。In step S611, the line counter L is compared with the number of lines. If it is determined that all the lines have not been processed yet because they do not match, L is incremented by one in step S620, and step S60 is performed.
Return to 2. If it is determined that all have been processed, S61
Proceed to 2. In S612, it is checked whether or not the ratio of c to the total number of rows is larger than the threshold value Vc. If it is larger than Vc, the process proceeds to S613, and for the text block,
An attribute indicating that there is a lot of character contact is given, and the process ends without performing language determination for this block. If the ratio is equal to or lower than Vc, the process proceeds to S614.

【００５２】Ｓ６１４にて、日本語らしいとみなされた
行数ｊの全行数に対する割合を閾値Ｖｊと比較する。割
合がＶｊより大きいときはＳ６１５に進み、テキストブ
ロックは日本語として言語属性情報をメモリ１０２に格
納して終了する。割合が小さいときは何の属性も与えず
に終了する。In S614, the ratio of the number j of lines regarded as Japanese to the total number of lines is compared with a threshold value Vj. If the ratio is greater than Vj, the process proceeds to S615, where the text block is stored in Japanese as the language attribute information in the memory 102, and the processing ends. If the ratio is small, the process ends without giving any attribute.

【００５３】次に、図５中のステップＳ５１１における
文字の横線度の特徴を用いた判定処理について説明す
る。Next, a description will be given of the determination processing using the characteristic of the horizontal line degree of the character in step S511 in FIG.

【００５４】ある文字を任意の位置で縦に切り、そのと
き断面が横切った文字線数をカウントする。この数値が
最大になる位置で切ったときの文字線数を、その文字の
横線度と呼ぶことにする。A character is vertically cut at an arbitrary position, and the number of character lines crossed by the cross section at that time is counted. The number of character lines cut at the position where this numerical value is maximum will be referred to as the horizontal line degree of the character.

【００５５】アルファベットや数字などの英語文章の文
字では、多くの字体に対してこの横線度は４以下であ
り、５以上のものは存在しない。一方、日本語の文字で
は「国」などの漢字をはじめとして、５以上のものが多
数存在する。すなわち横線度５以上の文字であれば、そ
れは日本語の文字であると推定できる。In the case of English text characters such as alphabets and numbers, the degree of horizontality is 4 or less for many fonts, and none of them is 5 or more. On the other hand, there are many Japanese characters, such as kanji such as "country", in a number of five or more. That is, if the character has a horizontal line degree of 5 or more, it can be estimated that the character is a Japanese character.

【００５６】図３中のステップＳ５１１は、テキストブ
ロック内のすべての文字矩形についてこの横線度を求
め、日本語の文字と思われる文字矩形が見つかった場合
には、テキストブロックに日本語の属性を与える処理で
ある。In step S511 in FIG. 3, this horizontal line degree is obtained for all the character rectangles in the text block, and if a character rectangle that seems to be a Japanese character is found, the Japanese attribute is added to the text block. This is the process of giving.

【００５７】具体的な処理内容について図７のフローチ
ャートを用いて説明する。Specific processing contents will be described with reference to the flowchart of FIG.

【００５８】テキストブロック中のすべての文字矩形を
カウンタｃｈにより管理することにより、全ての文字に
対し、Ｓ７０１にて横線度を求める。By managing all the character rectangles in the text block by the counter ch, the horizontal linearity is determined for all the characters in S701.

【００５９】文字の画像を幅２ｄｏｔの帯に分割する。
その各々の帯内の横２ｄｏｔ組の画素値の列を、上から
順に｛ｂ₁，ｂ₂，．．．，ｂ_m｝（｛ｂ_k｝＝｛００，０
１，１０，１１｝）とする。また、列の最初と最後に０
０を補う（ｂ₀＝ｂ_m+1＝００）。The image of the character is divided into bands having a width of 2 dots.
The columns of the horizontal 2-dot pixel values in each of the bands are {b ₁ , b ₂ ,. . . , B _m } ({b _k } = {00, 0
1, 10, 11}). Also, 0 at the beginning and end of the column
0 is supplemented (b ₀ = b _{m + 1} = 00).

【００６０】このとき、ｂ₀〜ｂ_m+1間で、００（００／０１／１０）＊１１（１１／１０／０１）
＊００というパターンの並びが現れる回数を数える、ここで
（ａ｜ｂ）はａまたはｂのいずれかであること、＊は直
前パターンの０回以上の繰り返しを意味する。即ち、０
１及び１０を除いて００→１１→００となる回数を求め
る。At this time, between b _{0 and} b _{m + 1} , 00 (00/01/10) * 11 (11/10/01)
The number of times the pattern arrangement of * 00 appears is counted, where (a | b) is either a or b, and * means 0 or more repetitions of the immediately preceding pattern. That is, 0
Except for 1 and 10, the number of times of 00 → 11 → 00 is obtained.

【００６１】文字内の全帯についてのおのおのの上記の
パターン出現回数を調べ（図１２）、その最大値をもっ
て、その文字横線度とする。この手法では、前記処理２
ｄｏｔの幅を持たせ、かつ、主に二値化時のノイズによ
る０１，１０のパターンの繰り返しを無視しているの
で、ノイズに対して安定して横線度を計数することがで
きる。The number of occurrences of the pattern described above is checked for all the bands in the character (FIG. 12), and the maximum value is used as the character horizontal line degree. In this method, the processing 2
Since the dot width is provided and the repetition of the 01 and 10 patterns mainly due to the noise at the time of binarization is ignored, the horizontal linearity can be counted stably with respect to the noise.

【００６２】Ｓ７０２にて、横線度ｃを閾値Ｃｔｈと比
較する。ｃがＣｔｈより大きければＳ７０４に進み、日
本語属性をテキストブロックに対応させてメモリ１０に
格納して終了する。さもなくば、Ｓ７０３に進む。Ｓ７
０３にて、ｃｈを全文字数と比較し、一致しない場合は
全文字を処理していないと判定し、Ｓ７０５でｃｈを１
インクリメントして次の文字に対しＳ７０１より繰り返
す。全文字を処理し終えたと判定された場合は、言語属
性は与えずに終了する。In step S702, the horizontal linearity c is compared with a threshold value Cth. If c is larger than Cth, the process proceeds to S704, where the Japanese attribute is stored in the memory 10 in association with the text block, and the process ends. Otherwise, the process proceeds to S703. S7
At 03, ch is compared with the total number of characters, and when they do not match, it is determined that all characters have not been processed, and at S705, ch is set to 1
Increment and repeat from S701 for the next character. If it is determined that all characters have been processed, the process ends without giving a language attribute.

【００６３】３つ目の判定処理として、図５中のステッ
プＳ５１３における基準線を用いた判定処理について説
明する。As a third determination process, the determination process using the reference line in step S513 in FIG. 5 will be described.

【００６４】図１３に示すように、英語行においては文
字行の上下端とは別に、文字矩形が揃う基準線が上下２
つ存在する。一方、日本語の行ではそのような基準線は
見つからない。この差を利用し、テキストブロック中の
各行からこのような英語行らしい基準線が抽出できれば
英語属性を与え、英語らしくない基準線が抽出できれば
日本語属性を与えるのが、図５中のステップＳ５１３の
判定処理である。As shown in FIG. 13, in the English line, apart from the upper and lower ends of the character line, the reference line where the character rectangle is aligned
Exist. On the other hand, no such reference line is found in the Japanese line. Utilizing this difference, the English attribute is given if such a reference line like an English line can be extracted from each line in the text block, and the Japanese attribute is given if a reference line not like English can be extracted. Is the determination process.

【００６５】具体的な処理内容について、図８のフロー
チャートを用いて説明する。The specific processing will be described with reference to the flowchart of FIG.

【００６６】Ｓ８０１にて各種カウンタをリセットして
から、テキストブロック内のすべての行に対して以下の
Ｓ８０２〜８１１の処理を行う。After resetting various counters in step S801, the following steps S802 to 811 are performed on all lines in the text block.

【００６７】Ｓ８０２で行内の文字数を調べ、閾値より
小さかったら、Ｓ８０３にてカウンタｕを＋１する。閾
値以上で有ればＳ８０４に進む。In step S802, the number of characters in the line is checked. If the number is smaller than the threshold value, the counter u is incremented by one in step S803. If it is equal to or greater than the threshold value, the process proceeds to S804.

【００６８】Ｓ８０４では、行内の文字の上端位置及び
下端位置の分布をそれぞれ求める。このときに２種類の
手法が選択的に用いられる。In S804, the distribution of the upper end position and the lower end position of the character in the line is obtained. At this time, two types of methods are selectively used.

【００６９】ひとつは、行内の文字矩形の上端および下
端座標を用いて分布を求める方法である。具体的には、
行Ｌ_i中の文字に対し、それぞれの行の上端から文字上
端位置までの距離を行高で正規化した数値列｛Ｅ₀ ⁱ｝＝｛（ｙ₀ ^i,j−Ｙ₀ ⁱ）／（Ｙ₁ ⁱ−Ｙ₀ ⁱ＋１）｜
ｊ＝１，．．．，ｎ_i｝，および、同様に行の下端から文字上端位置までの距離を
行高で正規化した数値列｛Ｅ₁ ⁱ｝＝｛（ｙ₁ ⁱ−Ｙ₁ ^i,j）／（Ｙ₁ ⁱ−Ｙ₀ ⁱ＋１）｜
ｊ＝１，．．．，ｎ_i｝，に対して、図１１の分布関数を適用する。これらをそれ
ぞれ分布Ｄ₀ ⁱ，Ｄ₁ ⁱとする。One method is to obtain a distribution using the coordinates of the upper and lower ends of a character rectangle in a line. In particular,
Numerical sequence {E ₀ ⁱ } = {(y ₀ ^{i, j} −Y ₀ ⁱ ) / () in which the distance from the upper end of each line to the upper end position of the character is normalized by the line height for the characters in line L _i. Y ₁ ⁱ −Y ₀ ⁱ +1) |
j = 1,. . . , N _i }, and a numerical sequence {E ₁ ⁱ } = {(y ₁ ⁱ −Y ₁ ^{i, j} ) / (Y ₁ ) in which the distance from the lower end of the line to the upper end position of the character is similarly normalized by the line height. ⁱ −Y ₀ ⁱ +1) |
j = 1,. . . , N _i }, the distribution function of FIG. 11 is applied. These are referred to as distributions D ₀ ⁱ and D ₁ ⁱ , respectively.

【００７０】もうひとつの手法は、文字矩形を横Ｎｄｏ
ｔ毎に分割し、各々分割された中で黒画素の最上端およ
び最下端の座標を用め、それらの集合から前記と同様に
求めた数値列により分布を求める方法である。Another method is to form a character rectangle horizontally Ndo.
In this method, the distribution is obtained by using the coordinates of the uppermost end and the lowermost end of the black pixel in each of the divided t, and using a set of numerical values obtained in the same manner as described above.

【００７１】本処理中で用いられる文字矩形は、行を黒
画素の垂直射影の切れ目で分割した矩形なので、１４に
示すようなイタリック体の英字や、つぶれによる接触な
どの場合に、ひとつの文字矩形内に実際には複数の文字
が含まれている可能性がある。The character rectangle used in this processing is a rectangle obtained by dividing a line at the break of the vertical projection of black pixels. A rectangle may actually contain more than one character.

【００７２】このときは本手法を用いることで、より正
しく個々の文字の上下端を抽出することが可能になる。At this time, by using this method, the upper and lower ends of each character can be more correctly extracted.

【００７３】なお、分割するｄｏｔ数は、高速化の為に
１バイトのビット数、例えば８の倍数にし、注目行の高
さに応じて文字の平均的幅に近くなるような値を選ぶ。The number of dots to be divided is set to a bit number of 1 byte, for example, a multiple of 8, for speeding up, and a value close to the average width of the character is selected according to the height of the line of interest.

【００７４】以上、２種類の手法の選択には、各テキス
トブロックに対し図６のステップＳ６１３で与えられた
接触文字属性の有無を用いる。図９は図８のステップＳ
８０４の分布を求める部分をより詳細に示したフローチ
ャートである。Ｓ９０１にて、テキストブロックに接触
文字属性がなければＳ８０２に進んで、文字矩形の上下
端座標から分布を求める。接触文字属性があれば、Ｓ９
０３に進んで、文字矩形を横Ｎｄｏｔ毎に分割した帯そ
れぞれの最上端、最下端の座標から分布を求める。As described above, the selection of the two methods uses the presence or absence of the contact character attribute given in step S613 of FIG. 6 for each text block. FIG. 9 shows step S in FIG.
FIG. 6 is a flowchart showing a portion for obtaining a distribution 804 in more detail. In step S901, if the text block does not have the contact character attribute, the process advances to step S802 to calculate the distribution from the upper and lower coordinates of the character rectangle. If there is a contact character attribute, S9
Proceeding to 03, the distribution is obtained from the coordinates of the uppermost and lowermost ends of each of the bands obtained by dividing the character rectangle for each horizontal Ndot.

【００７５】図８の処理フローに戻って、Ｓ８０５では
分布Ｄ₀ ⁱ，Ｄ₁ ⁱのそれぞれを、その形状から図１６に示
される分布のいずれかに分類してコードを与える。Returning to the processing flow of FIG. 8, in S805, each of the distributions D ₀ ⁱ and D ₁ ⁱ is classified into one of the distributions shown in FIG.

【００７６】図１６の分類について説明する。分布関数
Ｄの横軸に対しそれぞれ０．２，０．１２５のところに
基準位置Ｇ₁，Ｇ₂を設け（図１５）、分布中で一定値を
超える高さのピークの個数、およびそれらがＧ₁，Ｇ₂の
左右どちらにあるかによって図１６の１〜４，６，７，
８およびＸのいずれかに分類する。The classification in FIG. 16 will be described. Reference positions G ₁ and G ₂ are provided at 0.2 and 0.125, respectively, with respect to the horizontal axis of the distribution function D (FIG. 15), and the number of peaks having a height exceeding a certain value in the distribution, and Depending on whether G ₁ or G _{2 is on} the left or right, 1-4 in FIG.
Classify into either 8 or X.

【００７７】例えば、ピークがひとつだけの場合は、そ
の位置によって１，２，６のいずれかになる。ピークが
２つある場合で、右のピークがＧ₁の左にある場合は、
左右のピークの高さのどちらが高いかにより、３または
４に分類される。右のピークがＧ₁の右にある場合は、
左右のピークの高さによって６または７に分類される。
ただし、Ｇ₁より右にピークがある状態で、さらにその
右にピークが存在する場合はＸに分類される。For example, when there is only one peak, it becomes one of 1, 2, and 6 depending on the position. If the peak there are two, if the right peak is to the left in G ₁ is
It is classified into 3 or 4 depending on which of the heights of the left and right peaks is higher. If the right of the peak is to the right in G ₁ is
It is classified into 6 or 7 according to the height of the left and right peaks.
However, in the presence of peaks at the right of the G _1, if further exists a peak to its right is classified as X.

【００７８】ここで、図１６中のそれぞれの分布列にお
いてひとつのピークを差している太矢印は、注目行にお
ける基準線の位置に相当する。ただし、分類コードが１
の場合は、基準線はピークではなく０の位置にあるとす
る。これらの分布の判断基準をメモリ１０２に予め記憶
しておき、Ｓ８０５ではこの基準に基づいてコード化す
る。Here, the thick arrow pointing one peak in each distribution column in FIG. 16 corresponds to the position of the reference line in the row of interest. However, if the classification code is 1
In the case of, the reference line is not at the peak but at the position of 0. The criteria for determining these distributions are stored in the memory 102 in advance, and coding is performed based on the criteria in S805.

【００７９】Ｓ８０６にて、各分布から定める上下の基
準線の間の距離Ｗｂを求める。In S806, the distance Wb between the upper and lower reference lines determined from each distribution is determined.

【００８０】Ｓ８０７にて、分布Ｄ_０ ^ｉ，Ｄ_１ ^ｉの分類
コードの組合せから、行Ｌ_iに対し、Ｊ（日本語行），
Ｅ（英語行），Ｍ（わからない）のいずれかの属性を与
える。この関係を図１７に示す。図１７のテーブルはメ
モリ１０２に予め記憶しておくものである。[0080] In S807, from the combination of the classification code of the distribution _{_D} ⁰ _^i, _D ¹ _i, for the line L _i, J (Japanese line),
Either E (English line) or M (I do not know) attribute is given. This relationship is shown in FIG. The table of FIG. 17 is stored in the memory 102 in advance.

【００８１】行にあたえられた属性がＭの場合はＳ８０
８に進み、日本語らしいと判定された行をカウントする
カウンタｊを＋１する。行にあたえられた属性がＪの場
合はＳ８０９に進み、英語らしいと判定された行をカウ
ントするカウンタｅを＋１する。行にあたえられた属性
がＥの場合はＳ８１０に進み、言語種が不明であると判
定された行をカウントするカウンタｍを＋１する。If the attribute given to the row is M, S80
Then, the counter j for counting the lines determined to be Japanese is incremented by one. If the attribute given to the row is J, the process proceeds to S809, and the counter e for counting the rows determined to be English is incremented by one. When the attribute given to the line is E, the process proceeds to S810, and the counter m for counting the line for which the language type is determined to be unknown is incremented by one.

【００８２】Ｓ８１１で、カウンタＬと全行数とを比較
し、一致していなければすべての行の処理が済んでいな
いと判定してＳ８３０でＬを１インクリメントしＳ８０
１に戻って次の行に対してＳ８０２〜８１１の処理を繰
り返す。Ｓ８１１でＬが全行数と一致した場合はＳ８１
２に進む。In step S811, the counter L is compared with the total number of rows. If they do not match, it is determined that all rows have not been processed, and L is incremented by one in step S830, and step S80 is performed.
The process returns to 1 and the processes of S802 to 811 are repeated for the next row. If L matches the total number of lines in S811, S81
Proceed to 2.

【００８３】Ｓ８１２にて、カウンタｅ，ｊがともに０
であるとき、テキストブロックの言語は特定できないと
して、言語属性をあたえずそのまま終了する。そうでな
ければＳ８１３に進む。At S812, counters e and j are both 0.
, It is determined that the language of the text block cannot be specified, and the process ends without giving the language attribute. Otherwise, the process proceeds to S813.

【００８４】Ｓ８１３で、カウンタｅが０で、ｊが０よ
り大のときは、テキストブロックは日本語と考えられる
ので、Ｓ８１４に進んでテキストブロックに対応づけて
日本語属性をメモリ１０２に格納して終了する。そうで
なければＳ８１５に進む。In step S813, if the counter e is 0 and j is greater than 0, the text block is considered to be Japanese, and the flow advances to S814 to store the Japanese attribute in the memory 102 in association with the text block. To end. Otherwise, the process proceeds to S815.

【００８５】Ｓ８１５では、すべてのＥ属性を与えられ
た行の基準線の間の距離Ｗｂ、に対し、その平均値を計
算してＷとする。In S815, the average value of the distances Wb between the reference lines of all the rows to which the E attribute is given is calculated to be W.

【００８６】Ｓ８１６では、カウンタｊ’をリセット
し、カウンタｅ’には英語属性の行数ｅを与える。すべ
てのＪまたはＭ属性の行に対してＳ８１７〜８２０の処
理を行う。In S816, the counter j 'is reset, and the counter e' is given the number e of English attribute lines. The processing of S817 to 820 is performed on all the rows having the J or M attribute.

【００８７】Ｓ８１７にてＷとＷｂの差を求め、その差
が閾値より小さい場合は、Ｓ８１８に進んでカウンタ
ｅ’を＋１とする。これは、現在の注目行が、既に特定
されているＥ属性の行群と同幅の基準線間隔を持つ場
合、英語らしい行の集合に加える、という処理である。
差が閾値以上ならば、Ｓ８１９に進んでカウンタｊ’を
＋１し、英語らしくない行としてカウントする。In step S817, the difference between W and Wb is obtained. If the difference is smaller than the threshold value, the flow advances to step S818 to set the counter e 'to +1. This is a process of adding a current line of interest to a set of English-like lines when the current line of interest has a reference line interval of the same width as a line group of the E attribute already specified.
If the difference is equal to or larger than the threshold value, the process proceeds to S819, where the counter j ′ is incremented by one, and is counted as a line that is not English-like.

【００８８】処理していないＪまたはＭ属性の行があっ
たら、次のＪまたはＭ属性行に対し、Ｓ８１７〜８２０
をくり返す。すべて処理ずみならＳ８２１に進む。If there is a line with the J or M attribute that has not been processed, the process proceeds to S817-820 for the next J or M attribute line.
Repeat. If all have been processed, the process proceeds to S821.

【００８９】Ｓ８２１では、ｅ’の数、すなわち英語属
性の行が全行数に占める割合を求める。この割合が閾値
より大きいときはＳ８２３に進み、テキストブロックに
英語属性を与えて終了する。そうでなければＳ８２２に
進む。In S821, the number of e ′, that is, the ratio of the English attribute line to the total number of lines is obtained. If this ratio is larger than the threshold, the flow advances to step S823 to give an English attribute to the text block and terminate. Otherwise, the process proceeds to S822.

【００９０】Ｓ８２２では、ｊ’の数が全行数に占める
割合を求める。割合が閾値より大きいときはＳ８１４に
進み、テキストブロックに日本語属性を与えて終了す
る。そうでなければ言語判定は不能として、言語属性を
あたえず終了する。In S822, the ratio of the number of j's to the total number of rows is determined. If the ratio is larger than the threshold, the process proceeds to S814, and the text block is given a Japanese attribute, and the process ends. Otherwise, the language determination is not possible, and the process ends without giving the language attribute.

【００９１】以上述べたように、本発明によれば、文書
中のテキストブロック内に書かれた文章が英語文字認識
部によってより高い認識率でコード化できる英語文字の
みの文章か、日本語文字認識部でしか認識できない日本
語文字を含む文章かが自動的に判定される。その結果、
日英２つの言語がページ単位あるいはページ内で混在す
る文書を対象にした文字認識装置において、英語と日本
語それぞれに対して高精度な２つの認識部を、ユーザが
いちいち言語指定することなく、認識対象に応じて自動
的に使い分けることができるので、省力化と高い認識精
度を同時に得ることができ、文字認識装置の実用性を大
きく向上することができる。As described above, according to the present invention, a sentence written in a text block in a document is a sentence consisting of only English characters which can be coded at a higher recognition rate by the English character recognition unit, or a Japanese character. It is automatically determined that the sentence includes Japanese characters that can be recognized only by the recognition unit. as a result,
In a character recognition device for a document in which two languages, Japanese and English, are mixed on a page basis or within a page, the user can specify two highly accurate recognition units for English and Japanese without specifying each language. Since they can be automatically used depending on the recognition target, labor saving and high recognition accuracy can be obtained at the same time, and the practicality of the character recognition device can be greatly improved.

【００９２】先の説明では、日本語と英語の言語判別を
例に挙げて説明したが、本発明はこれに限ることなく、
日本語と他のアルファベット言語、例えばフランス語や
ドイツ語の判別、あるいは、中国語とアルファベット言
語の判別に用いてもよい。その場合も、認識部をユーザ
がいちいち言語指定することなく、認識対象に応じて自
動的に使い分けることができるようになるので、省力化
と高い認識精度を同時に得ることが出来、文字認識装置
の実用性を向上することができる。In the above description, the language discrimination between Japanese and English has been described as an example, but the present invention is not limited to this.
It may be used for discriminating between Japanese and other alphabet languages, for example, French and German, or for discriminating between Chinese and alphabet languages. Also in this case, the recognition unit can be automatically used according to the recognition target without having to specify the language each time, so that labor saving and high recognition accuracy can be obtained at the same time. Practicality can be improved.

【００９３】また、先の説明では、言語判定及び文字認
識の処理をテキストブロックに対して選択的に行う例に
ついて説明したが、これらの処理は、文字を含む表ブロ
ック及び図ブロック、或は画像ブロックや写真ブロック
に対して行っても良い。In the above description, an example has been described in which the processing of language determination and character recognition is selectively performed on a text block. However, these processing is performed in a table block, a figure block, or an image including characters. It may be performed for blocks or photo blocks.

【００９４】[0094]

【発明の効果】以上説明したように、本発明によれば、
画像から複数の文字領域を抽出し、前記抽出された複数
の文字領域各々について言語種を判別し、前記判別され
た言語種に従って、各文字領域に適した文字認識処理を
各々別個に処理することにより、言語種をオペレータが
視認し、指示することなく、高認識率を得られる認識を
可能とするので、操作性及び認識率の両方で大きな効果
を得られる。As described above, according to the present invention,
Extracting a plurality of character regions from an image, determining a language type for each of the extracted plurality of character regions, and separately performing a character recognition process suitable for each character region according to the determined language type. Thereby, the operator can visually recognize the language type and can obtain a high recognition rate without giving an instruction, so that a great effect can be obtained in both the operability and the recognition rate.

【００９５】以上説明したように、本発明によれば、前
記文字領域内の画像から文字を抽出し、各文字の横幅に
基づいて当該文字領域の言語種を判別することにより、
文字の横幅という、言語種により特徴の異なる情報を用
いて言語種を自動判別するので、言語種の判別の精度が
向上する。As described above, according to the present invention, characters are extracted from the image in the character area, and the language type of the character area is determined based on the width of each character.
Since the language type is automatically determined using information having different characteristics depending on the language type such as the character width, the accuracy of the language type determination is improved.

【００９６】以上説明したように、本発明によれば、前
記文字の横幅の行毎の平均値を求め、予め定められた閾
値との比較に基づいて前記言語種の判別を行なうことに
より、文字領域全体の特性を考慮して言語種を判別でき
るので、言語種の判別の精度が向上する。As described above, according to the present invention, the average value of the horizontal width of the character for each line is obtained, and the language type is determined based on comparison with a predetermined threshold value. Since the language type can be determined in consideration of the characteristics of the entire region, the accuracy of determining the language type is improved.

【００９７】以上説明したように、本発明によれば、前
記文字の横幅の行毎の平均値と閾値との比較の結果、平
均値が閾値を上回る行数と全行数との割合に基づいて前
記言語種の判別を行なうことにより、文字領域全体の特
性を考慮して言語種を判別できるので、言語種の判別の
精度が向上する。As described above, according to the present invention, as a result of comparing the average value of the width of the character for each line with the threshold, the average value is calculated based on the ratio of the number of lines exceeding the threshold to the total number of lines. By performing the above-described language type determination, the language type can be determined in consideration of the characteristics of the entire character area, so that the accuracy of the language type determination is improved.

【００９８】以上説明したように、本発明によれば、行
内の文字のピッチ及び間隔について、各々の分布のピー
クから最頻値を求め、その差を前記行毎の平均値とする
ことにより、文字領域全体の特性を考慮して言語種を判
別できるので、言語種の判別の精度が向上する。As described above, according to the present invention, the mode of the pitch and spacing of characters in a line is obtained from the peak of each distribution, and the difference is used as the average value for each line. Since the language type can be determined in consideration of the characteristics of the entire character area, the accuracy of determining the language type is improved.

【００９９】以上説明したように、本発明によれば、前
記文字領域内の画像から文字を抽出し、各文字に含まれ
る横線の数に基づいて当該文字領域の言語種を判別する
ことにより、文字のピッチ及び間隔という、言語種によ
り特徴の異なる情報を用いて言語種を自動判別するの
で、言語種の判別の精度が向上する。As described above, according to the present invention, characters are extracted from an image in the character area, and the language type of the character area is determined based on the number of horizontal lines included in each character. Since the language type is automatically determined using information having different characteristics depending on the language type, such as the character pitch and spacing, the accuracy of the language type determination is improved.

【０１００】以上説明したように、本発明によれば、前
記抽出した文字の画像を横幅２ドットずつに分割して縦
に走査し、特定の並びパターンを計数した結果に基づい
て前記横線の数を定めることにより、ノイズや線のかす
れに影響されず、正確な横線の数を判定することができ
る。As described above, according to the present invention, the image of the extracted character is divided into two dots in width and scanned vertically, and the number of the horizontal lines is determined based on the result of counting the specific arrangement pattern. , The number of horizontal lines can be accurately determined without being affected by noise or blurring of lines.

【０１０１】以上説明したように、本発明によれば、前
記文字領域内の画像から行を抽出し、各行の基準線に基
づいて当該文字領域の言語種を判別することにより、文
字領域全体の特性を考慮して言語種を判別できるので、
言語種の精度が向上する。As described above, according to the present invention, a line is extracted from the image in the character area, and the language type of the character area is determined based on the reference line of each line, whereby the entire character area is determined. Since the language type can be determined in consideration of the characteristics,
The accuracy of the language type is improved.

【０１０２】以上説明したように、本発明によれば、前
記行の基準線は、行に含まれる文字の上下端を表わす線
とすることにより、文字の上下端という、言語種により
異なる特性に基づいて言語種を判別するので、言語種の
自動判別の精度が向上する。As described above, according to the present invention, the reference line of the line is a line representing the upper and lower ends of the characters included in the line, so that the upper and lower ends of the characters are different depending on the language type. Since the language type is determined based on the language type, the accuracy of the automatic determination of the language type is improved.

【０１０３】以上説明したように、本発明によれば、前
記文字領域内の画像から文字を抽出し、各文字の画像を
一定幅に分割し、各々の中で検出される黒画素の上下端
の分布から各行の基準線を定めることにより、接触文字
にも対応することができる。As described above, according to the present invention, characters are extracted from the image in the character area, the image of each character is divided into a fixed width, and the upper and lower ends of the black pixels detected in each are divided. By determining the reference line of each line from the distribution of, it is possible to deal with contact characters.

【０１０４】以上説明したように、本発明によれば、前
記言語種の判別方法を複数備えることにより、文字の形
やレイアウト、字体が異なる場合にも適切に対応でき
る。As described above, according to the present invention, by providing a plurality of the language type discriminating methods, it is possible to appropriately cope with a case where the shape, layout, and font of a character are different.

【０１０５】以上説明したように、本発明によれば、前
記抽出された文字領域の行方向を判別し、判別された行
方向が特定の方向である場合には、前記言語種判別を行
なわず、予め定めた言語種と定めることを特徴とする請
求項１に記載の文字処理方法。As described above, according to the present invention, the line direction of the extracted character area is determined, and if the determined line direction is a specific direction, the language type determination is not performed. 2. The character processing method according to claim 1, wherein the language type is determined in advance.

【０１０６】以上説明したように、本発明によれば、前
記抽出された文字領域の位置を表わす情報と、前記判別
された言語種を表わす情報とを対応付けて記憶すること
により、判別された言語種に応じて適した文字認識処理
を行なうことが容易になる。As described above, according to the present invention, the information indicating the position of the extracted character area is stored in association with the information indicating the determined language type. Character recognition processing suitable for the language type can be easily performed.

【０１０７】以上説明したように、本発明によれば、前
記画像情報及び前記文字認識の結果を同じ出力装置によ
り出力することにより、文字認識対象の画像の確認及び
認識結果の確認がし易くなる。As described above, according to the present invention, by outputting the image information and the result of the character recognition by the same output device, it becomes easy to confirm the image of the character recognition target and to confirm the recognition result. .

[Brief description of the drawings]

【図１】本発明に係る装置のハード構成図FIG. 1 is a hardware configuration diagram of an apparatus according to the present invention.

【図２】本発明に係る機能ブロック図FIG. 2 is a functional block diagram according to the present invention.

【図３】本発明に係る全体的な処理のフローチャートFIG. 3 is a flowchart of an overall process according to the present invention.

【図４】図３中ステップＳ３０５の処理詳細のフローチ
ャートFIG. 4 is a flowchart showing details of a process in step S305 in FIG. 3;

【図５】図３中ステップＳ３０４の処理詳細のフローチ
ャートFIG. 5 is a flowchart showing details of a process in step S304 in FIG. 3;

【図６】図５中ステップＳ５１０の処理詳細のフローチ
ャートFIG. 6 is a flowchart showing details of a process in step S510 in FIG. 5;

【図７】図５中ステップＳ５１１の処理詳細のフローチ
ャートFIG. 7 is a flowchart showing details of a process in step S511 in FIG. 5;

【図８】図５中ステップＳ５１３の処理詳細のフローチ
ャートFIG. 8 is a flowchart showing details of processing in step S513 in FIG. 5;

【図９】図８中ステップＳ６１３の処理詳細のフローチ
ャートFIG. 9 is a flowchart showing details of a process in step S613 in FIG. 8;

【図１０】領域分割について説明する図FIG. 10 is a diagram illustrating region division.

【図１１】分布関数について説明する図FIG. 11 illustrates a distribution function.

【図１２】文字横線度について説明する図FIG. 12 is a diagram illustrating character horizontal line degree.

【図１３】文字行の基準線について説明する図FIG. 13 is a diagram illustrating a reference line of a character line.

【図１４】イタリック体の文字端位置抽出について説明
する図FIG. 14 is a view for explaining extraction of character end positions in italics

【図１５】文字端位置の分布について説明する図FIG. 15 is a diagram illustrating the distribution of character end positions.

【図１６】文字端位置の分布の分類について説明する図FIG. 16 is a view for explaining the classification of the distribution of character end positions.

【図１７】上下の文字端位置と行属性の対応について説
明する図FIG. 17 is a diagram for explaining correspondence between upper and lower character end positions and line attributes;

Claims

[Claims]

1. A plurality of character regions are extracted from an image, a language type is determined for each of the plurality of extracted character regions, and a character recognition process suitable for each character region is performed according to the determined language type. A character processing method characterized by processing separately.

2. The character processing method according to claim 1, wherein characters are extracted from the image in the character region, and a language type of the character region is determined based on a width of each character.

3. An average value of the width of the character for each line is obtained,
3. The character processing method according to claim 2, wherein the language type is determined based on a comparison with a predetermined threshold.

4. A method for determining the language type based on a ratio of the number of lines whose average value exceeds a threshold value to the total number of lines as a result of comparison between an average value of the width of the character for each line and a threshold value. The image processing method according to claim 3, wherein:

5. The pitch and spacing of characters in a line,
4. The character processing method according to claim 3, wherein a mode value is obtained from a peak of each distribution, and a difference between the modes is set as an average value for each line.

6. The character processing according to claim 1, wherein a character is extracted from an image in the character area, and a language type of the character area is determined based on the number of horizontal lines included in each character. Method.

7. The method according to claim 6, wherein the image of the extracted character is divided vertically into two dots in width, scanned vertically, and the number of the horizontal lines is determined based on a result of counting a specific arrangement pattern. Character processing method described in.

8. Extracting a line from an image in the character area,
2. The character processing method according to claim 1, wherein a language type of the character area is determined based on a reference line of each line.

9. The character processing method according to claim 8, wherein the reference line of the line is a line representing upper and lower ends of characters included in the line.

10. Extracting a character from an image in the character area, dividing the image of each character into a fixed width, and determining a reference line of each line from a distribution of upper and lower ends of black pixels detected in each character. The character processing method according to claim 9, wherein:

11. The character processing method according to claim 1, further comprising a plurality of language type determination methods.

12. A line direction of the extracted character area is determined, and when the determined line direction is a specific direction,
2. The character processing method according to claim 1, wherein the language type is not determined, but is determined as a predetermined language type.

13. The character processing method according to claim 1, wherein information representing the position of the extracted character area and information representing the determined language type are stored in association with each other.

14. The character processing method according to claim 1, wherein the image information and the result of the character recognition are output by the same output device.

15. The character processing method according to claim 1, wherein one of the determined language types is Japanese.

16. The character processing method according to claim 1, wherein one of the determined language types is English.

17. A character area extracting means for extracting a plurality of character areas from an image; a language type determining means for determining a language type for each of the plurality of character areas extracted by the character area extracting means; According to the language type determined by the means,
A character processing apparatus comprising: character recognition control means for separately processing character recognition processing suitable for each character area.

18. The method according to claim 17, wherein the language type determining unit extracts a character from an image in the character region and determines a language type of the character region based on a width of each character. Character processor.

19. The language type determining means obtains an average value of the horizontal width of the character for each line, and determines the language type based on a comparison with a predetermined threshold value. 19. The character processing device according to 18.

20. The language type discriminating means, as a result of comparing an average value of the horizontal width of the character for each line with a threshold, based on a ratio of the number of lines whose average value exceeds the threshold to the total number of lines. 20. The image processing apparatus according to claim 19, wherein the type is determined.

21. The character processing according to claim 19, wherein a mode value is obtained from a peak of each distribution for a pitch and an interval of characters in a line, and a difference between the modes is an average value for each line. apparatus.

22. The language type determining unit extracts characters from an image in the character region, and determines the language type of the character region based on the number of horizontal lines included in each character. Item 18. The character processing device according to item 17.

23. The method according to claim 22, wherein the image of the extracted character is divided into two dots in horizontal width and vertically scanned, and the number of the horizontal lines is determined based on a result of counting a specific arrangement pattern. 3. The character processing device according to 1.

24. The method according to claim 1, wherein the language type determining unit extracts a line from an image in the character region, and determines a language type of the character region based on a reference line of each line.
8. The character processing device according to 7.

25. The line according to claim 24, wherein the reference line of the line is a line representing upper and lower ends of characters included in the line.
3. The character processing device according to 1.

26. Extracting a character from an image in the character area, dividing the image of each character into a fixed width, and determining a reference line of each line from a distribution of upper and lower ends of black pixels detected in each character. 26. The character processing device according to claim 25, wherein:

27. The character processing apparatus according to claim 17, wherein said language type determining means includes a plurality of language type determining methods.

28. A line direction determining means for determining a line direction of a character area extracted by the character area extracting means; and if the line direction determined by the line direction determining means is a specific direction, 18. The character processing device according to claim 17, wherein a language type is determined without performing language type determination.

29. The character according to claim 17, further comprising storage means for storing information indicating the position of the extracted character area and information indicating the determined language type in association with each other. Processing equipment.

30. An apparatus according to claim 1, further comprising output means for outputting the image information and the result of the character recognition.
8. The character processing device according to 7.

31. The character processing apparatus according to claim 17, wherein one of the determined language types is Japanese.

32. The character processing device according to claim 17, wherein one of the determined language types is English.

33. A character processing apparatus according to claim 30, wherein said output means is an ink jet printer.

34. The character processing device according to claim 17, further comprising a scanner for reading the image information.

35. A control program for extracting a plurality of character regions from an image, a control program for determining a language type for each of the extracted plurality of character regions, A computer-readable storage medium storing a control program for separately performing a character recognition process suitable for a character area.