JP3243389B2

JP3243389B2 - Document identification method

Info

Publication number: JP3243389B2
Application number: JP03659995A
Authority: JP
Inventors: 和典高津
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1995-02-24
Filing date: 1995-02-24
Publication date: 2002-01-07
Anticipated expiration: 2017-01-07
Also published as: JPH08235313A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、文書の種類、例えば日
本語文書と英語文書を高精度に識別する文書識別方法に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for identifying a document type, for example, a Japanese document and an English document with high accuracy.

【０００２】[0002]

【従来の技術】情報の電子化が進展するに伴って、文書
データを保存し、それを検索、閲覧し、再利用すること
がオフィスを初めとして広範な分野で行われている。そ
して、文書内容をデータ化する一手法として、紙にプリ
ントアウトされた文書内容をテキストデータに変換する
文字認識処理の重要性が高まっている。この文字認識処
理においては、英文と日本文とに処理を分けたほうが、
言語特有の処理方法を活かせるので性能面で有利とな
る。2. Description of the Related Art With the advance of computerization of information, document data is stored, searched, browsed, and reused in a wide range of fields including offices. As one method for converting document content into data, the importance of character recognition processing for converting document content printed out on paper into text data is increasing. In this character recognition processing, it is better to separate the processing into English sentences and Japanese sentences,
This is advantageous in terms of performance because the language-specific processing method can be used.

【０００３】このような文書・文字種類を識別する従来
の技術としては、例えば、特開平４−３４６１８８号公
報および同４−３４６１８９号公報に記載された装置が
ある。前者の装置では、領域分割によって切り出された
文字行のイメージデータから、一定値以上の長さを持つ
縦線と横線を抽出し、これら縦、横線により囲まれた閉
領域（文字中の矩形）を抽出し、その個数に基づいて当
該文字行が日本文／英文のいずれであるかを認識する。
また、後者の装置では、領域分割によって切り出された
文字行のイメージデータをスキャンして、文字間の距離
を求め、文字間距離の分布に基づいて英文／日本文を識
別する。As a conventional technique for identifying such a document / character type, there is, for example, an apparatus described in Japanese Patent Application Laid-Open Nos. 4-346188 and 4-346189. In the former device, a vertical line and a horizontal line having a length equal to or more than a certain value are extracted from image data of a character line cut out by region division, and a closed region (a rectangle in a character) surrounded by the vertical and horizontal lines is extracted. Is extracted, and whether the character line is Japanese or English is recognized based on the number.
Further, the latter apparatus scans image data of a character line cut out by area division, obtains a distance between characters, and identifies English / Japanese sentences based on a distribution of the distance between characters.

【０００４】[0004]

【発明が解決しようとする課題】上記した従来の装置
は、領域分割によって切り出された文字領域について行
単位に文字種を判別する局所的な方法を採っていて、文
字認識を行う直前の前処理としては都合がよいものの、
領域分割がなされる前の文字領域が確定していない段階
には適用できない（つまり、領域分割の前処理としては
利用できない）。また、文書の全体あるいは、ある程度
大きな領域について、大局的に英文か日本文かを識別す
る目的に利用しようとすると、領域分割からの処理全部
を行う必要があるために長い処理時間と大きなメモリ容
量を要する。さらに、文字サイズや、文字間距離の分布
を利用して文書・文字種類を識別しているので、フォン
トやサイズの違いによる影響が大きくなるという問題が
ある。The above-described conventional apparatus employs a local method of determining the character type on a line-by-line basis in a character region cut out by region division, and as a preprocessing immediately before performing character recognition. Is convenient,
It cannot be applied to the stage where the character area before the area division is not determined (that is, it cannot be used as preprocessing of the area division). Also, if the whole document or a certain large area is to be used for the purpose of globally discriminating between English and Japanese, it is necessary to perform all the processing from the area division, so a long processing time and a large memory capacity are required. Cost. Furthermore, since the document and the character type are identified using the distribution of the character size and the distance between the characters, there is a problem that the influence of the difference in the font and the size increases.

【０００５】本発明の目的は、フォントや文字サイズの
違いによる影響を受けることなく、高精度に文書を識別
する文書識別方法を提供することにある。An object of the present invention is to provide a document identification method for accurately identifying a document without being affected by differences in fonts and character sizes.

【０００６】[0006]

【課題を解決するための手段】前記目的を達成するため
に、請求項１記載の発明では、入力された文書画像また
は該文書画像の部分領域の種類を識別する文書識別方法
において、文書画像から切り出されたパターンを、登録
されているテンプレートと照合し、該パターンがテンプ
レートとマッチングせず新しいパターンであれば、新テ
ンプレートとして登録するとともに新しいシンボルを付
与し、既に登録されているテンプレートと類似している
ときは、該パターンを、該テンプレートに付与されたシ
ンボルで置き換えるともに該シンボルの数を更新し、前
記文書画像または該文書画像の部分領域中の前記シンボ
ルの出現頻度分布を計算し、該出現頻度分布を基に前記
文書または部分領域の種類を識別することを特徴として
いる。According to a first aspect of the present invention, there is provided a document identification method for identifying a type of an input document image or a partial area of the document image. The cut-out pattern is compared with the registered template. If the pattern does not match the template and is a new pattern, the pattern is registered as a new template and a new symbol is added, and similar to the already registered template. When the pattern is replaced, the pattern is replaced with a symbol given to the template, the number of the symbols is updated, and the appearance frequency distribution of the symbol in the document image or a partial region of the document image is calculated. It is characterized in that the type of the document or the partial area is identified based on the appearance frequency distribution.

【０００７】請求項２記載の発明では、前記シンボルの
出現頻度分布を計算する際に、所定のサイズ以下のパタ
ーンを除いて計算することを特徴としている。According to a second aspect of the present invention, when calculating the appearance frequency distribution of the symbols, the calculation is performed excluding a pattern of a predetermined size or less.

【０００８】請求項３記載の発明では、一つのテンプレ
ートによって代表されるパターン数の平均を基に前記文
書または部分領域の種類を識別することを特徴としてい
る。According to a third aspect of the present invention, the type of the document or the partial area is identified based on the average of the number of patterns represented by one template.

【０００９】請求項４記載の発明では、テンプレートの
種類数を基に前記文書または部分領域の種類を識別する
ことを特徴としている。According to a fourth aspect of the present invention, the type of the document or the partial area is identified based on the number of types of the template.

【００１０】請求項５記載の発明では、出現頻度が高い
テンプレートの種類数を基に前記文書または部分領域の
種類を識別することを特徴としている。The invention according to claim 5 is characterized in that the type of the document or the partial area is identified based on the number of types of templates having a high appearance frequency.

【００１１】請求項６記載の発明では、出現頻度が高い
テンプレートの種類数、および総パターン数を基に前記
文書または部分領域の種類を識別することを特徴として
いる。The invention according to claim 6 is characterized in that the type of the document or the partial area is identified based on the number of types of templates having a high appearance frequency and the total number of patterns.

【００１２】[0012]

【作用】スキャナなどから読み込まれた文書は文書画像
データとなってメモリに格納される。メモリに記憶され
た文書画像データから連結成分が切り出され、パターン
メモリに記憶される。切り出されたパターンと予め登録
されたテンプレートとを照合し、新しいパターンであれ
ばテンプレートとして登録するとともに、新テンプレー
トに新シンボルを付けてシンボル列記憶メモリに記憶さ
れる。登録済みのテンプレートと類似度が高ければ、既
知テンプレートに付与されたシンボルの数を更新してシ
ンボル列記憶メモリに記憶される。そして、シンボル列
記憶メモリを参照してシンボルの頻度分布を計算し、シ
ンボルの頻度分布を基に日本語文と英語文などを識別す
る。他の実施例では、シンボルの頻度分布を計算する際
に、小さいパターンを除いてから計算し、絵柄領域が含
まれている文書についても高精度に文書の識別を行う。A document read from a scanner or the like is stored as document image data in a memory. A connected component is cut out from the document image data stored in the memory and stored in the pattern memory. The cut-out pattern is compared with a pre-registered template, and if it is a new pattern, it is registered as a template, and a new symbol is added to the new template and stored in the symbol string storage memory. If the similarity with the registered template is high, the number of symbols assigned to the known template is updated and stored in the symbol string storage memory. Then, the frequency distribution of the symbols is calculated with reference to the symbol string storage memory, and the Japanese sentence and the English sentence are identified based on the symbol frequency distribution. In another embodiment, when calculating the symbol frequency distribution, the calculation is performed after excluding small patterns, and the document including the picture region is identified with high accuracy.

【００１３】[0013]

【実施例】以下、本発明の一実施例を図面を用いて具体
的に説明する。図１は、本発明の各実施例に共通な構成
図である。以下、説明を簡単にするために、文書単位で
処理する場合について説明するが、予め領域識別によっ
て切り分けられた領域に対しても本発明を適用すること
ができる。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be specifically described below with reference to the drawings. FIG. 1 is a configuration diagram common to each embodiment of the present invention. Hereinafter, for simplicity, a case in which processing is performed in units of documents will be described. However, the present invention can also be applied to an area that has been divided in advance by area identification.

【００１４】画像入力部１０１によって、文書がイメー
ジデータとして入力され、イメージメモリ１０６に記憶
される。ここで画像入力部は、例えばスキャナであり、
あるいはイメージデータファイルやファックスからのデ
ータ入力であってもよい。連結成分抽出部１０２は、イ
メージメモリ１０６に記憶されたイメージデータから連
結成分毎にまとめて切りだし、パターンメモリ１０７に
記憶する。なお、このような切り出し方法としては、例
えば黒画素の縦、横方向の分布（射影）を求めることに
より、文字を切り出したり、あるいは連結する黒画素成
分を抽出し、この成分の内、縦に近接するものだけを統
合することにより切り出す方法を用いればよい。A document is input as image data by the image input unit 101 and stored in the image memory 106. Here, the image input unit is, for example, a scanner,
Alternatively, data may be input from an image data file or a fax. The connected component extraction unit 102 collectively cuts out connected components from the image data stored in the image memory 106, and stores the extracted data in the pattern memory 107. In addition, as such a clipping method, for example, by obtaining the distribution (projection) of black pixels in the vertical and horizontal directions, a character is cut out or a black pixel component to be connected is extracted. What is necessary is just to use the method of cutting out by integrating only those that are close to each other.

【００１５】パターン照合部１０３は、抽出されたパタ
ーンを、テンプレートメモリ１０８に登録されているテ
ンプレートと照合し、マッチングせず新しいパターンで
あれば、テンプレートメモリ１０８に新テンプレートと
して登録するとともに、その新テンプレートにシンボル
（例えば、番号）を付けてシンボル列記憶メモリ１０９
に記憶する。The pattern matching unit 103 matches the extracted pattern with a template registered in the template memory 108. If the pattern is not matched and is new, the pattern matching unit 103 registers the extracted pattern as a new template in the template memory 108, and A symbol (for example, a number) is attached to the template to store the symbol string storage memory 109.
To memorize.

【００１６】また、すでに登録されているテンプレート
と類似度が高ければ、登録済みテンプレートに付与され
ているシンボルの度数を更新して、シンボル列記憶メモ
リ１０９に記憶する。すなわち、シンボル列記憶メモリ
１０９には、出現した異なるパターンがシンボルとして
記憶され（シンボル化する）、併せて、シンボル毎にそ
の度数（つまり、類似パターンの数）も記憶される。頻
度分布計算部１０４は、シンボル列記憶メモリ１０９に
記憶されたシンボルの頻度分布を計算する。文書識別部
１０５は、得られた頻度分布を基に日本語文と英語文な
どの判定を行う。If the similarity with the already registered template is high, the frequency of the symbol assigned to the registered template is updated and stored in the symbol string storage memory 109. That is, in the symbol row storage memory 109, different patterns that have appeared are stored (symbolized) as symbols, and the frequency (that is, the number of similar patterns) is also stored for each symbol. The frequency distribution calculation unit 104 calculates the frequency distribution of the symbols stored in the symbol string storage memory 109. The document identification unit 105 determines a Japanese sentence and an English sentence based on the obtained frequency distribution.

【００１７】本発明の各実施例を説明する前に、本発明
で利用する、従来の画像圧縮方法を説明する。この方法
は、文書画像から連結成分を切り出し、一つの連結成分
を一つのパターンとみなしてテンプレートとして登録
し、該切り出されたパターンの内、類似のパターンをテ
ンプレートで置き換えることによって、イメージ情報を
削減するものである（米国特許第５，３０３，３１３号
を参照）。そして、類似度の判定方法としては、同公報
の段落９で説明されているテスト方法を用いる。つま
り、この方法は、位置を補正しながらパターンを重ね合
わせて差異の出る画素の現れる位置、現れる画素のパタ
ーンによって類似であるか否かを判別し、類似するパタ
ーンには同一のシンボルを付与する。Before describing each embodiment of the present invention, a conventional image compression method used in the present invention will be described. In this method, connected components are cut out from a document image, one connected component is regarded as one pattern, registered as a template, and a similar pattern is replaced with a template in the cut out pattern, thereby reducing image information. (See US Pat. No. 5,303,313). As a method of determining the similarity, the test method described in paragraph 9 of the publication is used. In other words, in this method, the patterns are superimposed while correcting the positions, and it is determined whether or not the pixels are different from each other based on the appearance position of the pixel having a difference and the pattern of the appearing pixel. .

【００１８】図１０は、切り出されたパターンから類似
パターンを検出してシンボルを付与する例を示す。８０
１は英文字からなる文書画像、８０２はテンプレート、
８０３はシンボル（テンプレート番号）である。文書画
像８０１から連結成分（文字パターン）を切り出す。図
の例では、まず連結成分「Ｈ」を切り出し、これをテン
プレート８０２に登録されたパターンとのパターンマッ
チングを行う。この場合、何も登録されていないのでマ
ッチングせず、連結成分「Ｈ」がテンプレートとして新
規に登録される。また、連結成分「Ｈ」をシンボル
「１」（テンプレート番号）で表現し、メモリに格納す
る。FIG. 10 shows an example in which a similar pattern is detected from a cut-out pattern and a symbol is assigned. 80
1 is a document image composed of English characters, 802 is a template,
803 is a symbol (template number). A connected component (character pattern) is cut out from the document image 801. In the example of the figure, first, a connected component “H” is cut out, and this is subjected to pattern matching with a pattern registered in the template 802. In this case, since nothing is registered, no matching is performed, and the connected component “H” is newly registered as a template. Also, the connected component “H” is represented by the symbol “1” (template number) and stored in the memory.

【００１９】次いで、連結成分「ｅ」を切り出し、これ
もテンプレートに登録されたパターンとマッチングしな
いので、テンプレートとして新規に登録し、これをシン
ボル「２」で表す。以下、連結成分「ｔ」、「ｏ」、
「ｌ」、「ｄ」、「ｍ」までは、同様に処理されてテン
プレートとして登録され、それぞれにシンボル「３」か
ら「７」が割当られ、メモリに格納される。Next, the connected component "e" is cut out and does not match the pattern registered in the template, so that it is newly registered as a template and is represented by the symbol "2". Hereinafter, the connected components “t”, “o”,
“L”, “d”, and “m” are processed in the same manner and registered as templates, and symbols “3” to “7” are assigned to each and stored in the memory.

【００２０】続いて、「ｍ」の次の「ｅ」が切り出され
と、この連結成分「ｅ」は、テンプレートに登録された
パターン「ｅ」とマッチングするのでテンプレートとし
て新規に登録されない。ただし、登録済みのパターンと
マッチングしたパターンとの平均（あるいは両パターン
の代表値）をとったパターンを作成してテンプレートを
更新処理する。これにより、類似するパターンについて
は、その代表パターンが登録されるように更新処理され
る。Subsequently, when "e" next to "m" is cut out, the connected component "e" matches the pattern "e" registered in the template and is not newly registered as a template. However, a pattern is created by taking an average of the registered pattern and the matched pattern (or a representative value of both patterns), and the template is updated. Thereby, the similar pattern is updated so that the representative pattern is registered.

【００２１】また、シンボルには既に決定されている
「２」が割り当てられる。以下、同様にしてシンボルと
して数字「９」までが使用されると、文字「ｈ」、
「ａ」には、それぞれシンボル「ａ」、「ｂ」が割当ら
れる。The symbol "2" which has already been determined is assigned to the symbol. Hereinafter, in the same manner, when the symbols up to the number “9” are used, the characters “h”,
Symbols “a” and “b” are assigned to “a”, respectively.

【００２２】上記したようにして文書画像は、シンボル
列（テンプレート番号の列）と、各テンプレートのパタ
ーン情報に分解される。そして、本発明には直接関係し
ないが、これらの情報を予測符号化部で予測符号化する
ことにより、画像を高い圧縮率で圧縮する。As described above, the document image is decomposed into a symbol sequence (a sequence of template numbers) and pattern information of each template. Although not directly related to the present invention, the image is compressed at a high compression rate by predictively coding these pieces of information in the predictive coding unit.

【００２３】〈実施例１〉図２は、本発明の実施例１に係る処理フローチャートで
ある。画像入力部１０１から画像を入力し（ステップ２
０１）、連結成分抽出部１０２は、入力された画像デー
タから連結成分を抽出し（ステップ２０２）、抽出され
た連結成分をパターン照合部１０３に出力する。Embodiment 1 FIG. 2 is a processing flowchart according to Embodiment 1 of the present invention. An image is input from the image input unit 101 (Step 2)
01), the connected component extraction unit 102 extracts connected components from the input image data (Step 202), and outputs the extracted connected components to the pattern matching unit 103.

【００２４】パターン照合部１０３は、抽出された連結
成分とテンプレートメモリ１０８内に登録されているテ
ンプレートとのパターンマッチングを行って（ステップ
２０３）、抽出された連結成分が新しいパターンである
か否かを判定し（ステップ２０４）、新しいパターンで
あれば、新しいシンボル（例えば番号）を付与してテン
プレートメモリ１０８に登録するとともに、そのシンボ
ルをシンボル列メモリ１０９に登録する（ステップ２０
６、２０５）。The pattern matching section 103 performs pattern matching between the extracted connected component and the template registered in the template memory 108 (step 203), and determines whether the extracted connected component is a new pattern. (Step 204), and if the pattern is a new pattern, a new symbol (for example, a number) is assigned and registered in the template memory 108, and the symbol is registered in the symbol row memory 109 (step 20).
6, 205).

【００２５】また、既存のテンプレートとマッチングす
れば、シンボル列メモリ１０９に登録されている、その
既存テンプレートのシンボルの数に「１」を加算してか
ら、シンボル列メモリ１０９に登録する（ステップ２０
５）。パターン照合部１０３は、入力画像中の全ての連
結成分について上記したパターンマッチング処理を行う
（ステップ２０３〜２０７）。全ての連結成分を処理し
ていないとき（ステップ２０７でＮＯ）、ステップ２０
３に進み、全ての連結成分を処理したとき（ステップ２
０７でＹＥＳ）、ステップ２０８に進む。If a match is found with the existing template, "1" is added to the number of symbols of the existing template registered in the symbol column memory 109, and then the symbol is registered in the symbol column memory 109 (step 20).
5). The pattern matching unit 103 performs the above-described pattern matching processing on all connected components in the input image (Steps 203 to 207). If all connected components have not been processed (NO in step 207), step 20
3 when all connected components have been processed (step 2
07, (YES), and proceed to step 208.

【００２６】頻度分布計算部１０４は、シンボル列記憶
メモリ１０９に記憶されたシンボルの頻度分布を計算し
（ステップ２０８）、文書識別部１０５は、該計算され
た頻度分布を基に日本語文と英語文の判定を行う（ステ
ップ２０９）。The frequency distribution calculation section 104 calculates the frequency distribution of the symbols stored in the symbol string storage memory 109 (step 208), and the document identification section 105 generates a Japanese sentence and an English language based on the calculated frequency distribution. The sentence is determined (step 209).

【００２７】図３は、日本語文書と英語文書のパターン
頻度分布の具体例を示す。日本語文書の場合は、多種類
のテンプレートが低頻度で分布しているのに対し、英語
文書では少ない種類のテンプレートが高頻度で分布して
いる。すなわち、英語文書は、アルファベットや数字な
ど約８０種類の文字から構成されていると一般的に言わ
れている。一方、日本語はカタカナ、平仮名だけで１０
０を数え、漢字を加えれば優に１０００を超える種類の
文字から構成される。従って、シンボルの出現頻度は、
日本語と英語では大きく異なり、文書を識別する場合に
有効な特徴量となるので、例えば、頻度分布の分散を計
算することにより、日本語と英語の区別が可能となる。FIG. 3 shows a specific example of the pattern frequency distribution of a Japanese document and an English document. In the case of a Japanese document, many types of templates are distributed at a low frequency, whereas in an English document, few types of templates are distributed at a high frequency . In other words, English documents are
And it is configured from degrees about 80 types of characters generally words
Have been. On the other hand, Japanese is only 10 katakana and hiragana
If you count 0 and add Kanji, it will be over 1000 kinds
Consists of characters. Therefore, the appearance frequency of the symbol is
Japanese and English are very different.
Since it is an effective feature amount, it is possible to distinguish between Japanese and English , for example, by calculating the variance of the frequency distribution.

【００２８】また、文書中に絵柄が含まれている場合、
ディザ処理されていることから絵柄領域には多数の小さ
なパターンがあり、この数多くの小さなパターンを基に
して、文書中に絵柄が含まれているか否かという文書の
識別も可能となる。In the case where a picture is included in the document,
Because of the dither processing, there are many small patterns in the picture area. Based on these many small patterns, it is possible to identify a document as to whether or not a picture is included in the document.

【００２９】〈実施例２〉図４は、本発明の実施例２に係る処理フローチャートで
ある。ステップ３０７までの処理は、実施例１で説明し
たものと同様である。この実施例２では、頻度分布計算
部１０４が、頻度分布を計算する際に、テンプレートの
大きさを調べ、所定の閾値以下である小さいテンプレー
トを除いてから計算する（ステップ３０８）。文書画像
には、しばしば絵柄の領域が含まれていて、その絵柄領
域には数多くの小さなパターンが存在している。そし
て、このようなパターンは非常に個数が多いことから頻
度分布の分散などに大きな影響を与えてしまう。Second Embodiment FIG. 4 is a processing flowchart according to a second embodiment of the present invention. The processing up to step 307 is the same as that described in the first embodiment. In the second embodiment, when calculating the frequency distribution, the frequency distribution calculation unit 104 checks the size of the template, and excludes small templates that are equal to or smaller than a predetermined threshold value before calculating (step 308). A document image often includes a picture area, and there are many small patterns in the picture area. Since such patterns are very large in number, they greatly affect the distribution of frequency distribution and the like.

【００３０】本実施例２では、頻度分布を計算する際
に、小さなパターンを除いているので、上記した影響が
なくなり、部分的に絵柄が入った文書などにおいても、
より正確に日本語と英語を識別することが可能になる。In the second embodiment, when calculating the frequency distribution, small patterns are excluded, so that the above-mentioned effect is eliminated.
It is possible to more accurately distinguish between Japanese and English.

【００３１】〈実施例３〉図５は、本発明の実施例３に係る処理フローチャートで
ある。ステップ４０８までの処理は、実施例１で説明し
たものと同様である。この実施例３の文書識別部１０５
では、一つのテンプレートが代表するパターン数の平均
を特徴量として文書の識別を行う（ステップ４０９）。
このテンプレート当たりのパターン数は、シンボル（テ
ンプレート）毎にその度数（パターン数）が書き込まれ
ているシンボル列記憶メモリ１０９を参照することによ
り分かる。英語の場合は、出現頻度が高い文字ほどテン
プレート当たりのパターン数が多い。これに対して、日
本語はテンプレート当たりのパターン数が英語に比べて
少ないので、日本語と英語の区別が容易にできる。Third Embodiment FIG. 5 is a processing flowchart according to a third embodiment of the present invention. The processing up to step 408 is the same as that described in the first embodiment. Document identification unit 105 of the third embodiment
Then, a document is identified using the average of the number of patterns represented by one template as a feature amount (step 409).
The number of patterns per template can be determined by referring to the symbol string storage memory 109 in which the frequency (number of patterns) is written for each symbol (template). In the case of English, a character having a higher appearance frequency has a larger number of patterns per template. On the other hand, Japanese has a smaller number of patterns per template than English, so it is easy to distinguish between Japanese and English.

【００３２】〈実施例４〉図６は、本発明の実施例４に係る処理フローチャートで
ある。ステップ５０８までの処理は、実施例１で説明し
たものと同様である。実施例４の文書識別部１０５で
は、テンプレートの種類数を用いて文書の識別を行う
（ステップ５０９）。このテンプレートの種類数は、シ
ンボルの種類数としてシンボル列記憶メモリ１０９を参
照することにより分かる。英語は約８０種類の文字によ
って文章が構成されているため、日本語に比べてはるか
に必要とするテンプレートの種類数が少なく、その数を
判別することによって容易に日本語と英語を識別するこ
とができる。Fourth Embodiment FIG. 6 is a processing flowchart according to a fourth embodiment of the present invention. The processing up to step 508 is the same as that described in the first embodiment. The document identification unit 105 according to the fourth embodiment identifies a document using the number of template types (step 509). The number of template types can be determined by referring to the symbol string storage memory 109 as the number of symbol types. Since English has a sentence composed of about 80 characters, the number of required template types is much smaller than in Japanese, and it is easy to distinguish between Japanese and English by judging the number. Can be.

【００３３】〈実施例５〉図７は、本発明の実施例５に係る処理フローチャートで
ある。ステップ６０８までの処理は、実施例１で説明し
たものと同様である。この実施例５では、出現頻度の高
いテンプレートの種類数を用いて文書を識別する。出現
頻度は、シンボル列記憶メモリ１０９の度数を参照する
ことにより分かる。英語の文書ではシンボルの出現頻度
に偏りがあり、例えば出現頻度が２％を超えるパターン
も幾つかあるが、これに対し日本語ではこのようなこと
はあまりない。そこで、出現頻度の高いテンプレート数
をカウントすれば（ステップ６０９）、容易に日本語と
英語を識別することができる。Embodiment 5 FIG. 7 is a processing flowchart according to Embodiment 5 of the present invention. The processing up to step 608 is the same as that described in the first embodiment. In the fifth embodiment, a document is identified using the number of template types having a high appearance frequency. The appearance frequency can be determined by referring to the frequency in the symbol string storage memory 109. In English documents, the appearance frequency of symbols is biased. For example, there are some patterns in which the appearance frequency exceeds 2%, whereas in Japanese, this is rare. Therefore, by counting the number of templates having a high appearance frequency (step 609), Japanese and English can be easily identified.

【００３４】〈実施例６〉図８は、本発明の実施例６に係る処理フローチャートで
ある。ステップ７０８までの処理は、実施例１で説明し
たものと同様である。この実施例６では、出現頻度の高
いテンプレートの種類数の他に、総パターン数を用いて
文書の識別を行う（ステップ７０９）。総パターン数
は、シンボル列記憶メモリ１０９内の各シンボルの度数
の総和から求められる。Embodiment 6 FIG. 8 is a processing flowchart according to Embodiment 6 of the present invention. The processing up to step 708 is the same as that described in the first embodiment. In the sixth embodiment, a document is identified using the total number of patterns in addition to the number of types of templates having a high appearance frequency (step 709). The total number of patterns is obtained from the sum of the frequencies of each symbol in the symbol string storage memory 109.

【００３５】文書の一部領域を識別する場合、総パター
ン数が少ないため出現頻度が高いテンプレート数が増え
てしまい、実施例５の識別方法では識別が困難になる。
そこで、出現頻度が高いテンプレート数に総パターン数
を乗じるなどして正規化して、総パターン数が少ない場
合の影響を除去し、より正確に日本語と英語を識別する
ことが可能となる。In the case of identifying a partial region of a document, the number of templates having a high appearance frequency increases because the total number of patterns is small, and the identification method of the fifth embodiment makes identification difficult.
Therefore, normalization is performed by multiplying the number of templates having a high appearance frequency by the total number of patterns, thereby removing the influence of the case where the total number of patterns is small, thereby enabling more accurate Japanese and English discrimination.

【００３６】図９は、実施例３と実施例６の特徴量を組
み合わせてプロットした図である。縦軸は出現頻度２％
以上のテンプレート数×総パターン数であり、横軸は一
つのテンプレートが表すパターン数の平均である。この
図から容易に日本語と英語が線形に分離可能であること
が分かる。FIG. 9 is a diagram in which the feature amounts of the third and sixth embodiments are combined and plotted. The vertical axis is appearance frequency 2%
The above number of templates × the total number of patterns, and the horizontal axis is the average of the number of patterns represented by one template. From this figure, it can be seen that Japanese and English can be easily separated linearly.

【００３７】なお、上記した実施例では、各機能を実行
する専用の処理部を設けた構成になっているが、本発明
はこれに限定されるものではなく、例えば、各機能をＲ
ＯＭなどに組み込んで、汎用のプロセッサ上で演算、処
理されるように構成を変更することができる。In the above-described embodiment, a configuration is provided in which a dedicated processing unit for executing each function is provided. However, the present invention is not limited to this.
The configuration can be changed so as to be incorporated in an OM or the like and operated and processed on a general-purpose processor.

【００３８】[0038]

【発明の効果】以上、説明したように、請求項１記載の
発明によれば、文書画像からパターンを切り出してテン
プレート化し、画像中のパターンを該テンプレートで置
き換えてシンボル化し、文書中あるいは部分領域中のシ
ンボルの出現頻度分布を計算し、その頻度分布を利用し
て文書あるいは部分領域の種類を識別しているので、フ
ォントやサイズの違いによる影響を受けずに、高精度に
文書を識別することができる。As described above, according to the first aspect of the present invention, a pattern is cut out from a document image and made into a template, and the pattern in the image is replaced with the template to be converted into a symbol, and the pattern in the document or a partial area is formed. Calculates the appearance frequency distribution of symbols inside and uses the frequency distribution to identify the type of document or partial area, so it is possible to identify documents with high accuracy without being affected by differences in fonts and sizes be able to.

【００３９】請求項２記載の発明によれば、出現頻度分
布を計算する際に小さいパターンを頻度分布から除外し
ているので、文書中に絵柄領域が含まれていても、高精
度に文書を識別することができる。According to the second aspect of the present invention, small patterns are excluded from the frequency distribution when calculating the appearance frequency distribution. Therefore, even if a picture area is included in the document, the document can be accurately calculated. Can be identified.

【００４０】請求項３記載の発明によれば、テンプレー
ト当たりのパターン数の平均を用いて識別しているの
で、フォントやサイズの違いによる影響を受けずに、請
求項１の発明よりも簡単な計算で、高精度に文書を識別
することができる。According to the third aspect of the present invention, since the identification is performed using the average of the number of patterns per template, it is simpler than the first aspect of the invention without being affected by differences in fonts and sizes. The calculation can identify the document with high accuracy.

【００４１】請求項４記載の発明によれば、テンプレー
トの種類数という簡単な指標を用いることで、フォント
やサイズの違いによる影響を受けずに、より高精度に文
書を識別することができる。According to the fourth aspect of the invention, by using a simple index of the number of template types, a document can be identified with higher accuracy without being affected by differences in fonts and sizes.

【００４２】請求項５記載の発明によれば、出現頻度が
高いテンプレートの種類数を用いて識別しているので、
フォントやサイズの違いによる影響を受けずに、より高
精度に文書を識別することができる。According to the fifth aspect of the present invention, the template is identified using the number of types of templates having a high appearance frequency.
Documents can be identified with higher accuracy without being affected by differences in fonts and sizes.

【００４３】請求項６記載の発明によれば、さらに総パ
ターン数を加味しているので、フォントやサイズの違い
による影響を受けずに、また文字量が比較的少ない場合
でも、高精度に文書を識別することができる。According to the sixth aspect of the present invention, since the total number of patterns is further taken into account, the document is not affected by differences in fonts and sizes, and even if the amount of characters is relatively small, the document can be accurately obtained. Can be identified.

[Brief description of the drawings]

【図１】本発明の各実施例に共通な構成図である。FIG. 1 is a configuration diagram common to each embodiment of the present invention.

【図２】本発明の実施例１に係る処理フローチャートで
ある。FIG. 2 is a processing flowchart according to the first embodiment of the present invention.

【図３】日本語文書と英語文書のパターン頻度分布の具
体例を示す。FIG. 3 shows a specific example of a pattern frequency distribution of a Japanese document and an English document.

【図４】本発明の実施例２に係る処理フローチャートで
ある。FIG. 4 is a processing flowchart according to a second embodiment of the present invention.

【図５】本発明の実施例３に係る処理フローチャートで
ある。FIG. 5 is a processing flowchart according to a third embodiment of the present invention.

【図６】本発明の実施例４に係る処理フローチャートで
ある。FIG. 6 is a processing flowchart according to a fourth embodiment of the present invention.

【図７】本発明の実施例５に係る処理フローチャートで
ある。FIG. 7 is a processing flowchart according to a fifth embodiment of the present invention.

【図８】本発明の実施例６に係る処理フローチャートで
ある。FIG. 8 is a processing flowchart according to a sixth embodiment of the present invention.

【図９】実施例３と実施例７の特徴量を組み合わせてプ
ロットした図である。FIG. 9 is a diagram in which feature amounts of the third embodiment and the seventh embodiment are combined and plotted.

【図１０】切り出されたパターンから類似パターンを検
出してシンボルを付与する具体例を示す。FIG. 10 shows a specific example in which a similar pattern is detected from a cut-out pattern and a symbol is assigned.

[Explanation of symbols]

１０１画像入力部１０２連結成分抽出部１０３パターン照合部１０４頻度分布計算部１０５文書識別部１０６イメージメモリ１０７パターンメモリ１０８テンプレートメモリ１０９シンボル列記憶メモリ Reference Signs List 101 Image input unit 102 Connected component extraction unit 103 Pattern matching unit 104 Frequency distribution calculation unit 105 Document identification unit 106 Image memory 107 Pattern memory 108 Template memory 109 Symbol string storage memory

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06K 9/20 G06K 9/62 G06T 7/00 - 7/60 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G06K 9/20 G06K 9/62 G06T 7/00-7/60

Claims

(57) [Claims]

In a document identification method for identifying the type of an input document image or a partial region of the document image, a pattern cut out from the document image is compared with a registered template, and the pattern is identified as a template. If it is a new pattern without matching, it is registered as a new template and a new symbol is added. If the pattern is similar to an already registered template, the pattern is replaced with the symbol assigned to the template and Updating the number of symbols, calculating the appearance frequency distribution of the symbol in the document image or the partial area of the document image, and identifying the type of the document or the partial area based on the appearance frequency distribution. Document identification method to use.

2. The document identification method according to claim 1, wherein when calculating the appearance frequency distribution of the symbols, the calculation is performed excluding a pattern of a predetermined size or less.

3. The document identification method according to claim 1, wherein the type of the document or the partial area is identified based on an average of the number of patterns represented by one template.

4. The document identification method according to claim 1, wherein the type of the document or the partial area is identified based on the number of types of the template.

5. The document identification method according to claim 1, wherein the type of the document or the partial area is identified based on the number of types of templates having a high appearance frequency.

6. The number of types of templates having a high appearance frequency,
2. The document identification method according to claim 1, wherein the type of the document or the partial area is identified based on the total number of patterns.