JP2019195117A

JP2019195117A - Information processing apparatus, information processing method, and program

Info

Publication number: JP2019195117A
Application number: JP2018088124A
Authority: JP
Inventors: 欽也本田; Kinya Honda
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-05-01
Filing date: 2018-05-01
Publication date: 2019-11-07

Abstract

To determine whether a PDF file is a scanned PDF with high accuracy.SOLUTION: An information processing apparatus includes: first determination means for determining whether a PDF file is a scan PDF created from a scanned image obtained by scanning a document on the basis of whether an image of the same size as the page size specified in the PDF file is included in the PDF file; deriving means for deriving the number of character areas in the image included in the PDF file; and a second determination unit for determining whether the PDF file is a scanned PDF on the basis of the number of the character areas derived by the deriving means.SELECTED DRAWING: Figure 5

Description

本発明は、情報処理装置、情報処理方法、及びプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

ＰＤＦフォーマットのファイル（以下ＰＤＦファイルと呼ぶ）の作成方法には、様々な方法がある。例えば、ファイルを保存する際のフォーマットとしてＰＤＦを指定することで、直接ＰＤＦファイルとして保存することが可能なアプリケーション（例えばマイクロソフト社のＷｏｒｄやＰｏｗｅｒＰｏｉｎｔ（商標）等）がある。また、ファイル保存時にＰＤＦを指定できないアプリケーションにおいても、印刷処理を実行する際に、プリンタドライバの代わりにＰＤＦ変換用のドライバを選択することで、出力される印刷データをＰＤＦファイルに変換する方法もある。さらに、複写機等のデバイスの中には、スキャナで紙文書を読み取ることにより得た画像（以下スキャン画像と呼ぶ）をＰＤＦフォーマットで保存可能なデバイスもある。以下、直接ＰＤＦファイルとして保存することが可能なアプリケーションをＰＤＦ作成用ソフトウェアと呼ぶ。また、スキャン画像から作成されたＰＤＦファイルをスキャンＰＤＦと呼ぶ。 There are various methods for creating a PDF format file (hereinafter referred to as a PDF file). For example, there is an application (for example, Microsoft Word or PowerPoint (trademark), etc.) that can be directly saved as a PDF file by specifying PDF as a format for saving the file. In addition, even in an application in which a PDF cannot be specified when saving a file, a method for converting print data to be output to a PDF file by selecting a PDF conversion driver instead of a printer driver when executing print processing is also available. is there. Furthermore, some devices such as copiers can store an image obtained by reading a paper document with a scanner (hereinafter referred to as a scanned image) in a PDF format. Hereinafter, an application that can be directly saved as a PDF file is referred to as PDF creation software. A PDF file created from a scanned image is called a scan PDF.

スキャンＰＤＦは、ＰＤＦファイル内に、用紙サイズ相当の画像（ＰＤＦファイルで規定されるページサイズと同等のサイズの画像）を含んでいる。図３は、スキャンＰＤＦの一例を示す図である。ＰＤＦファイル３０１内には、用紙サイズ相当の画像３０２が含まれている。但し、ＰＤＦファイル内に含まれる画像のサイズは、用紙サイズと全く同じとは限らない。図３のケースのように、画像のサイズが、用紙サイズより余白の分だけ小さい場合がある。ＰＤＦファイル３０１内の「Ｉｎｖｏｉｃｅ」等の文字列は、全て画像内の絵の一部であり、ＰＤＦファイル３０１内に文字コード情報は含まれていない。 The scan PDF includes an image corresponding to the paper size (an image having a size equivalent to the page size defined in the PDF file) in the PDF file. FIG. 3 is a diagram illustrating an example of a scan PDF. The PDF file 301 includes an image 302 corresponding to the paper size. However, the size of the image included in the PDF file is not always the same as the paper size. As in the case of FIG. 3, the image size may be smaller than the paper size by the margin. A character string such as “Invoice” in the PDF file 301 is all part of a picture in the image, and no character code information is included in the PDF file 301.

ところで、従来、入力されたＰＤＦファイルに対して、編集、画像変換、印刷等を行うソフトウェア（所謂ＰＤＦ編集用ソフトウェア）がある。ＰＤＦ編集用ソフトウェアの中には、入力されたＰＤＦファイルがスキャンＰＤＦか否かを判定し、その判定結果に応じて、その後の処理や、使用するモジュールを変更するものがある。 Conventionally, there is software (so-called PDF editing software) that performs editing, image conversion, printing, and the like on an input PDF file. Some PDF editing software determines whether or not an input PDF file is a scan PDF, and changes subsequent processing and modules to be used according to the determination result.

例えば、スキャンＰＤＦを作成するとき、スキャナで紙文書を読み取った際に得たスキャン画像が傾いたり、ノイズが入っていたりする可能性がある。そのため、ＰＤＦ編集用ソフトフェアによっては、入力されたＰＤＦファイルがスキャンＰＤＦか判定し、入力されたＰＤＦファイルがスキャンＰＤＦだった場合に、傾き補正やノイズ除去等の画像処理を自動的に行うものがある。従って、ＰＤＦファイルがスキャンＰＤＦか精度良く判定する技術が求められている。その理由は、ＰＤＦ作成用ソフトウェアを用いて作成されたＰＤＦファイルをスキャンＰＤＦと誤判定して、傾き補正やノイズ除去等の画像処理を行った場合、無駄な処理を実行することになってしまうからである。 For example, when creating a scan PDF, there is a possibility that a scan image obtained when a paper document is read by a scanner is tilted or noisy. Therefore, some PDF editing software determines whether the input PDF file is a scan PDF, and automatically performs image processing such as tilt correction and noise removal when the input PDF file is a scan PDF. There is. Therefore, a technique for accurately determining whether a PDF file is a scanned PDF is required. The reason is that if a PDF file created using the PDF creation software is erroneously determined as a scan PDF and image processing such as tilt correction and noise removal is performed, useless processing is executed. Because.

特許文献１には、ＰＤＦファイルに含まれる作成者の情報（具体的にはＰｒｏｄｕｃｅｒの値）に基づき、該ＰＤＦファイルが純正か非純正か（即ち、Ａｄｏｂｅ社のソフトウェアを用いて作成されたＰＤＦファイルか否か）を判定する技術が記載されている。 Patent Document 1 discloses whether a PDF file is genuine or non-genuine (that is, PDF created using Adobe software) based on the creator's information (specifically, the Producer value) included in the PDF file. A technique for determining whether the file is a file is described.

特開２００７−１５６９０３号公報JP 2007-156903 A

しかしながら、特許文献１に記載の技術では、ＰＤＦファイルがスキャンＰＤＦか精度良く判定することができない。その理由は、ＰＤＦファイルに含まれる作成者の情報とは、ＰＤＦフォーマットへの変換に用いたソフトウェア等の名称を示すものであって、スキャン画像が含まれるかを示すものではないためである。 However, the technique described in Patent Document 1 cannot accurately determine whether a PDF file is a scan PDF. The reason is that the creator information included in the PDF file indicates the name of software or the like used for conversion to the PDF format, and does not indicate whether a scanned image is included.

そこで本発明は、上記の課題に鑑み、ＰＤＦファイルがスキャンＰＤＦか精度良く判定することを目的とする。 In view of the above problems, an object of the present invention is to accurately determine whether a PDF file is a scanned PDF.

本発明は、ＰＤＦファイル内に、該ＰＤＦファイルで規定されるページサイズと同等のサイズの画像が含まれるかに基づいて、該ＰＤＦファイルが、原稿をスキャンすることで取得されるスキャン画像から作成されたスキャンＰＤＦではないか判定する第１の判定手段と、前記ＰＤＦファイル内に含まれる画像内の文字領域の数を導出する導出手段と、前記導出手段によって導出された前記文字領域の数に基づいて、前記ＰＤＦファイルが、スキャンＰＤＦか否かを判定する第２の判定手段とを有することを特徴とする情報処理装置である。 The present invention creates a PDF file from a scanned image acquired by scanning a document based on whether the PDF file includes an image having a size equivalent to the page size specified by the PDF file. A first determination unit for determining whether the scanned PDF is a scanned PDF, a deriving unit for deriving the number of character regions in the image included in the PDF file, and the number of the character regions derived by the deriving unit An information processing apparatus comprising: a second determination unit configured to determine whether the PDF file is a scan PDF based on the second file.

本発明により、ＰＤＦファイルがスキャンＰＤＦか精度良く判定することが可能になる。 According to the present invention, it is possible to accurately determine whether a PDF file is a scanned PDF.

第１の実施形態における複写機１００の構成を示すブロック図1 is a block diagram showing a configuration of a copying machine 100 according to a first embodiment. 第１の実施形態における情報処理装置２００の構成を示すブロック図The block diagram which shows the structure of the information processing apparatus 200 in 1st Embodiment. スキャンＰＤＦの一例An example of scan PDF ＰＤＦファイル作成用ソフトウェアを用いて作成したＰＤＦファイルの一例An example of a PDF file created using the PDF file creation software 第１の実施形態におけるＰＤＦ判定処理のフローチャートFlowchart of PDF determination process in the first embodiment 画像に対する領域分割処理を実行した結果を示す図The figure which shows the result of having performed the area division processing on the image 高圧縮ＰＤＦの一例Example of high compression PDF 第２の実施形態におけるＰＤＦ判定処理のフローチャートFlow chart of PDF determination processing in the second embodiment 文字領域を特定するための手法を説明する図Diagram explaining the method for specifying the character area

［第１の実施形態］
以下、図面を参照して、本発明の実施形態について説明する。尚、以下の実施形態は、本発明を限定するものではない。また、以下の実施形態で説明されている特徴の組み合わせの全てが、本発明の課題解決の手段として必須のものとは限らない。 [First Embodiment]
Embodiments of the present invention will be described below with reference to the drawings. The following embodiments do not limit the present invention. In addition, not all combinations of features described in the following embodiments are essential as means for solving the problems of the present invention.

＜複写機の構成、及び、情報処理装置の構成について＞
本実施形態におけるシステムは、複写機１００と情報処理装置２００とから構成される。図１は、本実施形態における複写機１００の構成を示すブロック図である。複写機１００は、スキャナ部１０１と、送受信部と１０２、プリンタ部と１０３、制御部１０４とから構成される。 <Configuration of copier and information processing apparatus>
The system according to this embodiment includes a copying machine 100 and an information processing apparatus 200. FIG. 1 is a block diagram showing the configuration of the copying machine 100 according to this embodiment. The copier 100 includes a scanner unit 101, a transmission / reception unit 102, a printer unit 103, and a control unit 104.

図２は、本実施形態における情報処理装置２００の構成を示すブロック図である。複写機１００と接続される情報処理装置２００は、例えばパーソナルコンピュータ（ＰＣ）等であり、ＣＰＵ、ＲＯＭ、及びＲＡＭを内部に有する。ＲＯＭに格納されている情報処理装置２００のプログラムがＲＡＭに展開され、ＣＰＵが該展開されたプログラムを実行することにより、図２に示す各コンポーネントにおける処理が実行される。尚、ユーザーの指示を受け付ける受付部２０５は、キーボードとマウスを含む形態が一般的であるが、この形態に限られない。また、図２では、表示部と受付部とを別個のコンポーネントとしているが、表示部及び受付部は、例えばタッチパネルのような一体のコンポーネントであっても良い。 FIG. 2 is a block diagram illustrating a configuration of the information processing apparatus 200 according to the present embodiment. An information processing apparatus 200 connected to the copying machine 100 is, for example, a personal computer (PC) or the like, and has a CPU, a ROM, and a RAM inside. The program of the information processing apparatus 200 stored in the ROM is expanded in the RAM, and the CPU executes the expanded program, whereby processing in each component shown in FIG. 2 is executed. The reception unit 205 that receives a user instruction generally includes a keyboard and a mouse, but is not limited thereto. In FIG. 2, the display unit and the reception unit are separate components, but the display unit and the reception unit may be integrated components such as a touch panel.

＜スキャン画像からスキャンＰＤＦを作成する流れについて＞
複写機１００のスキャナ部１０１が文書をスキャンすると、スキャン画像（スキャン画像データともいう）が生成される。生成されたスキャン画像を制御部１０４がスキャンＰＤＦに変換し、該スキャンＰＤＦを送受信部１０２が情報処理装置２００に送信する。すると、情報処理装置２００の送受信部２０１がスキャンＰＤＦを受信し、制御部２０３は、該受信したスキャンＰＤＦを保存部２０２に保存する。尚、保存部２０２には、スキャンＰＤＦの他、ユーザーがＰＤＦ作成用ソフトウェアを用いて作成したＰＤＦファイルも保存される。 <Flow for creating scan PDF from scanned image>
When the scanner unit 101 of the copying machine 100 scans a document, a scanned image (also referred to as scanned image data) is generated. The control unit 104 converts the generated scan image into a scan PDF, and the transmission / reception unit 102 transmits the scan PDF to the information processing apparatus 200. Then, the transmission / reception unit 201 of the information processing apparatus 200 receives the scan PDF, and the control unit 203 stores the received scan PDF in the storage unit 202. In addition to the scanned PDF, the storage unit 202 stores a PDF file created by the user using the PDF creation software.

ユーザーが受付部２０５を介して、保存部２０２に保存されている１又は複数のＰＤＦファイルの中から、１つのＰＤＦファイルを選択する。すると、制御部２０３は、そのＰＤＦファイルを表示部２０４に表示する。 The user selects one PDF file from one or a plurality of PDF files stored in the storage unit 202 via the reception unit 205. Then, the control unit 203 displays the PDF file on the display unit 204.

このとき、ＰＤＦファイルを表示部２０４に表示するにあたって、制御部２０３は、ユーザーによって選択されたＰＤＦファイルがスキャンＰＤＦか判定する。この判定結果が真、つまりユーザーによって選択されたＰＤＦファイルがスキャンＰＤＦの場合、該ＰＤＦファイルに含まれる画像に対する黒点除去や傾き補正等の画像処理を制御部２０３が実行する。或いは、この場合、このような画像処理を実行するかを確認するメッセージを表示部２０４に表示して、該画像処理の実行可否をユーザーに選択させても良い。 At this time, when displaying the PDF file on the display unit 204, the control unit 203 determines whether the PDF file selected by the user is a scan PDF. If this determination result is true, that is, if the PDF file selected by the user is a scan PDF, the control unit 203 executes image processing such as black point removal and inclination correction on the image included in the PDF file. Alternatively, in this case, a message for confirming whether or not to execute such image processing may be displayed on the display unit 204 to allow the user to select whether or not to execute the image processing.

＜領域分割処理について＞
本実施形態では、画像内の文字領域を特定するため領域分割の技術を用いる。以下、領域分割の各プロセスについて詳しく説明する。 <Regarding area division processing>
In the present embodiment, a region division technique is used to specify a character region in an image. Hereinafter, each process of area division will be described in detail.

（１）２値化
制御部２０３は、スキャン画像に対して２値化を行うことで、黒画素と白画素とから成る２値画像を得る。２値化により、スキャン画像における所定の閾値以上の濃度値を有する画素は黒画素、該所定の閾値未満の濃度値を有する画素は白画素となる。尚、以下では、スキャン画像が１００ＤＰＩであるものとして説明するが、スキャン画像の解像度がこの解像度に限られないのは言うまでもない。 (1) Binarization The control unit 203 obtains a binary image composed of black pixels and white pixels by binarizing the scan image. By binarization, a pixel having a density value equal to or higher than a predetermined threshold in the scanned image is a black pixel, and a pixel having a density value lower than the predetermined threshold is a white pixel. In the following description, it is assumed that the scanned image is 100 DPI, but it goes without saying that the resolution of the scanned image is not limited to this resolution.

（２）黒画素塊の検出
制御部２０３は、２値画像に対して８連結で繋がる黒画素の輪郭を追跡することで、８方向の何れかの方向で連続して存在する黒画素の塊（黒画素塊とする）を検出する。８連結とは、左上、左、左下、下、右下、右、右上、上の８つの方向の何れかで、同じ色（今回のケースでは黒）の画素が連続しているという意味である。一方、４連結とは、左、下、右、上の４つの方向の何れかで同じ色の画素が連続しているという意味である。本実施形態における（２）黒画素塊の検出の処理では、８方向に存在する８つの隣接画素の何れもが黒画素ではない単独の黒画素は検出されない。一方、８方向に存在する８つの隣接画素の何れか１つにでも黒画素が存在する黒画素は、その隣接する黒画素と共に、黒画素塊として検出されることになる。図９の符号９０１は、制御部２０３が検出した黒画素塊の一例を示す。 (2) Detection of black pixel block The control unit 203 tracks the outline of black pixels connected by 8-connection to the binary image, thereby collecting black pixel blocks that exist continuously in any of the eight directions. (A black pixel block) is detected. Eight connected means that pixels of the same color (black in this case) are continuous in any one of the eight directions of upper left, left, lower left, lower, lower right, right, upper right, and upper. . On the other hand, 4-connection means that pixels of the same color are continuous in any of the four directions of left, bottom, right, and top. In the processing of (2) black pixel block detection in this embodiment, a single black pixel in which none of the eight adjacent pixels existing in the eight directions is a black pixel is detected. On the other hand, a black pixel in which a black pixel exists in any one of eight adjacent pixels existing in the eight directions is detected as a black pixel block together with the adjacent black pixels. Reference numeral 901 in FIG. 9 indicates an example of a black pixel block detected by the control unit 203.

また、制御部２０３は、検出した黒画素塊の外接矩形の位置情報、具体的には、外接矩形の四頂点夫々のＸ座標、Ｙ座標を導出する。なお、Ｘ軸は右方向に伸び、Ｙ軸は下方向に伸びているものとする。図９の符号９０２は、黒画素塊９０１の外接矩形を示す。尚、本明細書で「幅」はＸ軸方向の長さを、「高さ」はＹ軸方向の長さを指すものとする。また、本明細書で特に断り無く「矩形」と表現したときは、斜め向きの矩形は含まれず、四辺の全てがＸ軸とＹ軸との何れかと平行な矩形を表すものとする。 In addition, the control unit 203 derives position information of the circumscribed rectangle of the detected black pixel block, specifically, the X coordinate and the Y coordinate of each of the four vertices of the circumscribed rectangle. It is assumed that the X axis extends in the right direction and the Y axis extends in the downward direction. Reference numeral 902 in FIG. 9 indicates a circumscribed rectangle of the black pixel block 901. In this specification, “width” refers to the length in the X-axis direction, and “height” refers to the length in the Y-axis direction. In addition, when expressed as “rectangular” in the present specification without any particular notice, an oblique rectangle is not included, and all four sides represent a rectangle parallel to either the X axis or the Y axis.

（３）表領域の検出
制御部２０３は、検出した黒画素塊の夫々について、以下の３つの条件を全て満たすか判定し、３つの条件を全て満たす黒画素塊を、表の枠線を構成する黒画素塊と判定する。 (3) Detection of Table Area The control unit 203 determines whether or not all of the following three conditions are satisfied for each detected black pixel block, and configures a black pixel block that satisfies all three conditions as a table frame The black pixel block is determined.

第１の条件は、黒画素塊の外接矩形の幅が所定の閾値以上であり、かつ、該外接矩形の高さが所定の閾値以上であることである。尚、ここでは一例として、幅及び高さがともに、１００画素に相当する０．２５ｃｍ以上であるか判定するものとする。 The first condition is that the width of the circumscribed rectangle of the black pixel block is equal to or greater than a predetermined threshold value, and the height of the circumscribed rectangle is equal to or greater than the predetermined threshold value. Here, as an example, it is determined whether both the width and the height are 0.25 cm or more corresponding to 100 pixels.

第２の条件は、外接矩形内部における黒画素塊の充填率が所定の閾値以下であることである。尚、ここでは一例として、黒画素塊の外接矩形に占める割合が２０％以下であるか判定するものとする。 The second condition is that the filling rate of the black pixel block inside the circumscribed rectangle is not more than a predetermined threshold value. Here, as an example, it is determined whether the ratio of the black pixel block to the circumscribed rectangle is 20% or less.

第３の条件は、黒画素塊の最大幅と外接矩形の幅との間の差、及び、黒画素塊の最大高さと外接矩形の高さとの間の差が何れも小さいことである。つまり、黒画素塊の最大幅と外接矩形の幅との間の差が所定の閾値以下であり、かつ、黒画素塊の最大高さと外接矩形の高さとの間の差が所定の閾値以下であることである。尚、ここでは一例として、黒画素塊の最大幅と外接矩形の幅との間の差、及び、黒画素塊の最大高さと外接矩形の高さとの間の差がともに、１０画素以下であるか判定するものとする。 The third condition is that the difference between the maximum width of the black pixel block and the width of the circumscribed rectangle and the difference between the maximum height of the black pixel block and the height of the circumscribed rectangle are both small. That is, the difference between the maximum width of the black pixel block and the width of the circumscribed rectangle is equal to or smaller than a predetermined threshold value, and the difference between the maximum height of the black pixel block and the circumscribed rectangle is equal to or smaller than the predetermined threshold value. That is. Here, as an example, the difference between the maximum width of the black pixel block and the width of the circumscribed rectangle and the difference between the maximum height of the black pixel block and the height of the circumscribed rectangle are both 10 pixels or less. Shall be determined.

制御部２０３は、黒画素塊の夫々について上述の第１〜第３の条件を全て満たすかの判定を行うことで、表の枠線を構成する黒画素塊であるかの判定を行い、表の枠線を構成する黒画素塊の外接矩形の位置情報を保存部２０２に保存する。このようにして保存された位置情報を持つ外接矩形の領域を表領域と呼ぶ。図９に示すケースでは、表領域の検出の結果、黒画素塊９０１が、表の枠線を構成する黒画素塊と判定され、外接矩形９０２の領域が、表領域として検出される。尚、本実施形態では、上述の第１〜第３の条件を全て満たす黒画素塊を、表の枠線を構成する黒画素塊と判定するが、判定条件はこれに限られない。例えば、第１〜第３の条件のうちの少なくとも１つを満たす黒画素塊を、表の枠線を構成する黒画素塊と判定しても良い。 The control unit 203 determines whether each of the black pixel blocks satisfies all the above first to third conditions, thereby determining whether the black pixel block is a black pixel block constituting the table frame. The position information of the circumscribed rectangle of the black pixel block constituting the frame line is stored in the storage unit 202. A circumscribed rectangular area having position information stored in this manner is called a table area. In the case shown in FIG. 9, as a result of the detection of the table area, the black pixel block 901 is determined as the black pixel block constituting the table frame line, and the circumscribed rectangle 902 area is detected as the table area. In the present embodiment, a black pixel block that satisfies all of the first to third conditions described above is determined as a black pixel block that forms a table frame line, but the determination condition is not limited to this. For example, a black pixel block that satisfies at least one of the first to third conditions may be determined as a black pixel block that forms a front frame.

（４）認識セルの特定
制御部２０３は、表領域内部の認識セルを特定する。ここで「認識セル」とは、表領域内部の白画素塊の外接矩形である。認識セルを特定するためには、表領域内部の白画素の輪郭を追跡することにより、白画素塊を検出する必要がある。その上で、制御部２０３は、検出した白画素塊の夫々について、以下の３つの条件を満たすか判定し、３つの条件を全て満たす白画素塊の外接矩形を、認識セルとして特定する。 (4) Identification cell recognition The control unit 203 identifies a recognition cell inside the table area. Here, the “recognition cell” is a circumscribed rectangle of the white pixel block inside the table area. In order to identify the recognition cell, it is necessary to detect a white pixel block by tracking the outline of the white pixel in the table area. Then, the control unit 203 determines whether or not each of the detected white pixel blocks satisfies the following three conditions, and identifies a circumscribed rectangle of the white pixel block that satisfies all the three conditions as a recognition cell.

第１の条件は、白画素塊の外接矩形の幅が所定の閾値以上であり、かつ該外接矩形の高さが所定の閾値以上であることである。尚、ここでは一例として、幅及び高さがともに、２０画素以上であるか判定するものとする。 The first condition is that the width of the circumscribed rectangle of the white pixel block is equal to or greater than a predetermined threshold value, and the height of the circumscribed rectangle is equal to or greater than the predetermined threshold value. Here, as an example, it is determined whether both the width and the height are 20 pixels or more.

第３の条件は、白画素塊の最大幅と外接矩形の幅との間の差、及び、白画素塊の最大高さと外接矩形の高さとの間の差が何れも小さいことである。つまり、白画素塊の最大幅と外接矩形の幅との間の差が所定の閾値以下であり、かつ、白画素塊の最大高さと外接矩形の高さとの間の差が所定の閾値以下であることである。尚、ここでは一例として、白画素塊の最大幅と外接矩形の幅との間の差、及び、白画素塊の最大高さと外接矩形の高さとの間の差がともに、５画素以下であるか判定する。 The third condition is that the difference between the maximum width of the white pixel block and the width of the circumscribed rectangle and the difference between the maximum height of the white pixel block and the height of the circumscribed rectangle are both small. That is, the difference between the maximum width of the white pixel block and the width of the circumscribed rectangle is equal to or smaller than a predetermined threshold value, and the difference between the maximum height of the white pixel block and the circumscribed rectangle is equal to or smaller than the predetermined threshold value. That is. Here, as an example, the difference between the maximum width of the white pixel block and the width of the circumscribed rectangle and the difference between the maximum height of the white pixel block and the height of the circumscribed rectangle are both 5 pixels or less. To determine.

図９の符号９０３、９０４は、制御部２０３が特定した認識セルを示す。制御部２０３は、特定した認識セルの位置情報を保存部２０２に保存する。 Reference numerals 903 and 904 in FIG. 9 indicate recognition cells identified by the control unit 203. The control unit 203 stores the position information of the identified recognition cell in the storage unit 202.

尚、本実施形態では、上述の第１〜第３の条件を全て満たす白画素塊の外接矩形を、認識セルとして特定したが、判定条件はこれに限られない。例えば、第１〜第３の条件のうちの少なくとも１つを満たす白画素塊の外接矩形を、認識セルとして特定しても良い。 In this embodiment, the circumscribed rectangle of the white pixel block that satisfies all the above first to third conditions is specified as the recognition cell, but the determination condition is not limited to this. For example, a circumscribed rectangle of a white pixel block that satisfies at least one of the first to third conditions may be specified as the recognition cell.

（５）認識セル内の文字領域の特定
制御部２０３は、各認識セルの内部に、その各認識セルに内接する白画素塊によって囲まれた黒画素塊があるか判定する。そして、黒画素塊があると判定した場合、あると判定された全ての黒画素塊に外接矩形を設定する。 (5) Identification of Character Area in Recognition Cell The control unit 203 determines whether there is a black pixel block surrounded by a white pixel block inscribed in each recognition cell inside each recognition cell. When it is determined that there is a black pixel block, a circumscribed rectangle is set for all the black pixel blocks determined to be present.

さらに、制御部２０３は、１つの認識セルの中に複数の外接矩形を設定した場合に、外接矩形同士の距離に基づき、該距離が近い外接矩形を検出する。具体的には、制御部２０３は、外接矩形を１つずつ選択し、選択した外接矩形からの距離が所定の閾値以下となる外接矩形を検出する。尚、ここでは一例として、外接矩形同士の距離が２０画素に相当する０．０５ｃｍ以下であるか判定するものとする。 Further, when a plurality of circumscribed rectangles are set in one recognition cell, the control unit 203 detects circumscribed rectangles having the closest distance based on the distance between the circumscribed rectangles. Specifically, the control unit 203 selects circumscribed rectangles one by one, and detects circumscribed rectangles whose distance from the selected circumscribed rectangle is equal to or less than a predetermined threshold. Here, as an example, it is determined whether the distance between circumscribed rectangles is 0.05 cm or less corresponding to 20 pixels.

さらに、制御部２０３は、そのような外接矩形を検出した場合、選択した外接矩形と、該選択した外接矩形に対して検出された外接矩形とを統合する。つまり、制御部２０３は、これら両方の外接矩形に外接する新たな外接矩形を設定するとともに、選択した外接矩形と、検出した外接矩形とを削除する。 Furthermore, when such a circumscribed rectangle is detected, the control unit 203 integrates the selected circumscribed rectangle and the circumscribed rectangle detected with respect to the selected circumscribed rectangle. That is, the control unit 203 sets a new circumscribed rectangle that circumscribes both of these circumscribed rectangles, and deletes the selected circumscribed rectangle and the detected circumscribed rectangle.

新しい外接矩形の設定、２つの外接矩形の削除が完了した後、制御部２０３は、その認識セル内の外接矩形をまた初めから１つずつ選択し、互いの間の距離が所定の閾値以下の外接矩形同士を統合していく。以上の処理を繰り返す。即ち、互いの間の距離が所定の閾値以下の外接矩形が無くなるまで、外接矩形同士の統合が繰り返される。 After setting the new circumscribed rectangle and completing the deletion of the two circumscribed rectangles, the control unit 203 again selects the circumscribed rectangles in the recognition cell one by one from the beginning, and the distance between them is equal to or less than a predetermined threshold value. Integrate circumscribed rectangles. The above processing is repeated. That is, the integration of circumscribed rectangles is repeated until there is no circumscribed rectangle whose distance between each other is equal to or less than a predetermined threshold.

以上の通り、本実施形態では、１つの認識セルの内部に存在する外接矩形同士の統合を行うが、認識セルをまたぐ外接矩形同士の統合を行わない。 As described above, in this embodiment, circumscribing rectangles existing in one recognition cell are integrated, but circumscribing rectangles straddling the recognition cells are not integrated.

以上の処理が終わって依然として設定されている外接矩形は、文字領域と呼ばれる。以上の処理を認識セル内の文字領域の特定と呼ぶ。制御部２０３は、認識セルの内部に存在する文字領域の位置情報を、該認識セルに関連付けて保存部２０２に保存する。 The circumscribed rectangle that has been set after the above processing is called a character area. The above processing is called identification of the character area in the recognition cell. The control unit 203 stores the position information of the character area existing inside the recognition cell in the storage unit 202 in association with the recognition cell.

図９のケースでは、符号９０５、９０６が文字領域を示す。文字領域９０５の位置情報は、認識セル９０３に関連付けられて保存部２０２に保存される。また、文字領域９０６の位置情報は、認識セル９０４に関連付けられて保存部２０２に保存される。 In the case of FIG. 9, reference numerals 905 and 906 indicate character areas. The position information of the character area 905 is stored in the storage unit 202 in association with the recognition cell 903. Further, the position information of the character region 906 is stored in the storage unit 202 in association with the recognition cell 904.

＜ＰＤＦ判定処理について＞
本実施形態では、ＰＤＦファイルがスキャンＰＤＦか否かを判定するため、ＰＤＦファイルのヘッダ等で規定されている用紙サイズと同等のサイズの画像、つまり、用紙サイズ相当の画像がＰＤＦファイル内に含まれているか判定する。そして、用紙サイズ相当の画像がＰＤＦファイル内に含まれていれば、その画像を領域分割して、文字領域を特定する。そして、その文字領域の情報に基づき、ＰＤＦファイルがスキャンＰＤＦか否かを判定する。尚、ここで言っている「用紙サイズ」とは、ＰＤＦフォーマットにおいて各ページに設定されているＣｒｏｐＢｏｘやＭｅｄｉａｂｏｘ等の値である。以下、本実施形態における、ＰＤＦファイルがスキャンＰＤＦか否かを判定する処理（ＰＤＦファイル判定処理とする）について、図５を用いて説明する。 <About PDF judgment processing>
In this embodiment, in order to determine whether or not the PDF file is a scanned PDF, the PDF file includes an image having a size equivalent to the paper size specified by the header of the PDF file, that is, an image corresponding to the paper size. Determine if it is. If an image corresponding to the paper size is included in the PDF file, the image is divided into regions to specify character regions. Then, based on the information of the character area, it is determined whether the PDF file is a scan PDF. The “paper size” referred to here is a value such as CropBox or Mediabox set for each page in the PDF format. Hereinafter, processing for determining whether or not a PDF file is a scan PDF (referred to as PDF file determination processing) in the present embodiment will be described with reference to FIG.

ステップＳ５０１において、制御部２０３は、判定対象のＰＤＦファイルを解析する。本ステップにおける解析により、ＰＤＦファイルに含まれる画像、文字列、グラフィック等のオブジェクトに関する情報が取得される。尚、以下、「ステップＳ〜」を単純に「Ｓ〜」と略記する。 In step S501, the control unit 203 analyzes the determination target PDF file. By the analysis in this step, information about objects such as images, character strings, graphics, etc. included in the PDF file is acquired. Hereinafter, “step S˜” is simply abbreviated as “S˜”.

Ｓ５０２において、制御部２０３は、用紙サイズ相当の画像がＰＤＦファイル内に含まれるか判定する。具体的には、ＣｒｏｐＢｏｘやＭｅｄｉａｂｏｘ等の値に基づき、ページの幅ｘ［ｍｍ］と高さｙ［ｍｍ］を取得した上で、幅がｘ−５［ｍｍ］以上かつ高さがｙ−５［ｍｍ］以上の画像がＰＤＦファイル内に含まれるか判定する。ここで５は、バッファ（即ち、許容可能な誤差）である。本ステップの判定結果が偽の場合、Ｓ５０３に進む一方、該判定結果が真の場合、Ｓ５０４に進む。 In step S502, the control unit 203 determines whether an image corresponding to the paper size is included in the PDF file. Specifically, the page width x [mm] and height y [mm] are obtained based on values such as CropBox and Mediabox, and then the width is x-5 [mm] or more and the height is y-5. It is determined whether an image of [mm] or more is included in the PDF file. Here, 5 is a buffer (that is, an allowable error). If the determination result of this step is false, the process proceeds to S503, whereas if the determination result is true, the process proceeds to S504.

以下、Ｓ５０２で用紙サイズ相当の画像がＰＤＦファイル内に含まれるか判定する理由について説明する。複合機等のデバイスにおいて、スキャン画像から単純にＰＤＦフォーマットを指定してスキャンＰＤＦを作成する場合について考える。この場合、通常、ＰＤＦファイル内には用紙サイズ相当の画像のオブジェクトが１つだけ存在し、かつ、他のオブジェクトが存在しない状態のＰＤＦファイルが作成される。従って、ＰＤＦファイル内に含まれるオブジェクトに基づいてスキャンＰＤＦか否かを判定しようとするのであれば、ＰＤＦファイル内に用紙サイズ相当の画像オブジェクトが含まれるか、という判定条件が考えられる。よって、本実施形態では、Ｓ５０２で用紙サイズ相当の画像がＰＤＦファイル内に含まれるか判定している。 Hereinafter, the reason for determining whether an image corresponding to the paper size is included in the PDF file in S502 will be described. Consider a case where a scan PDF is created by simply specifying a PDF format from a scan image in a device such as a multifunction peripheral. In this case, a PDF file is usually created in which only one image object corresponding to the paper size exists in the PDF file and no other objects exist. Therefore, if it is determined whether or not the scanned PDF is based on an object included in the PDF file, a determination condition may be considered as to whether an image object corresponding to the paper size is included in the PDF file. Therefore, in this embodiment, it is determined in S502 whether an image corresponding to the paper size is included in the PDF file.

但し、マイクロソフトのＰｏｗｅｒＰｏｉｎｔ（商標）等のＰＤＦ作成用ソフトウェアを用いて作成したＰＤＦファイルの中には、用紙サイズ相当の画像が含まれるものがあるので留意しなければならない。図４は、ＰｏｗｅｒＰｏｉｎｔ（商標）を用いて作成したＰＤＦファイルの一例を示す図である。図４（ｂ）に示すように、ＰＤＦファイル４０１は、背景画像として、用紙サイズ相当の画像４０５を含んでいる。図４（ａ）中の符号４０２〜４０４は、画像ではなく、文字コード情報を持つ文字列を示す。 However, it should be noted that some PDF files created using PDF creation software such as Microsoft's PowerPoint (trademark) include images corresponding to the paper size. FIG. 4 is a diagram illustrating an example of a PDF file created using PowerPoint (trademark). As shown in FIG. 4B, the PDF file 401 includes an image 405 corresponding to the paper size as a background image. Reference numerals 402 to 404 in FIG. 4A indicate character strings having character code information, not images.

ここで仮に、用紙サイズ相当の画像（オブジェクト）がＰＤＦファイル内に含まれるかという条件だけに基づき、ＰＤＦファイルがスキャンＰＤＦか否かを判定した場合を考える。この場合、ＰＤＦファイル４０１は用紙サイズ相当の画像４０５を含むため、ＰＤＦファイル４０１をスキャンＰＤＦと誤判定してしまう。従って、本実施形態では、Ｓ５０２の判定処理に加え、後続のＳ５０７の判定処理を設けることで、判定精度の向上を図っている。 Here, suppose a case where it is determined whether or not the PDF file is a scan PDF based only on the condition that an image (object) corresponding to the paper size is included in the PDF file. In this case, since the PDF file 401 includes an image 405 corresponding to the paper size, the PDF file 401 is erroneously determined as a scan PDF. Therefore, in this embodiment, in addition to the determination process of S502, the subsequent determination process of S507 is provided to improve the determination accuracy.

本実施形態では、スキャン画像からＰＤＦファイルを作成する際に余白が入る可能性を考慮している。つまり、ページサイズと画像サイズとが完全に一致しなくても、ある程度ページサイズに近いサイズの画像が含まれていれば、ＰＤＦファイルにスキャン画像が含まれているとみなす。尚、ここでは、許容可能な誤差を５［ｍｍ］に設定したが、この値はユーザーのスキャン環境に応じて適宜変更して構わない。 In the present embodiment, the possibility of a blank space is taken into consideration when creating a PDF file from a scanned image. That is, even if the page size and the image size do not completely match, if the image having a size close to the page size is included to some extent, it is considered that the scanned image is included in the PDF file. Here, the allowable error is set to 5 [mm], but this value may be appropriately changed according to the scanning environment of the user.

Ｓ５０３において、制御部２０３は、判定対象のＰＤＦファイルがスキャンＰＤＦではないと判定し、一連の処理は終了する。 In step S503, the control unit 203 determines that the determination target PDF file is not a scan PDF, and the series of processing ends.

Ｓ５０４において、制御部２０３は、Ｓ５０２で用紙サイズ相当と判定された画像をＰＤＦファイルから取り出す。 In step S504, the control unit 203 extracts an image determined to be equivalent to the paper size in step S502 from the PDF file.

Ｓ５０５において、制御部２０３は、Ｓ５０４で取得した画像に対し、前述の領域分割処理を実行する。図６に、領域分割処理を実行した結果の一例を示す。図６において、破線で描かれた矩形は、領域分割処理の結果、文字領域と判定された領域を示している。例えば、「Ｉｎｖｏｉｃｅ」の文字領域６０１である。同様に、符号６０２〜６０５も文字領域を示し、他の符号を付けていない破線で描かれた幾つかの矩形も文字領域を示す。 In step S505, the control unit 203 executes the above-described region division process on the image acquired in step S504. FIG. 6 shows an example of the result of executing the area division processing. In FIG. 6, a rectangle drawn with a broken line indicates an area determined as a character area as a result of the area division process. For example, the character area 601 of “Invoice”. Similarly, reference numerals 602 to 605 also indicate character areas, and some rectangles drawn by broken lines without other reference numerals also indicate character areas.

Ｓ５０６において、制御部２０３は、画像内の文字領域の数をカウントする。例えば、図６のケースでは、カウントされる文字領域の数は２８個である。 In step S506, the control unit 203 counts the number of character areas in the image. For example, in the case of FIG. 6, the number of character areas to be counted is 28.

Ｓ５０７において、制御部２０３は、Ｓ５０６で取得した文字領域の数が所定数以上であるか判定する。尚、ここでは一例として、所定数を閾値５とする。尚、本ステップで用いる所定数は、ユーザーの環境に応じて適宜変更して構わない。本ステップの判定結果が真の場合（つまり、文字領域の数が所定数以上の場合）、Ｓ５０８に進む。一方、本ステップの判定結果が偽の場合（つまり、文字領域の数が所定数未満の場合）、Ｓ５０９に進む。例えば、図６に示すＰＤＦファイルの場合、文字領域の数２８≧所定の閾値５となるため、本ステップの判定の結果、Ｓ５０８に進むこととなる。 In step S507, the control unit 203 determines whether the number of character areas acquired in step S506 is equal to or greater than a predetermined number. Here, as an example, the predetermined number is the threshold value 5. The predetermined number used in this step may be changed as appropriate according to the user's environment. When the determination result of this step is true (that is, when the number of character areas is a predetermined number or more), the process proceeds to S508. On the other hand, when the determination result of this step is false (that is, when the number of character areas is less than the predetermined number), the process proceeds to S509. For example, in the case of the PDF file shown in FIG. 6, since the number of character areas is 28 ≧ the predetermined threshold value 5, the result of the determination in this step is that of S508.

Ｓ５０８において、制御部２０３は、判定対象のＰＤＦファイルがスキャンＰＤＦと判定し、一連の処理は終了する。 In step S508, the control unit 203 determines that the determination target PDF file is a scan PDF, and the series of processing ends.

Ｓ５０９において、制御部２０３は、判定対象のＰＤＦファイルがスキャンＰＤＦではないと判定し、一連の処理は終了する。 In step S509, the control unit 203 determines that the determination target PDF file is not a scan PDF, and the series of processing ends.

Ｓ５０７〜Ｓ５０９について、詳しく説明する。スキャンＰＤＦの場合、図６に示すケースのように、画像内に複数の文字領域が含まれることが多い。特に帳票等をスキャンして作成したスキャンＰＤＦの場合は殆ど、画像内に複数の文字領域が含まれる。 S507 to S509 will be described in detail. In the case of scan PDF, a plurality of character areas are often included in an image as in the case shown in FIG. In particular, in the case of a scan PDF created by scanning a form or the like, a plurality of character areas are included in the image.

これに対し、ＰＤＦ作成用ソフトウェアを用いて作成したＰＤＦファイルの場合、図４に示すケースのように、用紙サイズ相当の画像は見栄えをよくするための背景画像であり、画像内に文字領域が含まれないことが多い。例えば、図４（ａ）中の文字列４０２〜４０４は、画像内ではなく、文字コード情報を含むオブジェクトとしてＰＤＦファイル内に格納されている。つまり、ＰＤＦファイル内に含まれる用紙サイズ相当の画像内に、ある程度の数の文字領域が含まれる場合、該ＰＤＦファイルがスキャンＰＤＦの可能性が高いと考えられる。そのため本実施形態では、Ｓ５０７〜Ｓ５０９の処理を行っている。 On the other hand, in the case of a PDF file created using PDF creation software, as shown in the case of FIG. 4, an image corresponding to the paper size is a background image for improving the appearance, and a character area is included in the image. Often not included. For example, the character strings 402 to 404 in FIG. 4A are not stored in the image but are stored in the PDF file as objects including character code information. That is, when a certain number of character areas are included in an image corresponding to the paper size included in the PDF file, it is considered that the PDF file is highly likely to be a scanned PDF. Therefore, in the present embodiment, the processes of S507 to S509 are performed.

尚、前述の形態では、用紙サイズ相当の画像内に５個以上の文字領域がある場合に、ＰＤＦファイルがスキャンＰＤＦと判定しており（Ｓ５０７でＹＥＳ→Ｓ５０８）、ある程度のバッファを持たせている。この理由は、ＰＤＦ作成用ソフトウェアを用いて作成したＰＤＦファイルの画像内にも文字領域が若干含まれる可能性があるためである。 In the above-described embodiment, when there are five or more character areas in the image corresponding to the paper size, the PDF file is determined to be a scan PDF (YES in S507 → S508), and a certain amount of buffer is provided. Yes. The reason for this is that there is a possibility that some character areas are also included in the image of the PDF file created using the PDF creation software.

以上が、本実施形態におけるＰＤＦファイル判定処理の内容である。 The above is the content of the PDF file determination process in the present embodiment.

＜本実施形態の効果について＞
本実施形態により、ＰＤＦファイルがスキャンＰＤＦか高精度に判定することが可能になる。 <About the effect of this embodiment>
According to the present embodiment, it is possible to determine whether a PDF file is a scanned PDF with high accuracy.

［第２の実施形態］
第１の実施形態では、ＰＤＦファイル内に含まれる画像の数が１つの場合を説明した（図３の画像３０２、図４（ｂ）の画像４０５）。これに対し、本実施形態では、ＰＤＦファイル内に含まれる画像の数が複数の場合を想定している。 [Second Embodiment]
In the first embodiment, a case has been described where the number of images included in the PDF file is one (image 302 in FIG. 3 and image 405 in FIG. 4B). On the other hand, in the present embodiment, it is assumed that the number of images included in the PDF file is plural.

複写機を用いてスキャンＰＤＦを作成する場合、スキャン画像が複数に分かれてファイル内に格納される場合がある。図７は、スキャン画像が複数に分かれる場合の一例を示す図であって、図３に示したＰＤＦファイル３０１を取得する際にスキャンした原稿と同じものをスキャンした場合に取得されるスキャンＰＤＦ内に含まれる複数の画像を示している。 When creating a scan PDF using a copying machine, a scan image may be divided into a plurality of parts and stored in a file. FIG. 7 is a diagram illustrating an example of a case where the scan image is divided into a plurality of images. In the scan PDF acquired when the same original as the original scanned when the PDF file 301 illustrated in FIG. 3 is acquired is scanned. A plurality of images included in FIG.

図３に示すケースでは、スキャンＰＤＦであるＰＤＦファイル３０１内に含まれる画像は、スキャン画像である画像３０２のみであった。これに対し、図７に示すケースでは、スキャンＰＤＦ内に含まれる画像が、図７（ａ）に示すような用紙サイズ相当の画像である背景画像７０１と、図７（ｂ）に示すような文字画像７０２〜７０５とに分かれている。 In the case shown in FIG. 3, the image included in the PDF file 301 that is a scanned PDF is only the image 302 that is a scanned image. On the other hand, in the case shown in FIG. 7, the image included in the scan PDF is a background image 701 that is an image corresponding to the paper size as shown in FIG. 7A and the image shown in FIG. It is divided into character images 702 to 705.

このようにファイル内に格納する画像を複数に分ける理由は、スキャンＰＤＦに基づいて描画する際に文字列の画質を劣化させることがないようにしつつ、該スキャンＰＤＦのファイルサイズを小さくするためである。図７に示すように、スキャン画像を背景画像と文字画像とに分離して、背景画像のみ高圧縮率で圧縮する。こうすることで、文字列の可読性を維持したまま、ファイルサイズを小さくすることができる。このような態様で圧縮されたスキャンＰＤＦを、以下「高圧縮ＰＤＦ」と呼ぶ。高圧縮ＰＤＦに対応している複写機は、ユーザーの選択に応じて、高圧縮ＰＤＦを作成する場合がある。 The reason for dividing the image stored in the file in this way is to reduce the file size of the scan PDF while not degrading the image quality of the character string when rendering based on the scan PDF. is there. As shown in FIG. 7, the scanned image is separated into a background image and a character image, and only the background image is compressed at a high compression rate. By doing so, the file size can be reduced while maintaining the readability of the character string. The scan PDF compressed in this manner is hereinafter referred to as “high compression PDF”. A copier that supports high-compression PDF may create a high-compression PDF according to the user's selection.

本実施形態では、ＰＤＦファイルが高圧縮ＰＤＦの場合に、該ＰＤＦファイルがスキャンＰＤＦか判定するための方法を述べる。 In the present embodiment, a method for determining whether a PDF file is a scan PDF when the PDF file is a highly compressed PDF will be described.

＜ＰＤＦ判定処理について＞
以下、本実施形態におけるＰＤＦファイル判定処理について、図８を用いて説明する。 <About PDF judgment processing>
Hereinafter, the PDF file determination processing in the present embodiment will be described with reference to FIG.

Ｓ８０１〜Ｓ８０３は、第１の実施形態のＳ５０１〜Ｓ５０３（図５）と同様である。但し、本実施形態では、Ｓ８０２の判定結果が真の場合、Ｓ８０４に進む。 S801 to S803 are the same as S501 to S503 (FIG. 5) of the first embodiment. However, in this embodiment, when the determination result of S802 is true, the process proceeds to S804.

Ｓ８０４において、制御部２０３は、ＰＤＦファイル内から全ての画像を取り出す。例えば図７に示すケースでは、背景画像７０１、及び、文字画像７０２〜７０５が取り出される。 In step S804, the control unit 203 extracts all images from the PDF file. For example, in the case shown in FIG. 7, a background image 701 and character images 702 to 705 are extracted.

Ｓ８０５において、制御部２０３は、Ｓ８０４で取り出した全ての画像をマージすることで、１つの画像を作成する。 In step S805, the control unit 203 creates one image by merging all the images extracted in step S804.

Ｓ８０６において、Ｓ８０５で作成した画像に対し、前述の領域分割処理を実行する。Ｓ８０６〜Ｓ８１０は、第１の実施形態のＳ５０５〜Ｓ５０９（図５）と同様である。以上が、本実施形態におけるＰＤＦファイル判定処理の内容である。 In step S806, the above-described region division processing is executed on the image created in step S805. S806 to S810 are the same as S505 to S509 (FIG. 5) of the first embodiment. The above is the content of the PDF file determination process in the present embodiment.

＜複数の画像をマージする理由について＞
以下、Ｓ８０５で複数の画像をマージする理由について説明する。 <Reasons for merging multiple images>
Hereinafter, the reason for merging a plurality of images in S805 will be described.

高圧縮ＰＤＦでは、スキャン画像が背景画像と文字画像に分離される。しかし、その分離精度は１００％ではない。例えば、スキャン時に画像にノイズが入ってしまい、そのノイズが原因で、文字列が分断されてしまい、一部の文字列が背景画像に含まれてしまう場合がある。この場合、文字領域の特定精度が悪化する虞がある。 In high-compression PDF, a scanned image is separated into a background image and a character image. However, the separation accuracy is not 100%. For example, noise may be included in the image during scanning, and the character string may be divided due to the noise, and a part of the character string may be included in the background image. In this case, there is a possibility that the accuracy of specifying the character area is deteriorated.

また、表を含む画像の場合、表の罫線は背景画像に含まれる。しかし、表の罫線と表内の文字列がマージされた状態の方が、領域分割による文字領域の特定精度が上がる場合がある。 In the case of an image including a table, the ruled lines of the table are included in the background image. However, when the ruled line of the table and the character string in the table are merged, the accuracy of character region identification by region division may increase.

これらの理由により、領域分割による文字領域の特定精度を高める目的で、Ｓ８０５で画像をマージしている。 For these reasons, the images are merged in step S805 for the purpose of improving the accuracy of character area specification by area division.

＜本実施形態の効果、変形例等について＞
本実施形態により、ＰＤＦファイルが高圧縮ＰＤＦの場合にも、ＰＤＦファイルがスキャンＰＤＦか精度良く判定することが可能になる。 <Effects and Modifications of the Present Embodiment>
According to the present embodiment, even when the PDF file is a highly compressed PDF, it is possible to accurately determine whether the PDF file is a scan PDF.

尚、前述の形態では、ＰＤＦファイル内から取り出した複数の画像をマージした１つの画像に対して領域分割処理を実行することで、文字領域を特定し、該特定した文字領域の数に基づき、該ＰＤＦファイルがスキャンＰＤＦか判定している。しかし、本実施形態は前述の形態に限定されない。例えば、ＰＤＦファイル内から取り出した画像のそれぞれに対して領域分割処理を実行することで、文字領域を特定する。そして、画像のそれぞれに対応する文字領域の数を合計することで導出される和に基づき、ＰＤＦファイルがスキャンＰＤＦか判定するような形態も考えられる。 In the above-described form, the character region is specified by performing region division processing on one image obtained by merging a plurality of images extracted from the PDF file, and based on the number of the specified character regions, It is determined whether the PDF file is a scan PDF. However, the present embodiment is not limited to the above-described form. For example, the character region is specified by performing region division processing on each of the images extracted from the PDF file. Then, based on the sum derived by summing up the number of character areas corresponding to each of the images, a mode in which it is determined whether the PDF file is a scan PDF can be considered.

また、高圧縮ＰＤＦには、図７で示すような文字部を複数の画像で表現するものの他、スキャン画像を文字部と背景とに分離した後、分離した文字部を１つの画像で表現したものも存在する。本実施形態は、ＰＤＦファイルがそのような高圧縮ＰＤＦの場合にも、適用可能である。 Further, in the high-compression PDF, in addition to expressing the character part as shown in FIG. 7 as a plurality of images, the scanned image is separated into the character part and the background, and then the separated character part is expressed as one image. There are also things. The present embodiment is also applicable when the PDF file is such a highly compressed PDF.

［その他の実施形態］
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 [Other Embodiments]
The present invention supplies a program that realizes one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in a computer of the system or apparatus read and execute the program This process can be realized. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

Claims

Based on whether the PDF file includes an image having a size equivalent to the page size defined in the PDF file, the PDF file is a scanned PDF created from a scanned image acquired by scanning a document. First determination means for determining whether or not
Deriving means for deriving the number of character areas in the image included in the PDF file;
An information processing apparatus comprising: a second determination unit that determines whether the PDF file is a scan PDF based on the number of the character areas derived by the deriving unit.

When the PDF file does not include an image having a size equivalent to the page size defined in the PDF file, the determination result by the first determination unit is true,
The information processing apparatus according to claim 1, wherein when the PDF file includes an image having a size equivalent to a page size defined by the PDF file, the determination result is false.

When the determination result by the first determination unit is false and the number of images included in the PDF file is 1, the character region in the image is specified by performing region division processing on the image The information processing apparatus according to claim 1, further comprising a specifying unit that performs the processing.

In the case where the determination result by the first determination unit is false and the number of images included in the PDF file is plural, a creation unit is further provided for creating one image by merging the plurality of images. The information processing apparatus according to claim 1 or 2.

5. The information according to claim 4, further comprising specifying means for specifying a character area in the one image by performing region division processing on the one image created by the creating means. Processing equipment.

The information processing apparatus according to claim 3, wherein the deriving unit derives the number of character areas by counting the character areas specified by the specifying unit.

When the number of the character areas is equal to or greater than a predetermined number, the second determination unit determines that the PDF file is a scan PDF,
The said 2nd determination means determines with the said PDF file not being a scanning PDF when the number of the said character area is less than the said predetermined number, The any one of Claim 1 thru | or 6 characterized by the above-mentioned. Information processing device.

The image having a size equivalent to the page size defined by the PDF file includes an image having the same size as the page size and an image smaller than the page size. The information processing apparatus according to item.

Based on whether the PDF file includes an image having a size equivalent to the page size defined in the PDF file, the PDF file is a scanned PDF created from a scanned image acquired by scanning a document. A step of determining whether or not
Deriving the number of character regions in the image included in the PDF file;
And determining whether the PDF file is a scanned PDF based on the derived number of character regions.

A program for causing a computer to execute the method according to claim 9.