JP2023005163A

JP2023005163A - Document processing method and document processor

Info

Publication number: JP2023005163A
Application number: JP2021106909A
Authority: JP
Inventors: 浩一馬養; Koichi Umakai
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2023-01-18

Abstract

To provide a mechanism for accurately discriminating boundaries of documents of various configurations from image data obtained by reading the documents.SOLUTION: A document creation method comprises steps of acquiring a document as image data page by page, and detecting a sign of a boundary page that satisfies a predetermined condition for each page from the acquired image data. In addition, the document creation method comprises steps of identifying, on the basis of a page containing the sign of detected boundary page and information of adjacent pages adjacent to the page, one or more boundary pages that form boundaries of the document, and creating document data divided at the identified one or more boundary page locations.SELECTED DRAWING: Figure 6

Description

本発明は、文書処理方法、及び文書処理装置に関するものである。 The present invention relates to a document processing method and a document processing apparatus.

従来の文書管理システムでは、スキャナで読み込んだ文書画像群の内容をオペレータが目視でチェックし、ツールを利用して文書の区切り位置を指定することにより、文書画像群を文書単位に分割して登録していた。そのため、オペレータが文書の区切り位置を判定するために労力を要し、文書登録時の作業の妨げとなっていた。 In a conventional document management system, an operator visually checks the contents of a group of document images read by a scanner and uses a tool to specify the position of document separation, dividing the group of document images into document units and registering them. Was. For this reason, the operator has to work hard to determine the document delimiter position, which hinders the document registration work.

この対策の一つとして、特許文献１及び特許文献２には、スキャナで文書を読込む際、機械が容易に判別可能な仕切り用紙を予め文書間に挟むことで、文書を半自動的に分割する方式が提案されている。また、他の対策として、特許文献３には、ページ下部に空白領域を有する場合に文書の最終ページとみなし、最終ページの次のページに見出し領域を持つページが来るか来ないかで原稿読み取り順序の正誤判定を行う方式が提案されている。特許文献４では、着目ページ内の情報（予約語の有無、空白領域の高さ、ヘッダ領域内文字認識結果）と前ページとの比較情報（単語の類似性、用紙サイズの相違、平均文字サイズの相違、文字列方向の相違）とを利用し、文書の非連続性を検出している。このように、文書の非連続性を検出することで先頭ページを判別することができる。 As one of the countermeasures, in Patent Documents 1 and 2, when a document is read by a scanner, the document is semi-automatically divided by inserting a partition sheet that can be easily distinguished by the machine in advance between the documents. A method is proposed. As another countermeasure, in Patent Document 3, if there is a blank area at the bottom of the page, it is regarded as the last page of the document, and the page following the last page has a header area. A method for judging whether the order is correct or not has been proposed. In Patent Document 4, information in the page of interest (presence or absence of reserved words, height of blank area, character recognition result in header area) and comparison information with the previous page (similarity of words, difference in paper size, average character size , and character string direction) to detect discontinuity of documents. In this way, the first page can be determined by detecting the discontinuity of the document.

特許２９６２９６１号公報Japanese Patent No. 2962961 特開２００５－１４１４２５号公報JP 2005-141425 A 特開２０２０－１２０３０８号公報Japanese Patent Application Laid-Open No. 2020-120308 特開２００２－３１２３８５号公報JP-A-2002-312385

しかしながら、上記従来技術には以下に記載する課題がある。例えば、仕切り用紙を用いて文書を分割する方法では、予め文書間に仕切り用紙を人手で挿入する作業を要するという課題がある。ページ下部に空白領域を有する場合に文書の最終ページとみなす方法、又は、文書の非連続性を検出することで先頭ページを判別する方法では、表や図の一部が本文最終ページの次ページ以降に掲載されている場合に文書区切りを誤って判別してしまう課題がある。或いは、製本文書の場合には、本文の後にページ合わせ用の空白ページが挿入されることがあるが、この場合にも文書区切りを誤って判別してしまう。 However, the conventional technology described above has the following problems. For example, in the method of dividing a document using partition sheets, there is a problem that it is necessary to manually insert the partition sheets between documents in advance. In the method of determining the last page of the document when there is a blank area at the bottom of the page, or the method of determining the first page by detecting the discontinuity of the document, part of the table or figure is the next page of the last page of the text There is a problem that the document delimiter is erroneously determined when it is published later. Alternatively, in the case of a bound document, a blank page for page alignment may be inserted after the main text, and in this case also, the document delimiter is erroneously determined.

本発明は、上述の課題の少なくとも一つに鑑みて成されたものであり、種々の構成の文書について、それらの文書を読み取って得られる画像データから文書の境界を的確に判別する仕組みを提供する。 SUMMARY OF THE INVENTION The present invention has been made in view of at least one of the above problems, and provides a mechanism for accurately discriminating the boundaries of documents of various configurations from image data obtained by reading the documents. do.

本発明は、例えば、文書作成方法であって、文書をページ単位で画像データとして取得する取得工程と、前記取得された画像データからページごとに所定の条件を満たす境界ページの兆候を検出する検出工程と、前記検出工程において検出された境界ページの兆候を含むページと、該ページに隣接する隣接ページの情報とに基づいて、文書の境界となる１以上の境界ページを特定する特定工程と、前記特定工程において特定された前記１以上の境界ページの位置で分割した文書データを作成する作成工程とを含むことを特徴とする。 The present invention is, for example, a document creation method, comprising: an acquisition step of acquiring image data of a document page by page; an identifying step of identifying one or more boundary pages that form a boundary of a document based on the page including the sign of the boundary page detected in the detecting step and the information of adjacent pages adjacent to the page; and a creating step of creating document data divided at the positions of the one or more boundary pages specified in the specifying step.

本発明によれば、種々の構成の文書について、それらの文書を読み取って得られる画像データから文書の境界を的確に判別することができる。 According to the present invention, it is possible to accurately determine the boundaries of documents of various configurations from image data obtained by reading the documents.

一実施形態に係る文書処理装置のハードウェア構成例を示す図。1 is a diagram showing an example hardware configuration of a document processing apparatus according to an embodiment; FIG. 一実施形態に係る文書処理装置の機能構成例を示す図。1 is a diagram showing an example functional configuration of a document processing apparatus according to an embodiment; FIG. 一実施形態に係る空白領域の検出方法を説明する図。4A and 4B are diagrams for explaining a method of detecting a blank area according to an embodiment; FIG. 一実施形態に係る文書境界の特定方法を説明する図。4A and 4B are diagrams for explaining a document boundary specifying method according to an embodiment; FIG. 一実施形態に係る文書境界の特定方法を説明する図。4A and 4B are diagrams for explaining a document boundary specifying method according to an embodiment; FIG. 一実施形態に係る文書処理装置の処理に関するフローチャート。4 is a flowchart related to processing of a document processing device according to an embodiment;

以下、添付図面を参照して実施形態を詳しく説明する。なお、以下の実施形態は特許請求の範囲に係る発明を限定するものではない。実施形態には複数の特徴が記載されているが、これらの複数の特徴の全てが発明に必須のものとは限らず、また、複数の特徴は任意に組み合わせられてもよい。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In addition, the following embodiments do not limit the invention according to the scope of claims. Although multiple features are described in the embodiments, not all of these multiple features are essential to the invention, and multiple features may be combined arbitrarily. Furthermore, in the accompanying drawings, the same or similar configurations are denoted by the same reference numerals, and redundant description is omitted.

＜第１の実施形態＞
＜文書処理装置のハードウェア構成＞
以下では、本発明の一実施形態について説明する。まず、図１を参照して、本実施形態に係る文書処理装置のハードウェア構成の一例を説明する。ここでは、文書処理装置の一例として画像処理装置を例に説明する。本実施形態では、文書処理装置が扱う文書データをスキャナによって文書を読み取った文書画像ページデータ（以下では、単に画像データとも称する。）を例に説明するが、本発明を限定する意図はなく、例えば外部から入力された文書データを処理するものであってもよい。したがって、本文書処理装置はスキャナ等の読取部を備えなくてもよい。 <First Embodiment>
<Hardware Configuration of Document Processing Device>
An embodiment of the invention is described below. First, with reference to FIG. 1, an example of the hardware configuration of the document processing apparatus according to this embodiment will be described. Here, an image processing apparatus will be described as an example of a document processing apparatus. In the present embodiment, document image page data (hereinafter also simply referred to as image data) obtained by reading document data handled by a document processing apparatus with a scanner will be described as an example. For example, it may process document data input from the outside. Therefore, the document processing apparatus does not need to include a reading unit such as a scanner.

文書処理装置１００は、ＣＰＵ１０１、ＲＡＭ１０２、ＲＯＭ１０３、記憶装置１０４、スキャナ１０５、及び通信Ｉ／Ｆ１０６を備え、各デバイスはシステムバス１１１を介してデータを相互に送受することができる。ＣＰＵ１０１は、ＲＯＭ１０３内に記憶されたプログラムや、記憶装置１０４からＲＡＭ１０２にロードされたＯＳ（オペレーションシステム）やアプリケーション等のプログラムを実行する。すなわち、ＣＰＵ１０１が、読み取り可能な記憶媒体に格納されたプログラムを実行することにより、後述する各フローチャートの処理を実行する各処理部として機能する。ＲＡＭ１０２は、ＣＰＵ１０１のメインメモリであり、ワークエリア等として機能する。スキャナ１０５は、文書をページ単位で文書画像ページデータとして読み込み、文書間の境界ページを特定するための画像データを生成する。なお、本発明においては、文書間の境界ページを特定するための文書データとして、スキャナ１０５が原稿から読み取った画像データに限らず、外部装置から受信した画像データでもよい。通信Ｉ／Ｆ１０６はネットワークを介して外部装置と通信を行うためのインタフェースであり、例えば外部装置から処理対象となる画像データを受信する。システムバス１１１は、各ハードウェア要素間の通信を行うための通信経路である。 The document processing apparatus 100 includes a CPU 101, a RAM 102, a ROM 103, a storage device 104, a scanner 105, and a communication I/F 106, and each device can exchange data via a system bus 111. FIG. The CPU 101 executes programs stored in the ROM 103 and programs such as an OS (Operating System) and applications loaded from the storage device 104 to the RAM 102 . That is, the CPU 101 functions as each processing unit that executes the processing of each flowchart described later by executing the program stored in the readable storage medium. A RAM 102 is a main memory of the CPU 101 and functions as a work area and the like. The scanner 105 reads a document page by page as document image page data, and generates image data for specifying boundary pages between documents. In the present invention, document data for specifying boundary pages between documents is not limited to image data read from a document by the scanner 105, and may be image data received from an external device. A communication I/F 106 is an interface for communicating with an external device via a network, and receives image data to be processed from the external device, for example. A system bus 111 is a communication path for communication between hardware elements.

＜機能ブロック＞
次に、図２を参照して、本実施形態に係る文書処理装置１００の機能構成について説明する。図中の各機能部は、ＲＯＭ１０３内に記憶されたプログラムや、記憶装置１０４からＲＡＭ１０２にロードされたアプリケーション等のプログラムをＣＰＵ１０１が実行することで実現される。各処理の実行結果はＲＡＭ１０２に保持される。 <Functional block>
Next, the functional configuration of the document processing apparatus 100 according to this embodiment will be described with reference to FIG. Each functional unit in the figure is implemented by the CPU 101 executing programs such as programs stored in the ROM 103 and applications loaded from the storage device 104 to the RAM 102 . Execution results of each process are held in the RAM 102 .

文書処理装置１００は、機能構成として、ページ取得部２０１、空白領域検出部２０２、境界ページ特定部２０３、電子文書作成部２０４、補助情報領域検出部２０５、及び追加情報確認部２０６を備える。ページ取得部２０１は、処理対象となる画像データであるページデータをスキャナ１０５から取得する。或いは、ページ取得部２０１は、通信Ｉ／Ｆ１０６を介して外部装置から処理対象となる画像データを取得してもよい。 The document processing apparatus 100 includes a page acquisition unit 201, a blank area detection unit 202, a border page identification unit 203, an electronic document creation unit 204, an auxiliary information area detection unit 205, and an additional information confirmation unit 206 as functional configurations. A page acquisition unit 201 acquires page data, which is image data to be processed, from the scanner 105 . Alternatively, the page acquisition unit 201 may acquire image data to be processed from an external device via the communication I/F 106 .

空白領域検出部２０２は、ページ取得部２０１によって取得された処理対象の画像データについて、ページごとに空白領域を検出する。また、補助情報領域検出部２０５は、画像データに対応するページにおける補助情報領域を検出する。本実施形態では、補助情報領域はフッター領域、脚注領域、及びページ番号領域を含むものとして説明を行うが、これ以外にも文書の本文ではなく、本文に対する補助情報を提供する領域を補助情報領域としてもよい。空白領域検出部２０２は、補助情報領域検出部２０５の検出結果も用いて、空白領域の検出を行う。 The blank area detection unit 202 detects a blank area for each page of the image data to be processed acquired by the page acquisition unit 201 . Also, the auxiliary information area detection unit 205 detects an auxiliary information area on the page corresponding to the image data. In this embodiment, the auxiliary information area includes a footer area, a footnote area, and a page number area. may be The blank area detection unit 202 also uses the detection result of the auxiliary information area detection unit 205 to detect blank areas.

境界ページ特定部２０３は、処理対象の画像データについて、空白領域検出部２０２の検出結果に基づき、文書の境界ページを特定する。追加情報確認部２０６は、処理対象の画像データ（文書画像ページデータ）が、文書内容における追加情報であるかどうかを確認する。追加情報とは、例えば「コラム」や「参考情報」、「インデックス」など、前ページから継続する内容を示す情報である。境界ページ特定部２０３は、追加情報確認部２０６の確認結果も用いて境界ページを特定する。特定方法の詳細については後述する。電子文書作成部２０４は、境界ページ特定部２０３の特定結果に基づいて、境界情報を含む文書データを作成する。 A boundary page identification unit 203 identifies a boundary page of a document based on the detection result of the blank area detection unit 202 for image data to be processed. The additional information confirmation unit 206 confirms whether the image data to be processed (document image page data) is additional information in the document content. The additional information is information indicating contents continued from the previous page, such as "column", "reference information", and "index". The boundary page specifying unit 203 also uses the check result of the additional information checking unit 206 to specify the border page. Details of the identification method will be described later. The electronic document creation unit 204 creates document data including boundary information based on the identification result of the boundary page identification unit 203 .

＜空白領域の検出＞
次に、図３を参照して、本実施形態に係る文書処理装置１００の空白領域の検出方法について説明する。３００は文書から読み取った１ページ分の文書画像ページデータに対応する画像を示す。 <Blank area detection>
Next, a blank area detection method of the document processing apparatus 100 according to the present embodiment will be described with reference to FIG. Reference numeral 300 denotes an image corresponding to one page of document image page data read from the document.

３０１は文書領域を示し、３０２は検出される空白領域を示す。空白領域３０２には、補助情報領域として脚注領域３０３が含まれる。本実施形態に係る補助情報領域検出部２０５は文書画像ページデータを解析して、脚注領域３００を補助情報領域として検出する。ここで、補助情報領域検出部２０５は処理中の着目領域の位置情報（例えば、ページ下方の所定領域）に基づいて、当該領域に存在する罫線や文字列などを補助情報領域として検出する。空白領域検出部２０２は、補助情報領域検出部２０５によって検出された脚注領域３００を空白領域とみなして制御する。次に、空白領域検出部２０２は、脚注領域３００を空白領域とみなした状態で、文書の下端からの空白領域を検出する。図３の例では、空白領域３０２が検出され、その高さを３０４に示す。 301 indicates the document area and 302 indicates the detected blank area. The blank area 302 includes a footnote area 303 as an auxiliary information area. The auxiliary information area detection unit 205 according to this embodiment analyzes the document image page data and detects the footnote area 300 as an auxiliary information area. Here, the auxiliary information area detection unit 205 detects ruled lines, character strings, and the like existing in the area of interest as the auxiliary information area based on the position information of the area of interest being processed (for example, a predetermined area at the bottom of the page). The blank area detection unit 202 controls the footnote area 300 detected by the auxiliary information area detection unit 205 as a blank area. Next, the blank area detection unit 202 detects a blank area from the lower end of the document while regarding the footnote area 300 as a blank area. In the example of FIG. 3, a blank area 302 is detected and its height is indicated at 304 .

より詳細には、空白領域検出部２０２は、文書画像ページデータの横線上にある黒画素数をカウントしたヒストグラムを作成し、文書の下端から上端に向かってヒストグラムの値が０である横線の本数をカウントし、カウント結果を空白領域の高さとする。なお、本発明はこれに限定されず、任意の他の方法で空白領域の高さを検出してもよい。また、空白領域の高さではなく面積を求めてもよい。検出された空白領域３０２の高さは、後述するように、文書の境界を特定する際に利用される。 More specifically, the blank area detection unit 202 creates a histogram that counts the number of black pixels on the horizontal line of the document image page data, and counts the number of horizontal lines with a histogram value of 0 from the lower end to the upper end of the document. is counted, and the count result is used as the height of the blank area. Note that the present invention is not limited to this, and any other method may be used to detect the height of the blank area. Also, the area of the blank area may be obtained instead of the height. The height of the detected blank area 302 is used to identify the boundaries of the document, as will be described later.

＜文書境界の特定＞
次に、図４及び図５を参照して、本実施形態に係る文書処理装置１００の文書境界の特定方法について説明する。図４の４００、４１０、４２０、４３０は、文書から読み取った各ページの文書画像ページデータに対応する画像を示す。各ページには、右下にページ番号領域が設けられ、各ページの番号が示されている。各ページにおいて、領域４０１、４１２、４２１、４３１はそれぞれ文書領域を示す。領域４１１は図面領域を示す。領域４０２、４２２は空白領域を示す。 <Identification of document boundaries>
Next, a document boundary specifying method of the document processing apparatus 100 according to the present embodiment will be described with reference to FIGS. 4 and 5. FIG. Reference numerals 400, 410, 420, and 430 in FIG. 4 denote images corresponding to the document image page data of each page read from the document. Each page is provided with a page number area in the lower right to indicate the number of each page. On each page, areas 401, 412, 421, and 431 indicate document areas, respectively. Area 411 indicates the drawing area. Areas 402 and 422 indicate blank areas.

１ページ目４００において、下部に４０３に示す高さの空白領域４０２が空白領域検出部２０２によって検出される。ここで、境界ページ特定部２０３は、空白領域検出部２０２によって検出された空白領域４０２が所定のサイズ（例えば、所定の高さ）を有するものかを判断し、所定のサイズを有するものであれば当該ページを境界候補として特定する。境界候補として特定すると、境界ページ特定部２０３は、次ページの文書画像ページデータを取得し、当該ページが前ページから内容が続くページであることを示す追加情報を検出する。 In the first page 400, the blank area detection unit 202 detects a blank area 402 with a height indicated by 403 at the bottom. Here, the boundary page specifying unit 203 determines whether the blank area 402 detected by the blank area detecting unit 202 has a predetermined size (for example, a predetermined height), and determines if the blank area 402 has a predetermined size. page is identified as a boundary candidate. Once identified as a boundary candidate, the boundary page identification unit 203 acquires the document image page data of the next page and detects additional information indicating that the page is a page whose content continues from the previous page.

取得した文書画像ページデータが追加情報を含むか否かの判定は、本実施形態では以下の２つの条件の少なくとも一方を満たす場合に追加情報を含むと判定する。１つ目の条件は、取得した文書画像ページデータが所定の文字列から始まる文章である場合である。所定の文字列は例えば「見本」、「コラム」、「参考情報」、「付録」、「appendix」などを設定することができる。文字列の設定については任意であり、ユーザ入力に従って変更することができる。２つ目の条件は、取得した文書画像ページデータの最上部に図面領域や表領域が検出された場合である。なお、上記１つ目の条件として挙げた文字列を境界としてみなすような制御としてもよい。即ち、上記文字列を含むページと前ページとの間を境界と判断してもよい。これらの設定はユーザが任意で行うことができる。 In this embodiment, it is determined whether the acquired document image page data contains additional information when at least one of the following two conditions is satisfied. The first condition is when the acquired document image page data is a sentence that begins with a predetermined character string. For example, "sample", "column", "reference information", "appendix", and "appendix" can be set as the predetermined character string. The setting of the character string is arbitrary and can be changed according to user input. The second condition is when a drawing area or a table area is detected at the top of the acquired document image page data. Note that the control may be such that the character strings listed as the first condition are regarded as boundaries. That is, the boundary may be determined between the page containing the character string and the previous page. These settings can be arbitrarily set by the user.

図４の例では、２ページ目４１０において上部に図面領域４１１が存在する。つまり、図面領域４１１に描画された図は、１ページ目４００の下部にある空白領域４０３に収まることなく、次のページである２ページ目４１０の上部にずれて配置されたものと推測することができる。即ち、境界ページ特定部２０３は、文書の境界候補として登録した１ページ目４００は実際には境界候補ではなく、２ページ目４１０以降に続いていると判断する。 In the example of FIG. 4, the drawing area 411 exists in the upper part of the second page 410 . In other words, it can be inferred that the figure drawn in the drawing area 411 is shifted to the top of the second page 410, which is the next page, without fitting into the blank area 403 at the bottom of the first page 400. can be done. That is, the boundary page identifying unit 203 determines that the first page 400 registered as a boundary candidate of the document is not actually a boundary candidate, but continues to the second page 410 and subsequent pages.

なお、２ページ目４１０には空白領域が存在しないため、次ページである３ページ目４２０へ処理対象を移す。３ページ目４２０では、空白領域４２２が特定されており、当該ページは新たな境界候補として設定される。続いて、境界ページ特定部２０３は、次ページの文書画像ページデータを取得して解析する。４ページ目４３０は上部から文書領域が続いており、特に所定の文字列や図表は検出されない。従って、境界ページ特定部２０３は、境界候補として設定した３ページ目４２０と、４ページ目４３０との間に文書の境界４４０が存在するものと判断し、登録する。 Since there is no blank area on the second page 410, the processing target is shifted to the third page 420, which is the next page. A blank area 422 is identified on the third page 420, and this page is set as a new boundary candidate. Subsequently, the boundary page specifying unit 203 acquires and analyzes the document image page data of the next page. The fourth page 430 has a document area continuing from the top, and a predetermined character string or chart is not detected. Therefore, the boundary page specifying unit 203 determines that a document boundary 440 exists between the third page 420 and the fourth page 430 set as boundary candidates, and registers the boundary.

このように、本実施形態に係る文書処理装置１００は、下部に存在する所定サイズの空白領域と、所定サイズの空白領域が検出されたページとそれ以降のページ（以下では、隣接ページと称する。）とを解析して、文書の境界を特定する。つまり、空白領域に加えて、隣接ページの情報に基づいて文書の境界を特定する。 As described above, the document processing apparatus 100 according to the present embodiment detects a blank area of a predetermined size existing at the bottom, a page in which a blank area of a predetermined size is detected, and subsequent pages (hereinafter referred to as adjacent pages). ) to identify document boundaries. That is, in addition to blank areas, document boundaries are identified based on adjacent page information.

図５は追加情報を含む隣接ページの例を示す。５００は「見本」という文字列５０１と、文書領域５０２とを含む。５１０は空白ページ（白紙）を示す。なお、空白ページはページ合わせ用の調整ページである。したがって、境界ページ特定部２０３は、５００、５１０ともに追加情報を含むものと判断する。境界ページ特定部２０３は、これらの追加情報を含むページを検出すると、境界候補として特定したページが境界ではなく、継続ページが存在するものと判断する。 FIG. 5 shows an example of a contiguous page containing additional information. 500 includes a character string 501 “sample” and a document area 502 . 510 indicates a blank page (blank paper). Note that the blank page is an adjustment page for page alignment. Therefore, the boundary page identification unit 203 determines that both 500 and 510 contain additional information. When the boundary page identifying unit 203 detects a page including these additional information, it determines that the page identified as the boundary candidate is not a boundary and that there is a continuation page.

＜処理手順＞
次に、図６を参照して、本実施形態における文書画像ページデータの境界ページを特定する際の処理手順を説明する。以下で説明する処理は、例えばＣＰＵ１０１がＲＯＭ１０３や記憶装置１０４に記憶されたプログラムをＲＡＭ１０２に読み出して実行することにより実現される。 <Processing procedure>
Next, with reference to FIG. 6, a processing procedure for identifying boundary pages of document image page data in this embodiment will be described. The processing described below is realized by, for example, the CPU 101 reading a program stored in the ROM 103 or the storage device 104 into the RAM 102 and executing the program.

Ｓ１００でページ取得部２０１は、自動分割対象とする文書の束を、スキャナ１０５等を利用することで、ページ単位で文書画像ページデータとして読み込む。ここで、文書の束とは雑多な形式を持つ複数の文書を束ねたものであり、例えば紙ファイルに綴じられている文書群をそのまま取り出した紙束である。なお、外部装置で読み取られた文書画像データを通信Ｉ／Ｆ１０６を利用して取得するようにしてもよい。続いて、Ｓ１１１でＣＰＵ１０１は、Ｓ１００で読み込んだ文書画像データのすべてのページ（文書画像ページデータ）の処理が完了したかを判定する。処理が完了している場合にはＳ１１２に処理を移し、完了していない場合にはＳ１０１に処理を移す。 In S100, the page acquisition unit 201 uses the scanner 105 or the like to read a bundle of documents to be automatically divided page by page as document image page data. Here, a bundle of documents is a bundle of documents having miscellaneous formats, for example, a paper bundle obtained by directly extracting a group of documents bound in a paper file. Note that the document image data read by the external device may be acquired using the communication I/F 106 . Subsequently, in S111, the CPU 101 determines whether all pages (document image page data) of the document image data read in S100 have been processed. If the processing has been completed, the process proceeds to S112, and if the processing has not been completed, the process proceeds to S101.

Ｓ１０１でＣＰＵ１０１は、読み込んだ文書画像データから未取得の先頭１ページを文書画像ページデータとして取得する。ここで、未取得のページとは処理が完了していないページを示す。ページを取得すると、Ｓ１０２で補助情報領域検出部２０５は、取得したページの補助情報領域の検出を行う。この処理は、例えば文書のレイアウト規則や罫線情報を用いてレイアウトを解析することにより実現できる。或いは、文字認識結果から得られる文字列の位置情報や文字列のフォント情報、又は文字列に所定の文字記号を含む場合に補助情報領域を検出するようしてもよい。さらに検出した補助情報領域（着目領域）の位置情報を利用して正しく検出されている補助情報領域を選別するよう構成してもよい。ここで、補助情報領域検出部２０５は上述のような処理を用いて、ページ下部に位置するフッター領域、脚注領域、ページ番号領域などの補助情報領域を検出する。 In S101, the CPU 101 acquires the unacquired first page from the read document image data as document image page data. Here, an unacquired page indicates a page for which processing has not been completed. When the page is acquired, in S102 the auxiliary information area detection unit 205 detects the auxiliary information area of the acquired page. This processing can be realized by analyzing the layout using, for example, layout rules and ruled line information of the document. Alternatively, the auxiliary information area may be detected when the positional information of the character string obtained from the character recognition result, the font information of the character string, or the predetermined character symbol is included in the character string. Furthermore, it is also possible to use the positional information of the detected auxiliary information area (region of interest) to select the correctly detected auxiliary information area. Here, the auxiliary information area detection unit 205 uses the processing described above to detect auxiliary information areas such as the footer area, footnote area, and page number area positioned at the bottom of the page.

次に、Ｓ１０３で空白領域検出部２０２は、補助情報領域検出部２０５によって検出された補助情報領域を空白領域として登録する。つまり、空白領域検出部２０２は、例えば文書画像データがモノクロ画像データである場合には、補助情報領域として検出された領域における有色の画素を白色の画素として認識するように制御する。ここでは実際に文書画像データの画素値を変更する必要はなく、当該領域を空白領域として認識して処理するよう設定するものである。例えば、補助情報領域の座標位置を設定し、後述するＳ１０４で利用する。 Next, in S103, the blank area detection unit 202 registers the auxiliary information area detected by the auxiliary information area detection unit 205 as a blank area. In other words, when the document image data is monochrome image data, the blank area detection unit 202 performs control so that the colored pixels in the area detected as the auxiliary information area are recognized as white pixels. Here, there is no need to actually change the pixel values of the document image data, and the area is recognized as a blank area and processed. For example, the coordinate position of the auxiliary information area is set and used in S104, which will be described later.

Ｓ１０４で空白領域検出部２０２は、文書画像ページデータを画像処理的に解析して文書画像ページデータの下部にある空白領域（ページ下部空白領域）を検出する。続いて、空白領域検出部２０２は、Ｓ１０３で検出した補助情報領域の中で、ページ下部空白領域と重なる領域および隣接する領域があれば、それらを結合することで最終的な下部空白領域を作成する。 In S104, the blank area detection unit 202 analyzes the document image page data in terms of image processing to detect a blank area (page bottom blank area) at the bottom of the document image page data. Subsequently, the blank area detection unit 202 creates a final lower blank area by combining areas overlapping and adjacent to the lower blank area of the page, if any, in the auxiliary information area detected in S103. do.

次に、Ｓ１０５で境界ページ特定部２０３は、現在のページが文書境界候補の条件を満たしているかを判定し、判定結果に基づいて処理を分岐する。本実施形態では、文書境界候補の条件をページ下部空白領域の位置とサイズで定める。すなわち、ページ下部空白領域がページの下端に接しており、かつページ下部空白領域の高さが所定の高さＨ以上の場合に、文書の境界ページの候補（境界ページの兆候）とする。ここで、ページ下部空白領域がページの下端に接しているとは、ページ下部空白領域の下に空白領域以外の領域が存在しないことを意味する。Ｈの値は予め設定しておく。なお、Ｈの値を予め設定しておくのではなく、動的に設定することも可能である。例えば、前ページまでのページ下部空白領域の高さを統計的に処理した結果（例えば、平均値）をＨの値として設定したり、前ページでのページ下部空白領域の高さをＨとして設定したりするよう構成することも可能である。現在のページが文書境界候補の条件を満たしている場合には処理をＳ１０６に進め、条件を満たしていない場合にはＳ１１１に処理を戻す。 Next, in S105, the boundary page specifying unit 203 determines whether the current page satisfies the document boundary candidate conditions, and branches the processing based on the determination result. In this embodiment, the conditions for document boundary candidates are determined by the position and size of the blank area at the bottom of the page. That is, when the bottom blank area of the page is in contact with the bottom edge of the page and the height of the bottom blank area of the page is equal to or greater than a predetermined height H, it is determined as a border page candidate (border page symptom) of the document. Here, the lower page blank area being in contact with the bottom edge of the page means that there is no area other than the blank area under the page lower blank area. The value of H is set in advance. It should be noted that it is also possible to dynamically set the value of H instead of setting it in advance. For example, the result of statistically processing the height of the blank area at the bottom of the page up to the previous page (for example, the average value) is set as the value of H, or the height of the blank area at the bottom of the page on the previous page is set as H It is also possible to configure to If the current page satisfies the document boundary candidate conditions, the process proceeds to S106; otherwise, the process returns to S111.

次に、Ｓ１０６でＣＰＵ１０１は、追加情報確認部２０６によって次ページに追加情報が記されているかを確認させるため、Ｓ１００で読み込んだ文書画像データから、次の文書画像ページデータを取得する。続いて、Ｓ１０７で追加情報確認部２０６は、Ｓ１０６で取得された文書画像ページデータに境界候補の前ページに関する追加情報を含むかを判定する。判定方法については図４及び図５を用いて既に説明しているため詳細は省略する。判定結果に基づいて処理を分岐する。取得した文書画像ページデータが追加情報を含むと判定された場合には処理をＳ１０２に移す。その後、Ｓ１０２からＳ１０５までの処理を実行することで追加情報の最終ページ候補（文書境界候補）を探索する。一方、取得した文書画像ページデータが追加情報を含むと判定されなかった場合或いはＳ１０６で次の文書画像ページデータを取得できなかった場合は、処理をＳ１０８に移す。 Next, in S106, the CPU 101 acquires the next document image page data from the document image data read in S100 in order to allow the additional information confirmation unit 206 to confirm whether additional information is written on the next page. Subsequently, in S107, the additional information confirmation unit 206 determines whether the document image page data acquired in S106 includes additional information regarding the previous page of the boundary candidate. Since the determination method has already been described with reference to FIGS. 4 and 5, details thereof will be omitted. Branch the processing based on the determination result. If it is determined that the acquired document image page data includes additional information, the process proceeds to S102. After that, the process from S102 to S105 is executed to search for the final page candidate (document boundary candidate) of the additional information. On the other hand, if it is determined that the acquired document image page data does not include additional information, or if the next document image page data cannot be acquired in S106, the process proceeds to S108.

Ｓ１０８でＣＰＵ１０１は、次ページが空白ページであるかを確認するため、Ｓ１００で読み込んだ文書画像データから、次の文書画像ページデータを取得する。続いて、Ｓ１０９で境界ページ特定部２０３は、Ｓ１０８で取得した文書画像ページデータが空白ページであるかを判定し、判定結果に基づいて処理を分岐する。本実施形態における空白ページの判定は、文書画像ページデータから空白領域を検出することで行う。検出した空白領域がページの上端および下端に接している場合に空白ページと判定する。取得した文書画像ページデータが空白ページと判定された場合はＳ１０８に処理を移し、空白ページと判定されなかった場合或いはＳ１０８で次の文書画像ページデータを取得できなかった場合はＳ１１０に処理を移す。 In S108, the CPU 101 acquires the next document image page data from the document image data read in S100 in order to confirm whether the next page is a blank page. Subsequently, in S109, the boundary page specifying unit 203 determines whether the document image page data acquired in S108 is a blank page, and branches the processing based on the determination result. Blank page determination in this embodiment is performed by detecting a blank area from document image page data. A blank page is determined when the detected blank area is in contact with the top and bottom edges of the page. If the obtained document image page data is determined to be a blank page, the process proceeds to S108, and if not determined to be a blank page or if the next document image page data cannot be obtained in S108, the process proceeds to S110. .

Ｓ１１０で境界ページ特定部２０３は、１つ前に処理した文書画像ページデータが文書の境界ページであるとして文書境界を設定し、処理をＳ１１１に戻す。その後、Ｓ１１１で全ページの処理が完了したと判定されるとＳ１１２に進み、電子文書作成部２０４は、境界ページ特定部２０３によって特定された１以上の文書境界に従って、電子文書を作成し、処理を終了する。 In S110, the boundary page specifying unit 203 sets the document boundary assuming that the previously processed document image page data is the boundary page of the document, and returns the processing to S111. After that, if it is determined in S111 that all pages have been processed, the process proceeds to S112, and the electronic document creation unit 204 creates and processes an electronic document according to one or more document boundaries identified by the boundary page identification unit 203. exit.

以上説明したように、本実施形態に係る文書作成方法では、文書をページ単位で画像データとして取得し、得された画像データからページごとに所定の条件を満たす境界ページの兆候（空白領域）を検出する。また、文書作成方法は、検出された空白領域を含むページと、該ページに隣接する隣接ページの情報とに基づいて、文書の境界となる１以上の境界ページを特定し、特定された１以上の境界ページの位置で分割した文書データを作成する。これにより、文書の境界ページの最下方に脚注やページ番号が存在する場合、文書に関する図や表が別表として次ページ以降に配置されている場合、又は文書の内容を補完するような情報が次ページ以降に追加されている場合にも文書境界を正しく特定できる。さらに、製本等の事情により文書の最後に空白ページが追加されている場合にも文書境界を正しく特定できる。このように、本実施形態によれば、種々の構成の文書について、それらの文書を読み取って得られる画像データから文書の境界を的確に判別することができる。 As described above, in the document creation method according to the present embodiment, a document is acquired as image data on a page-by-page basis, and signs (blank areas) of boundary pages that satisfy predetermined conditions are identified for each page from the acquired image data. To detect. Further, the document creation method identifies one or more boundary pages that serve as boundaries of the document based on the page containing the detected blank area and information on adjacent pages adjacent to the page, and identifies the identified one or more boundary pages. Create document data divided at the position of the boundary page. As a result, if there are footnotes or page numbers at the bottom of the boundary page of the document, if figures or tables related to the document are placed on the following pages as separate tables, or if information that complements the content of the document is Document boundaries can be correctly identified even if they are added after the page. Furthermore, even if a blank page is added at the end of the document due to circumstances such as bookbinding, the document boundary can be specified correctly. As described above, according to the present embodiment, it is possible to accurately determine the boundaries of documents having various configurations from the image data obtained by reading the documents.

＜変形例＞
本発明は上記実施形態に限らず様々な変形が可能である。上記実施形態では、空白領域検出部２０２において所定の条件を満たす空白領域を検出し、当該空白領域を有する文書画像ページデータと所定数の隣接ページデータとを解析することでページ境界を特定した。しかしこれに限らず、空白領域以外の兆候を検出した場合に、当該兆候を有するページと所定数のページとを解析することでページ境界を決定するよう構成してもよい。例えば、空白領域検出部を境界ページ兆候検出部とし、当該境界ページ兆候検出部において文書画像ページデータから所定の文字列を検出するようにしてもよい。所定の文字列とは例えば「〆」や「以上」などである。或いは、境界ページ兆候検出部において文書画像ページデータから所定の文字列を検出する他に空白領域を検出するように構成してもよい。この場合、当該空白領域及び当該文字列とを有する文書画像ページデータと所定数の隣接ページデータとを解析することでページ境界をより正確に特定することができる。 <Modification>
The present invention is not limited to the above embodiment, and various modifications are possible. In the above embodiment, the blank area detection unit 202 detects a blank area that satisfies a predetermined condition, and the document image page data having the blank area and a predetermined number of adjacent page data are analyzed to specify the page boundary. However, without being limited to this, when a sign other than a blank area is detected, a page boundary may be determined by analyzing a page having the sign and a predetermined number of pages. For example, the blank area detection section may be a boundary page sign detection section, and the boundary page sign detection section may detect a predetermined character string from the document image page data. The predetermined character string is, for example, "〆" or "above". Alternatively, the boundary page sign detection unit may be configured to detect a blank area in addition to detecting a predetermined character string from the document image page data. In this case, the page boundary can be specified more accurately by analyzing the document image page data having the blank area and the character string and the predetermined number of adjacent page data.

＜その他の実施形態＞
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 <Other embodiments>
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

発明は上記実施形態に制限されるものではなく、発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。従って、発明の範囲を公にするために請求項を添付する。 The invention is not limited to the embodiments described above, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, the claims are appended to make public the scope of the invention.

１０１：ＣＰＵ、１０２：ＲＡＭ、１０３：ＲＯＭ、１０４：記憶装置、１０５：スキャナ、１０６：通信Ｉ／Ｆ、２０１：ページ取得部、２０２：空白領域検出部、２０３：境界ページ特定部、２０４：電子文書作成部 101: CPU; 102: RAM; 103: ROM; Electronic document creation department

Claims

A document creation method comprising:
an acquisition step of acquiring the document as image data page by page;
a detection step of detecting a sign of a boundary page that satisfies a predetermined condition for each page from the acquired image data;
an identifying step of identifying one or more boundary pages serving as document boundaries based on the page containing the sign of the boundary page detected in the detecting step and the information of adjacent pages adjacent to the page;
and a creating step of creating document data divided at the positions of the one or more boundary pages specified in the specifying step.

2. The document creation method according to claim 1, wherein said detection step detects a blank area that satisfies said predetermined condition as said boundary page sign.

further comprising a confirmation step of confirming additional information indicating content continued from the previous page of the adjacent page in the adjacent page;
3. The document creation method according to claim 2, wherein, in said identifying step, said one or more boundary pages are identified using a confirmation result obtained in said confirming step.

4. The document creation method according to claim 3, wherein, in said identifying step, if said adjacent page contains additional information, said boundary page is identified from adjacent pages after said adjacent page.

5. The document creation method according to claim 3, wherein, in said confirmation step, when a predetermined character string or chart is included at the head of said adjacent page, it is determined that said additional information is included.

6. The document according to any one of claims 2 to 5, wherein in said detection step, an auxiliary information area containing auxiliary information of said document is detected, and said detected auxiliary information area is treated as a blank area. How to make.

7. The document creation method according to claim 6, wherein the auxiliary information area includes at least one of a footer area, a footnote area, and a page number area.

8. The document creation method according to claim 6, wherein in said detecting step, said auxiliary information area is detected based on positional information of an area of interest on a page.

9. The document creation method according to claim 8, wherein in said detecting step, said auxiliary information area is detected based on font information of characters included in said focused area.

10. The document creation method according to claim 8, wherein, in said detection step, if said area of interest contains a predetermined symbol, said area of interest is detected as said auxiliary information area.

11. The document creation method according to any one of claims 2 to 10, wherein in said detecting step, at least one of a position on a page and a size indicating said detected blank area is detected.

12. A condition according to any one of claims 2 to 11, wherein in said detecting step, when detecting said blank area, there is no area different from said blank area below said blank area on a page. or the document creation method according to item 1.

2. The document creation method according to claim 1, wherein, in said detecting step, a page including a predetermined character string at the bottom of the page as said predetermined condition is detected as a sign of said boundary page.

14. The document creation method according to any one of claims 1 to 13, wherein in said acquisition step, image data read from a document by a scanner is acquired.

14. The document creation method according to any one of claims 1 to 13, wherein in said acquisition step, image data read from a document is acquired from an external device.

A document production device,
Acquisition means for acquiring a document as image data on a page-by-page basis;
detection means for detecting a sign of a boundary page that satisfies a predetermined condition for each page from the acquired image data;
identifying means for identifying one or more boundary pages serving as boundaries of a document based on the page containing the sign of the boundary page detected by the detecting means and information on adjacent pages adjacent to the page;
and a creating unit that creates document data divided at the positions of the one or more boundary pages specified by the specifying unit.