JP2020170445A

JP2020170445A - Data extraction method and system from digital document

Info

Publication number: JP2020170445A
Application number: JP2019072770A
Authority: JP
Inventors: 明男島田; Akio Shimada; 光雄早坂; Mitsuo Hayasaka
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2019-04-05
Filing date: 2019-04-05
Publication date: 2020-10-15
Anticipated expiration: 2039-04-05
Also published as: JP7252818B2

Abstract

To improve the quality of data analysis for digital documents.SOLUTION: A data extraction system from digital documents in one preferable aspect of the present invention includes: a conversion processing unit that converts a digital document including an object in a document page and coordinate information on the object into text data with image data and the coordinate information; a text data analysis unit that extracts a caption related to an object a user wants to extract from the text data; and an image analysis unit that detects the object the user wants to extract from the image data based on information on the extracted caption.SELECTED DRAWING: Figure 3

Description

本発明は、デジタル文書からグラフや表などのデータを抽出する手法に関する。 The present invention relates to a method for extracting data such as graphs and tables from a digital document.

人工知能（AI：Artificial Intelligence）などによるデータ分析で、デジタル文書として発行されている新聞記事、論文、企業の決算報告などのデータを分析し、ユーザにとって価値ある情報を提供するサービスが一般的になっている。データ分析を行うための準備として、Webなどから収集した大量のデジタル文書を、利活用がしやすい形式に変換する処理が必要となる。デジタル文書は、Hyper Text Markup Language（HTML）やExtensible Markup Language（XML）といった、様々なファイルフォーマットで提供されるが、これらのフォーマットのデジタル文書には、データそのものの他に、データの構造を示すメタ情報が含まれている。このようなデジタル文書の構造を解析し、デジタル文書のタイトルや、本文、著者、発行日、表、グラフといった、分析対象とするデータのみを抽出し、Data Base（DB）などのストレージ装置に格納する変換処理を、データ分析のデータ準備として行う必要がある。 A service that analyzes data such as newspaper articles, papers, and corporate financial statements published as digital documents by data analysis using artificial intelligence (AI), etc., and provides valuable information to users. It has become. As a preparation for data analysis, it is necessary to convert a large amount of digital documents collected from the Web etc. into a format that is easy to use. Digital documents come in a variety of file formats, such as Hyper Text Markup Language (HTML) and Extensible Markup Language (XML), and digital documents in these formats show the structure of the data as well as the data itself. Contains meta information. By analyzing the structure of such digital documents, only the data to be analyzed, such as the title of the digital document, text, author, publication date, table, and graph, is extracted and stored in a storage device such as Data Base (DB). It is necessary to perform the conversion process to be performed as data preparation for data analysis.

現在、数多くのデジタル文書がPortable Document Format（PDF）のファイルとして頒布されており、PDFファイルの分析が、データ分析において求められている。データ分析を行う際、データ準備として、PDFファイルの構造を解析し、分析対象のデータを抽出する処理を行う必要がある。表やグラフといったデータは、重要な情報が格納されている可能性が高く、表やグラフデータの抽出が強く求められる。 Currently, many digital documents are distributed as Portable Document Format (PDF) files, and analysis of PDF files is required for data analysis. When performing data analysis, it is necessary to analyze the structure of the PDF file and extract the data to be analyzed as data preparation. Data such as tables and graphs are likely to store important information, and extraction of table and graph data is strongly required.

PDFファイルは、テキストとバイナリが混在するなど、複雑なフォーマットとなっており、そのままでは、データ構造の解析が難しい。また、PDFファイルは、文書ページ中のオブジェクト（テキスト、画像など）と、当該オブジェクトがページ中のどこに存在するかを示す座標情報によって構成されており、各オブジェクトが、どのようなデータかを示す情報は、埋め込まれていない。例えば、文書ページ中のテキストが、文書の本文なのか、それとも、表データなのかといった情報は、PDFファイルには埋め込まれておらず、これに複雑なフォーマットであることも加わり、PDFファイルでは、文書中のデータの識別と抽出が難しい。 PDF files have a complicated format such as a mixture of text and binary, and it is difficult to analyze the data structure as it is. In addition, the PDF file is composed of objects (text, images, etc.) in the document page and coordinate information indicating where the object exists in the page, and indicates what kind of data each object is. The information is not embedded. For example, information such as whether the text in the document page is the body of the document or the table data is not embedded in the PDF file, and in addition to this, it is a complicated format, and in the PDF file, Difficult to identify and extract data in documents.

従来技術として、PDFファイルのテキストデータを、データ構造の解析が容易なフォーマットに変換したうえで、表データの抽出処理を行う方式がある。PDFファイルからの表データの抽出は、表の検出、レイアウト認識、および表中テキストの抽出によって行われる。表の検出は、PDFファイルの文書ページから表を検出すること、レイアウト認識は、表の各セルが、どの行、列に属するかを認識する処理である。表中テキストの抽出は、検出した表内のテキストを抽出する処理である。抽出したテキストは、元の表の構造（レイアウト）を再現したHTMLのテーブル表現や、DBのテーブルなどの形式で、ストレージ装置に保存する。 As a conventional technique, there is a method in which text data of a PDF file is converted into a format in which the data structure can be easily analyzed, and then table data is extracted. Extraction of table data from PDF files is done by table discovery, layout recognition, and extraction of text in the table. Table detection is a process of detecting a table from a document page of a PDF file, and layout recognition is a process of recognizing which row and column each cell of the table belongs to. Extracting the text in the table is a process of extracting the detected text in the table. The extracted text is saved in the storage device in the form of an HTML table representation that reproduces the structure (layout) of the original table or a DB table.

従来技術の一つは、PDFファイルを画像ファイルに変換し、データ構造の解析を行う方式である。表の罫線などの画像情報（表の罫線など）を解析し、表の検出および表レイアウトの認識を行う。表中テキストの抽出は、OCR（Optical Character Recognition/Reader）によって行う。特許文献１では、表を画像データとして抽出し、画像解析によって表データを抽出する技術が示されている。 One of the conventional techniques is a method of converting a PDF file into an image file and analyzing the data structure. It analyzes image information such as table ruled lines (table ruled lines, etc.), detects the table, and recognizes the table layout. The text in the table is extracted by OCR (Optical Character Recognition / Reader). Patent Document 1 discloses a technique of extracting a table as image data and extracting the table data by image analysis.

もう一つは、PDFファイルを座標情報付きのテキストファイルに変換し、データ構造の解析を行う方式である。PDFMiner（https://github.com/euske/pdfminer/）等のテキスト抽出ツールを用いると、PDF文書中のテキストに対して、ページ内での座標を付与したテキストファイルを得ることができる。座標は、ある程度まとまった単位（単位ごと、文章ごと、など）で付与される。座標情報付きのテキストを解析し、テキスト間の位置関係を解析することで、表データの検出とレイアウト認識を行う。例として、非特許文献１などが挙げられる。 The other is a method of converting a PDF file into a text file with coordinate information and analyzing the data structure. If you use a text extraction tool such as PDFMiner (https://github.com/euske/pdfminer/), you can obtain a text file with the coordinates in the page added to the text in the PDF document. Coordinates are given in units that are organized to some extent (units, sentences, etc.). Table data is detected and layout recognition is performed by analyzing the text with coordinate information and analyzing the positional relationship between the texts. As an example, Non-Patent Document 1 and the like can be mentioned.

US20150093021A1US20150093021A1

Martha et.al., “TAO: System for Table Detection and Extraction from PDF Documents”, FLAIRS Conference, 2016Martha et.al., “TAO: System for Table Detection and Extraction from PDF Documents”, FLAIRS Conference, 2016

特許文献１の画像解析によって表データ抽出を行う方式では、ページ中の表ではないオブジェクトを表と誤検出してしまう問題がある。また、表のレイアウト認識において、横方向の罫線しか含まない表など、画像情報が不足している表では、表のレイアウトを正確に認識出来ない問題がある。また、表中テキストの抽出において、OCRのテキスト誤認識が発生する問題がある。 The method of extracting table data by image analysis of Patent Document 1 has a problem that an object other than a table on a page is erroneously detected as a table. Further, in table layout recognition, there is a problem that the layout of the table cannot be accurately recognized in a table lacking image information such as a table containing only horizontal ruled lines. In addition, there is a problem that OCR text misrecognition occurs in the extraction of text in the table.

一方、非特許文献１の座標情報付きテキストを解析し、表データ抽出を行う方式では、表の罫線などの画像情報を用いることができないため、表の検出において、表の範囲を正しく認識することが、難しい問題がある。 On the other hand, in the method of analyzing the text with coordinate information of Non-Patent Document 1 and extracting the table data, the image information such as the ruled lines of the table cannot be used, so that the range of the table is correctly recognized in the detection of the table. But there is a difficult problem.

そこで本願発明の課題は、デジタル文書を対象とした、データ分析の質を向上することにある。 Therefore, an object of the present invention is to improve the quality of data analysis for digital documents.

本発明の好ましい一側面は、文書ページ中のオブジェクトと当該オブジェクトの座標情報を含むデジタル文書を、画像データと座標情報付きのテキストデータに変換する変換処理部と、テキストデータから、抽出したいオブジェクトに関連するキャプションを抽出するテキストデータ解析部と、画像データから、抽出したキャプションの情報に基づいて、抽出したいオブジェクトの検出を実行する画像解析部と、を備えるデジタル文書からのデータ抽出システムである。 A preferred aspect of the present invention is a conversion processing unit that converts an object in a document page and a digital document containing the coordinate information of the object into text data with image data and coordinate information, and an object to be extracted from the text data. It is a data extraction system from a digital document including a text data analysis unit that extracts related captions and an image analysis unit that detects an object to be extracted based on the information of the extracted captions from the image data.

本発明の好ましい他の一側面は、デジタル文書から所望のオブジェクトを抽出する、デジタル文書からのデータ抽出方法であって、デジタル文書から座標情報付きのテキストデータを生成する、テキストデータ生成処理、キーワード検索により、テキストデータから所望のオブジェクトのキャプションを検出する、キャプション検出処理、キャプションが検出されたデジタル文書の頁の画像データから、所望のオブジェクトを抽出する、オブジェクト抽出処理、を備える、デジタル文書からのデータ抽出方法である。 Another preferred aspect of the present invention is a method of extracting data from a digital document, which extracts a desired object from the digital document, generating text data with coordinate information from the digital document, text data generation processing, keywords. From a digital document, including a caption detection process that detects a caption of a desired object from text data by a search, and an object extraction process that extracts a desired object from the image data of a page of a digital document in which a caption is detected. Data extraction method.

デジタル文書を対象とした、データ分析の質が向上する。上記した以外のさらに具体的な課題、構成、及び効果は、以下の実施形態の説明により明らかにされる。 Improves the quality of data analysis for digital documents. More specific issues, configurations, and effects other than those described above will be clarified by the following description of the embodiments.

実施例に係るデータ抽出装置のハードウェア構成を示すブロック図。The block diagram which shows the hardware configuration of the data extraction apparatus which concerns on Example. 実施例に係るデータ抽出プログラムの論理構成を示す論理ブロック図。The logical block diagram which shows the logical structure of the data extraction program which concerns on Example. 表のキャプションを検出する際の処理を示す流れ図。A flow chart showing the processing when detecting the caption of the table. PDFファイルを座標情報付きのテキストファイルに変換した際の、座標情報付きテキストファイルの例を示す表図。A table diagram showing an example of a text file with coordinate information when a PDF file is converted into a text file with coordinate information. 表のキャプションを検出する際、表のキャプションと本文を区別するために用いる文章パターンの例を示す表図。A chart showing an example of a sentence pattern used to distinguish between a table caption and the text when detecting a table caption. 画像解析部に送られる表のキャプションの検出結果の例を示す表図。The figure which shows the example of the detection result of the caption of the table sent to the image analysis part. 表の検出および、レイアウト認識を実行する際の処理を示すフロー図。A flow chart showing the processing when table detection and layout recognition are performed. 表の検出を実行する際、どのように画像解析を行う範囲を限定するかを示す説明図。Explanatory diagram showing how to limit the range of image analysis when performing table detection. 罫線によってセルが格子状に表現されている表のレイアウト認識を実行する際の処理を示すフロー図。A flow diagram showing the processing when performing layout recognition of a table in which cells are represented in a grid pattern by ruled lines. 罫線情報が不足している表のレイアウト認識を実行する際の処理を示すフロー図。A flow diagram showing processing when executing layout recognition of a table lacking ruled line information. 結合セルを含み、かつ、罫線情報が不足している表の行・列の認識を実行する際の処理を示すフロー図。A flow chart showing processing when performing row / column recognition of a table that includes merged cells and lacks ruled line information. ヒストグラム解析を行なう表の例を示す表図。The figure which shows the example of the table which performs the histogram analysis. ヒストグラム解析により、セルの認識を行った結果の例を示す表図。The figure which shows the example of the result of recognizing a cell by the histogram analysis. 表のレイアウト認識において、正しくレイアウトを認識した場合の例を示す説明図。An explanatory diagram showing an example when the layout is correctly recognized in the layout recognition of the table. 表のレイアウト認識において、誤ってレイアウトを認識した場合の例を示す説明図。Explanatory drawing which shows an example when the layout is erroneously recognized in the layout recognition of a table. 座標情報テキストファイル内に存在するPDF文書の本文を解析し、表のレイアウト認識を行う方法を示す説明図。An explanatory diagram showing a method of analyzing the body of a PDF document existing in a coordinate information text file and recognizing the layout of a table. データ抽出プログラムが抽出する表の座標情報の構造を示す表図。A table diagram showing the structure of the coordinate information of the table extracted by the data extraction program. データ抽出プログラムが、表中のテキストを抽出し、出力する処理を示すフロー図。A flow chart showing the process by which the data extraction program extracts and outputs the text in the table. データ抽出プログラムが、表データを出力する形式の例を示す説明図。An explanatory diagram showing an example of a format in which a data extraction program outputs table data. データ抽出プログラムが、抽出した表データの性質を示す情報を、座標情報付きのテキストファイルから抽出する処理を示すフロー図。A flow chart showing a process in which a data extraction program extracts information indicating the properties of the extracted table data from a text file with coordinate information. 抽出した表データの性質を示す情報を、座標情報付きのテキストファイルから抽出ために用いる、文章パターンの例を示す表図。A table diagram showing an example of a sentence pattern used for extracting information indicating the properties of the extracted table data from a text file with coordinate information. 抽出したグラフデータの性質を示す情報を、座標情報付きのテキストファイルから抽出するために用いる、文章パターンの例を示す表図。A table diagram showing an example of a sentence pattern used for extracting information indicating the properties of the extracted graph data from a text file with coordinate information. データ抽出プログラムが、グラフの検出と出力を実行する際の処理を示すフロー図。A flow diagram showing the processing when a data extraction program detects and outputs a graph.

以下では、本発明の実施例を、図面を参照して説明する。なお、以下に説明する実施例は特許請求の範囲にかかる発明を限定するものではなく、また実施例で説明されている諸要素及びその組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, examples of the present invention will be described with reference to the drawings. It should be noted that the examples described below do not limit the inventions claimed in the claims, and not all of the elements and combinations thereof described in the examples are essential for the means for solving the invention. Absent.

以下に説明する発明の構成において、同一部分又は同様な機能を有する部分には同一の符号を異なる図面間で共通して用い、重複する説明は省略することがある。 In the configuration of the invention described below, the same reference numerals may be used in common among different drawings for the same parts or parts having similar functions, and duplicate description may be omitted.

同一あるいは同様な機能を有する要素が複数ある場合には、同一の符号に異なる添字を付して説明する場合がある。ただし、複数の要素を区別する必要がない場合には、添字を省略して説明する場合がある。 When there are a plurality of elements having the same or similar functions, they may be described by adding different subscripts to the same reference numerals. However, if it is not necessary to distinguish between a plurality of elements, the subscript may be omitted for explanation.

本明細書等における「第１」、「第２」、「第３」などの表記は、構成要素を識別するために付するものであり、必ずしも、数、順序、もしくはその内容を限定するものではない。また、構成要素の識別のための番号は文脈毎に用いられ、一つの文脈で用いた番号が、他の文脈で必ずしも同一の構成を示すとは限らない。また、ある番号で識別された構成要素が、他の番号で識別された構成要素の機能を兼ねることを妨げるものではない。 Notations such as "first", "second", and "third" in the present specification and the like are attached to identify the constituent elements, and do not necessarily limit the number, order, or contents thereof. is not. Further, the numbers for identifying the components are used for each context, and the numbers used in one context do not always indicate the same composition in the other contexts. Further, it does not prevent the component identified by a certain number from having the function of the component identified by another number.

図面等において示す各構成の位置、大きさ、形状、範囲などは、発明の理解を容易にするため、実際の位置、大きさ、形状、範囲などを表していない場合がある。このため、本発明は、必ずしも、図面等に開示された位置、大きさ、形状、範囲などに限定されない。 The position, size, shape, range, etc. of each configuration shown in the drawings and the like may not represent the actual position, size, shape, range, etc. in order to facilitate understanding of the invention. Therefore, the present invention is not necessarily limited to the position, size, shape, range, etc. disclosed in the drawings and the like.

本明細書において単数形で表される構成要素は、特段文脈で明らかに示されない限り、複数形を含むものとする。 The components represented in the singular form herein shall include the plural form unless explicitly indicated in the context.

なお、以下の説明では、「プログラム」を主語として処理を説明する場合があるが、プログラムは、プロセッサ（例えばCentral Processing Unit（CPU））によって実行されることで、定められた処理を、適宜に記憶資源（例えばメモリ）及び／又は通信インターフェースデバイス（例えばポート）を用いながら行うため、処理の主語がプログラムとされても良い。プログラムを主語として説明された処理は、プロセッサ或いはそのプロセッサを有する計算機が行う処理としても良い。 In the following description, processing may be described with "program" as the subject, but the program is executed by a processor (for example, Central Processing Unit (CPU)) to appropriately perform the specified processing. Since it is performed using a storage resource (for example, memory) and / or a communication interface device (for example, a port), the subject of processing may be a program. The process described with the program as the subject may be a process performed by a processor or a computer having the processor.

以下で詳細に説明される実施例は、デジタル文書から表やグラフ等の所望のオブジェクトを抽出するものであり、その概要を以下に述べる。一つの実施例のシステムでは、PDFファイルを座標情報付きのテキストファイルに変換し、座標情報付きのテキストファイルから表のキャプションを検出する。そして、キャプションを検出したPDFファイルの文書ページを、画像ファイルに変換する。次に、画像ファイルに変換したページに対して、画像解析を行い、表を検出する。この際、表キャプションの座標をもとに、画像解析を実施する画像の範囲を限定する。表のキャプションの座標は、座標情報付きのテキストファイルから取得する。 The embodiments described in detail below are for extracting desired objects such as tables and graphs from a digital document, the outline of which is described below. In the system of one embodiment, the PDF file is converted into a text file with coordinate information, and the caption of the table is detected from the text file with coordinate information. Then, the document page of the PDF file in which the caption is detected is converted into an image file. Next, the page converted into an image file is subjected to image analysis to detect a table. At this time, the range of the image to be analyzed is limited based on the coordinates of the table caption. The coordinates of the table caption are obtained from a text file with coordinate information.

なお、実施例ではPDFを例に説明しているが、文書ページ中のオブジェクト（テキスト、画像など）と、当該オブジェクトの座標情報を含んでいるデジタル文書であれば、これに限るものではない。このようなデジタル文書からは、座標情報付きのテキストファイルや画像ファイルを容易に生成することができる。 In the embodiment, PDF is used as an example, but the present invention is not limited to this as long as it is an object (text, image, etc.) in the document page and a digital document containing the coordinate information of the object. From such a digital document, a text file or an image file with coordinate information can be easily generated.

表キャプションを含むページにのみ画像解析を行うことで、表を含まないページから、表でないものを表として検出してしまう表の誤検出を回避することができる。また、表のキャプションの位置から画像解析の範囲を限定することで、画像解析においてノイズとなる情報を減らし、表でないものを表として認識してしまう表の誤検出が発生する可能性を低減することができる。また、表の検出を画像解析によって行う事で、表の検出において罫線等の情報も利用することができ、表の範囲を正確に認識することができる。なお、表とは一般には、文字や数字などのデータを、縦横の罫線で格子状に区切ったセルに記述して表したものであるが、罫線の一部または全部が省略される場合もある。また、セルが部分的に結合したり分離したりすることもある。 By performing image analysis only on pages that include table captions, it is possible to avoid erroneous detection of tables that do not include tables and detect non-tables as tables. In addition, by limiting the range of image analysis from the position of the caption of the table, information that causes noise in image analysis is reduced, and the possibility of erroneous detection of the table that recognizes something other than the table as a table is reduced. be able to. Further, by detecting the table by image analysis, information such as ruled lines can be used in the detection of the table, and the range of the table can be accurately recognized. A table is generally represented by describing data such as characters and numbers in cells divided by vertical and horizontal ruled lines in a grid pattern, but some or all of the ruled lines may be omitted. .. Also, cells may be partially merged or separated.

画像解析により検出した表に対して、レイアウト認識を実施する。この際、横方向の罫線しか含まない表など、画像情報が不足している表では、座標情報付きのテキストファイルから、表の構造を説明する本文を抽出し、表のレイアウトの推測に利用できる。これにより、表のレイアウトを正確に認識することが可能となる。 Layout recognition is performed on the table detected by image analysis. At this time, for a table that lacks image information, such as a table that contains only horizontal ruled lines, the text that explains the structure of the table can be extracted from the text file with coordinate information and used to estimate the layout of the table. .. This makes it possible to accurately recognize the layout of the table.

画像解析によって検出・レイアウト認識を実行した表の各セルの座標情報を抽出する。この座標情報と、座標情報付きテキストファイルの内容を照合し、各セルのテキストを座標情報付きテキストファイルから抽出する。これにより、OCRによるテキストの誤認識を回避することができる。また、座標情報付きテキストファイルと予め定義した文章パターンを照合し、表の内容を説明する本文を、表データの性質を示す情報として抽出することができる。 Extract the coordinate information of each cell of the table for which detection and layout recognition have been executed by image analysis. This coordinate information is collated with the contents of the text file with coordinate information, and the text of each cell is extracted from the text file with coordinate information. This makes it possible to avoid erroneous recognition of text by OCR. Further, the text file with coordinate information can be collated with a predetermined sentence pattern, and the text explaining the contents of the table can be extracted as information indicating the nature of the table data.

他の実施例では、PDFファイルを座標情報付きのテキストファイルに変換し、座標情報付きのテキストファイルからグラフのキャプションを検出する。そして、キャプションを検出したPDFファイルの文書ページを、画像ファイルに変換する。次に、画像ファイルに変換したページに対して、画像解析を行い、グラフを検出する。この際、グラフキャプションの座標をもとに、画像解析を実施する画像の範囲を限定する。グラフキャプションの座標は、座標情報付きのテキストファイルから取得する。グラフキャプションを含むページにのみ画像解析を行うことで、グラフを含まないページから、グラフでないものをグラフとして検出してしまうグラフの誤検出を回避することができる。また、グラフのキャプションの位置から画像解析の範囲を限定することで、画像解析においてノイズとなる情報を減らし、グラフでないものをグラフと検出してしまうなど、グラフの誤検出が発生する可能性を低減することができる。なお、グラフとは一般に、互いに関係のある二つ以上の数量の関係を図として示したものである。 In another embodiment, the PDF file is converted into a text file with coordinate information, and the caption of the graph is detected from the text file with coordinate information. Then, the document page of the PDF file in which the caption is detected is converted into an image file. Next, image analysis is performed on the page converted into an image file, and a graph is detected. At this time, the range of the image to be analyzed is limited based on the coordinates of the graph caption. The coordinates of the graph caption are obtained from a text file with coordinate information. By performing image analysis only on pages that include graph captions, it is possible to avoid false detection of graphs that do not include graphs and detect non-graphs as graphs. In addition, by limiting the range of image analysis from the position of the caption of the graph, information that causes noise in image analysis can be reduced, and non-graphs can be detected as graphs, which may lead to false detection of graphs. It can be reduced. In general, a graph is a diagram showing the relationship between two or more quantities that are related to each other.

また、座標情報付きテキストファイルと予め定義した文章パターンを照合し、座標情報付きテキストファイルから、グラフの内容を説明する本文を、グラフデータの情報を示す情報として抽出することができる。 Further, the text file with coordinate information and the predetermined sentence pattern can be collated, and the text explaining the contents of the graph can be extracted from the text file with coordinate information as information indicating the information of the graph data.

図１を用いて、実施例１に係るデータ抽出装置の計算機システムの概要を説明する。図１は、実施例に係る計算機システムのハードウェア構成を示す。データ抽出装置１００はデータ抽出プログラムを実行し、PDFファイルから、表データを抽出する。CPU１０２は、バス１０１を介して主記憶装置であるメモリ１０３と接続しており、メモリ１０３上のデータ抽出プログラムを実行する。 An outline of the computer system of the data extraction device according to the first embodiment will be described with reference to FIG. FIG. 1 shows a hardware configuration of a computer system according to an embodiment. The data extraction device 100 executes a data extraction program and extracts table data from the PDF file. The CPU 102 is connected to the memory 103, which is the main storage device, via the bus 101, and executes a data extraction program on the memory 103.

CPU１０２およびメモリ１０３は、バス１０１を介して、二次記憶装置であるディスク装置１０４と接続している。CPU１０２上で実行されるデータ抽出プログラムは、ディスク装置１０４に格納されているPDFファイルをメモリ１０３上に読みこんだり、ディスク装置１０４に格納されているPDFファイルにデータを書き込んだりすることができる。 The CPU 102 and the memory 103 are connected to the disk device 104, which is a secondary storage device, via the bus 101. The data extraction program executed on the CPU 102 can read the PDF file stored in the disk device 104 into the memory 103 and write the data to the PDF file stored in the disk device 104.

以上のハードウェア構成は、一般的なサーバで構成することができる。図示はしていないが、一般的なサーバが備える各種の入力装置や出力装置を備えていても良い。以上のハードウェア構成は、単体のサーバで構成してもよいし、あるいは、入力装置、出力装置、CPU、ディスク装置の任意の部分が、ネットワークで接続された他のサーバで構成されてもよい。また、本実施例中、ソフトウエアで構成した機能と同等の機能は、FPGA（Field Programmable Gate Array）、ASIC（Application Specific Integrated Circuit）などのハードウェアでも実現できる。 The above hardware configuration can be configured with a general server. Although not shown, various input devices and output devices provided in a general server may be provided. The above hardware configuration may be configured by a single server, or any part of the input device, output device, CPU, and disk device may be configured by another server connected by a network. .. Further, in this embodiment, the same function as the function configured by the software can be realized by the hardware such as FPGA (Field Programmable Gate Array) and ASIC (Application Specific Integrated Circuit).

なお、本実施例では、対象のファイルとしてPDFファイルを例とするが、文書ページ中のオブジェクト（テキスト、画像など）と、当該オブジェクトがページ中のどこに存在するかを示す座標情報を含んでいれば、他の形式のデジタル文書であってもよい。 In this embodiment, a PDF file is taken as an example of the target file, but an object (text, image, etc.) in the document page and coordinate information indicating where the object exists in the page are included. For example, it may be a digital document in another format.

図２は、実施例１に係るデータ抽出プログラム２００の機能ブロック構成を示す。変換処理部２０１は、ディスク装置１０４に格納されているPDFファイルを読み込み、座標情報付きのテキストファイルに変換する。また、ディスク装置１０４に格納されているPDFファイルを読み込み、画像ファイルに変換する。 FIG. 2 shows a functional block configuration of the data extraction program 200 according to the first embodiment. The conversion processing unit 201 reads the PDF file stored in the disk device 104 and converts it into a text file with coordinate information. In addition, the PDF file stored in the disk device 104 is read and converted into an image file.

テキストデータ解析部２０２は、座標情報付きのテキストファイルを解析し、表キャプションを抽出する。また、座標情報付きのテキストファイルを解析して、表の構造を説明する本文と、表の性質を示す本文を抽出する。ここで、キャプションとは写真・図版・表・グラフなどに付される表題やタイトルをいい、実施例１では表に付されるキャプション（例えば「表１：○○○のデータ」）を抽出する。 The text data analysis unit 202 analyzes the text file with the coordinate information and extracts the table caption. In addition, a text file with coordinate information is analyzed to extract a text explaining the structure of the table and a text showing the properties of the table. Here, the caption refers to a title or title attached to a photograph, a plate, a table, a graph, or the like, and in Example 1, the caption attached to the table (for example, "Table 1: Data of ○○○") is extracted. ..

画像解析部２０３は、テキストデータ解析部２０２から受け取った表キャプションの情報をもとに画像ファイルを解析し、表の検出を行う。また、テキストデータ解析部２０２が座標情報付きテキストファイルを解析して得た表の構造を説明する本文をもとに、表のレイアウト認識を実行する。そして、表の座標情報を抽出する。 The image analysis unit 203 analyzes the image file based on the information of the table caption received from the text data analysis unit 202, and detects the table. Further, the text data analysis unit 202 executes table layout recognition based on the text explaining the structure of the table obtained by analyzing the text file with coordinate information. Then, the coordinate information of the table is extracted.

データ格納部２０４は、画像解析部２０３が抽出した表の座標情報と、座標情報付きテキストファイルを照合し、表中テキストの抽出をおこなう。そして、抽出したテキストを、ファイルやDBのテーブルとして出力する。 The data storage unit 204 collates the coordinate information of the table extracted by the image analysis unit 203 with the text file with the coordinate information, and extracts the text in the table. Then, the extracted text is output as a file or DB table.

図３は実施例１に係るデータ抽出プログラム２００が、表データの抽出を行う一連の処理において、表キャプションを抽出する処理３００を示す。まず、変換処理部２０１が、表データを抽出するPDFファイルを座標情報付きのテキストファイルに変換する（処理３１０）。 FIG. 3 shows a process 300 for extracting a table caption in a series of processes in which the data extraction program 200 according to the first embodiment extracts table data. First, the conversion processing unit 201 converts the PDF file from which the table data is extracted into a text file with coordinate information (process 310).

図４のファイル４００は、座標情報付きのテキストファイルの例を示している。図４に示すように、変換したPDFファイルの文書ページごとにタグを付与し、メタ情報として、ページの横方向・縦方向の大きさ（ページサイズ）と、ページ番号を付与する。また、文書ページ中の各テキストに対し、タグを付与し、メタ情報として、テキストの座標情報を付与する。図４の例では、テキストの左上部の横座標・縦座標と、テキストの幅と高さを、座標情報として付与している。例では、ページごとにPAGEタグを、テキストごとにTOKENタグを付与しているが、タグ名は別のものでも良い。座標情報付きテキストファイルへの変換は、PDFMiner等の、既存のテキスト抽出ツールを用いることで、実行する事ができるし、独自の変換処理を開発して組み込んでもよい。 File 400 in FIG. 4 shows an example of a text file with coordinate information. As shown in FIG. 4, a tag is added to each document page of the converted PDF file, and the horizontal / vertical size (page size) of the page and the page number are added as meta information. In addition, a tag is added to each text in the document page, and the coordinate information of the text is added as meta information. In the example of FIG. 4, the abscissa and ordinate coordinates of the upper left part of the text and the width and height of the text are given as coordinate information. In the example, a PAGE tag is added to each page and a TOKEN tag is added to each text, but the tag name may be different. The conversion to a text file with coordinate information can be executed by using an existing text extraction tool such as PDFMiner, or an original conversion process may be developed and incorporated.

次に、テキストデータ解析部２０２が、座標情報付きのテキストファイルから、表のキャプションの抽出を行う。まず、座標情報付きのテキストファイルに対し、「表」、「Table」といった、表のキャプションに含まれそうなキーワードを含むテキストを検索する処理を行う（処理３１１）。検索は、TOKENタグを付与されたテキストに対して行う。キーワードを含むテキストを検出したら、それを抽出する。 Next, the text data analysis unit 202 extracts the caption of the table from the text file with the coordinate information. First, a process of searching a text file with coordinate information for texts including keywords such as "table" and "Table" that are likely to be included in the caption of the table is performed (process 311). The search is performed on the text with the TOKEN tag. If it finds text that contains a keyword, it extracts it.

次に、座標情報付きテキストファイルの座標情報をもとに、キーワードを含むテキストの周囲のテキストも抽出する（処理３１２）。テキストを抽出する範囲は、予め設定しておく。そして、抽出したテキスト全体を、予め設定した文章パターンと照合する（処理３１３）。 Next, based on the coordinate information of the text file with coordinate information, the text around the text including the keyword is also extracted (process 312). The range for extracting text is set in advance. Then, the entire extracted text is collated with a preset sentence pattern (process 313).

図５Ａは、抽出したテキストと照合するために、予め設定した文章パターンを格納したテーブルの例である。これらの文章は、本文中で表などを引用したり参照したりする際に用いられる表現であり、これらの文章が表のキャプションとして用いられる可能性は小さい。テーブル５００に示すような文章パターンは事前に定義しておき、ディスク装置１０４等に格納しておく。 FIG. 5A is an example of a table in which preset sentence patterns are stored in order to collate with the extracted text. These sentences are expressions used when quoting or referring to a table or the like in the text, and it is unlikely that these sentences will be used as captions for the table. The sentence pattern as shown in the table 500 is defined in advance and stored in the disk device 104 or the like.

もし、テキスト全体が、「表○○は××を示す」（○○、××は任意の単語・文字）などといった、テーブル５００に含まれる文章パターンのいずれかと一致する一連の文章になっていたら、キーワードを含むテキストを、表キャプションではないと判断する（処理３１４：No）。 If the entire text is a series of sentences that match any of the sentence patterns included in the table 500, such as "Table XX indicates XX" (XX, XX are arbitrary words / characters). Then, it is determined that the text including the keyword is not a table caption (process 314: No).

もし、一連の文章になっていなければ、キーワードを含むテキストが、表キャプションであると判断し（処理３１４：Yes）、当該テキストを含むページのページ番号とページサイズ、当該テキスト、当該テキストの座標情報を、表キャプションの情報として抽出し、画像解析部２０３に送る（処理３１５）。処理３１２から処理３１５までの一連の処理は、キーワード検索によって得られたテキスト数だけ繰り返す（処理３１６）。 If it is not a series of sentences, it is determined that the text containing the keyword is a table caption (process 314: Yes), the page number and page size of the page containing the text, the text, and the coordinates of the text. The information is extracted as table caption information and sent to the image analysis unit 203 (process 315). The series of processes from process 312 to process 315 is repeated by the number of texts obtained by the keyword search (process 316).

図５Ｂのテーブル５１０は、画像解析部２０３に送られる表キャプションの情報の例を示す。ページごとにページ番号とページサイズを記録し、各ページについて、検出したキャプションと、その座標（座標情報付きテキストファイル上での座標）を記録する。 Table 510 in FIG. 5B shows an example of table caption information sent to the image analysis unit 203. Record the page number and page size for each page, and record the detected caption and its coordinates (coordinates on the text file with coordinate information) for each page.

図６は、データ抽出プログラム２００が、表データの抽出を行う一連の処理において、表の検出・レイアウト認識を行う処理６００を示す。表の検出・レイアウト認識処理６００は、処理３００において表キャプションが検出されたページそれぞれに対して、実行される。 FIG. 6 shows a process 600 in which the data extraction program 200 performs table detection and layout recognition in a series of processes for extracting table data. The table detection / layout recognition process 600 is executed for each page for which the table caption is detected in the process 300.

まず、画像解析部２０３が、テキストデータ解析部２０２が、処理３００において作成したテーブル５１０を参照し、表キャプションを含むページのページ番号を確認する（処理６１０）。そして、PDFファイルの当該文書ページを、画像ファイルに変換する（処理６１１）。画像ファイルへの変換は、pdftoppm（https://www.xpdfreader.com/pdftoppm-man.html）等の、既存の画像変換ツールを用いることで、実行することができるし、独自の変換処理を開発して組み込んでもよい。 First, the image analysis unit 203 refers to the table 510 created in the process 300 by the text data analysis unit 202, and confirms the page number of the page including the table caption (process 610). Then, the document page of the PDF file is converted into an image file (process 611). Conversion to an image file can be performed by using an existing image conversion tool such as pdftoppm (https://www.xpdfreader.com/pdftoppm-man.html), and original conversion processing can be performed. It may be developed and incorporated.

次に、処理３００においてテキストデータ解析部が抽出した表キャプションの座標情報（図５Ｂのテーブル５１０参照）を、画像ファイルの座標スケールに合わせる処理を行う（処理６１２）。画像ファイル上での当該文書ページのサイズをもとに、表キャプションの座標情報を、座標情報付きテキストファイル上での座標情報から、画像ファイル上での座標情報に変換する。変換は、以下の式で実行することができる。 Next, the coordinate information of the table caption extracted by the text data analysis unit in the process 300 (see Table 510 in FIG. 5B) is adjusted to the coordinate scale of the image file (process 612). Based on the size of the document page on the image file, the coordinate information of the table caption is converted from the coordinate information on the text file with the coordinate information to the coordinate information on the image file. The conversion can be performed by the following formula.

表キャプションの横座標（画像ファイル）＝
表キャプションの横座標（テキストファイル）×（文書ページの横方向サイズ（画像ファイル）÷文書ページの横方向サイズ（テキストファイル））

表キャプションの縦座標（画像ファイル）＝
表キャプションの縦座標（テキストファイル）×（文書ページの縦方向サイズ（画像ファイル）÷文書ページの縦方向サイズ（テキストファイル））

画像ファイル上での文書ページのサイズは、OpenCV（https://opencv.org）等の画像処理ライブラリによって取得することが出来る。
Abscissa of table caption (image file) =
Abscissa of table caption (text file) x (horizontal size of document page (image file) ÷ horizontal size of document page (text file))

Vertical coordinates of table caption (image file) =
Vertical coordinates of table caption (text file) x (vertical size of document page (image file) ÷ vertical size of document page (text file))

The size of the document page on the image file can be obtained by an image processing library such as OpenCV (https://opencv.org).

次に、画像解析部２０３は、表検出を実行する画像範囲を限定する処理を行う（処理６１３）。画像範囲の限定は、表キャプションの座標をもとに行う。なお、図６のフローでは、キャプションが抽出された頁のみを画像変換しているが、予め全ての頁を画像変換しておいてもよい。 Next, the image analysis unit 203 performs a process of limiting the image range in which the table detection is executed (process 613). The image range is limited based on the coordinates of the table caption. In the flow of FIG. 6, only the page from which the caption is extracted is image-converted, but all the pages may be image-converted in advance.

図７は、画像解析部２０３が、表検出を実行する画像範囲を限定する際の例を示している。画像７００では、表キャプションが、表の左上に配置されていることを前提として、表検出を実行する範囲を領域７０１に限定している。画像７１０では、表キャプションが、表の左上に配置されていることを前提として領域７１１と領域７１２に限定している。 FIG. 7 shows an example when the image analysis unit 203 limits the image range in which the table detection is executed. In image 700, assuming that the table caption is located at the upper left of the table, the range in which the table detection is executed is limited to the area 701. In image 710, the table caption is limited to the area 711 and the area 712 on the assumption that it is arranged in the upper left of the table.

画像７１０に示すように、同一ページ中に複数の表キャプションが含まれる場合では、表キャプション同士の位置関係から、表検出を実行する画像範囲に、複数の表が含まれないように、画像範囲を限定する。 As shown in image 710, when multiple table captions are included in the same page, the image range is such that the image range for which table detection is executed does not include multiple tables due to the positional relationship between the table captions. To limit.

例では、表キャプションが表の左上に存在することを前提としているが、表キャプションの位置は、文書ごとに異なっている。表検出を実行する画像範囲を限定する際は、表キャプションと表の位置関係を考慮して、画像範囲を限定する際の処理を変更する必要がある。 The example assumes that the table caption is in the upper left corner of the table, but the position of the table caption is different for each document. When limiting the image range for executing table detection, it is necessary to change the process for limiting the image range in consideration of the positional relationship between the table caption and the table.

この場合、例えばユーザが処理６００の開始前に、表キャプションと表検出を実行する画像範囲の位置関係を指定するように構成してもよい。あるいは、処理６１３を省略して、表キャプションを含む頁全体を表検出を実行する画像範囲としてもよい。ただし、表キャプションと表検出を実行する画像範囲の位置関係を利用するほうが、処理量の低下と精度の向上が期待できる。 In this case, for example, the user may be configured to specify the positional relationship between the table caption and the image range in which the table detection is executed before the start of the process 600. Alternatively, processing 613 may be omitted and the entire page including the table caption may be used as the image range for executing table detection. However, using the positional relationship between the table caption and the image range in which the table detection is executed can be expected to reduce the processing amount and improve the accuracy.

次に、画像解析部２０３は、処理６１３で限定した画像範囲から、表検出を実行する（処理６１４）。表検出はどのような方法を用いても良い。例えば、画像処理ライブラリ等を用いて矩形検出を行い、検出した矩形を表として検出する方法がある。また、画像処理ライブラリ等を用いて直線検出を行い、縦方向の座標上で最も上位にある直線と、最も下位にある直線を、表の上辺と下辺として、上辺、下辺に囲まれている範囲を、表として検出してもよい。 Next, the image analysis unit 203 executes table detection from the image range limited by the process 613 (process 614). Any method may be used for table detection. For example, there is a method of performing rectangle detection using an image processing library or the like and detecting the detected rectangle as a table. In addition, straight line detection is performed using an image processing library, etc., and the uppermost straight line and the lowest straight line in the vertical coordinates are set as the upper and lower sides of the table, and the range surrounded by the upper and lower sides. May be detected as a table.

次に、画像解析部２０３は、処理６１４で検出した表のレイアウト認識を実行する（処理６１５）。この処理は、処理６１４で認識した表それぞれに実行される。表のレイアウト認識は、どのような方法を用いてもよい。 Next, the image analysis unit 203 executes layout recognition of the table detected in the process 614 (process 615). This process is executed for each table recognized in process 614. Any method may be used for table layout recognition.

図８Ａは表レイアウト認識処理の一例の流れ図である。例えば、図８Ａの処理６１５Ａは、表の各セルの範囲が、縦方向・横方向の罫線で表現されている表の、レイアウト認識の例を示している。 FIG. 8A is a flow chart of an example of the table layout recognition process. For example, the process 615A of FIG. 8A shows an example of layout recognition of a table in which the range of each cell of the table is represented by ruled lines in the vertical and horizontal directions.

この例では、まず画像処理ライブラリ等を用いて、処理６１４で検出した表の範囲に対して、矩形検出処理を実行する（処理８１１０）。そして、検出した各矩形を、表の各セルとして認識する。そして、各セルの座標情報、例えば、表の左上の点の横座標・縦座標、セルの幅、高さを、画像処理ライブラリ等を用いて抽出する（処理８１１１）。そして、この座標情報をもとに、セルごとに、縦方向・横方向で、それぞれ隣接するセルを判定し、同一列、同一行に属するセルを認識する（処理８１１２）。以上の処理は、画像解析部２０３が画像解析により実行する。 In this example, first, a rectangle detection process is executed for the range of the table detected in the process 614 by using an image processing library or the like (process 8110). Then, each detected rectangle is recognized as each cell in the table. Then, the coordinate information of each cell, for example, the abscissa and ordinate coordinates of the upper left point of the table, the width and height of the cell, is extracted using an image processing library or the like (process 8111). Then, based on this coordinate information, adjacent cells are determined for each cell in the vertical direction and the horizontal direction, and cells belonging to the same column and the same row are recognized (process 8112). The image analysis unit 203 executes the above processing by image analysis.

図８Ｂの処理６１５Ｂは、横方向の罫線のみを含む表のレイアウト認識の例を示している。
図８Ｃは、処理６１５Ｂのうちの処理８２１２の詳細を示す流れ図である。
処理６１５Ｂの例では、画像解析部２０３は、まず画像処理ライブラリ等を用いて、処理６１４で検出した表の範囲に対してヒストグラム解析を行う（処理８２１０）。表画像の縦方向、横方向において、色を持つピクセルの数をカウントしていき、ヒストグラムを作成する。そして、ヒストグラムの谷の部分をセルのバウンダリとして、表の各セルを認識する。 Process 615B of FIG. 8B shows an example of layout recognition of a table including only horizontal ruled lines.
FIG. 8C is a flow chart showing the details of the process 8212 of the process 615B.
In the example of process 615B, the image analysis unit 203 first performs histogram analysis on the range of the table detected in process 614 using an image processing library or the like (process 8210). A histogram is created by counting the number of pixels with colors in the vertical and horizontal directions of the table image. Then, each cell in the table is recognized by using the valley portion of the histogram as the cell boundary.

図９Ａと図９Ｂは、横方向の罫線のみを含む表において、ヒストグラム解析によってセルを認識する際の例を示している。図９Ａの表９１１０は、ヒストグラム解析を行う表の例を示し、図９Ｂの表９１２０は、表９１１０に対して、ヒストグラム解析を行った結果を示している。表９１２０において、点線で囲まれている領域を、セルと認識し、各セルの座標情報、例えば、表の左上の点の横座標・縦座標、セルの幅、高さを、画像処理ライブラリ等を用いて抽出する（処理８２１１）。 9A and 9B show an example of recognizing cells by histogram analysis in a table containing only horizontal ruled lines. Table 9110 of FIG. 9A shows an example of a table for which a histogram analysis is performed, and Table 9120 of FIG. 9B shows the result of a histogram analysis for Table 9110. In Table 9120, the area surrounded by the dotted line is recognized as a cell, and the coordinate information of each cell, for example, the abscissa and ordinate coordinates of the upper left point of the table, the width and height of the cell, the image processing library, etc. (Processing 8211).

次に、同一列・同一行に属するセルの認識を行う（処理８２１２）。図８Ｃは、横方向の罫線のみを含む表において、同一列・同一行に属するセルを認識する処理８２１２の詳細を示す。まず、各セルについて、他のセルとの座標距離を計算する（処理８３１０）。そして、縦方向にもっと近い距離のセル同士を同一列、横方向に最も近い距離のセル同士を同一行とする（処理８３１１）。しかし、これでは、行・列の認識を誤る可能性がある。 Next, cells belonging to the same column and the same row are recognized (process 8212). FIG. 8C shows the details of the process 8212 for recognizing cells belonging to the same column and the same row in a table including only horizontal ruled lines. First, for each cell, the coordinate distance from the other cells is calculated (process 8310). Then, the cells having a closer distance in the vertical direction are set to the same column, and the cells having the closest distance in the horizontal direction are set to the same row (process 8311). However, this may lead to misrecognition of rows and columns.

図９Ｃの表９２１０は、表９１１０の列を正しく認識した場合を示している。列９２１３と列９２１４が結合セル９２１７に接続し、同一列９２１１として認識される。同様に、列９２１５と列９２１６が、結合セル９２１８に接続し、同一列９２１２として認識される。 Table 9210 of FIG. 9C shows the case where the columns of Table 9110 are correctly recognized. Columns 9213 and 9214 are connected to the merged cell 9217 and are recognized as the same column 9211. Similarly, column 9215 and column 9216 connect to merged cell 9218 and are recognized as identical column 9212.

図９Ｄの表９２２０は、表９１１０の列を誤って認識した場合を示している。本来は、列９２２４が、結合セル９２２７に接続し、列９２２３と同様に、列９２２１に属するべきである。しかし、列９２２４が、列９２２５、列９２２６と結合セル９２２８によって接続し、同一列９２２２として認識されてしまっている。これは、セル９２２９に最も座標距離が近いセルが、結合セル９２２８と判定されてしまったことによる。 Table 9220 of FIG. 9D shows the case where the columns of Table 9110 are erroneously recognized. Originally, column 9224 should be connected to merged cell 9227 and belong to column 9221 as well as column 9223. However, column 9224 is connected to column 9225 and column 9226 by the merged cell 9228 and is recognized as the same column 9222. This is because the cell having the closest coordinate distance to the cell 9229 has been determined to be the merged cell 9228.

このような列・行の誤認識を修正するため、本実施例では、データ格納部２０４により、PDF文書の本文から、セルのレイアウトを推測する処理を行う。まず、各セル中のテキストを座標情報付きのテキストファイルから抽出する処理を行う（処理８３１２）。画像解析部２０３は、各セル中のテキストを座標情報付きのテキストファイルから抽出するため、各セルの座標情報を、画像ファイル上の座標情報から、座標情報付きのテキストファイル上での座標情報に変換する。変換は以下の式で行う。 In order to correct such misrecognition of columns and rows, in this embodiment, the data storage unit 204 performs a process of estimating the cell layout from the body of the PDF document. First, a process of extracting the text in each cell from a text file with coordinate information is performed (process 8312). Since the image analysis unit 203 extracts the text in each cell from the text file with the coordinate information, the coordinate information of each cell is changed from the coordinate information on the image file to the coordinate information on the text file with the coordinate information. Convert. The conversion is performed by the following formula.

セルの横座標（テキストファイル）＝
セルの横座標（画像ファイル）×（文書ページの横方向サイズ（テキストファイル）÷文書ページの横方向サイズ（画像ファイル））

セルの縦座標（テキストファイル）＝
セルの縦座標（画像ファイル）×（文書ページの縦方向サイズ（テキストファイル）÷文書ページの縦方向サイズ（画像ファイル））

セルの幅（テキストファイル）＝
セルの幅（画像ファイル）×（文書ページの横方向サイズ（テキストファイル）÷文書ページの横方向サイズ（画像ファイル））

セルの高さ（テキストファイル）＝
セルの高さ（画像ファイル）×（文書ページの横方向サイズ（テキストファイル）÷文書ページの横方向サイズ（画像ファイル））

Cell abscissa (text file) =
Abscissa of cell (image file) x (horizontal size of document page (text file) ÷ horizontal size of document page (image file))

Cell ordinate (text file) =
Cell vertical coordinates (image file) x (document page vertical size (text file) ÷ document page vertical size (image file))

Cell width (text file) =
Cell width (image file) x (horizontal size of document page (text file) ÷ horizontal size of document page (image file))

Cell height (text file) =
Cell height (image file) x (horizontal size of document page (text file) ÷ horizontal size of document page (image file))

データ格納部２０４では、この座標情報と、座標情報付きテキストファイル内の座標情報を照合し、座標上でセルの範囲内に存在するテキストを、セル中のテキストとして抽出する。ファイル４００を例とすると、TOKENタグが付与されている各テキストの座標情報を参照し、当該テキストの中心座標を以下の式で求める。

中心座標（X, Y）＝（横座標＋(幅/２), 縦座標＋（高さ/２））

この中心座標が、セルの座標範囲内に収まっていれば、当該テキストが当該セル中のテキストであると判定する。 The data storage unit 204 collates this coordinate information with the coordinate information in the text file with the coordinate information, and extracts the text existing within the range of the cell on the coordinates as the text in the cell. Taking file 400 as an example, the coordinate information of each text to which the TOKEN tag is attached is referred to, and the center coordinates of the text are obtained by the following formula.

Center coordinates (X, Y) = (Abscissa + (width / 2), Vertical coordinates + (height / 2))

If the center coordinates are within the coordinate range of the cell, it is determined that the text is the text in the cell.

次に、テキストデータ解析部２０２で、座標情報付きのテキストファイルから表の説明文を抽出、参照し（処理８３１３）、行・列の修正を行う（処理８３１４）。まず、座標情報付きのテキストファイルにおいてTOKENタグが付与されている全テキストを抽出し、各テキストの座標情報をもとに文章を形成する。これは、座標上で連続しているテキストを一連の文章として認識することで行う。そうして形成した文章に、表の構造を説明する文章パターンが含まれるかどうかを調べ、もし含まれていれば、その文章パターンが含まれている一文を、表の説明文と判定し、抽出する。 Next, the text data analysis unit 202 extracts and refers to the explanation of the table from the text file with the coordinate information (process 8313), and corrects the rows and columns (process 8314). First, all the texts with the TOKEN tag are extracted from the text file with the coordinate information, and a sentence is formed based on the coordinate information of each text. This is done by recognizing continuous text on the coordinates as a series of sentences. It is checked whether the sentence formed in this way contains a sentence pattern explaining the structure of the table, and if it is included, one sentence containing the sentence pattern is determined to be the explanation sentence of the table. Extract.

図９Ｅに、表の構造を説明する文章パターン９３００の例を示す。文章パターン９３００は、ディスク装置１０４等に予め記憶しておく。図の例では、文章パターンが1つだが、複数の文章パターンをあらかじめ用意しておき、いずれかの文章パターンが含まれれば、その一文を、表の説明文と判定しても良い。 FIG. 9E shows an example of a sentence pattern 9300 explaining the structure of the table. The sentence pattern 9300 is stored in advance in the disk device 104 or the like. In the example of the figure, there is one sentence pattern, but a plurality of sentence patterns may be prepared in advance, and if any of the sentence patterns is included, one sentence may be determined as the explanation of the table.

次に、表の説明文の解析を行う。図９Ｅの説明文９３１０は、表９１１０の説明文が、文章パターン９３００によって抽出された例を示す。文章パターン９３００では、「××」をセル中テキストに含むセルと、「△△」をセル中テキストに含むセルが、「○○」をセル中テキストに含む結合セルによって接続し、同一の列を形成することを示す。よって、説明文９３１０を参照することで、「Step」をセル中テキストに含む結合セルに、「Compound」をセル中テキストに含むセルと、「Yield」をセル中テキストに含むセルがそれぞれ接続することがわかる。 Next, the explanation of the table is analyzed. The explanatory text 9310 of FIG. 9E shows an example in which the explanatory text of Table 9110 is extracted by the sentence pattern 9300. In the sentence pattern 9300, a cell containing "XX" in the text in the cell and a cell containing "△△" in the text in the cell are connected by a merged cell containing "○○" in the text in the cell, and are in the same column. Is shown to form. Therefore, by referring to the explanation 9310, the merged cell containing "Step" in the text in the cell, the cell containing "Compound" in the text in the cell, and the cell containing "Yield" in the text in the cell are connected. You can see that.

処理８３１２で各セルに対応するテキストが抽出されているので、抽出した各セルのセル中テキストを参照すれば、説明文９３１０に基づいてセルの接続関係を推定することができる。たとえば、画像解析部によるセル間の座標距離をもとにした行・列の認識で、図９Ｄの表９２２０に示すように、結合セル９２２８とセル９２２９が同一列９２２２に属するというような誤認識が起きても、説明文９３１０とセル内テキストを照合することにより、説明文９３１０に基づくセルの接続関係と矛盾することが分かる。よって、図９Ｃの表９２１０に示すように、結合セル９２１７とセル９１１９が同一列９２１１を形成する、正しい列認識に修正することができる。 Since the text corresponding to each cell is extracted in the process 8312, the cell connection relationship can be estimated based on the description 9310 by referring to the text in the cell of each extracted cell. For example, in the recognition of rows and columns based on the coordinate distance between cells by the image analysis unit, as shown in Table 9220 of FIG. 9D, misrecognition that the combined cell 9228 and the cell 9229 belong to the same column 9222 is erroneously recognized. Even if the above occurs, by collating the explanatory text 9310 with the text in the cell, it can be found that the cell connection relationship based on the explanatory text 9310 is inconsistent. Therefore, as shown in Table 9210 of FIG. 9C, the combined cell 9217 and the cell 9119 form the same column 9211, which can be corrected to the correct column recognition.

表レイアウトの認識処理が終了したら、次に、表の座標情報の出力を行う（処理６１６）。 After the table layout recognition process is completed, the table coordinate information is output (process 616).

図１０の座標情報１０００は、表レイアウト認識処理５１５によってレイアウト認識した表の座標情報の出力例を示している。表を構成する各セルにユニークなIDを付与し、各セルの座標情報（座標情報１０００の例では、セルの左上部の横座標、縦座標、セル幅、セルの高さ）を出力する。この座標情報は、画像ファイル上での座標情報である。また、各セルについて、表レイアウト認識処理によって認識した隣接セルのIDを出力する。隣接セルは、同一行と認識されたセル群において、最も横方向での座標距離が近いものと、同一列と認識されたセル群において、最も横方向での座標距離が近いものである。同一行の隣接セルは、左右それぞれ存在し、同一列の隣接セルは、上下それぞれ存在する。当該セルが結合セルの場合は、左右、上下、それぞれで、複数のセルと隣接する場合がある。また、抽出した表の座標情報１０００には、テーブル５１０に記載された表のキャプションにおいて対応するものを付与する。座標情報１０００は、画像解析部２０３から、データ格納部２０４に送られる。 The coordinate information 1000 of FIG. 10 shows an output example of the coordinate information of the table whose layout is recognized by the table layout recognition process 515. A unique ID is assigned to each cell constituting the table, and the coordinate information of each cell (in the example of the coordinate information 1000, the abscissa and ordinate coordinates of the upper left part of the cell, the cell width, and the cell height) is output. This coordinate information is the coordinate information on the image file. Also, for each cell, the ID of the adjacent cell recognized by the table layout recognition process is output. Adjacent cells have the closest coordinate distance in the lateral direction in the cell group recognized as the same row and the closest coordinate distance in the lateral direction in the cell group recognized as the same column. Adjacent cells in the same row exist on the left and right, and adjacent cells in the same column exist on the top and bottom. When the cell is a merged cell, it may be adjacent to a plurality of cells on the left, right, top and bottom, respectively. Further, the coordinate information 1000 of the extracted table is given the corresponding caption of the table shown in the table 510. The coordinate information 1000 is sent from the image analysis unit 203 to the data storage unit 204.

処理６１４で検出された全ての表に対し、処理６１５と処理６１６を実行したら、表レイアウト認識処理を終了する（処理６１７：Yes）。そうで無ければ、引き続き、処理６１５と処理６１６を行う（処理６１７：No）。 When the processes 615 and 616 are executed for all the tables detected in the process 614, the table layout recognition process is terminated (process 617: Yes). If not, processing 615 and processing 616 are subsequently performed (process 617: No).

表キャプションが検出された全てのページに対し、処理６１０から処理６１７までを実行したら、表検出・レイアウト認識処理を終了する（処理６１８：Yes）。そうで無ければ、引き続き、処理６１０から処理６１７を行う（処理６１８：No）。 When the processes 610 to 617 are executed for all the pages for which the table caption is detected, the table detection / layout recognition process is terminated (process 618: Yes). If not, the processes 610 to 617 are subsequently performed (process 618: No).

図１１は実施例１に係るデータ抽出プログラム２００が、表データの抽出を行う一連の処理において、表中テキストを抽出し、出力する処理を示す。この処理は、処理６１４で検出された表の数だけ行う。 FIG. 11 shows a process in which the data extraction program 200 according to the first embodiment extracts and outputs text in a table in a series of processes for extracting table data. This process is performed for the number of tables detected in process 614.

データ格納部２０４は、画像解析部２０３から受け取った表の座標情報１０００をもとに、各セルについて、座標情報付きのテキストファイルから、セル中テキストの抽出を行う（処理１１１０）。テキストの抽出方法は、処理８３１２で示した方法と同様である。 The data storage unit 204 extracts the text in the cell from the text file with the coordinate information for each cell based on the coordinate information 1000 of the table received from the image analysis unit 203 (process 1110). The method of extracting the text is the same as the method shown in the process 8312.

次に、抽出したテキストを、もとの表の構造を再現するデータに整形し、出力する（処理１１１１）。画像解析部２０３から受け取った表の座標情報から１０００、各セルの隣接セルを認識して、もとの表の構造を再現して、データを出力する。出力形式は、HTMLのテーブル表現やDBのテーブル構造が考えられる。 Next, the extracted text is shaped into data that reproduces the structure of the original table and output (process 1111). From the coordinate information of the table received from the image analysis unit 203, 1000, the adjacent cells of each cell are recognized, the structure of the original table is reproduced, and the data is output. The output format can be an HTML table representation or a DB table structure.

図１２のHTMLファイル１２１０は、表１２００をHTMLのテーブル表現で表したものである。HTMLファイル１２１０に示すように、各セルに、thタグ、またはtdタグを付与する。そして、同一行のセルを、trタグで囲み、全体をtableタグで囲むで、表の構造を表現する。水平方向の結合セルを表現する場合は、HTMLファイル１２１０に示すように、colspan属性を付与し、列方向の結合セルを表現する場合は、rowspan属性を付与する。 The HTML file 1210 of FIG. 12 is a representation of Table 1200 in HTML table representation. As shown in the HTML file 1210, a th tag or a td tag is added to each cell. Then, the cells in the same row are surrounded by tr tags and the whole is surrounded by table tags to express the structure of the table. When expressing the merged cells in the horizontal direction, the colspan attribute is added as shown in HTML file 1210, and when expressing the merged cells in the column direction, the rowspan attribute is given.

処理１１１０で抽出した各セルのテキストにthまたはtdタグを付与し、表の座標情報１０００内に記録されている各セルの隣接セルを参照して、同一行のセルをtrタグで囲むことで、HTMLのテーブル表現で、抽出した表データを出力することができる。列方向に複数の隣接セルを持つ結合セルの場合は、colspan属性を付与し、行方向に複数の隣接セルを持つ結合セルの場合は、rowspan属性を付与する。 By adding a th or td tag to the text of each cell extracted in process 1110, referring to the adjacent cell of each cell recorded in the coordinate information 1000 of the table, and enclosing the cell in the same row with the tr tag. , HTML table representation can output the extracted table data. In the case of a merged cell with multiple adjacent cells in the column direction, the colspan attribute is added, and in the case of a merged cell with multiple adjacent cells in the row direction, the rowspan attribute is added.

出力形式は、必ずしもHTMLのテーブル表現である必要は無く、既に述べたように、DBのテーブル構造などでもよい。出力の際、表のキャプションをメタデータとして付与する。HTMLのテーブル表現として出力する場合は、tableタグの属性情報として表のキャプションを付与する方法などが考えられる。DBのテーブルの場合は、表のキャプションをテーブル名にしてしまう方法などが考えられる。出力したデータは、ファイルやDB等のかたちで、ディスク装置１０４上に保存する。 The output format does not necessarily have to be an HTML table representation, and as already described, it may be a DB table structure or the like. At the time of output, the caption of the table is added as metadata. When outputting as an HTML table representation, a method such as adding a table caption as attribute information of the table tag can be considered. In the case of a DB table, it is possible to use the table caption as the table name. The output data is saved on the disk device 104 in the form of a file, DB, or the like.

図１３は実施例１に係るデータ抽出プログラム２００が、表データの抽出を行う一連の処理において、表データの性質を示す情報を抽出し、出力する処理を示す。 FIG. 13 shows a process in which the data extraction program 200 according to the first embodiment extracts and outputs information indicating the properties of the table data in a series of processes for extracting the table data.

まず、テキストデータ解析部２０２が、座標情報付きのテキストファイルにおいてTOKENタグが付与されている全テキストを抽出し、各テキストの座標情報をもとに文章を形成する。これは、座標上で連続しているテキストを一連の文章として認識することで行う。そうして形成した文章に、表の性質を説明する文章パターンが含まれるかどうかを調べる（処理１３１０）。 First, the text data analysis unit 202 extracts all the texts to which the TOKEN tag is attached in the text file with the coordinate information, and forms a sentence based on the coordinate information of each text. This is done by recognizing continuous text on the coordinates as a series of sentences. It is examined whether or not the sentence formed in this way contains a sentence pattern explaining the nature of the table (process 1310).

もし含まれていれば、その文章パターンが含まれている一文を、表の性質を示す情報として、抽出し、データ格納部に渡す。データ格納部は、渡された情報（テキスト）を、図１１で出力した表データと関連付けて、ファイルやDB等のかたちでディスク装置１０４上に保存する（処理１３１１）。 If it is included, one sentence containing the sentence pattern is extracted as information indicating the nature of the table and passed to the data storage unit. The data storage unit associates the passed information (text) with the table data output in FIG. 11 and saves it on the disk device 104 in the form of a file, DB, or the like (process 1311).

図１４は、表の性質を説明する本文を抽出する文章パターンの一例である、文章パターン１４００を示す表図である。 FIG. 14 is a table diagram showing a sentence pattern 1400, which is an example of a sentence pattern for extracting a text explaining the properties of the table.

処理１１１１で出力した表データのメタデータである表キャプションを、処理１３１１で抽出した情報と照合することで、処理１１１１で出力した表データの性質を、データ分析時に、データ分析を実行するアプリケーションが、知ることができる。 By collating the table caption, which is the metadata of the table data output in process 1111, with the information extracted in process 1311, the application that executes data analysis at the time of data analysis can check the properties of the table data output in process 1111. , You can know.

実施例１では、デジタル文書中の表を抽出する例を示した。抽出するオブジェクトとしては表に限らず、グラフや図面等でもよい。実施例２では、デジタル文書中のグラフを抽出する例を説明する。 In Example 1, an example of extracting a table in a digital document was shown. The object to be extracted is not limited to a table, but may be a graph, a drawing, or the like. In the second embodiment, an example of extracting a graph in a digital document will be described.

実施例２に係るデータ抽出装置の計算機システムの概要を説明する。実施例２に係る計算機システムのハードウェア構成は、図１と同様でよい。実施例２は、PDFファイルから、グラフデータを抽出する例であるため、データ抽出装置１００はグラフ抽出のためのデータ抽出プログラムを実行し、PDFファイルからグラフデータを抽出する。 The outline of the computer system of the data extraction device according to the second embodiment will be described. The hardware configuration of the computer system according to the second embodiment may be the same as that in FIG. Since the second embodiment is an example of extracting graph data from a PDF file, the data extraction device 100 executes a data extraction program for graph extraction and extracts the graph data from the PDF file.

グラフ抽出のためにメモリ１０３に格納されるデータ抽出プログラム２００の機能ブロック構成は、図２と同様でよい。ただし、各機能ブロック２０１〜２０４が実行する具体的処理が実施例１と一部異なる。図２を参照して以下で説明する。 The functional block configuration of the data extraction program 200 stored in the memory 103 for graph extraction may be the same as in FIG. However, the specific processing executed by each functional block 2001-204 is partially different from that of the first embodiment. This will be described below with reference to FIG.

変換処理部２０１は、ディスク装置１０４に格納されているPDFファイルを読み込み、座標情報付きのテキストファイルに変換する。また、ディスク装置１０４に格納されているPDFファイルを読み込み、画像ファイルに変換する。この処理は実施例1と同様でよい。 The conversion processing unit 201 reads the PDF file stored in the disk device 104 and converts it into a text file with coordinate information. In addition, the PDF file stored in the disk device 104 is read and converted into an image file. This process may be the same as in Example 1.

テキストデータ解析部２０２は、座標情報付きのテキストファイルを解析し、グラフキャプションを抽出する。また、座標情報付きのテキストファイルを解析して、グラフの性質を示す本文を抽出する。 The text data analysis unit 202 analyzes the text file with the coordinate information and extracts the graph caption. In addition, a text file with coordinate information is analyzed to extract a text showing the properties of the graph.

グラフキャプションを抽出する処理は、実施例１の図３に示すフローと同様であるが、キーワード検索処理３１１では、座標情報付きのテキストファイルに対し、「グラフ」、「Graph」といった、グラフのキャプションに含まれそうなキーワードを含むテキストを検索する処理を行う。また、文章パターンと照合する処理３１３では、図５Ａに示す表を抽出するための文章パターンの代わりに、グラフを抽出するための文章パターンを用いる。具体例としては、「グラフ」や「Graph」の内容を説明する文章パターンである。 The process of extracting the graph caption is the same as the flow shown in FIG. 3 of the first embodiment, but in the keyword search process 311, the caption of the graph such as "graph" or "Graph" is used for the text file with the coordinate information. Performs a process to search for text that contains keywords that are likely to be included in. Further, in the process 313 for collating with the sentence pattern, a sentence pattern for extracting a graph is used instead of the sentence pattern for extracting the table shown in FIG. 5A. As a specific example, it is a sentence pattern that explains the contents of "graph" and "Graph".

抽出されたキーワード周辺のテキストが、文章パターンの文章になっていなければ、キーワードを含むテキストが、グラフキャプションであると判断し、図５Ｂと同様の情報（ただし表に関するデータは、グラフに関するデータに置き換わる）を画像解析部２０３に送る（処理３１５）。処理３１２から処理３１５までの一連の処理は、キーワード検索によって得られたテキスト数だけ繰り返す（処理３１６）。 If the text around the extracted keyword is not a sentence of a sentence pattern, it is judged that the text including the keyword is a graph caption, and the same information as in FIG. 5B (however, the data related to the table is the data related to the graph. (Replace) is sent to the image analysis unit 203 (process 315). The series of processes from process 312 to process 315 is repeated by the number of texts obtained by the keyword search (process 316).

また、グラフの性質を示す本文を抽出する処理も、図１３と同様であるが、図１４に示す表の性質を示す本文を抽出するための文章パターンの代わりに、グラフの性質を示す本文を抽出するための文章パターンを用いる。 Further, the process of extracting the text showing the properties of the graph is the same as in FIG. 13, but instead of the sentence pattern for extracting the text showing the properties of the table shown in FIG. 14, the text showing the properties of the graph is used. Use a sentence pattern for extraction.

図１５は、グラフの性質を示す本文を抽出するための文章パターンの一例である、文章パターン２３００を示す表図である。 FIG. 15 is a table diagram showing a sentence pattern 2300, which is an example of a sentence pattern for extracting a text showing the properties of a graph.

画像解析部２０３は、テキストデータ解析部２０２から受け取ったグラフキャプションの情報をもとに画像ファイルを解析し、グラフの検出を行う。 The image analysis unit 203 analyzes the image file based on the graph caption information received from the text data analysis unit 202, and detects the graph.

データ格納部は２０４は、画像解析部２０３が抽出したグラフを、ファイルなどのかたちで出力する。 The data storage unit 204 outputs the graph extracted by the image analysis unit 203 in the form of a file or the like.

図１６は、データ抽出プログラム２００が、グラフデータの抽出を行う一連の処理において、グラフの検出・出力を行う処理２０００を示す。グラフの検出・出力処理２０００は、処理３００においてグラフのキャプションが検出されたページそれぞれに対して、実行される。 FIG. 16 shows a process 2000 in which the data extraction program 200 detects and outputs a graph in a series of processes for extracting graph data. The graph detection / output process 2000 is executed for each page in which the graph caption is detected in the process 300.

まず、画像解析部２０３が、テキストデータ解析部２０２が、処理３００において作成したテーブル５１０（ただし表キャプションはグラフキャプションに置き換わる）を参照し、グラフキャプションを含むページのページ番号を確認する（処理２０１０）。そして、PDFファイルの当該文書ページを、画像ファイルに変換する（処理２０１１）。画像ファイルへの変換は、実施例１と同様である。 First, the image analysis unit 203 refers to the table 510 created in the process 300 by the text data analysis unit 202 (however, the table caption is replaced with the graph caption), and confirms the page number of the page including the graph caption (process 2010). ). Then, the document page of the PDF file is converted into an image file (process 2011). The conversion to the image file is the same as in the first embodiment.

次に、処理３００においてテキストデータ解析部２０２が抽出したグラフキャプションの座標情報を、画像ファイルの座標スケールに合わせる処理を行う（処理２０１２）。画像ファイル上での当該文書ページのサイズをもとに、グラフキャプションの座標情報を、座標情報付きテキストファイル上での座標情報から、画像ファイル上での座標情報に変換する。サイズ変換は、実施例１の変換式において、「表キャプション」を「グラフキャプション」に置き換えることで、表キャプションの場合と同様に実行することができる。 Next, a process is performed in which the coordinate information of the graph caption extracted by the text data analysis unit 202 in the process 300 is adjusted to the coordinate scale of the image file (process 2012). Based on the size of the document page on the image file, the coordinate information of the graph caption is converted from the coordinate information on the text file with the coordinate information to the coordinate information on the image file. The size conversion can be executed in the same manner as in the case of the table caption by replacing the "table caption" with the "graph caption" in the conversion formula of the first embodiment.

次に、画像解析部２０３は、グラフ検出を実行する画像範囲を限定する処理を行う（処理２０１３）。画像範囲の限定は、グラフキャプションの座標をもとに行う。画像解析部２０３が、グラフ検出を実行する画像範囲を限定する手法は、実施例１と同様である。 Next, the image analysis unit 203 performs a process of limiting the image range in which the graph detection is executed (process 2013). The image range is limited based on the coordinates of the graph caption. The method by which the image analysis unit 203 limits the image range in which the graph detection is executed is the same as in the first embodiment.

次に、画像解析部２０３は、処理２０１３で限定した画像範囲から、グラフ検出を実行する（処理２０１４）。グラフ検出はどのような方法を用いても良い。例えば、画像処理ライブラリ等を用いて矩形検出を行い、検出した矩形をグラフとして検出する方法がある。また、画像処理ライブラリ等を用いて直線検出を行い、直交する２つの直線をグラフの縦軸と横軸として認識し、両直線で囲まれる範囲を、グラフとして検出してもよい。グラフを検出したら、画像解析部２０３は、検出したグラフの画像を、データ格納部２０４に送る。 Next, the image analysis unit 203 executes graph detection from the image range limited by the process 2013 (process 2014). Any method may be used for graph detection. For example, there is a method of performing rectangle detection using an image processing library or the like and detecting the detected rectangle as a graph. Further, straight line detection may be performed using an image processing library or the like, two orthogonal straight lines may be recognized as the vertical axis and the horizontal axis of the graph, and the range surrounded by both straight lines may be detected as a graph. After detecting the graph, the image analysis unit 203 sends the image of the detected graph to the data storage unit 204.

グラフの画像を受け取ったデータ格納部２０４は、ファイルなどのかたちで、ディスク装置１０４に保存する。出力の際、グラフのキャプションをメタデータとして付与する。メタデータの付与の方法は、どのような方法でも良いが、グラフのキャプションをファイル名にしてしまう方法などが考えられる。 The data storage unit 204 that has received the image of the graph stores it in the disk device 104 in the form of a file or the like. At the time of output, the caption of the graph is added as metadata. Any method can be used to add the metadata, but a method of using the caption of the graph as the file name can be considered.

実施例２においても、実施例１の図１３と同様にデータ抽出プログラム２００が、グラフデータの抽出を行う一連の処理において、グラフデータの性質を示す情報を抽出し、出力する処理を行なうことができる。 Also in the second embodiment, similarly to FIG. 13 of the first embodiment, the data extraction program 200 may perform a process of extracting and outputting information indicating the properties of the graph data in a series of processes for extracting the graph data. it can.

まず、テキストデータ解析部２０２が、座標情報付きのテキストファイルにおいてTOKENタグが付与されている全テキストを抽出し、各テキストの座標情報をもとに文章を形成する。これは、座標上で連続しているテキストを一連の文章として認識することで行う。そうして形成した文章に、グラフの性質を説明する文章パターン（図１５参照）が含まれるかどうかを調べ（処理１３１０）、もし含まれていれば、その文章パターンが含まれている一文を、グラフの性質を示す情報として、抽出し、データ格納部２０４に渡す。データ格納部２０４は、渡された情報（テキスト）を、ファイルやDB等のかたちでディスク装置１０４上に保存する（処理１３１１）。 First, the text data analysis unit 202 extracts all the texts to which the TOKEN tag is attached in the text file with the coordinate information, and forms a sentence based on the coordinate information of each text. This is done by recognizing continuous text on the coordinates as a series of sentences. It is checked whether the sentence formed in this way contains a sentence pattern (see FIG. 15) that explains the nature of the graph (process 1310), and if so, a sentence containing the sentence pattern is included. , As information indicating the nature of the graph, it is extracted and passed to the data storage unit 204. The data storage unit 204 stores the passed information (text) in the form of a file, DB, or the like on the disk device 104 (process 1311).

処理２０１５で出力したグラフデータのメタデータであるグラフキャプションを、処理２２１１で抽出した情報と照合することで、処理２０１５で出力したグラフデータの性質を、データ分析時に、データ分析を実行するアプリケーションが、知ることができる。 By collating the graph caption, which is the metadata of the graph data output in the process 2015, with the information extracted in the process 2211, the application that executes the data analysis at the time of data analysis can check the properties of the graph data output in the process 2015. , You can know.

以上詳細に説明した実施例によれば、画像情報と座標情報付きテキストの双方を利用することで、例えばPDFファイルから、表データやグラフデータを正確に抽出することが可能となる。また、表データ、グラフデータとともに、表、グラフの性質を示すような情報を抽出することが可能になる。これらにより、データ分析の質が向上する。 According to the embodiment described in detail above, by using both the image information and the text with coordinate information, it is possible to accurately extract table data and graph data from, for example, a PDF file. In addition to the table data and the graph data, it is possible to extract information indicating the properties of the table and the graph. These improve the quality of data analysis.

本明細書で示すデータ抽出方法は、PDFファイルだけでなく、文書ページ中のオブジェクトと、当該オブジェクトの座標情報によって構成されているデジタル文書全般に適用することができる。実施例記載の技術により、これらのデジタル文書を対象とした、データ分析の質が向上する。 The data extraction method described in the present specification can be applied not only to a PDF file but also to an object in a document page and a general digital document composed of coordinate information of the object. The techniques described in the examples improve the quality of data analysis for these digital documents.

データ抽出装置１００、バス１０１、CPU１０２、メモリ１０３、ディスク装置１０４、データ抽出プログラム２００、変換処理部２０１、テキストデータ解析部２０２、画像解析部２０３、データ格納部２０４ Data extraction device 100, bus 101, CPU 102, memory 103, disk device 104, data extraction program 200, conversion processing unit 201, text data analysis unit 202, image analysis unit 203, data storage unit 204

Claims

A conversion processing unit that converts an object on a document page and a digital document containing the coordinate information of the object into text data with image data and coordinate information.
A text data analysis unit that extracts captions related to the object to be extracted from the text data,
An image analysis unit that detects an object to be extracted based on the information of the caption extracted from the image data.
A data extraction system from digital documents.

The text data analysis unit
Search for a text that is a candidate for a caption by collating the text data with a keyword, and identify whether the candidate for the caption is a caption or a part of the text by collating the candidate for the caption with a preset sentence pattern.
A data extraction system from the digital document according to claim 1.

The conversion processing unit
Of the digital documents, the page from which the caption is extracted is converted into image data.
The image analysis unit
Perform object detection on the image data of the page from which the caption is extracted.
A data extraction system from the digital document according to claim 1.

The conversion processing unit
Of the digital documents, the page from which the caption is extracted is converted into image data.
The image analysis unit
For the image data of the page from which the caption is extracted, the image range in which the image analysis is performed based on the coordinate information of the caption is limited, and the object is detected.
A data extraction system from the digital document according to claim 1.

The object is table data or graph data,
A data extraction system from the digital document according to claim 1.

The object you want to extract is a table
It has a data storage unit that collates the detected coordinate information of the table with the text data with the coordinate information and extracts the text in the table.
A data extraction system from the digital document according to claim 1.

The image analysis unit
Extract each cell of the detected table and
The data storage unit
The coordinate information of each cell is collated with the text data with the coordinate information, and the text in each cell is extracted.
The data extraction system from the digital document according to claim 6.

The text data analysis unit
By collating the text data with a preset sentence pattern, the text explaining the structure of the table is extracted.
The data storage unit
Recognize the layout of the table based on the text in the cell and the text describing the structure of the table.
A data extraction system from the digital document according to claim 7.

The text data analysis unit
Extract information indicating the properties of an object from a text file with coordinate information based on a predefined sentence pattern.
A data extraction system from the digital document according to claim 1.

The caption information is at least one of the page on which the caption was detected and the coordinates on which the caption was detected.
A data extraction system from the digital document according to claim 1.

A data extraction method from a digital document that extracts a desired object from a digital document.
Text data generation processing, which generates text data with coordinate information from digital documents,
Caption detection processing, which detects the caption of the desired object from the text data by keyword search,
An object extraction process, which extracts the desired object from the image data of the page of the digital document in which the caption is detected.
A method of extracting data from a digital document.

In the object extraction process
The analysis range for extracting the desired object in the image data is limited based on the coordinate information of the caption.
The method for extracting data from a digital document according to claim 11.

Includes an image data conversion process that converts only the pages of the digital document in which the caption is detected into image data.
The method for extracting data from a digital document according to claim 11.

When extracting a table as the object, the object extraction process
Cell detection processing that detects the position of a cell from the image data, and
Includes a text extraction process that extracts text corresponding to the cell position from the text data.
The method for extracting data from a digital document according to claim 11.

The object extraction process
Layout estimation processing that estimates the layout of the table based on the detected position of the cell,
Table description reference processing that extracts and refers to the table description from the text data,
Includes a layout modification process that modifies the layout by collating the description of the table with the text corresponding to the cell position.
The method for extracting data from a digital document according to claim 14.