JP6719862B2

JP6719862B2 - PDF data retrieval system and program for PDF data retrieval system

Info

Publication number: JP6719862B2
Application number: JP2015057056A
Authority: JP
Inventors: 和人若林
Original assignee: Shimadzu Corp
Current assignee: Shimadzu Corp
Priority date: 2015-03-20
Filing date: 2015-03-20
Publication date: 2020-07-08
Anticipated expiration: 2035-03-20
Also published as: JP2016177524A

Description

本発明は、文書ファイルであるＰＤＦ（ポータブル・ドキュメント・フォーマット）ファイルから正しくデータを取り出すシステムに関し、特に、機器分析に用いられる各種分析装置（例えば、クロマトグラフ装置、分光光度計、質量分析装置等）により得られた各種データを含む分析レポート等のＰＤＦファイルから当該データを正しく取り出すシステムに関する。 The present invention relates to a system for correctly extracting data from a PDF (portable document format) file which is a document file, and particularly to various analyzers used for instrumental analysis (eg, chromatograph, spectrophotometer, mass spectrometer, etc.). The present invention relates to a system for correctly extracting the data from a PDF file such as an analysis report including various data obtained by the above.

昨今では、分析装置の制御や分析の結果として得られた分析データの処理は、一般のコンピュータ上で動作する分析制御ソフトウエアを用いて行うのが一般的である。
このような分析制御ソフトウエアでは、レポートを作成する機能を有しているものが少なくない（例えば、非特許文献１など参照）。すなわち、ある分析に関連した複数のデータを並べて配置し、最終的にそれを分析レポートとして出力するという機能である。
通常、分析レポートは改ざんされにくく、かつ閲覧が行いやすいという点から、ＰＤＦファイル形式が主に用いられる。 Nowadays, it is general to control analysis devices and process analysis data obtained as a result of analysis using analysis control software that operates on a general computer.
Many of such analysis control software have a function of creating a report (for example, refer to Non-Patent Document 1). That is, it is a function of arranging a plurality of data related to a certain analysis side by side and finally outputting the data as an analysis report.
Usually, the PDF file format is mainly used because the analysis report is hard to be tampered with and easy to browse.

このような分析レポートには、分析結果としての結論ばかりではなく、分析対象である試料を特定するための様々なデータの他、その結論を裏付けるためのデータが一緒に記載されることが多い。また、そのようなデータを得るに到った分析条件も記載しておくことが求められることもある。すなわち、同じ試料に対して同じ分析条件で分析を行った時に、同じデータが得られるという、再現性の保証が求められる。 In such an analysis report, not only the conclusion as the analysis result but also various data for identifying the sample to be analyzed and the data for supporting the conclusion are often described together. It may also be required to describe the analysis conditions that led to the acquisition of such data. That is, it is required to ensure reproducibility that the same data can be obtained when the same sample is analyzed under the same analysis conditions.

「LabSolutionsマルチデータレポート」、株式会社島津製作所、インターネット<http://www.an.shimadzu.co.jp/data-net/labsolutions/dbcs/mdr.htm>、[平成２７年３月１２日検索]"Lab Solutions Multi-Data Report", Shimadzu Corporation, Internet <http://www.an.shimadzu.co.jp/data-net/labsolutions/dbcs/mdr.htm>, [March 12, 2015 Search] ]

これらの、試料を特定するためのデータ、分析条件のデータ、分析結果のデータ等は、次にそれらが用いられることを前提とする場合には、エクセル（登録商標）等の表計算ソフトの形式で保存されるのが一般的である。しかし、分析装置と分析データ解析装置が同一メーカーのものである場合には、両者のデータ格納形式等が一致しており、又は、分かっているため、容易に利用することができるが、両者のメーカーが異なる場合、一般的にはそのような表計算ソフト形式のままのデータを利用することができない。或いは、そもそも、そのようなデータが得られない（出力されない）こともある。 These data for specifying the sample, the data for the analysis conditions, the data for the analysis results, etc., are used in the form of spreadsheet software such as Excel (registered trademark) when it is assumed that they will be used next. It is generally stored in. However, if the analysis device and the analysis data analysis device are manufactured by the same manufacturer, they can be easily used because the data storage formats of both are the same or known. If the manufacturers are different, it is generally not possible to use such data in the form of spreadsheet software. Alternatively, such data may not be obtained (output) in the first place.

それに対し、各分析装置や分析データ解析装置からは分析結果のレポートが出され、上記のように、その多くはＰＤＦ形式となっている。従って、メーカーが異なっていても、分析結果レポートは利用することができる。そして、その中には分析条件データや分析結果データも含まれていることが多いため、それを用いることにより、再現分析や一部の条件を変えた分析等を行うことが可能となる。 On the other hand, each analysis device or analysis data analysis device issues a report of analysis results, and most of them are in the PDF format as described above. Therefore, even if the manufacturers are different, the analysis result report can be used. Since the analysis condition data and the analysis result data are often included in the data, it is possible to perform the reproduction analysis or the analysis with some conditions changed by using the analysis condition data and the analysis result data.

分析条件データや分析結果データは、ＰＤＦファイル内において表形式で記載されているのが一般的である。ワープロソフトや表計算ソフトからＰＤＦファイルを作成する、アクロバット（登録商標）等のミドルウェアと呼ばれるソフトウェアは、ＰＤＦファイルからデータを取り出す機能も持っている。従って、ミドルウェアを用いることにより、ＰＤＦファイル中の表からデータを取り出すことはできる。
The analysis condition data and the analysis result data are generally described in a table format in the PDF file. To create a PDF file from a word processor or spreadsheet software, Acrobat software called (registered trademark) of middleware, also has the ability to retrieve data from a PDF file. Therefore, data can be retrieved from the table in the PDF file by using the middleware.

しかし、現在利用できるアクロバット等のミドルウェアは、次のような問題がある。例えば、図１(a)に示すような、表が含まれているＰＤＦファイルから表の部分をテキストデータとして取り出した場合、図１(b)に示されるように、表に空白のセルが存在すると、その後の部分のデータが前に詰められた状態で取り出されてしまう。また、表の前の文字データと表の後の文字データが表の中のデータと連続して取り出されるため、表だけを再構成することができない。さらに、ミドルウェアでは一般に、ＰＤＦファイルに含まれる表を構成している罫線のデータを取得することができない。よって正確に表を再現することが難しい。従って、既に行われた分析の結果を記載したＰＤＦファイルを使って同様の分析を行おうとしても、自動的に行うことができず、後の分析を行う者が再度前の分析の条件データを再構成しなければならなかったり、前の分析の結果のデータに基づいてさらなる解析を行おうとしても、再度手入力をしなければならない等の問題があった。 However, currently available middleware such as Acrobat have the following problems. For example, when a table part is extracted as text data from a PDF file containing a table as shown in FIG. 1(a), there is a blank cell in the table as shown in FIG. 1(b). Then, the data of the subsequent part is taken out in the state where it was packed before. Further, since the character data before the table and the character data after the table are taken out continuously with the data in the table, only the table cannot be reconstructed. Furthermore, middleware generally cannot acquire the data of the ruled lines forming the table included in the PDF file. Therefore, it is difficult to accurately reproduce the table. Therefore, even if an attempt is made to perform a similar analysis using a PDF file that describes the results of an analysis that has already been performed, it cannot be performed automatically, and the person who performs the subsequent analysis can re-read the condition data for the previous analysis. There was a problem that the data had to be reconstructed, or even if an attempt was made to perform further analysis based on the data obtained as a result of the previous analysis, manual input was required again.

本発明は上記のような課題を解決するためになされたものであり、その主たる目的は、ＰＤＦファイルに含まれる表から正しくデータを取り出すことができるシステム及びプログラムを提供することにある。 The present invention has been made to solve the above problems, and a main object thereof is to provide a system and a program capable of correctly extracting data from a table included in a PDF file.

上記の課題を解決するために成された本発明に係るＰＤＦデータ取り出しシステムは、
ＰＤＦミドルウェアを用いてＰＤＦファイルに含まれる表を構成する文字列を取得し、表形式で再現可能な形式で以て該文字列を出力するＰＤＦデータ取り出しシステムであって、
a) 前記ＰＤＦミドルウェアにおけるＰＤＦファイルからの文字列の抽出を、文字列の属性及び文字列間の横移動量の閾値である横閾値に基づく行単位に設定する抽出単位設定部と、
b) 前記横閾値を所定の値に設定する横閾値設定部と、
c) 前記抽出単位設定部及び前記横閾値設定部の設定に従い、指定されたＰＤＦファイルから文字列を行毎に抽出する文字列データ取得部と、
d) 前記文字列データ取得部により抽出された行毎の各文字列の座標値のうちx座標値により該文字列の行方向の位置を、y座標値により該文字列の列方向の位置を決定してそれらを表形式に配置することにより再構築表を形成する表形成部と、
e) 前記再構築表を、所定のデータ形式で出力する出力部と
を有することを特徴とする。 The PDF data extraction system according to the present invention made to solve the above problems is
A PDF data extraction system for acquiring a character string constituting a table included in a PDF file using PDF middleware and outputting the character string in a format reproducible in a table format,
a) an extraction unit setting unit that sets the extraction of the character string from the PDF file in the PDF middleware in line units based on the attribute of the character string and the horizontal threshold that is the threshold of the horizontal movement amount between the character strings,
b) a lateral threshold value setting unit that sets the lateral threshold value to a predetermined value,
c) a character string data acquisition unit that extracts a character string for each line from a specified PDF file according to the settings of the extraction unit setting unit and the horizontal threshold setting unit,
d) Among the coordinate values of each character string for each line extracted by the character string data acquisition unit , the position in the row direction of the character string is determined by the x coordinate value, and the position in the column direction of the character string is determined by the y coordinate value. and table forming section which determines and their form more reconstruction table to place in tabular form,
e) An output unit that outputs the reconstruction table in a predetermined data format.

本発明において「文字列」とは、文字、数字、記号等が1個又は複数個並んだものを言い、スペース（空白）を含んでいてもよい。ミドルウェアがＰＤＦファイルから文字列を抽出する際、その取り出し単位をオブジェクト単位、行単位、ブロック単位等とすることができる。本発明に係るＰＤＦデータ抽出システムでは、抽出単位設定部がこれを行単位に設定する。行単位で文字列を抽出する場合、ミドルウェアは、同一フォントを使用している、フォントサイズが同じである等の基準に基づき、一連の文字、数字、記号等が連なった文字列を1行であると判定するが、その際に、それらの間の横移動量が、予め定められた横閾値よりも大きい場合には、別の行として扱う。従って、横閾値設定部においてこの横移動量を適切に設定しておくことにより、指定されたＰＤＦファイルに表が含まれている場合、表の各行の各セル内のデータ（文字、数字、記号等）は、それぞれ別の行として抽出されるようになる。 In the present invention, the “character string” means one or a plurality of letters, numbers, symbols, etc. arranged side by side, and may include a space (blank). When the middleware extracts the character string from the PDF file, the extraction unit can be an object unit, a line unit, a block unit, or the like. In the PDF data extraction system according to the present invention, the extraction unit setting unit sets this for each row. When extracting character strings on a line-by-line basis, the middleware uses a same font, has the same font size, etc. However, if the lateral movement amount between them is larger than a predetermined lateral threshold value, it is treated as another row. Therefore, by appropriately setting the horizontal shift amount in the horizontal threshold setting unit, when the table is included in the specified PDF file, the data (characters, numbers, symbols) in each cell of each row of the table is set. Etc.) will be extracted as separate lines.

ミドルウェアがＰＤＦファイルから文字列を抽出する際、各文字列に座標値を付す。従って、こうして抽出された各行の文字列に付された座標値に基づき、抽出された行が前記横閾値以上の横移動量により分離された文字列であるか否かを判定することができる。表形成部は、抽出した行の文字列の座標値を順次点検し、このような横方向に分離された文字列による行が複数連続する場合には、それらを横方向に並べることにより、元の表の1行（以下、これを「元行」と呼ぶ。）を再構成することができる。その際、各文字列に付された座標値（x座標値）により、各セルの位置（列位置）も再構成することができる。また、そのような元行が複数連続する場合には、それらも、1行内の各列のセルはx座標値により、複数行の各セルはy座標値により、表を再構成することができる。なお、連続する行の中に空行が含まれる場合、その前後を連続行とみなして全体で1つの表を構成するとしてもよいし、空行で表が分かれ、別々の表を構成するとしてもよい。これは、予め設定しておくことで対応することができる。 When the middleware extracts the character string from the PDF file, each character string is given a coordinate value. Therefore, based on the coordinate value attached to the character string of each row thus extracted, it can be determined whether or not the extracted row is a character string separated by the lateral movement amount equal to or more than the lateral threshold value. The table forming unit sequentially inspects the coordinate values of the character strings of the extracted rows, and if a plurality of rows of character strings separated in the horizontal direction are continuous, by arranging them in the horizontal direction, the original It is possible to reconstruct one row of the table (hereinafter, referred to as "original row"). At that time, the position (column position) of each cell can also be reconstructed by the coordinate value (x coordinate value) attached to each character string. Further, when a plurality of such original rows are continuous, the table can be reconstructed by using the x coordinate value for the cells in each column in one row and the y coordinate value for each cell in the multiple rows. .. In addition, when blank lines are included in consecutive lines, it is possible to consider the surroundings as consecutive lines to form one table as a whole, or to separate tables with blank lines and form separate tables. Good. This can be dealt with by setting in advance.

ただし、対象となるＰＤＦの表において、１つのセル内に複数行の文字列が含まれたセルがある場合には、その複数行のy座標の飛び間隔が小さく（半分または整数分の１）なる。これとx座標値とを用いて、表形成部はそのような複数行を含むセルを再構築する。 However, in the target PDF table, if there is a cell that contains multiple lines of character strings in one cell, the y-coordinate skipping interval of the multiple lines is small (half or an integer fraction). Become. Using this and the x coordinate value, the table forming unit reconstructs a cell including such a plurality of rows.

これにより、表形成部は対象ＰＤＦファイルに含まれる表を高い確率で以て特定することができる。表の中に文字列がないセルが存在していたとしても、それを空白セルとして認識することができるため、表を正しく再構築することができる。 Accordingly, the table forming unit can specify the table included in the target PDF file with high probability. Even if there is a cell without a character string in the table, it can be recognized as a blank cell, so that the table can be reconstructed correctly.

出力部は、表形成部によって形成された表を所定の形式、例えばｃｓｖ形式やタブ区切り形式、ｘｍｌ形式、Ｅｘｃｅｌ形式などで出力する。 The output unit outputs the table formed by the table forming unit in a predetermined format, such as a csv format, a tab delimited format, an xml format, or an Excel format.

本発明の好適な一実施形態として、
ユーザに前記表の直前又は直後のテキストを入力させる区切入力部を更に備え、前記表形成部は、そのように入力されたテキストに基づいて前記表の開始又は終了を判断するようにすることができる。 As a preferred embodiment of the present invention,
The table forming unit may further include a delimiter input unit that allows the user to input text immediately before or after the table, and the table forming unit may determine the start or end of the table based on the text thus input. it can.

本発明に係るＰＤＦデータ取り出しシステムでは、表を含むＰＤＦファイルから、該表を検知し、それに含まれる文字列を取得することができる。さらに、たとえその表に文字列が含まれない空白セルがあったとしても、その空白セルを反映した形態で以て、表を再構築することが可能である。 With the PDF data extraction system according to the present invention, the table can be detected from the PDF file including the table, and the character string included in the table can be acquired. Furthermore, even if there is a blank cell that does not contain a character string in the table, the table can be reconstructed in a form that reflects the blank cell.

よって、分析装置の制御や分析の結果として得られた分析レポートがＰＤＦ形式で出力されていたとしても、至極簡単かつ正確に該分析レポートに含まれる表を再構築することでき、分析レポートに記載されているものと同一あるいは類似した条件で再分析や追加分析を行う必要があるような場合にユーザの手間を大幅に低減することができる。これに加え、本発明に係るＰＤＦデータ取り出しシステムにより、ユーザによる手入力処理が大幅に減少する。結果として、エラーやミスを減少させることができる。 Therefore, even if the analysis report obtained as a result of the control of the analysis device or the analysis is output in the PDF format, the table included in the analysis report can be reconstructed extremely easily and accurately. When it is necessary to perform reanalysis or additional analysis under the same or similar conditions as those used, it is possible to greatly reduce the user's labor. In addition to this, the PDF data retrieval system according to the present invention significantly reduces manual input processing by the user. As a result, errors and mistakes can be reduced.

ＰＤＦファイルの一表示（出力）例(a)と、該ＰＤＦファイルに含まれる表のテキストデータをコピー・ペーストした際に出力される結果の例(b)を示す図（従来技術）。The figure which shows the example (a) of one display (output) of a PDF file, and the example (b) of the result output when the text data of the table contained in this PDF file is copied and pasted (prior art). 本発明に係るＰＤＦデータ取り出しシステムの一実施例を示す概略構成図。The schematic block diagram which shows one Example of the PDF data extraction system which concerns on this invention. ＰＤＦデータ取り出しシステムの動作の例を示すフローチャート。The flowchart which shows the example of operation|movement of a PDF data extraction system. ＰＤＦデータ取り出しシステムの動作の結果出力されるファイルの例。An example of a file output as a result of the operation of the PDF data extraction system. ＰＤＦデータ取り出しシステムの動作における他の画面例。Another screen example in the operation of the PDF data extraction system.

以下、本発明に係るＰＤＦデータ取り出しシステムの実施形態の例を図面を参照しつつ詳細に説明する。 Hereinafter, an example of an embodiment of a PDF data extraction system according to the present invention will be described in detail with reference to the drawings.

図２に、本発明に係るＰＤＦデータ取り出しシステム１の一実施形態を示す。本ＰＤＦデータ取り出しシステム１の実態はコンピュータであり、中央演算処理装置であるＣＰＵ（Central Processing Unit）１０にメモリ１２、ＬＣＤ（Liquid Crystal Display）等から成るモニタ（表示部）１４、キーボードやマウス等から成る入力部１６、ハードディスクやＳＳＤ（Solid State Drive）等の大容量記憶装置から成る記憶部２０が互いに接続されている。記憶部２０には、本発明に係るＰＤＦデータ取り出しシステム用プログラム２１の他、例えばアドビ社のアクロバット等のＰＤＦファイル用ミドルウェア（ＰＤＦミドルウェア）２２が記憶されている。記憶部２０にはまた、ＯＳ（Operating System）２９も記憶されている。なお、ＰＤＦミドルウェア２２は、ＰＤＦファイルから文字列を取り出す機能を持つものであれば、その他各社から市販されているものを用いることができる。 FIG. 2 shows an embodiment of the PDF data extraction system 1 according to the present invention. The present PDF data extraction system 1 is actually a computer, and includes a CPU (Central Processing Unit) 10 which is a central processing unit, a memory 12, a monitor (display unit) 14 including an LCD (Liquid Crystal Display), a keyboard, a mouse, and the like. An input unit 16 including a storage unit and a storage unit 20 including a large-capacity storage device such as a hard disk or an SSD (Solid State Drive) are connected to each other. The storage unit 20 stores a PDF file extracting system program 21 according to the present invention, and a PDF file middleware (PDF middleware) 22 such as Adobe Acrobat, for example. The storage unit 20 also stores an OS (Operating System) 29. Note that the PDF middleware 22 may be commercially available from other companies as long as it has a function of extracting a character string from a PDF file.

本実施形態に係るＰＤＦデータ取り出しシステム１は、外部装置との直接的な接続や、外部装置等とのＬＡＮ（Local Area Network）などのネットワークを介した接続を司るためのインターフェース（Ｉ／Ｆ）１８を備えており、該Ｉ／Ｆ１８よりネットワークケーブルＮＷ（又は無線ＬＡＮ）を介してクロマトグラフ質量分析装置である分析装置Ａ１に接続されている。なお、本発明に係るＰＤＦデータ取り出しシステムでは、Ｉ／Ｆ１８を介して外部に接続される分析装置は１台に限られることはなく、複数台であっても構わない。また、このＰＤＦデータ取り出しシステムが分析装置と一体化された構成とすることもできる。 The PDF data extraction system 1 according to the present embodiment is an interface (I/F) for controlling direct connection with an external device or connection with an external device via a network such as a LAN (Local Area Network). 18 and is connected from the I/F 18 to an analyzer A1 which is a chromatograph mass spectrometer through a network cable NW (or wireless LAN). In the PDF data extraction system according to the present invention, the number of analyzers connected to the outside via the I/F 18 is not limited to one, and a plurality of analyzers may be used. Further, the PDF data extracting system may be integrated with the analyzer.

図２に示されるように、ＰＤＦデータ取り出しシステム用プログラム２１には、抽出単位設定部３１、横閾値設定部３２、文字列データ取得部３３、表形成部３４、出力部３５が含まれている。これらはいずれも基本的にはＣＰＵ１０がＰＤＦデータ取り出しシステム用プログラム２１を実行することによりソフトウエア的に実現される機能手段である。なお、ＰＤＦデータ取り出しシステム用プログラム２１は必ずしも単体のプログラムである必要はなく、例えば分析装置を制御するためのプログラムの一部に組み込まれた機能であってもよく、その形態は特に問わない。 As shown in FIG. 2, the PDF data extraction system program 21 includes an extraction unit setting unit 31, a lateral threshold value setting unit 32, a character string data acquisition unit 33, a table forming unit 34, and an output unit 35. .. All of these are functional means basically realized by software by the CPU 10 executing the PDF data fetching system program 21. The PDF data extraction system program 21 does not necessarily have to be a single program, and may have a function incorporated in a part of a program for controlling the analyzer, for example, and its form is not particularly limited.

以下、フローチャートである図３を参照しつつ、ＰＤＦデータ取り出し処理の動作を具体的に説明する。 Hereinafter, the operation of the PDF data extraction process will be specifically described with reference to the flowchart of FIG.

まず、ユーザがＰＤＦデータ取り出しシステム用プログラム２１（以下、単にプログラムと呼ぶ。）を開始する。この時、ユーザは操作の対象としてＰＤＦファイルである分析レポート５を指定する。本実施形態では、分析レポート５が図１(a)の分析報告書であるものとして説明を行う。 First, the user starts the PDF data fetch system program 21 (hereinafter, simply referred to as a program). At this time, the user designates the analysis report 5 which is a PDF file as an operation target. In the present embodiment, the analysis report 5 will be described as the analysis report of FIG.

［ステップＳ２１］分析レポート５の指定が行われたことにより、抽出単位設定部３１、横閾値設定部３２、文字列データ取得部３３は、ＰＤＦミドルウェア２２と協同することにより、分析レポート５から文字列を行単位に抽出する。
すなわち、抽出単位設定部３１は、ＰＤＦミドルウェア２２の文字列抽出単位を行単位に設定し、横閾値設定部３２は更にその場合の横閾値を設定する。一つの行に含まれる文字列の中の文字間の距離がこの横閾値以上である場合、ＰＤＦミドルウェア２２はその前後の文字列を別の行に属すると判定する。この横閾値、通常は予め設定された値（例えば、普通の大きさ（例えば11ポイント）の半角スペースで2個以上、等）に設定する。これら抽出単位設定部３１及び横閾値設定部３２によりＰＤＦミドルウェア２２の設定を行った後、文字列データ取得部３３は、ＰＤＦミドルウェア２２を用いて、分析レポート５のＰＤＦファイルから行単位で各文字列を取得する。こうしてＰＤＦミドルウェア２２から出力されるデータには、抽出された文字列の他、それら文字列の座標値が含まれている。本例では、各行<L1>〜<L31>に関し、以下に示す文字列とそれらの座標値（"x:y"で表す）が得られる。なお、以下の例では文字列の区切りがカンマ(,)によって表されているが、文字列中にカンマが含まれる可能性があるような場合には、他の記号やタブなどを用いることもできる。または、座標値（x:y)そのものや座標値とカンマの組み合わせを文字列同士の区切りとしてもよい。 [Step S21] Since the analysis report 5 is designated, the extraction unit setting unit 31, the lateral threshold value setting unit 32, and the character string data acquisition unit 33 cooperate with the PDF middleware 22 so that the characters are extracted from the analysis report 5. Extract columns by row.
That is, the extraction unit setting unit 31 sets the character string extraction unit of the PDF middleware 22 in units of lines, and the horizontal threshold setting unit 32 further sets the horizontal threshold in that case. When the distance between the characters in the character string included in one line is equal to or larger than the horizontal threshold value, the PDF middleware 22 determines that the character strings before and after the character string belong to another line. This lateral threshold value is usually set to a preset value (for example, two or more half-width spaces of a normal size (for example, 11 points), etc.). After the PDF middleware 22 is set by the extraction unit setting unit 31 and the horizontal threshold setting unit 32, the character string data acquisition unit 33 uses the PDF middleware 22 to extract each character in line units from the PDF file of the analysis report 5. Get a column. Thus, the data output from the PDF middleware 22 includes the extracted character strings and the coordinate values of the character strings. In this example, the following character strings and their coordinate values (represented by "x:y") are obtained for each of the lines <L1> to <L31>. In addition, in the following example, the delimiter of the character string is represented by a comma (,), but if there is a possibility that the character string contains a comma, you can use other symbols or tabs. it can. Alternatively, the coordinate value (x:y) itself or a combination of the coordinate value and a comma may be used as a delimiter between the character strings.

<L1> 分析報告書(40:1)
<L2> 分析結果(0:3)
<L3> サンプル１(2:5)
<L4> サンプル２(25:5)
<L5> サンプル３(50:5)
<L6> サンプル４(70:5)
<L7> 311(10:7)
<L8> 353(30:7)
<L9> 322(60:7)
<L10> 399(80:7)
<L11> 2598(7:9)
<L12> 2283(26:9)
<L13> 2033(77:9)
<L14>
<L15> 11.3(26:13)
<L16> 13.6(60:13)
<L17> 0(85:13)
<L18> 115(10:15)
<L19> 203(26:15)
<L20> 268(60:15)
<L21> 183(80:16)
<L22> ※1(10:17)
<L23> ※2(26:17)
<L24> ※3(60:17)
<L25> 7973(7:19)
<L26> 6088(77:19)
<L27> 15(60:21)
<L28> 3(85:21)
<L29> コメント(0:23)
<L30> 今回分析を行ったサンプル１〜４のうち、サンプル４(0:25)
<L31> にのみpHの異常が見られた。これは、サンプル４の(0:27) <L1> Analysis Report (40:1)
<L2> Analysis result (0:3)
<L3> Sample 1 (2:5)
<L4> Sample 2 (25:5)
<L5> Sample 3 (50:5)
<L6> Sample 4 (70:5)
<L7> 311 (10:7)
<L8> 353(30:7)
<L9> 322 (60:7)
<L10> 399 (80:7)
<L11> 2598 (7:9)
<L12> 2283 (26:9)
<L13> 2033 (77:9)
<L14>
<L15> 11.3 (26:13)
<L16> 13.6 (60:13)
<L17> 0 (85:13)
<L18> 115(10:15)
<L19> 203 (26:15)
<L20> 268 (60:15)
<L21> 183 (80:16)
<L22> *1 (10:17)
<L23> *2 (26:17)
<L24> *3 (60:17)
<L25> 7973 (7:19)
<L26> 6088 (77:19)
<L27> 15 (60:21)
<L28> 3 (85:21)
<L29> Comment (0:23)
<L30> Of the samples 1-4 analyzed this time, sample 4 (0:25)
Abnormal pH was observed only in <L31>. This is sample 4 (0:27)

［ステップＳ２２］次に、表形成部３２は、取得された行毎の文字列と座標値のうち、横閾値以上の横移動量により分離された行が複数連続するかどうかを調べる。各行のy座標が同一である文字列が複数行に亘っている場合、それであると判断することができる。本例の場合、第3行<L3>〜第6行<L6>まで、第7行<L7>〜第10行<L10>などがそれに該当する。そして、それらの行を、x座標値に基づき1つの行（前記「元行」）に再構成する。これらの元行が複数行に亘る場合、表形成部３４はそれらを、1つの表を構成しているデータであると判断する。上記例では、空行である第14行<L14>を含む第3行<L3>〜第28行<L28>が1つの表を構成するデータであると判断される。一方、第1行<L1>、第2行<L2>、第30行<L30>及び第31行<L31>はいずれも１行が１文字列で構成されているため、普通の文章のデータであると判断される。 [Step S22] Next, the table forming unit 32 checks whether or not a plurality of lines separated by the horizontal movement amount equal to or larger than the horizontal threshold value are consecutive in the acquired character string and coordinate value for each line. When a character string having the same y-coordinate in each row extends over a plurality of rows, it can be determined to be that. In the case of this example, the third line <L3> to the sixth line <L6>, the seventh line <L7> to the tenth line <L10>, and the like correspond thereto. Then, those lines are reconstructed into one line (the “original line”) based on the x coordinate value. When these original rows extend over a plurality of rows, the table forming unit 34 determines that they are data forming one table. In the above example, it is determined that the third row <L3> to the 28th row <L28> including the 14th row <L14> which is an empty row are data forming one table. On the other hand, the 1st line <L1>, the 2nd line <L2>, the 30th line <L30> and the 31st line <L31> are all composed of one character string, so normal text data Is determined to be.

表形成部３４は、第3行<L3>から第28行<L28>までの文字列のx座標がほぼ等間隔に4つに分かれていること、すなわち列が4列あることを検出する。また、第21行<L21>のy座標が第18行〜第20行のy座標との間及び第22行<L22>〜第24行<L24>のy座標と間において、他の元行の間よりもy方向の飛び幅が小さいことを検出し、第18行<L18>から第24行<L24>がy方向に１つのセルを構成していると判断する。 The table forming unit 34 detects that the x coordinate of the character string from the third line <L3> to the 28th line <L28> is divided into four at substantially equal intervals, that is, there are four columns. In addition, the y coordinate of the 21st line <L21> is between the y coordinate of the 18th line to the 20th line and the y coordinate of the 22nd line <L22> to the 24th line <L24>. It is detected that the jump width in the y direction is smaller than that between the two, and it is determined that the 18th row <L18> to the 24th row <L24> constitute one cell in the y direction.

以上のような判断、すなわち、
・第3行<L3>が表の先頭行であり、第28行<L28>が最終行である。
・列数は4列である。
・第14行<L14>には文字列がなく、空行となっている。
・第18行<L18>から第24行<L24>はy方向に1つのセルとなる。
・8行×4列の表である。
に基づき、表形成部３４は、各文字列が表の正しい位置に配置されるように表を形成する。 The above judgment, that is,
-The third line <L3> is the first line of the table, and the 28th line <L28> is the last line.
・The number of rows is four.
-The 14th line <L14> has no character string and is a blank line.
-The 18th row <L18> to the 24th row <L24> is one cell in the y direction.
・A table with 8 rows and 4 columns.
Based on the above, the table forming unit 34 forms the table so that each character string is arranged at the correct position in the table.

［ステップＳ２３］最後に、出力部３５は、表形成部３４によって形成された表を、予め指定された、またはこのタイミングでユーザが指定する所定のファイル形式で以て出力する（図４。なお、これはｃｓｖ形式で出力した例である。）。これでＰＤＦデータ取り出し処理が終了する。 [Step S23] Finally, the output unit 35 outputs the table formed by the table forming unit 34 in a predetermined file format designated in advance or designated by the user at this timing (FIG. 4, FIG. 4). , This is an example output in csv format.) This completes the PDF data extraction process.

図５に、本発明に係るＰＤＦデータ取り出しシステム１の別の実施形態における動作中の画面例を示す。この実施形態では、ＣＰＵ１０が分析ＰＤＦデータ取り出しシステム用プログラム２１を実行することによりソフトウエア的に実現される機能手段として、さらにタイトル入力部３６と区切入力部３７が備わっている。 FIG. 5 shows an example of a screen in operation in another embodiment of the PDF data extraction system 1 according to the present invention. In this embodiment, a title input unit 36 and a division input unit 37 are further provided as functional means realized by software when the CPU 10 executes the analysis PDF data extraction system program 21.

この入力画面（タイトル入力部３６）でユーザが列のタイトルリストとして「サンプル１、サンプル２、サンプル３、サンプル４」を入力すると、表形成部３４は、これらの文字列が表の第一行（元行）目の構成要素であり、且つ、表の列数が4列であることを確定することができる。また、本実施例では行のタイトルリストを空白としているが、この欄に文字列を入力することで表の行数を確定することもできる。 When the user inputs "Sample 1, Sample 2, Sample 3, Sample 4" as the column title list on this input screen (title input unit 36), the table forming unit 34 displays these character strings in the first row of the table. It is possible to determine that the number of columns in the table is four, which is a component of the (original row). Further, although the title list of rows is blank in this embodiment, the number of rows in the table can be determined by inputting a character string in this field.

また、この入力画面（区切入力部３７）では、終端文字列として、表の最後の文字列の直後の文字列である「コメント」をユーザが指定する。これにより、表形成部３４は文字列「コメント」の直前の文字列である「3」で表が終了することを確定することができる。図５の例では終端文字列のみが入力できるが、表の開始文字列として、ＰＤＦファイルにおいて表の一番最初の文字列の直前の文字列をユーザが入力することができるようにしてもよい。これにより、表形成部３４は表の開始文字列を決定することができる。 Further, in this input screen (delimiter input section 37), the user designates "comment", which is the character string immediately after the last character string in the table, as the end character string. Accordingly, the table forming unit 34 can determine that the table ends with "3" which is the character string immediately before the character string "comment". In the example of FIG. 5, only the terminating character string can be entered, but the user may be allowed to enter the character string immediately before the first character string in the table in the PDF file as the starting character string in the table. .. Accordingly, the table forming unit 34 can determine the start character string of the table.

１…ＰＤＦデータ取り出しシステム
５…分析レポート（ＰＤＦファイル）
１０…ＣＰＵ
１２…メモリ
１４…モニタ
１６…入力部
１８…Ｉ／Ｆ
２０…記憶部
２１…ＰＤＦデータ取り出しシステム用プログラム
２２…ＰＤＦミドルウェア
２９…ＯＳ
３１…抽出単位設定部
３２…横閾値設定部
３３…文字列データ取得部
３４…表形成部
３５…出力部
３６…タイトル入力部
３７…区切入力部
Ａ１…分析装置 1...PDF data retrieval system 5...Analysis report (PDF file)
10... CPU
12...Memory 14...Monitor 16...Input unit 18...I/F
20... Storage unit 21... PDF data retrieval system program 22... PDF middleware 29... OS
31... Extraction unit setting unit 32... Horizontal threshold value setting unit 33... Character string data acquisition unit 34... Table forming unit 35... Output unit 36... Title input unit 37... Break input unit A1... Analytical device

Claims

A PDF data extraction system for acquiring a character string constituting a table included in a PDF file using PDF middleware and outputting the character string in a format reproducible in a table format,
a) an extraction unit setting unit that sets the extraction of the character string from the PDF file in the PDF middleware in line units based on the attribute of the character string and the horizontal threshold that is the threshold of the horizontal movement amount between the character strings,
b) a lateral threshold value setting unit that sets the lateral threshold value to a predetermined value,
c) a character string data acquisition unit that extracts a character string for each line from a specified PDF file according to the settings of the extraction unit setting unit and the horizontal threshold setting unit,
d) Among the coordinate values of each character string for each line extracted by the character string data acquisition unit , the position in the row direction of the character string is determined by the x coordinate value, and the position in the column direction of the character string is determined by the y coordinate value. and table forming section which determines and their form more reconstruction table to place in tabular form,
e) A PDF data extraction system, comprising: an output unit that outputs the reconstruction table in a predetermined data format.

Furthermore,
f) A delimiter input unit that allows a user to input text immediately before or after the table, wherein the table forming unit determines the start or end of the table based on the text. The PDF data retrieval system described.

Furthermore,
g) A title input unit for allowing a user to input a title of a column or row of the table, and the table forming unit determines the number of columns or rows of the table based on the title. Alternatively, the PDF data extraction system according to item 2.

A program used in a computer for a PDF data extraction system that acquires a character string forming a table included in a PDF file using PDF middleware and outputs the character string in a format that can be reproduced in a table format. , The computer,
a) an extraction unit setting unit that sets the extraction of the character string from the PDF file in the PDF middleware in line units based on the attribute of the character string and the horizontal threshold that is the threshold of the horizontal movement amount between the character strings,
b) a lateral threshold value setting unit that sets the lateral threshold value to a predetermined value,
c) a character string data acquisition unit that extracts a character string for each line from a specified PDF file according to the settings of the extraction unit setting unit and the horizontal threshold setting unit,
d) Among the coordinate values of each character string for each line extracted by the character string data acquisition unit , the position in the row direction of the character string is determined by the x coordinate value, and the position in the column direction of the character string is determined by the y coordinate value. and table forming section which determines and their form more reconstruction table to place in tabular form,
e) A program for a PDF data extraction system, which causes the reconstruction table to function as an output unit that outputs the data in a predetermined data format.