JP5465279B2

JP5465279B2 - Information processing apparatus and program

Info

Publication number: JP5465279B2
Application number: JP2012138105A
Authority: JP
Inventors: 進風間
Original assignee: 株式会社ポーラ・メソッド
Priority date: 2011-12-02
Filing date: 2012-06-19
Publication date: 2014-04-09
Anticipated expiration: 2032-06-19
Also published as: JP2013137733A

Description

本発明の実施形態は、電子文書ファイルを処理する情報処理装置及びプログラムに関する。 Embodiments described herein relate generally to an information processing apparatus and a program for processing an electronic document file.

コンピュータは、文書データを所定形式の電子文書ファイルとして管理する。例えば、電子文書ファイルがＰＤＦ（Portable Document Format）ファイルの場合、コンピュータは、データ変換を行うことなく、ＰＤＦファイルの内容を表示装置の画面に表示することができ、又は、印刷装置で印刷することができる。 The computer manages the document data as an electronic document file in a predetermined format. For example, when the electronic document file is a PDF (Portable Document Format) file, the computer can display the content of the PDF file on the screen of the display device without performing data conversion, or print it with the printing device. Can do.

ＰＤＦファイルは、表示、印刷、コンピュータによる記憶装置への書き込み及び読み出しの一単位として用いられる。 A PDF file is used as a unit of display, printing, writing to a storage device by a computer, and reading.

ＰＤＦファイルは、例えば１つ又は複数のページ表現データと、資源データとを含む。資源データは、例えば、ページ表現データに対して使用される文字データと画像データとのうちの少なくとも１つを含む。例えば、ＰＤＦファイルが高品質プリンタに対して使用されることを想定し、ＰＤＦファイル内の文字データ又は画像データが過度な高品質を持つ場合がある。 The PDF file includes, for example, one or a plurality of page expression data and resource data. The resource data includes, for example, at least one of character data and image data used for the page expression data. For example, assuming that a PDF file is used for a high-quality printer, character data or image data in the PDF file may have an excessively high quality.

多数の文書データを多数の個別のＰＤＦファイルで保存する場合には、それぞれのＰＤＦファイルに資源データが含まれる。このため、多数のＰＤＦファイルの全てに対して必要となる記憶容量は、多量の文書データを１つのＰＤＦファイルで保存する場合に必要となる記憶容量よりも、大きくなる。 When a large number of document data is stored in a large number of individual PDF files, the resource data is included in each PDF file. For this reason, the storage capacity required for all of a large number of PDF files is larger than the storage capacity required for storing a large amount of document data in one PDF file.

また、コンピュータに多数（例えば数百万件以上）のＰＤＦファイルを記憶し、必要なＰＤＦファイルを検索する場合、コンピュータの例えば既存のオペレーティングシステムに付属するファイル制御プログラムが極小で多数のファイルを読み出すことを想定していないため、ＰＤＦファイルの数が多いほどデータの読み出し時間が長くなり、コンピュータの作業量も大きくなる。 In addition, when a large number (for example, millions or more) of PDF files are stored in a computer and a necessary PDF file is searched, a file control program attached to an existing operating system of the computer is extremely small and reads a large number of files. For this reason, the larger the number of PDF files, the longer the data read time and the greater the work amount of the computer.

ＰＤＦファイルは、ある１つの資源データを内包し、この資源データを複数のページで使用することができる。また、ＰＤＦファイルは、ファイル外の資源データを参照し、使用することができる。しかしながら、このようにファイル外の資源データが参照される場合には、資源データの更新管理が複雑化する。 A PDF file contains a certain resource data, and this resource data can be used in a plurality of pages. The PDF file can be used by referring to resource data outside the file. However, when resource data outside the file is referred to in this way, update management of the resource data becomes complicated.

特許第４７８４３６１号公報Japanese Patent No. 4784361

本発明の実施形態は、所定形式の電子文書ファイルの管理を効率化させるための情報処理装置及びプログラムを提供することを目的とする。 An object of an embodiment of the present invention is to provide an information processing apparatus and a program for improving the efficiency of management of an electronic document file of a predetermined format.

実施形態によれば、情報処理装置は、インデックス抽出手段、ページ追記手段、資源データ変更手段、マージ手段、インデックス埋め込み手段、検索手段を含む。インデックス抽出手段は、第１の記憶装置に記憶されている複数のＰＤＦファイルから、所望のインデックスデータを抽出する。ページ追記手段は、複数のＰＤＦファイルに含まれているページ表現データの所望の位置に、インデックスデータを表す文字又はコードデータを追記する。この追記された文字又はコードデータは、検索された結果の閲覧時に目視可能とする。資源データ変更手段は、複数のＰＤＦファイルに含まれている資源データに対してデータサイズ低減処理を行う。マージ手段は、ページ追記手段によって文字又はコードデータの追記された複数のＰＤＦファイルに含まれているページ表現データをマージし、当該マージされたページ表現データと、資源データ変更手段によってデータサイズの低減された資源データとを含むマージＰＤＦファイルを生成する。インデックス埋め込み手段は、マージＰＤＦファイルの所望の位置にインデックスデータを埋め込んだ再編成ＰＤＦファイルを生成し、再編成ＰＤＦファイルを第２の記憶装置に記憶する。検索手段は、インデックスデータに基づいて、第２の記憶装置に記憶されている再編成ＰＤＦファイルに対する検索処理を実行する。 According to the embodiment, the information processing apparatus includes an index extraction unit, a page addition unit, a resource data change unit, a merge unit, an index embedding unit, and a search unit. The index extraction means extracts desired index data from a plurality of PDF files stored in the first storage device. The page appending means appends the character or code data representing the index data to a desired position of the page expression data included in the plurality of PDF files. The added character or code data is visible when browsing the retrieved result. The resource data changing unit performs a data size reduction process on the resource data included in the plurality of PDF files. The merging unit merges page expression data included in a plurality of PDF files in which character or code data is added by the page addition unit, and reduces the data size by the merged page expression data and the resource data changing unit. A merge PDF file including the generated resource data is generated. The index embedding unit generates a reorganized PDF file in which index data is embedded at a desired position of the merge PDF file, and stores the reorganized PDF file in the second storage device. The search means executes search processing for the reorganized PDF file stored in the second storage device based on the index data.

本発明の実施形態より、所定形式の電子文書ファイルの管理を効率化させることができる。 According to the embodiment of the present invention, management of electronic document files of a predetermined format can be made efficient.

第１の実施形態に係る情報処理装置の構成の一例を示すブロック図。1 is a block diagram showing an example of the configuration of an information processing apparatus according to a first embodiment. ＰＤＦファイルに含まれているページ表現データ及び文書データの構成の一例を示すブロック図。The block diagram which shows an example of a structure of the page expression data and document data contained in a PDF file. 第１の実施形態に係るインデックス抽出部による処理の第１の例を示すブロック図。The block diagram which shows the 1st example of the process by the index extraction part which concerns on 1st Embodiment. 第１の実施形態に係るインデックス抽出部による処理の第２の例を示すブロック図。The block diagram which shows the 2nd example of the process by the index extraction part which concerns on 1st Embodiment. 第１の実施形態に係るページ追記部による処理の一例を示すブロック図。The block diagram which shows an example of the process by the page addition part which concerns on 1st Embodiment. 第１の実施形態に係る文字データ変更部による処理の一例を示すブロック図。The block diagram which shows an example of the process by the character data change part which concerns on 1st Embodiment. 第１の実施形態に係るインデックス埋め込み部の一例を示すブロック図。The block diagram which shows an example of the index embedding part which concerns on 1st Embodiment. 第１の実施形態に係る情報処理装置の処理の一例を示すフローチャート。5 is a flowchart illustrating an example of processing of the information processing apparatus according to the first embodiment. 第２の実施形態に係る再編成ＰＤＦファイルの一例を示すデータ構成図。The data block diagram which shows an example of the reorganization PDF file which concerns on 2nd Embodiment. 第２の実施形態に係るバッチ検索処理の一例を示すフローチャート。9 is a flowchart showing an example of batch search processing according to the second embodiment. 第２の実施形態に係るリアルタイム検索処理の一例を示すフローチャート。The flowchart which shows an example of the real-time search process which concerns on 2nd Embodiment.

以下、図面を参照しながら本発明の実施形態について説明する。なお、以下の説明において、略または実質的に同一の機能及び構成要素については、同一符号を付し、必要に応じて説明を行う。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, substantially the same or substantially the same functions and components are denoted by the same reference numerals and will be described as necessary.

（第１の実施形態）
本実施形態においては、管理対象の電子文書ファイルがＰＤＦファイルの場合について説明するが、他の電子文書形式のファイルであってもよい。 (First embodiment)
In the present embodiment, the case where the electronic document file to be managed is a PDF file will be described, but it may be a file of another electronic document format.

図１は、本実施形態に係る情報処理装置の構成の一例を示すブロック図である。 FIG. 1 is a block diagram illustrating an example of the configuration of the information processing apparatus according to the present embodiment.

本実施形態に係る情報処理装置１は、多量の文書データに対応する複数のＰＤＦファイルのインデックス検索を実現させるための構成と、ＰＤＦファイルのデータサイズを低減させるための構成とを備える。 The information processing apparatus 1 according to the present embodiment includes a configuration for realizing an index search of a plurality of PDF files corresponding to a large amount of document data, and a configuration for reducing the data size of the PDF file.

ＰＤＦファイルのインデックス検索を実現させるために、情報処理装置１は、複数のＰＤＦファイルのページ表現データからインデックスデータを抽出する。例えば、ページ表現データは、印刷文字列、バーコードなどを含む印刷体裁データである。情報処理装置１は、複数のＰＤＦファイルをマージし、複数のＰＤＦファイルよりも少ない数の再編成ＰＤＦファイルを生成する。再編成ＰＤＦファイルには、インデックスデータが例えば閲覧ソフト、検索ソフトなどで参照容易な形式で埋め込まれる。さらに、再編成ＰＤＦファイルに含まれるページ表現データには、インデックスデータを表す文字又はコードデータが、例えば閲覧ソフトで閲覧容易な形成で追記される（埋め込まれる）。このように、複数のＰＤＦファイルに含まれるページ表現データから抽出されたインデックスデータを再編成ＰＤＦファイルに検索しやすい形式で再度収納することにより、ＰＤＦファイルの更新管理性と可搬性とを高めていることは、本実施形態の第１の特徴的事項である。 In order to realize index search of PDF files, the information processing apparatus 1 extracts index data from page expression data of a plurality of PDF files. For example, the page expression data is print format data including a print character string and a barcode. The information processing apparatus 1 merges a plurality of PDF files and generates a smaller number of reorganized PDF files than the plurality of PDF files. In the reorganized PDF file, index data is embedded in a format that can be easily referred to by, for example, browsing software or search software. Furthermore, in the page expression data included in the reorganized PDF file, character data or code data representing the index data is added (embedded) in a form that can be easily viewed by, for example, browsing software. In this way, the index data extracted from the page expression data included in a plurality of PDF files is stored again in a format that can be easily searched in the reorganized PDF file, thereby improving the update management and portability of the PDF file. This is the first characteristic matter of the present embodiment.

ＰＤＦファイルのデータサイズを低減させるために、情報処理装置１は、複数のＰＤＦファイルを、複数のＰＤＦファイルよりも少ない数（少なくとも１つ）の再編成ＰＤＦファイルにマージする。再編成ＰＤＦファイル内に含まれる文書データ数は、元のＰＤＦファイルのそれぞれの文書データ数よりも多くなる。複数のＰＤＦファイルをマージすることで、資源データを共有させる。情報処理装置１は、ＰＤＦファイルに内包されている文字データをなるべく非内包形式に切り替える。情報処理装置１は、ＰＤＦファイルに含まれている文字データの字形表現情報を表示品質劣化が許容される範囲で簡素化し、データサイズを削減する。情報処理装置１は、画像データの解像度を低減させ、画像データのデータサイズを許容される範囲で低減させる。このように、複数のＰＤＦファイルに含まれた後の資源データのデータサイズを変更する。これらのデータサイズ削減を再編成ＰＤＦファイルを生成する過程で行うことは、本実施形態の第２の特徴的事項である。 In order to reduce the data size of the PDF file, the information processing apparatus 1 merges the plurality of PDF files into a smaller number (at least one) of reorganized PDF files than the plurality of PDF files. The number of document data included in the reorganized PDF file is larger than the number of document data in the original PDF file. Resource data is shared by merging multiple PDF files. The information processing apparatus 1 switches character data included in the PDF file to a non-included format as much as possible. The information processing apparatus 1 simplifies the character form representation information of the character data included in the PDF file as long as display quality deterioration is allowed, and reduces the data size. The information processing apparatus 1 reduces the resolution of the image data and reduces the data size of the image data within an allowable range. Thus, the data size of the resource data after being included in the plurality of PDF files is changed. Performing such data size reduction in the process of generating a reorganized PDF file is a second characteristic matter of the present embodiment.

すなわち、本実施形態に係る情報処理装置１は、ＰＤＦファイルが形成された後に、このＰＤＦファイルに含まれている文字データのデータサイズの削減、画像データのデータサイズの削減を行い、変更された文字データ及び画像データを再編成ＰＤＦファイルに内包する特徴を持つ。 In other words, after the PDF file is formed, the information processing apparatus 1 according to the present embodiment reduces the data size of the character data included in the PDF file and the data size of the image data, and has been changed. Character data and image data are included in a reorganized PDF file.

情報処理装置１は、記憶装置２ａ，２ｂ、文書編集部３、再編成部４、文書管理データベースシステム５、検索部６を具備する。情報処理装置１は、入力装置７、表示装置８、印刷装置９と接続されている。なお、情報処理装置１は、１台のコンピュータにより構成されてもよく、複数台のコンピュータがデータを送受信可能に接続されているコンピュータシステムにより構成されてもよい。文書編集部３、再編成部４、検索部６は、例えば、記憶媒体に記憶されているプログラムにしたがって動作するプロセッサによって実現される。記憶装置２ａ，２ｂは、例えば、情報処理装置１に備えられているハードディスク、主記憶装置、又は内部メモリであり、任意に組み合わせてもよく、任意に分離されてもよい。記憶装置２ａ，２ｂは、作業メモリとして使用されてもよい。 The information processing apparatus 1 includes storage devices 2a and 2b, a document editing unit 3, a reorganization unit 4, a document management database system 5, and a search unit 6. The information processing apparatus 1 is connected to an input device 7, a display device 8, and a printing device 9. The information processing apparatus 1 may be configured by a single computer, or may be configured by a computer system in which a plurality of computers are connected so as to be able to transmit and receive data. The document editing unit 3, the reorganization unit 4, and the search unit 6 are realized by a processor that operates according to a program stored in a storage medium, for example. The storage devices 2a and 2b are, for example, a hard disk, a main storage device, or an internal memory provided in the information processing device 1, and may be arbitrarily combined or arbitrarily separated. The storage devices 2a and 2b may be used as a working memory.

記憶装置２ａは、例えばテキストデータなどのような文書データ１０、例えば文字ＩＤ（識別情報）、固定文字コード、文字形状データなどを含む文字データ１１、画像データ１２を記憶している。 The storage device 2a stores document data 10 such as text data, for example, character data 11 including character ID (identification information), fixed character code, character shape data, and image data 12.

文書編集部３は、記憶装置２ａに記憶されている文書データ１０、文字データ１１、画像データ１２に基づいて、複数のＰＤＦファイル１３１〜１３ｎを生成し、生成された複数のＰＤＦファイル１３１〜１３ｎを記憶装置２ａに記憶する。文書編集部３による複数のＰＤＦファイル１３１〜１３ｎの生成は、例えば、作業者の指示に基づいて行われてもよく、予め設定されている雛形に基づいて行われてもよい。 The document editing unit 3 generates a plurality of PDF files 131 to 13n based on the document data 10, the character data 11, and the image data 12 stored in the storage device 2a, and the generated plurality of PDF files 131 to 13n. Is stored in the storage device 2a. The generation of the plurality of PDF files 131 to 13n by the document editing unit 3 may be performed based on, for example, an operator's instruction or based on a preset template.

本実施形態において、複数のＰＤＦファイル１３１〜１３ｎは、それぞれ資源データ１４１〜１４ｎと、ページ表現データ１５１〜１５ｎを含む。さらに、資源データ１４１〜１４ｎは、それぞれ文字データ１６１〜１６ｎと、画像データ１７１〜１７ｎを含む。 In the present embodiment, the plurality of PDF files 131 to 13n include resource data 141 to 14n and page expression data 151 to 15n, respectively. Furthermore, the resource data 141 to 14n includes character data 161 to 16n and image data 171 to 17n, respectively.

再編成部４は、記憶装置２ａに記憶されている元の複数のＰＤＦファイル１３１〜１３ｎを参照し、このＰＤＦファイル１３１〜１３ｎに対する再編成を実行し、再編成ＰＤＦファイル１８１〜１８ｋを生成し、文書管理データベースシステム５に記憶する。本実施形態においては、ＰＤＦファイル１３１〜１３ｎが併合されてＰＤＦファイル１８１〜１８ｋが生成されるため、ＰＤＦファイル１３１〜１３ｎの数よりもＰＤＦファイル１８１〜１８ｋの数が少なくなる。 The reorganization unit 4 refers to the plurality of original PDF files 131 to 13n stored in the storage device 2a, executes reorganization on the PDF files 131 to 13n, and generates reorganized PDF files 181 to 18k. And stored in the document management database system 5. In the present embodiment, since the PDF files 131 to 13n are merged to generate the PDF files 181 to 18k, the number of the PDF files 181 to 18k is smaller than the number of the PDF files 131 to 13n.

本実施形態において、ＰＤＦファイル１８１〜１８ｋは、それぞれ資源データ１９１〜１９ｋ、ページ表現データ２０１〜２０ｋ、インデックスデータ２１１〜２１ｋを含む。さらに、資源データ１９１〜１９ｋは、それぞれ文字データ２２１〜２２ｋと、画像データ２３１〜２３ｋを含む。 In the present embodiment, the PDF files 181 to 18k include resource data 191 to 19k, page expression data 201 to 20k, and index data 211 to 21k, respectively. Furthermore, the resource data 191 to 19k include character data 221 to 22k and image data 231 to 23k, respectively.

再編成部４は、例えば、インデックス抽出部２４、ページ追記部２５、文字データ変更部２６、画像データ変更部２７、マージ部２８、インデックス埋め込み部２９を含む。 The reorganization unit 4 includes, for example, an index extraction unit 24, a page addition unit 25, a character data change unit 26, an image data change unit 27, a merge unit 28, and an index embedding unit 29.

記憶装置２ｂは、インデックス指定データ３０、ページ追記指定データ３１、文字変更指定データ３２、画像変更指定データ３３、マージ指定データ３４、インデックス埋め込み指定データ３５を記憶する。なお、インデックス指定データ３０、ページ追記指定データ３１、文字変更指定データ３２、画像変更指定データ３３、マージ指定データ３４、インデックス埋め込み指定データ３５は、それぞれインデックス抽出部２４、ページ追記部２５、文字データ変更部２６、画像データ変更部２７、マージ部２８、インデックス埋め込み部２９に組み込まれていてもよい。 The storage device 2b stores index designation data 30, page addition designation data 31, character change designation data 32, image change designation data 33, merge designation data 34, and index embedding designation data 35. The index designation data 30, page addition designation data 31, character change designation data 32, image change designation data 33, merge designation data 34, and index embedding designation data 35 are index extraction section 24, page addition section 25, and character data, respectively. The change unit 26, the image data change unit 27, the merge unit 28, and the index embedding unit 29 may be incorporated.

インデックス指定データ３０は、ＰＤＦファイルからのインデックスデータの抽出位置及び規則などを指定する。 The index designation data 30 designates index data extraction positions and rules from the PDF file.

ページ追記指定データ３１は、ＰＤＦファイルに含まれているページ表現データにインデックスデータを表す文字、文字列、コードデータを追記する位置、インデックスデータを対応する文字、文字列、コードデータに変換するための関係データなどの各種の規則を指定する。 The page additional designation data 31 is for converting characters, character strings and code data representing index data into the page expression data included in the PDF file, positions for adding the index data, and converting the index data into corresponding characters, character strings and code data. Specify various rules such as related data.

文字変更指定データ３２は、ＰＤＦファイルに含まれる文字データのデータサイズを低減させる規則などを指定する。 The character change designation data 32 designates a rule for reducing the data size of character data included in the PDF file.

画像変更指定データ３３は、ＰＤＦファイルに含まれる画像データのデータサイズを低減させる規則などを指定する。 The image change designation data 33 designates a rule for reducing the data size of the image data included in the PDF file.

マージ指定データ３４は、複数のＰＤＦファイルをマージするための規則などを指定する。 The merge designation data 34 designates a rule for merging a plurality of PDF files.

インデックス埋め込み指定データ３５は、インデックスデータをマージされたＰＤＦファイルへ埋め込む場合の領域（位置）及び規則などを指定する。 The index embedding designation data 35 designates an area (position) and a rule for embedding index data in the merged PDF file.

インデックス抽出部２４は、インデックス指定データ３０に基づいて、ＰＤＦファイル１３１〜１３ｎから、インデックス指定データ３０によって指定される所望のインデックスデータ３６１〜３６ｎを抽出し、ＰＤＦファイル１３１〜１３ｎにそれぞれ対応するインデックスデータ３６１〜３６ｎを記憶装置２ｂに記憶する。 The index extraction unit 24 extracts desired index data 361 to 36n designated by the index designation data 30 from the PDF files 131 to 13n based on the index designation data 30, and indexes corresponding to the PDF files 131 to 13n, respectively. Data 361 to 36n are stored in the storage device 2b.

ページ追記部２５は、ＰＤＦファイル１３１〜１３ｎに含まれているページ表現データ１５１〜１５ｎの、ページ追記指定データ３１によって指定される所望の位置に、それぞれインデックスデータ３６１〜３６ｎを表す文字又はコードデータを、例えば表示した場合に目視しやすい形式で追記する。この文字又はコードデータは、例えは、検索時などにおいて閲覧（目視）可能であり、インデックスデータの内容をユーザが容易に確認することができる。 The page appending unit 25 is a character or code data representing the index data 361 to 36n at desired positions designated by the page appending designation data 31 in the page expression data 151 to 15n included in the PDF files 131 to 13n. Is added in a format that is easy to see when displayed, for example. This character or code data can be browsed (viewed), for example, at the time of search or the like, and the user can easily confirm the contents of the index data.

文字データ変更部２６は、文字変更指定データ３２にしたがって、ＰＤＦファイル１３１〜１３ｎに含まれている文字データ１６１〜１６ｎに対してデータサイズ低減処理を実行する。例えば、文字データ変更部２６は、ＰＤＦファイル１３１〜１３ｎに含まれている文字データ１６１〜１６ｎのうちの少なくとも一部を非内包形式（Font un-embeded）とする。また、例えば、文字データ変更部２６は、ＰＤＦファイル１３１〜１３ｎに含まれている文字データ１６１〜１６ｎにおける文字を構成する座標の数を減少させる。本実施形態において、文字データ変更部２６は、インデックス埋め込み部２９によってインデックスデータ２１１〜２１ｋがマージされたＰＤＦファイルに埋め込まれる前に、文字データ１６１〜１６ｎの変更を行う。 The character data changing unit 26 performs a data size reduction process on the character data 161 to 16n included in the PDF files 131 to 13n according to the character change designation data 32. For example, the character data changing unit 26 sets at least a part of the character data 161 to 16n included in the PDF files 131 to 13n to a non-included format (Font un-embeded). For example, the character data changing unit 26 reduces the number of coordinates constituting the characters in the character data 161 to 16n included in the PDF files 131 to 13n. In the present embodiment, the character data changing unit 26 changes the character data 161 to 16n before the index embedding unit 29 embeds the index data 211 to 21k in the merged PDF file.

画像データ変更部２７は、画像変更指定データ３３にしたがって、ＰＤＦファイル１３１〜１３ｎに含まれている画像データ１７１〜１７ｎに対してデータサイズ低減処理を実行する。例えば、画像データ変更部２７は、ＰＤＦファイル１３１〜１３ｎに含まれている画像データ１７１〜１７ｎのうち、データサイズが所定の値を超える画像データに対して、解像度を下げる。解像度の低減は、例えば画像データの個々のピクセルの平均化による間引きによって行う。本実施形態において、画像データ変更部２７は、インデックス埋め込み部２９によってインデックスデータ２１１〜２１ｋがマージされたＰＤＦファイル４８１〜４８ｋに埋め込まれる前に、画像データ１７１〜１７ｎの変更を行う。 The image data changing unit 27 performs data size reduction processing on the image data 171 to 17n included in the PDF files 131 to 13n according to the image change designation data 33. For example, the image data changing unit 27 lowers the resolution of image data whose data size exceeds a predetermined value among the image data 171 to 17n included in the PDF files 131 to 13n. The resolution is reduced by thinning, for example, by averaging individual pixels of image data. In the present embodiment, the image data changing unit 27 changes the image data 171 to 17n before the index embedding unit 29 embeds the index data 211 to 21k in the merged PDF files 481 to 48k.

マージ部２８は、複数のＰＤＦファイル１３１〜１３ｎをマージする処理を実行する。例えば、マージ部２８は、マージ後のファイルサイズが所定の範囲となるように、データサイズ低減された文字データ２２１〜２２ｋ、データサイズ低減された画像データ２３１〜２３ｋ、ページ表現データ１５１〜１５ｎをマージしたページ表現データ２０１〜２０ｋを含むマージＰＤＦファイル４８１〜４８ｋを生成し、記憶装置２ｂに記憶する。マージＰＤＦファイル４８１〜４８ｋは、それぞれが、データサイズ低減された文字データ２２１〜２２ｋとデータサイズ低減された画像データ２３１〜２３ｋとを含む資源データ１９１〜１９ｋと、ページ表現データ２０１〜２０ｋを含む。資源データ１９１〜１９ｋのそれぞれの中では、重複が排除されている。マージされたページ表現データ２０１〜２０ｋには、インデックスデータ３６１〜３６ｎを表す文字又はコードデータが追記されている。 The merge unit 28 executes a process of merging a plurality of PDF files 131 to 13n. For example, the merge unit 28 stores the character data 221 to 22k with the data size reduced, the image data 231 to 23k with the data size reduced, and the page expression data 151 to 15n so that the file size after merging is within a predetermined range. Merge PDF files 481 to 48k including the merged page expression data 201 to 20k are generated and stored in the storage device 2b. The merge PDF files 481 to 48k include resource data 191 to 19k including character data 221 to 22k with reduced data size and image data 231 to 23k with reduced data size, and page expression data 201 to 20k, respectively. . In each of the resource data 191 to 19k, duplication is eliminated. Character or code data representing the index data 361 to 36n is added to the merged page expression data 201 to 20k.

インデックス埋め込み部２９は、インデックス埋め込み指定データ３５にしたがって、マージＰＤＦファイル４８１〜４８ｋに対してインデックスデータ３６１〜３６ｎを適宜割り当てたインデックスデータ２１１〜２１ｋを生成し、マージＰＤＦファイル４８１〜４８ｋに割り当てられたインデックスデータ２１１〜２１ｋを埋め込み、再編成ＰＤＦファイル１８１〜１８ｋを生成し、再編成ＰＤＦファイル１８１〜１８ｋを文書管理データベースシステム５に記憶する。再編成ＰＤＦファイル１８１〜１８ｋは、それぞれ、重複の排除された資源データ１９１〜１９ｋと、ページ表現データ２０１〜２０ｋと、インデックスデータ２１１〜２１ｋを含む。 The index embedding unit 29 generates index data 211 to 21k appropriately assigned index data 361 to 36n to the merge PDF files 481 to 48k according to the index embedding designation data 35, and is assigned to the merge PDF files 481 to 48k. The index data 211 to 21k are embedded, reorganized PDF files 181 to 18k are generated, and the reorganized PDF files 181 to 18k are stored in the document management database system 5. The reorganized PDF files 181 to 18k include resource data 191 to 19k from which duplication is eliminated, page expression data 201 to 20k, and index data 211 to 21k, respectively.

上記のように、再編成部４は、インデックスデータ抽出、ページ表現データに対するインデックスデータを表す参照容易な形式の文字、文字列、又はコードデータの追記、文字データサイズの低減、画像データサイズの低減、マージ、インデックスデータの埋め込みによって生成された再編成ＰＤＦファイル１８１〜１８ｋを、文書管理データベースシステム５に記憶する。文書管理データベースシステム５は、各種ファイル及びデータの管理を行う。 As described above, the reorganization unit 4 extracts index data, appends characters, character strings, or code data in an easy-to-reference format that represents the index data for the page expression data, reduces the character data size, and reduces the image data size. Reorganized PDF files 181 to 18k generated by merging and embedding index data are stored in the document management database system 5. The document management database system 5 manages various files and data.

入力装置７は、ユーザからの指示、命令、又は検索キーワードなどを受け付け、指示、命令、又は検索キーワードを検索部６に提供する。 The input device 7 receives an instruction, a command, or a search keyword from the user, and provides the search unit 6 with the instruction, the command, or the search keyword.

検索部６は、ユーザからの指示、命令、検索キーワード、又は、インデックスデータなどに基づいて、文書管理データベースシステム５からこの指示、命令、検索キーワード、又は、インデックスデータに対応するＰＤＦファイルを抽出する。そして、検索部６は、検索されたＰＤＦファイルを表示装置８又は印刷装置９に提供する。 The search unit 6 extracts a PDF file corresponding to the instruction, command, search keyword, or index data from the document management database system 5 based on an instruction, command, search keyword, or index data from the user. . Then, the search unit 6 provides the searched PDF file to the display device 8 or the printing device 9.

表示装置８は、検索部６によって検索されたＰＤＦファイルを画面表示する。 The display device 8 displays the PDF file searched by the search unit 6 on the screen.

印刷装置９は、検索部６によって検索されたＰＤＦファイルを印刷する。 The printing device 9 prints the PDF file searched by the search unit 6.

情報処理装置１は、再編成ＰＤＦファイル１８１〜１８ｋを文書管理データベースシステム５に記憶し、必要なＰＤＦファイルを検索し、画面表示又は印刷し、使用可能とする。情報処理装置１は、例えば、保険、証券、銀行、自治体などの送付物に対する電話質問に回答するコールセンター業務において使用される。また、情報処理装置１は、例えば、文書センター、図書館などの電子文書閲覧、部分複写、タブレット型端末のブラウザによる文書閲覧において使用される。 The information processing apparatus 1 stores the reorganized PDF files 181 to 18k in the document management database system 5, searches for a necessary PDF file, displays the screen, prints it, and makes it usable. The information processing apparatus 1 is used, for example, in a call center service that answers a telephone question for a delivery item such as insurance, securities, bank, or local government. The information processing apparatus 1 is used, for example, in electronic document browsing such as a document center and a library, partial copying, and document browsing using a browser of a tablet terminal.

図２は、ＰＤＦファイルに含まれているページ表現データ及び文書データの構成の一例を示すブロック図である。 FIG. 2 is a block diagram illustrating an example of the configuration of page expression data and document data included in a PDF file.

ＰＤＦファイル３７に含まれているページ表現データ３８は、それぞれが文字列１行分を表現する命令列３８１，３８２を含む。 The page expression data 38 included in the PDF file 37 includes instruction strings 381 and 382 each representing one line of a character string.

命令列３８１は、ページにおける行の先頭位置の座標ａ１（水平方向位置ｘ１，垂直方向位置ｙ１）、行の属性ｂ１（例えば行の方向、文字サイズ、文字の色など）、文字データを指定する識別情報(タグ)ｃ−１，ｃ−２，ｃ−３，ｃ−４，ｃ−５を含む。 The instruction sequence 381 designates the coordinates a1 (horizontal position x1, vertical position y1) of the line in the page, line attribute b1 (for example, line direction, character size, character color, etc.), and character data. It includes identification information (tags) c-1, c-2, c-3, c-4, and c-5.

次の命令列３８２は、ページにおける行の先頭位置の座標ａ２、行の属性ｂ２、文字データを指定する識別情報ｃ−１，ｃ−２，ｃ−１０，ｃ−２０，ｃ−３０を含む。 The next instruction sequence 382 includes the coordinates a2 of the head position of the line in the page, the attribute b2 of the line, and identification information c-1, c-2, c-10, c-20, c-30 designating character data. .

ＰＤＦファイル３７に含まれている文字データ３９１は、命令列で指定される識別情報ｃ−１、対象の文字に関する固定文字コードｄ１、グリフ情報ｅ１（例えばベゼー曲線、スプラインなどの円弧近似関数、直線を使用したベクトル閉曲線により文字形状を表すデータ）を含む。 The character data 391 included in the PDF file 37 includes identification information c-1 specified by the instruction sequence, a fixed character code d1 related to the target character, glyph information e1 (for example, an arc approximation function such as a Besee curve or a spline, a straight line Data representing a character shape by a vector closed curve using.

ＰＤＦファイル３７に含まれている文字データ３９２は、命令列で指定される識別情報ｃ−２、対象の文字に関する固定文字コードｄ２、グリフ情報ｅ２を含む。 The character data 392 included in the PDF file 37 includes identification information c-2 designated by the instruction sequence, a fixed character code d2 related to the target character, and glyph information e2.

例えば、ｃ−１が「あ」、ｃ−２が「い」、ｃ−３が「う」、ｃ−４が「え」、ｃ−５が「お」に相当し、ｃ−１０が「く」、ｃ−２０が「け」、ｃ−３０が「こ」に相当する場合、ＰＤＦファイル３７のページ体裁記述では、ページに「あいうえお」の文字が並んだ行と「あいくけこ」の文字が並んだ行とがレイアウトされる。 For example, c-1 corresponds to "A", c-2 corresponds to "I", c-3 corresponds to "U", c-4 corresponds to "E", c-5 corresponds to "O", and c-10 corresponds to "O". "," C-20 corresponds to "ke", and c-30 corresponds to "ko". In the page format description of the PDF file 37, the line where the characters "aiueo" are arranged on the page and "aikukeko" A line with characters is laid out.

図３は、本実施形態に係るインデックス抽出部２６による処理の第１の例を示すブロック図である。この図３は、ＰＤＦファイル４０に含まれる印刷文字を抽出する例を示す。 FIG. 3 is a block diagram illustrating a first example of processing by the index extraction unit 26 according to the present embodiment. FIG. 3 shows an example in which print characters included in the PDF file 40 are extracted.

インデックス指定データ３０は、ＰＤＦファイル４０のページ表現データにおけるインデックスデータを抽出する領域を指定する。さらに、インデックス指定データ３０は、抽出すべきキー文字列とその範囲とを指定する。 The index designation data 30 designates an area for extracting index data in the page expression data of the PDF file 40. Further, the index designation data 30 designates a key character string to be extracted and its range.

この図３の例では、インデックス抽出部２６は、インデックス指定データ３０と、ＰＤＦファイル４０のページ表現データに基づいて、座標ｘ，ｙを基準とする垂直方向の幅ａ、水平方向の幅ｂの領域から、郵便番号とこの郵便番号に続く所定数の数字を、インデックスデータ４１として抽出する。 In the example of FIG. 3, the index extraction unit 26 has a vertical width a and a horizontal width b based on the coordinates x and y based on the index designation data 30 and the page expression data of the PDF file 40. A zip code and a predetermined number of numbers following the zip code are extracted as index data 41 from the area.

図４は、本実施形態に係るインデックス抽出部２６による処理の第２の例を示すブロック図である。この図４は、ＰＤＦファイル４０に印刷文字としてパーコードを解読して、インデックスデータ４１を生成する例を示す。 FIG. 4 is a block diagram showing a second example of processing by the index extraction unit 26 according to the present embodiment. FIG. 4 shows an example of generating index data 41 by decoding the parcode as print characters in the PDF file 40.

インデックス指定データ３０は、ＰＤＦファイル４０のページ表現データにおける所定の領域に存在するバーコードを抽出することを指定する。 The index designation data 30 designates that a barcode existing in a predetermined area in the page expression data of the PDF file 40 is to be extracted.

インデックス抽出部２６は、インデックス指定データ３０と、ＰＤＦファイル４０のページ表現データに基づいて、座標ｘ，ｙを基準とする垂直方向の幅ａ、水平方向の幅ｂの領域からバーコードを抽出し、バーコードの示す情報「ＡＢＣＤ」を含むインデックスデータ４１を生成する。 Based on the index designation data 30 and the page expression data of the PDF file 40, the index extraction unit 26 extracts a barcode from an area having a vertical width a and a horizontal width b based on the coordinates x and y. The index data 41 including the information “ABCD” indicated by the barcode is generated.

より具体的に説明すると、インデックス抽出部２６は、ＰＤＦファイル４０をイメージにデコードし、イメージ化された文字、１次元バーコード、２次元バーコードを取り出す。バーコードは、全体がイメージ情報としてレイアウトされている場合、又は、個々のバーコードが文字情報としてレイアウトされている場合がある。インデックス抽出部２６は、インデックス指定データ３０によって特定されている領域をプログラムによってスキャンすることにより、キー文字列及びバーコードの認識処理を行い、例えば認識結果をテキスト化する。 More specifically, the index extraction unit 26 decodes the PDF file 40 into an image, and takes out the imaged character, one-dimensional barcode, and two-dimensional barcode. The barcode may be laid out as image information as a whole, or individual barcodes may be laid out as character information. The index extraction unit 26 scans the area specified by the index designation data 30 by a program, thereby performing recognition processing of the key character string and the barcode, for example, converting the recognition result into text.

インデックスデータ４１は、例えば、郵便番号、顧客ＩＤ、顧客名称、開始ページ番号、終了ページ番号などを含む。 The index data 41 includes, for example, a zip code, a customer ID, a customer name, a start page number, and an end page number.

図５は、本実施形態に係るページ追記部２５による処理の一例を示すブロック図である。この図５は、ＰＤＦファイル４０に印刷文字として含まれている郵便番号と住所から、郵便バーコードを求めて追記する例を示す。 FIG. 5 is a block diagram illustrating an example of processing performed by the page appending unit 25 according to the present embodiment. FIG. 5 shows an example in which a postal barcode is obtained from a postal code and an address included as print characters in the PDF file 40 and added.

ページ追記指定データ３１は、郵便番号に対応するバーコードを追記することと、追記する（埋め込み）位置とを指定する。 The page additional designation data 31 designates the addition of a barcode corresponding to the postal code and the position (embedding) to be added.

ページ追記部２５は、ページ追記指定データ３１に基づいてインデックスデータ４１の郵便番号をバーコードに変換する。そして、ページ追記部２５は、ＰＤＦファイル４０に含まれているページ表現データにおけるページ追記指定データ３１で指定される位置に、バーコードを追記し、ページ表現データにバーコードが追記されたＰＤＦファイル４２を生成する。追記されたバーコードは、ＰＤＦファイル４２のページ体裁としてＰＤＦファイル４２を表示した場合に目視可能である。 The page appending unit 25 converts the zip code of the index data 41 into a barcode based on the page appending designation data 31. Then, the page appending unit 25 appends a barcode at a position designated by the page appending designation data 31 in the page representation data included in the PDF file 40, and the PDF file in which the barcode is appended to the page representation data 42 is generated. The added barcode is visible when the PDF file 42 is displayed as the page format of the PDF file 42.

図６は、本実施形態に係る文字データ変更部２６による処理の一例を示すブロック図である。図６は、文字の形状を示す座標情報のうち冗長性の高い座標を削除してデータ量を低減させる例を示す。 FIG. 6 is a block diagram illustrating an example of processing by the character data changing unit 26 according to the present embodiment. FIG. 6 shows an example of reducing the amount of data by deleting highly redundant coordinates from the coordinate information indicating the character shape.

文字変更指定データ３２は、文字種別ＳＴ１を非内包形式とするとともに、他の文字種別に対して字形を変更することを指定する。 The character change designation data 32 designates the character type ST1 as a non-inclusive format and changes the character type with respect to other character types.

文字データ変更部２６は、例えば、ＰＤＦファイルに含まれる文字データ４３のデータサイズを低減させるために、文字データ４３のうち、文字変更指定データ３２で指定された文字種別ＳＴ１を非内包形式に変更する。例えば、再編成されたＰＤＦファイルを閲覧する情報処理装置１に登録済みの文字種別ＳＴ１については、この情報処理装置１に登録されている文字種別ＳＴ１を参照して、画面表示又は印刷を行うことができるため、再編成されたＰＤＦファイルに含まれる文字データ４３から、文字種別ＳＴ１を削除する。 For example, in order to reduce the data size of the character data 43 included in the PDF file, the character data changing unit 26 changes the character type ST1 specified by the character change specifying data 32 in the character data 43 to the non-included format. To do. For example, for the character type ST1 registered in the information processing apparatus 1 that browses the reorganized PDF file, the character type ST1 registered in the information processing apparatus 1 is referred to and displayed or printed. Therefore, the character type ST1 is deleted from the character data 43 included in the reorganized PDF file.

さらに、文字データ変更部２６は、例えば、文字変更指定データ３２に基づいて、文字データ４３の他の文字種別の字形を変更し、文字データ４４のデータサイズを削減する。文字種別の字形の変更としては、例えば、文字を構成する座標点の削除、近似位置の座標点の統合、特定の座標点への入射角と出射角とのずれがわずかな座標点の削除などが行われる。 Furthermore, the character data changing unit 26 changes the character shape of the other character type of the character data 43 based on the character change designation data 32, for example, and reduces the data size of the character data 44. Examples of character type changes include deletion of coordinate points that make up characters, integration of coordinate points at approximate positions, deletion of coordinate points with a slight difference between the incident angle and the exit angle at a specific coordinate point, etc. Is done.

図７は、本実施形態に係るインデックス埋め込み部２９の一例を示すブロック図である。 FIG. 7 is a block diagram illustrating an example of the index embedding unit 29 according to the present embodiment.

ＰＤＦファイルは、コメント命令の「start」と「end」の間など、ＰＤＦ閲覧ソフトウェアに読み飛ばされる領域、ＰＤＦ閲覧ソフトウェアによって認識されない領域にデータを収納することができる。 The PDF file can store data in an area that is skipped by the PDF browsing software, such as between “start” and “end” of a comment command, or an area that is not recognized by the PDF browsing software.

インデックス埋め込み指定データ３５は、このＰＤＦ閲覧ソフトウェアに読み飛ばされる領域のうちのいずれかの領域に、インデックスデータが配置されることを指定する。インデックスデータの埋め込み方法としては、例えば、ＰＤＦの「しおり」機能、または、ＰＤＦの規約に従って「メタデータ」挿入の機能が用いられる。 The index embedding designation data 35 designates that index data is arranged in any one of the areas skipped by the PDF browsing software. As an index data embedding method, for example, a “bookmark” function of PDF or a function of “metadata” insertion is used in accordance with the rules of PDF.

インデックス埋め込み部２９は、インデックス埋め込み指定データ３５にしたがって、マージされたＰＤＦファイル４５のうちＰＤＦ閲覧ソフトウェアに読み飛ばされる領域４６に、インデックスデータ４１を埋め込み、再編成ＰＤＦファイル４７を生成する。 The index embedding unit 29 embeds the index data 41 in the area 46 skipped by the PDF browsing software in the merged PDF file 45 in accordance with the index embedding designation data 35 and generates a reorganized PDF file 47.

図８は、本実施形態に係る情報処理装置１の処理の一例を示すフローチャートである。なお、下記の各ステップの処理順序は、任意に変更可能である。 FIG. 8 is a flowchart illustrating an example of processing of the information processing apparatus 1 according to the present embodiment. Note that the processing order of the following steps can be arbitrarily changed.

ステップＳ１において、情報処理装置１は、複数のＰＤＦファイル１３１〜１３ｎに含まれているページ表現データ１５１〜１５ｎからインデックスデータ３６１〜３６ｎを抽出する。 In step S1, the information processing apparatus 1 extracts index data 361 to 36n from the page expression data 151 to 15n included in the plurality of PDF files 131 to 13n.

ステップＳ２において、情報処理装置１は、ページ表現データ１５１〜１５ｎへ、インデックスデータ３６１〜３６ｎを表す文字、文字列、又はコードデータを追記する。 In step S2, the information processing apparatus 1 adds a character, a character string, or code data representing the index data 361 to 36n to the page expression data 151 to 15n.

ステップＳ３において、情報処理装置１は、インデックスデータ３６１〜３６ｎを表す文字、文字列、又はコードデータを追記されたページ表現データをマージし、ページ表現データ２０１〜２０ｋを生成する。 In step S <b> 3, the information processing apparatus 1 merges the page expression data added with characters, character strings, or code data representing the index data 361 to 36 n to generate page expression data 201 to 20 k.

ステップＳ４において、情報処理装置１は、複数のＰＤＦファイル１３１〜１３ｎに含まれている文字データ１６１〜１６ｎのデータサイズを低減させ、文字データ２２１〜２２ｋを生成する。 In step S4, the information processing apparatus 1 reduces the data size of the character data 161 to 16n included in the plurality of PDF files 131 to 13n, and generates character data 221 to 22k.

ステップＳ５において、情報処理装置１は、複数のＰＤＦファイル１３１〜１３ｎに含まれている画像データ１７１〜１７ｎのデータサイズを低減させ、画像データ２３１〜２３ｋを生成する。 In step S5, the information processing apparatus 1 reduces the data size of the image data 171 to 17n included in the plurality of PDF files 131 to 13n, and generates image data 231 to 23k.

ステップＳ６において、情報処理装置１は、文字又はコードデータを追記されたページ表現データ２０１〜２０ｋ、データサイズの低減された文字データ２２１〜２２ｋ、データサイズの低減された画像データ２３１〜２３ｋを含むマージＰＤＦファイル１８１〜１８ｋを生成する。 In step S6, the information processing apparatus 1 includes page expression data 201 to 20k to which character or code data is added, character data 221 to 22k having a reduced data size, and image data 231 to 23k having a reduced data size. Merge PDF files 181 to 18k are generated.

ステップＳ７において、情報処理装置１は、マージＰＤＦファイル１８１〜１８ｋに、インデックスデータ３６１〜３６ｎに対応するインデックスデータ２１１〜２１ｋを埋め込む。 In step S7, the information processing apparatus 1 embeds index data 211 to 21k corresponding to the index data 361 to 36n in the merge PDF file 181 to 18k.

以上説明した本実施形態においては、情報処理装置１の記憶装置２ａに記憶されている多量のＰＤＦファイル１３１〜１３ｎから、迅速かつ自動で、検索対象のＰＤＦファイルを抽出するためのインデックスデータ３６１〜３６ｎが生成される。そして、本実施形態においては、複数のＰＤＦファイル１３１〜１３ｎをまとめた再編成ＰＤＦファイル１８１〜１８ｋが形成される。これにより、ＰＤＦファイルの数の増大を防止し、ＰＤＦファイルに含まれる資源データの共有化を図ることができ、ＰＤＦファイルのデータサイズを低減させることができる。 In the present embodiment described above, the index data 361 to extract the search target PDF file quickly and automatically from the large number of PDF files 131 to 13n stored in the storage device 2a of the information processing apparatus 1. 36n is generated. In this embodiment, reorganized PDF files 181 to 18k in which a plurality of PDF files 131 to 13n are collected are formed. Thereby, an increase in the number of PDF files can be prevented, resource data included in the PDF file can be shared, and the data size of the PDF file can be reduced.

本実施形態においては、インデックスデータ１３１〜１３ｎを再編成ＰＤＦファイル１８１〜１８ｋに割り当てたインデックスデータ２１１〜２１ｋが、再編成ＰＤＦファイル１８１〜１８ｋ内に埋め込まれており、インデックスデータ２１１〜２１ｋとＰＤＦファイル１８１〜１８ｋとが別構成とされていない。これにより、再編成ＰＤＦファイル１８１〜１８ｋに対する検索を迅速に行うことができ、再編成ＰＤＦファイル１８１〜１８ｋの取り扱いを簡素化することができる。 In the present embodiment, the index data 211 to 21k in which the index data 131 to 13n are allocated to the reorganized PDF files 181 to 18k are embedded in the reorganized PDF files 181 to 18k, and the index data 211 to 21k and the PDF are stored. The files 181 to 18k are not separately configured. Thereby, the search with respect to the reorganization PDF file 181-18k can be performed rapidly, and handling of the reorganization PDF file 181-18k can be simplified.

本実施形態において、インデックスデータ１３１〜１３ｎは、検索容易な形式で再編成ＰＤＦファイル１８１〜１８ｋに埋め込まれるため、検索のための情報の可搬性を向上させることができる。 In the present embodiment, the index data 131 to 13n are embedded in the reorganized PDF files 181 to 18k in an easily searchable format, so that the portability of information for search can be improved.

本実施形態においては、複数のＰＤＦファイル１３１〜１３ｎがまとめられて再編成ＰＤＦファイル１８１〜１８ｋが生成され、複数の資源データ１４１〜１４ｎがまとめられ、重複が排除されて資源データ１９１〜１９ｋに再編成され、これによりデータサイズを削減することができる。 In the present embodiment, a plurality of PDF files 131 to 13n are combined to generate reorganized PDF files 181 to 18k, a plurality of resource data 141 to 14n are combined, duplication is eliminated, and resource data 191 to 19k are generated. It can be reorganized, which can reduce the data size.

本実施形態においては、例えばＰＤＦファイルの閲覧及び印刷を行う情報処理装置１がこのＰＤＦファイルと同一の文字種別を保有しているなど、内包不要の文字種別が文字データに含まれている場合、この内包不要の文字種別が非内包形式に変更される。ＰＤＦファイルのデータサイズについては、資源データの占める比率が大きいため、文字データの総データ容量を削減することにより、ＰＤＦファイルのデータサイズを削減することができる。 In the present embodiment, for example, when an information processing apparatus 1 that performs browsing and printing of a PDF file has the same character type as that of the PDF file, the character data includes an unnecessary character type. The character type that does not require inclusion is changed to a non-inclusive format. Since the data size of the PDF file has a large ratio of resource data, the data size of the PDF file can be reduced by reducing the total data capacity of the character data.

本実施形態においては、内包形式にできない文字データのデータサイズが削減され、これによりさらにＰＤＦファイルのデータサイズが削減される。例えば、文字データは、ベゼー曲線、スプライン関数、直線を使用したベクトル閉曲線によって文字を表現する。一般的な文字データは、高品質用途を想定して過度な座標点を持つ。しかしながら、本実施形態においては、例えば画面表示及び低解像度プリンタの印刷に不要な座標点が削減され、文字データのデータサイズが低減される。 In this embodiment, the data size of character data that cannot be included is reduced, thereby further reducing the data size of the PDF file. For example, character data represents a character by a vector closed curve using a Beze curve, a spline function, and a straight line. General character data has excessive coordinate points for high quality applications. However, in this embodiment, for example, coordinate points unnecessary for screen display and printing by a low-resolution printer are reduced, and the data size of character data is reduced.

本実施形態においては、写真などのようなデータサイズの大きい画像データに対して、一括して解像度を所定のレベルまで低減させるための処理が実行される。画像データは、カラー色彩と濃度とを表現する階調情報によって表現される。画像品質に影響する画像データの解像度を低減させることにより、画像データのデータサイズも低減させることができる。これにより、ＰＤＦファイルのデータサイズを削減しつつ、所定のレベルの画像品質を確保することができる。 In the present embodiment, processing for reducing the resolution to a predetermined level at once is performed on image data having a large data size such as a photograph. The image data is expressed by gradation information expressing color color and density. By reducing the resolution of the image data that affects the image quality, the data size of the image data can also be reduced. Thereby, it is possible to ensure a predetermined level of image quality while reducing the data size of the PDF file.

本実施形態においては、抽出されたインデックスデータ３６１〜３６ｎを加工し、インデックスデータ３６１〜３６ｎに対応する文字、バーコードなどのコードデータが再編成ＰＤＦファイル１８１〜１８ｋのページ表現データ２０１〜２０ｋに目視可能とするために追記される。これにより、再編成ＰＤＦファイル１８１〜１８ｋを閲覧することで、ユーザがインデックスデータの内容を容易に確認することができる。 In the present embodiment, the extracted index data 361 to 36n is processed, and code data such as characters and barcodes corresponding to the index data 361 to 36n is converted into the page expression data 201 to 20k of the reorganized PDF files 181 to 18k. Added to make it visible. Thus, the user can easily confirm the contents of the index data by browsing the reorganized PDF files 181 to 18k.

なお、本実施形態に係る情報処理装置１の各構成要素は、自由に組み合わせることができ、また、自由に分離することができる。例えば、文字データ変更部２６と画像データ変更部２７は組み合わせてもよく、記憶装置２ａと記憶装置２ｂとは組み合わせてもよい。 In addition, each component of the information processing apparatus 1 which concerns on this embodiment can be combined freely, and can be isolate | separated freely. For example, the character data changing unit 26 and the image data changing unit 27 may be combined, and the storage device 2a and the storage device 2b may be combined.

さらに、本実施形態に係る情報処理装置１の各構成要素の処理順序は、再編成ＰＤＦファイル１８１〜１８ｋが生成可能な範囲で適宜変更可能である。例えば、ＰＤＦファイル１３１〜１３ｎがマージされる前又はマージされた後に、資源データに対するデータサイズ低減が実行されてもよい。ＰＤＦファイル１３１〜１３ｎがマージされた後に、インデックスデータを抽出するとしてもよい。ＰＤＦファイル１３１〜１３ｎがマージされた後に、ページ表現データに対して、インデックスデータに対応する文字又はコードデータが追記されるとしてもよい。インデックスデータ３６１〜３６ｎがＰＤＦファイル１３１〜１３ｎに埋め込まれた後に、マージが実行されるとしてもよい。ページ表現データがマージされた後に、インデックスデータが抽出されてもよい。ページ表現データがマージされた後に、インデックスデータを表す文字又はコードデータが追記されてもよい。 Furthermore, the processing order of each component of the information processing apparatus 1 according to the present embodiment can be changed as appropriate as long as the reorganized PDF files 181 to 18k can be generated. For example, the data size reduction for the resource data may be performed before or after the PDF files 131 to 13n are merged. The index data may be extracted after the PDF files 131 to 13n are merged. After the PDF files 131 to 13n are merged, the character or code data corresponding to the index data may be added to the page expression data. The merge may be executed after the index data 361 to 36n are embedded in the PDF files 131 to 13n. The index data may be extracted after the page expression data is merged. After the page expression data is merged, character data or code data representing the index data may be additionally written.

（第２の実施形態）
本実施形態においては、多数の文書データに相当する上記第１の実施形態の再編成ＰＤＦファイル１８１〜１８ｋから特定の１又は複数の文書データを高速に取り出す方法について説明する。 (Second Embodiment)
In the present embodiment, a method for extracting one or more specific document data from the reorganized PDF files 181 to 18k of the first embodiment corresponding to a large number of document data at high speed will be described.

再編成ＰＤＦファイル１８１〜１８ｋは、高速検索のために用いられるインデックスデータ２１１〜２１ｋを含む。再編成ＰＤＦファイル１８１〜１８ｋに含まれる資源データ１９１〜１９ｋのデータ容量は削減されている。再編成ＰＤＦファイル１８１〜１８ｋのそれぞれには、単体のページ表現データ又は組み合わされたページ表現データが適宜割り当てられている。 The reorganized PDF files 181 to 18k include index data 211 to 21k used for high-speed search. The data capacity of the resource data 191 to 19k included in the reorganized PDF files 181 to 18k is reduced. Single page expression data or combined page expression data is appropriately assigned to each of the reorganized PDF files 181 to 18k.

情報処理装置１は、生成済みのＰＤＦファイル１３１〜１３から、データ容量を削減した再編成ＰＤＦファイル１８１〜１８ｋを生成する特徴を持つ。 The information processing apparatus 1 has a feature of generating reorganized PDF files 181 to 18k with reduced data capacity from the generated PDF files 131 to 13.

情報処理装置１は、必要な文書データを検索するためにインデックスデータ２１１〜２１ｋを生成し、再編成ＰＤＦファイル１８１〜１８ｋにインデックスデータ２１１〜２１ｋを内包させる。これにより、データ可搬性、データ管理の効率を向上させることができる。 The information processing apparatus 1 generates index data 211 to 21k in order to search for necessary document data, and includes the index data 211 to 21k in the reorganized PDF files 181 to 18k. Thereby, the data portability and the efficiency of data management can be improved.

さらに、情報処理装置１は、再編成ＰＤＦファイル１８１〜１８ｋのデータサイズを小さくするために、画像データ２３１〜２３ｋの画像解像度の低減、文字データ２２１〜２２ｋの形状表現の座標情報の間引きを行う。 Further, the information processing apparatus 1 reduces the image resolution of the image data 231 to 23k and thins out the coordinate information of the shape representation of the character data 221 to 22k in order to reduce the data size of the reorganized PDF files 181 to 18k. .

図９は、再編成ＰＤＦファイル１８１のデータ構成の一例を示す形式図である。なお、他の再編成ＰＤＦファイル１８２〜１８ｋも図９と同様のデータ構成を持つことができる。 FIG. 9 is a format diagram showing an example of the data structure of the reorganized PDF file 181. As shown in FIG. The other reorganized PDF files 182 to 18k can have the same data structure as that shown in FIG.

再編成ＰＤＦファイル１８１は、文字データ２２１、画像データ２３１、ページ表現データ２０１、インデックスデータ２１１を含む。 The reorganized PDF file 181 includes character data 221, image data 231, page expression data 201, and index data 211.

通常、多量のページに関するＰＤＦファイルであっても、この多量のページで使用される文字データ、画像データは、ページ数に比例して増大しない。したがって、上記第１の実施形態のように、多量のページを一つのＰＤＦファイルで保存することにより、多数のページを分割して保存するよりもデータ容量を削減することができる。 Normally, even in a PDF file related to a large number of pages, the character data and image data used in the large number of pages do not increase in proportion to the number of pages. Accordingly, by storing a large number of pages as one PDF file as in the first embodiment, the data capacity can be reduced as compared with dividing and storing a large number of pages.

上記のような特徴に加えて、本実施形態においては、検索部６による検索処理の高速化について説明する。本実施形態において、文書データの検索は、電子図書館、電子文書館、保険会社の情報処理システム、コールセンターなどにおいて使用されることを想定しているが、他の利用分野においても使用可能である。 In addition to the above features, in the present embodiment, the speeding up of search processing by the search unit 6 will be described. In the present embodiment, it is assumed that the retrieval of document data is used in an electronic library, an electronic document building, an information processing system of an insurance company, a call center, etc., but can also be used in other fields of use.

第１の検索として、バッチ検索処理が用いられる。 A batch search process is used as the first search.

図１０は、バッチ検索処理の一例を示すフローチャートである。 FIG. 10 is a flowchart illustrating an example of batch search processing.

ステップＴ１において、検索部６は、検索対象のインデックスデータの指定を受ける。 In step T1, the search unit 6 receives specification of index data to be searched.

ステップＴ２において、検索部６は、再編成ＰＤＦファイル１８２〜１８ｋから文書データ２２１〜２２ｋ、画像データ２３１〜２３ｋ、インデックスデータ２１１〜２１ｋを抽出する。 In step T2, the search unit 6 extracts document data 221 to 22k, image data 231 to 23k, and index data 211 to 21k from the reorganized PDF files 182 to 18k.

ステップＴ３において、検索部６は、検索対象のインデックスデータに基づいて、インデックスデータ２１１〜２１ｋを参照し、検索対象のページ表現データ（検索対象ページ範囲）を決定する。 In step T3, the search unit 6 refers to the index data 211 to 21k based on the search target index data, and determines search target page expression data (search target page range).

ステップＴ４において、検索部６は、決定された検索対象のページ表現データ、当該検索対象のページ表現データで使用されている文字データ及び画像データを、例えば、記憶装置２ｂ、表示装置８、印刷装置９などに出力する。 In step T4, the search unit 6 uses the determined search target page expression data, character data and image data used in the search target page expression data, for example, the storage device 2b, the display device 8, and the printing device. Output to 9 etc.

第２の検索として、リアルタイム検索処理が用いられる。 A real-time search process is used as the second search.

図１１は、リアルタイム検索処理の一例を示すフローチャートである。 FIG. 11 is a flowchart illustrating an example of real-time search processing.

このリアルタイム検索処理においては、逐次検索リクエストに応答して、検索プログラムと資源データ１９１〜１９ｋ（文字データ２２１〜２２ｋ、画像データ２３１〜２３ｋ）とインデックスデータ２１１〜２１ｋを、例えば作業用の高速アクセス可能な記憶装置２ｂ（例えば内部メモリ）に常駐させる。検索プログラムは、情報処理装置１のプロセッサにより実行され、検索部６の機能を実現させる。このリアルタイム検索処理においては、例えば、他のプログラムから、例えば、検索対象のインデックスデータなどのような検索リクエストが発行される。 In this real-time search processing, the search program, resource data 191 to 19k (character data 221 to 22k, image data 231 to 23k), and index data 211 to 21k are accessed in response to sequential search requests, for example, high-speed access for work. It resides in a possible storage device 2b (for example, an internal memory). The search program is executed by the processor of the information processing apparatus 1 to realize the function of the search unit 6. In this real-time search process, for example, a search request such as index data to be searched is issued from another program.

このリアルタイム検索処理においては、バッチ検索処理よりも、検索速度を大幅に向上させることができる。 In this real-time search process, the search speed can be significantly improved as compared to the batch search process.

ステップＵ１において、情報処理装置１は、再編成ＰＤＦファイル１８１〜１８ｋのうちの文字データ２２１〜２２ｋ、画像データ２３１〜２３ｋ、インデックスデータ２１１〜２１ｋと、プロセッサを検索部６として機能させる検索プログラムとを記憶装置２ｂに常駐させ、検索プログラムを実行させる。 In step U1, the information processing apparatus 1 includes character data 221 to 22k, image data 231 to 23k, index data 211 to 21k in the reorganized PDF files 181 to 18k, and a search program that causes the processor to function as the search unit 6. Is resident in the storage device 2b and the search program is executed.

ステップＵ２において、検索部６は、他のプログラムから発行された検索対象のインデックスデータの指定を監視する。 In step U2, the search unit 6 monitors the designation of index data to be searched issued from other programs.

ステップＵ３において、検索部６は、検索対象のインデックスデータの指定を受けた場合に、検索対象のインデックスデータに基づいて、記憶装置２ｂに常駐しているインデックスデータ２１１〜２１ｋを参照し、検索対象のページ表現データを決定する。 In step U3, when receiving the specification of the index data to be searched, the search unit 6 refers to the index data 211 to 21k residing in the storage device 2b based on the index data to be searched, and searches Determine page representation data.

ステップＵ４において、検索部６は、決定された検索対象のページ表現データ、当該検索対象のページ表現データで使用されており記憶装置２ｂに常駐されている文字データ及び画像データを、例えば、記憶装置２ｂ、表示装置８、印刷装置９などに出力する。 In step U4, the search unit 6 uses the determined search target page expression data, character data and image data used in the search target page expression data and resident in the storage device 2b, for example, as a storage device. 2b, output to the display device 8, the printing device 9, and the like.

情報処理装置１は、リアルタイム検索処理が終了となるまで場合に、上記ステップＵ２以下の処理を繰り返す（ステップＵ５）。 The information processing apparatus 1 repeats the process from step U2 onward until the real-time search process ends (step U5).

以上説明したように、本実施形態においては、検索部６による検索処理が具体的に説明されている。検索部６は、リアルタイム検索処理を実行することにより、高速に所望のデータを抽出することができる。 As described above, in the present embodiment, the search process by the search unit 6 is specifically described. The search unit 6 can extract desired data at high speed by executing real-time search processing.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１…情報処理装置、２ａ，２ｂ…記憶装置、３…文書編集部、４…再編成部、５…文書管理データベースシステム、６…検索部、７…入力装置、８…表示装置、９…印刷装置、１０…文書データ、１１，２２１〜２２ｋ…文字データ、１２，２３１〜２３ｋ…画像データ、１３１〜１３ｎ…ＰＤＦファイル、１４１〜１４ｎ，１９１〜１９ｋ…資源データ、１５１〜１５ｎ，２０１〜２０ｋ…ページ表現データ、１６１〜１６ｎ…文字データ、１７１〜１７ｎ…画像データ、１８１〜１８ｋ…再編成されたＰＤＦファイル、２１１〜２１ｋ…インデックスデータ、２４…インデックス抽出部、２５…ページ追記部２５、２６…文字データ変更部、２７…画像データ変更部、２８…マージ部、２９…インデックス埋め込み部、３０…インデックス指定データ、３１…ページ追記指定データ、３２…文字変更指定データ、３３…画像変更指定データ、３４…マージ指定データ、３５…インデックス埋め込み指定データ、３６１〜３６ｎ…インデックスデータ、４８１〜４８ｋ…マージされたＰＤＦファイル。 DESCRIPTION OF SYMBOLS 1 ... Information processing apparatus, 2a, 2b ... Memory | storage device, 3 ... Document editing part, 4 ... Reorganization part, 5 ... Document management database system, 6 ... Search part, 7 ... Input device, 8 ... Display apparatus, 9 ... Printing Device, 10 ... Document data, 11, 221-2k ... Character data, 12, 231-23k ... Image data, 131-13n ... PDF file, 141-14n, 191-19k ... Resource data, 151-15n, 201-20k ... page expression data, 161 to 16n ... character data, 171 to 17n ... image data, 181 to 18k ... reorganized PDF file, 211 to 21k ... index data, 24 ... index extraction section, 25 ... page appending section 25, 26 ... Character data changing unit, 27 ... Image data changing unit, 28 ... Merge unit, 29 ... Index embedding unit, 30 ... Index Designated data, 31 .. page appending designation data, 32... Character modification designation data, 33... Image modification designation data, 34 .. merge designation data, 35 .. index embedding designation data, 361 to 36n .. index data, 481 to 48k. PDF file.

Claims

Index extraction means for extracting desired index data from a plurality of PDF files stored in the first storage device;
Page appending means for appending character or code data representing the index data to a desired position of page expression data included in the plurality of PDF files;
Resource data changing means for performing data size reduction processing on resource data included in the plurality of PDF files;
The page expression data included in the plurality of PDF files in which the character or code data is additionally written by the page additional means is merged, and the data size is reduced by the merged page expression data and the resource data changing means. Merging means for generating a merged PDF file that includes the generated resource data;
Index embedding means for generating a reorganized PDF file in which the index data is embedded at a desired position of the merge PDF file, and storing the reorganized PDF file in a second storage device;
An information processing apparatus comprising: search means for executing a search process for the reorganized PDF file stored in the second storage device based on the index data.

The information processing apparatus according to claim 1,
The resource data changing means is configured to make the character data included in the resource data non-included, to reduce coordinate points representing character shapes in the character data included in the resource data, and An information processing apparatus that performs at least one of lowering a resolution of image data included in data to a predetermined level or less.

The information processing apparatus according to claim 1 or 2,
The index extraction means is stored in a third storage device, based on index designation data that designates an area in which the index data is extracted from the plurality of PDF files, a key character, the key, At least one of the character having a predetermined relationship with the character and the predetermined code data is extracted as the index data,
The page appending means includes page representation data included in the plurality of PDF files based on page appending designation data that is stored in a fourth storage device and designates a position where the character or the code data is additionally written. Add the character or the code data to
The index embedding unit embeds the index data in the merge PDF file based on index embedding designation data that is stored in a fifth storage device and designates a position where the index data is to be embedded. apparatus.

The information processing apparatus according to any one of claims 1 to 3,
The information processing apparatus, wherein the index embedding unit arranges the index data in an area skipped by the PDF file browsing software of the reorganized PDF file.

The information processing apparatus according to any one of claims 1 to 4,
The index extraction means images the plurality of PDF files, extracts index data from the imaged data,
The information processing apparatus, wherein the page appending means arranges character or code data representing the index data at the desired position when the plurality of PDF files are imaged.

Computer
Index extraction means for extracting desired index data from a plurality of PDF files stored in the first storage device;
Page appending means for appending character or code data representing the index data to a desired position of page expression data included in the plurality of PDF files;
Resource data changing means for performing data size reduction processing on resource data included in the plurality of PDF files;
The page expression data included in the plurality of PDF files in which the character or code data is additionally written by the page additional means is merged, and the data size is reduced by the merged page expression data and the resource data changing means. Merging means for generating a merged PDF file that includes the generated resource data;
Index embedding means for generating a reorganized PDF file in which the index data is embedded at a desired position of the merge PDF file, and storing the reorganized PDF file in a second storage device;
The program for functioning as a search means which performs the search process with respect to the said reorganization PDF file memorize | stored in the said 2nd memory | storage device based on the said index data.