WO2022222547A1 - 一种大批量pdf文件合并的方法及系统 - Google Patents

一种大批量pdf文件合并的方法及系统 Download PDF

Info

Publication number
WO2022222547A1
WO2022222547A1 PCT/CN2022/000057 CN2022000057W WO2022222547A1 WO 2022222547 A1 WO2022222547 A1 WO 2022222547A1 CN 2022000057 W CN2022000057 W CN 2022000057W WO 2022222547 A1 WO2022222547 A1 WO 2022222547A1
Authority
WO
WIPO (PCT)
Prior art keywords
merged
pdf
pdf file
information
page
Prior art date
Application number
PCT/CN2022/000057
Other languages
English (en)
French (fr)
Inventor
梁俊义
Original Assignee
福建福昕软件开发股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福建福昕软件开发股份有限公司 filed Critical 福建福昕软件开发股份有限公司
Priority to US18/035,161 priority Critical patent/US20240005083A1/en
Publication of WO2022222547A1 publication Critical patent/WO2022222547A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/143Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables

Definitions

  • the present invention relates to the field of computer technology, in particular, to the processing of PDF files in a computer, and more particularly to a method and system for merging large batches of PDF files.
  • PDF Portable Document Format
  • PS PostScript language
  • Figure 1 is a schematic diagram of the structure of a PDF file.
  • a PDF file is usually composed of the following four elements: a file header (header), which identifies the PDF specification version that the file conforms to; An object containing a document; a cross-reference table, which contains information about indirect objects in the document; and a trailer, which provides the cross-reference table and the location of certain special objects within the body of the document.
  • a file header header
  • An object containing a document An object containing a document
  • a cross-reference table which contains information about indirect objects in the document
  • a trailer which provides the cross-reference table and the location of certain special objects within the body of the document.
  • the existing PDF file merging method is to first parse the PDF files, and then clone the contents of the PDF files to be merged. Copy method) to a newly generated PDF file, and finally save the newly generated PDF file.
  • This method of merging PDF files needs to save the relevant information of the entire merged PDF file in the memory during execution. Therefore, the program memory will continue to increase, especially when the amount of PDF files to be merged is very large, using this method The method takes up a lot of computer memory, takes a long time to merge, has low execution efficiency, and also affects the execution of other applications in the calculation.
  • the present invention provides a method and system for merging large batches of PDF files, by only obtaining the position information of each object in the file from the PDF files to be merged, parsing a few dictionary information, Call the global object value generator and modify the object value in each PDF file to be merged and output it to the newly generated PDF file, so as to realize the completion of large batches of PDF files in a short time with less memory 's merger.
  • the invention provides a method for merging PDF files in a large batch, which comprises the following steps:
  • Step 1 Determine and output the header information of the merged target PDF file, output the corresponding catalog dictionary information, generate and record the object number corresponding to the PDF page object;
  • Step 2 Analyze multiple PDF files to be merged in sequence, obtain the object numbers and offsets of all indirect objects of each PDF file to be merged, and obtain the catalog dictionary information of each PDF file to be merged;
  • Step 3 sequentially parse the page object dictionary information corresponding to the PDF files to be merged from the catalog dictionary information of each PDF file to be merged, and sequentially read the object number information of each page object from all the page object dictionary information;
  • Step 4 call the global object number generator to generate a new object number, and record the correspondence between the original object number information and the new object number in the mapping;
  • Step 5 call the output class of the PDF indirect object, output the page object of each PDF file to be merged into the page object of the merged target PDF file, and record its start position and length in the target PDF file;
  • Step 6 Check if all PDF files to be merged have been merged
  • the information parsed from the catalog dictionary information of each PDF file to be merged in step 3 further includes interactive form information and bookmark information corresponding to the PDF file to be merged.
  • step 5 is specifically:
  • Step 501 Store all indirect objects referenced in the page object dictionary information of each PDF file to be merged into a vector
  • Step 502 Circularly output all the indirect objects in the vector to the merged target PDF file, when any output is the parent class dictionary of the page object of the PDF file to be merged, then use the page object of the target PDF file to replace and merge. end the corresponding output;
  • Step 503 Determine whether all indirect objects have been output
  • step 501 when the indirect object of the parent class of the page object of each PDF file to be merged in step 501 is stored, it is modified into the page object of the merged target PDF file.
  • the output of any indirect object in step 502 is performed only once.
  • the global information combined in step 6 includes interactive form information and bookmark information.
  • the present invention also provides a system for merging large batches of PDF files, which includes:
  • the PDFMerger module is used to manage the merged target PDF file, which includes the object numbers of all indirect objects output during the PDF merging process, the offsets of all indirect objects, and the page object dictionary information of the target PDF file;
  • the MergePDFDocument module is used for the management and parsing of the PDF files to be merged.
  • the parsed content includes the object numbers and offsets of all indirect objects, catalog dictionary information of the PDF files to be merged, all page object dictionary information and interactive form dictionary information. .
  • the MergePDFPage module is used to process all indirect objects in the page object dictionary to be output by the PDF file to be merged;
  • the PDFObjnumGenerator module is used to generate the object number of the indirect object of the merged target PDF file, which is a global-oriented class module.
  • the method and system for merging large batches of PDF files provided by the present invention have shorter merging time when merging large batches of PDF files, less occupation of system memory in the whole process, and higher merging efficiency. And the operation of merging does not affect the use of other applications.
  • Figure 1 is a schematic diagram of the PDF file structure
  • FIG. 3 is a system architecture diagram of an embodiment of the present invention.
  • FIG. 4 is a time consumption comparison diagram of 50 PDF documents merging once according to an embodiment of the present invention.
  • FIG. 5 is a comparison diagram of memory consumption of 50 PDF documents merged once according to an embodiment of the present invention.
  • FIG. 6 is a time consumption comparison diagram of merging 200 PDF documents once according to an embodiment of the present invention.
  • FIG. 7 is a comparison diagram of memory consumption for merging 200 PDF documents once according to an embodiment of the present invention.
  • FIG. 8 is a time consumption comparison diagram of merging 1000 PDF documents once according to an embodiment of the present invention.
  • FIG. 9 is a comparison diagram of memory consumption when 1000 PDF documents are merged once according to an embodiment of the present invention.
  • FIG. 10 is a time consumption comparison diagram of merging 2000 PDF documents once according to an embodiment of the present invention.
  • FIG. 11 is a comparison diagram of memory consumption when 2000 PDF documents are merged once according to an embodiment of the present invention.
  • FIG. 2 is a flowchart of an embodiment of the present invention. As shown in FIG. 2 , the present embodiment provides a method for merging a large batch of PDF files, which includes the following steps:
  • Step 1 Determine and output the header information of the merged target PDF file, output the corresponding catalog dictionary information, generate and record the object number (objnum) corresponding to the PDF page object (pages);
  • the catalog dictionary is the root of the PDF document object hierarchy, which is located through the root (Root) entry in the PDF file trailer (trailer), which is equivalent to a directory, which contains definitions for document content, outline (outline), articles Threads (article threads), named destinations (named destinations), and references to other objects with other attributes; page objects (pages), which are page tree nodes, are the root nodes of the document page tree, and are an indirect object.
  • Step 2 Analyze multiple PDF files to be merged in sequence, obtain the object number (objnum) and offset (offset) of all indirect objects of each PDF file to be merged, and obtain the catalog dictionary information of each PDF file to be merged;
  • Step 3 Sequentially parse the page object (page) dictionary information corresponding to the to-be-merged PDF file from the catalog dictionary information of each to-be-merged PDF file, and sequentially read each page from all the page object (page) dictionary information Object number (objnum) information of the object (page);
  • the information parsed from the catalog dictionary information of each PDF file to be merged in step 3 further includes information such as interactive form (AcroForm) information and bookmark (bookmark) corresponding to the PDF files to be merged.
  • AcroForm interactive form
  • bookmark bookmark
  • Step 4 Call the global object number (objnum) generator to generate a new object number (objnumber), and record the correspondence between the original object number (objnum) information and the new object number (objnumber) in the map (map). ;
  • Step 5 Call the output class of the PDF indirect object, output the page object (page) of each PDF file to be merged into the page object (pages) of the merged target PDF file, and record its start in the target PDF file location and length;
  • step 5 is specifically:
  • Step 501 store all indirect objects referenced in the page object (page) dictionary information of each PDF file to be merged into a vector;
  • Step 502 cyclically output all indirect objects in the vector (vector) to the merged target PDF file, when any output is the parent dictionary of the page object (page) of the PDF file to be merged, Then use the page object (pages) of the target PDF file to replace and end the corresponding output;
  • all the indirect objects in step 502 are output only once, and during the loop output, if the indirect objects have already been output, they do not need to be output again.
  • Step 503 Determine whether all indirect objects have been output
  • Step 6 Check if all PDF files to be merged have been merged
  • the global information combined in step 6 includes information such as interactive form (AcroForm) information and bookmark (bookmark) information.
  • FIG. 3 is a system architecture diagram of an embodiment of the present invention. As shown in FIG. 3 , the present embodiment provides a system (10) for merging large batches of PDF files, which is used to implement the method of Embodiment 1, which includes:
  • the PDFMerger module (101) is used to manage the merged target PDF file, which includes the object numbers (objnum) of all indirect objects output during the PDF merging process, the offsets of all indirect objects (offset), and the pages of the target PDF file Object (pages) dictionary information;
  • the MergePDFDocument module (102) is used for the management and analysis of the PDF files to be merged; in this embodiment, the MergePDFDocument module (102) is mainly used to parse the PDF files to be merged, and obtain the object numbers (objnum) of all indirect objects in these files. ) and offset (offset), and also parse the catalog dictionary of the PDF file to be merged to obtain the dictionary information of all page objects (page) and the dictionary information of the interactive form (AcroForm) of the corresponding file.
  • the MergePDFPage module (103) is used to process all indirect objects in the page object (page) dictionary to be output by the PDF files to be merged; in this embodiment, all the indirect objects in the page object (page) dictionary are in the process of outputting Instead of decompression, it is directly output to the merged target PDF file using the original compression method in the PDF file to be merged.
  • the PDFObjnumGenerator module (104) is used to generate the indirect reference object number (objnum) of the merged target PDF file, and the PDFObjnumGenerator module (104) is a global-oriented class module. In this embodiment, the new object numbers (objnum) of all objects are uniformly generated by this class module.
  • a test environment is built according to Embodiment 1 and Embodiment 2, the performance of PDF file merging under different circumstances is tested, and the performance of merging the same PDF file with Adobe Acrobat11.0.0.379 is compared, as follows. :
  • Test environment Windows 7 Professional 64-bit operating system, 4GB memory
  • Execution method automatic execution, set the corresponding test file path, number of merged files, test machine, etc., merge documents in batches, obtain performance data in each merge process, and compare with Adobe Acrobat11.0.0.379 data right.
  • Test 1 Performance data merged once for 50 documents
  • FIG. 4 is a time consumption comparison diagram of 50 PDF documents merging once according to an embodiment of the present invention
  • FIG. 5 is a memory consumption comparison diagram of 50 PDF documents merging once according to an embodiment of the present invention.
  • the abscissa is the number of groups to perform the merge operation. In this embodiment, every 50 PDF documents is a group, and a total of 265 groups are merged.
  • the ordinate is the time-consuming and memory occupancy values, as shown in Figure 4 and Figure 5.
  • the present invention when merging the same 50 PDF documents at a time, the present invention takes an average of 11 seconds and an average memory usage of 112MB, while Adobe's average time-consuming is 23 seconds, the average memory usage is 142MB, and Adobe Acrobat The average time consumption is much higher than that of the present invention, and the memory usage is slightly greater than that of the present invention.
  • Test 2 Performance data of 200 documents merged once
  • FIG. 6 is a time consumption comparison diagram of merging 200 PDF documents once according to an embodiment of the present invention
  • FIG. 7 is a memory consumption comparison diagram of merging 200 PDF documents once according to an embodiment of the present invention.
  • the abscissa is the number of groups to perform the merge operation. In this embodiment, every 200 PDF documents is a group, and a total of 43 groups are merged.
  • the ordinate is the time-consuming and memory occupancy values, as shown in Figure 6 and Figure 7 , in this embodiment, when merging the same 200 PDF documents at a time, the average time-consuming of the present invention is 48 seconds, and the average memory usage is 116MB, while the average time-consuming of Adobe is 75 seconds, and the average memory usage is 189MB. It can be seen that , the average time consumption and memory occupation of Adobe Acrobat are higher than those of the present invention.
  • Test 3 Performance data of 1000 documents merged once
  • FIG. 8 is a time consumption comparison diagram of merging 1000 PDF documents once according to an embodiment of the present invention
  • FIG. 9 is a memory consumption comparison diagram of merging 1000 PDF documents once according to an embodiment of the present invention.
  • the abscissa is the number of groups to perform the merge operation. In this embodiment, every 1000 PDF documents is a group, and a total of 8 groups are merged.
  • the ordinate is the time-consuming and memory occupancy values, as shown in Figure 8 and Figure 9.
  • the average time-consuming of the present invention is 140 seconds and the average memory usage is 124MB
  • the average time-consuming of Adobe is 291 seconds and the average memory usage is 204MB. It can be seen that , the average time consumption and memory occupation of Adobe Acrobat are much higher than those of the present invention.
  • Test 4 Performance data of 2000 documents merged once
  • FIG. 10 is a time consumption comparison diagram of merging 2000 PDF documents once according to an embodiment of the present invention
  • FIG. 11 is a memory consumption comparison diagram of merging 2000 PDF documents once according to an embodiment of the present invention.
  • the abscissa is the number of groups to perform the merge operation. In this embodiment, every 2000 PDF documents is a group, and a total of 3 groups are merged.
  • the ordinate is the time-consuming and memory occupancy values, as shown in Figure 10 and Figure 11.
  • the average time-consuming of the present invention is 521 seconds
  • the average memory usage is 133MB
  • the average time-consuming of Adobe is 657 seconds
  • the average memory usage is 244MB. It can be seen that , the average time consumption of Adobe Acrobat is slightly higher than that of the present invention, but the average memory occupation of Adobe Acrobat is much higher than that of the present invention.
  • the present invention consumes better operation time in merging different numbers of PDF documents, and the memory occupation is relatively stable.
  • the present invention is better than Adobe Acrobat in terms of time consumption
  • the present invention is also better than Adobe Acrobat in memory occupation.
  • the method and system for merging large batches of PDF files provided by the present invention, when merging large batches of PDF files, the merging time is short, the whole process occupies little system memory, and the merging efficiency is high, And the operation of merging does not affect the use of other applications.
  • the modules in the apparatus in the embodiment may be distributed in the apparatus in the embodiment according to the description of the embodiment, and may also be located in one or more apparatuses different from this embodiment with corresponding changes.
  • the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开一种大批量PDF文件合并的方法及系统,其中方法包括:输出目标PDF文件的头部信息,输出catalog字典信息,生成PDF页面对象的对象编号并记录;依序解析待合并PDF文件,获取所有间接对象的对象编号和偏移量以及catalog字典信息;依序从catalog字典信息中解析对应待合并PDF文件的页面对象字典信息,并依序读取每个页面对象的对象编号信息;调用全局的对象编号生成器生成新的对象编号,并将原来的对象编号信息和新的对象编号的对应关系记录到映射中;调用PDF间接对象的输出类,将待合并PDF文件的页面对象输出到目标PDF文件的页面对象中,并记录其在目标PDF文件中的开始位置和长度;检查是否所有待合并PDF文件都已经完成合并。

Description

一种大批量PDF文件合并的方法及系统 技术领域
本发明涉及计算机技术领域,具体而言,涉及计算机中PDF文件的处理,更具体地为一种大批量PDF文件合并的方法及系统。
背景技术
PDF(Portable Document Format,便携式文档格式)是由Adobe Systems用于与应用程序、操作系统、硬件无关的方式进行文件交换所发展出的文件格式。PDF文件以PostScript语言(简称PS,是主要用于电子产业和桌面出版领域的一种页面描述语言和编程语言)图像模型为基础,无论在哪种打印机上都可保证精确的颜色和准确的打印效果,即PDF会忠实地再现原稿的每一个字符、颜色以及图象。图1为PDF文件结构示意图,如图1所示,PDF文件通常由以下4个元素构成:文件头(header),标识文件所符合的PDF规范版本;正文(body),包含了组成文件中所含文档的对象;交叉引用表(cross-reference table),包含关于文件中间接对象的信息;尾注(trailer),提供交叉引用表和文件正文内某些特殊对象的位置。
用户在使用PDF文件的过程中,可能会需要将多个PDF文件进行合并,现有的PDF文件合并方法是首先对PDF文件进行解析,然后把要合并的PDF文件内容都clone(Java程序进行对象复制的方法)到一个新生成的PDF文件中,最后保存这个新生成的PDF文件。这种PDF文件的合并方法在执行时需要在内存中保存整个合并后PDF文件的相关信息,因此,会导致程序内存不断增大,尤其当需要合并的PDF文件量非常多的时候,采用此种方法会大量占用计算机内存,并且合并所需要的时间较长,执行效率较低,还会影响计算中其他应用的执行。
发明内容
为了解决上述问题,本发明提供一种大批量PDF文件合并的方法及系统,通过仅从待合并的PDF文件获取到每个对象在文件中的位置信息,解析几个少数的几个字典信息,调用全局对象数值生成器并修改每个待合并PDF文件中的对象数值后输出到新生成的PDF文件中,用以实现只需较少的内存即可在较短的时间内完成大批量PDF文件的合并。
为了达到上述目的,本发明提供了一种大批量PDF文件合并的方法,其包括以 下步骤:
步骤1:确定并输出合并后的目标PDF文件的头部信息,输出对应catalog字典信息,生成对应PDF页面对象的对象编号并记录;
步骤2:依序解析多个待合并PDF文件,获取每个待合并PDF文件所有间接对象的对象编号和偏移量,同时获取每个待合并PDF文件的catalog字典信息;
步骤3:依序从每个待合并PDF文件的catalog字典信息中解析对应待合并PDF文件的页面对象字典信息,并从所有页面对象字典信息中依序读取每个页面对象的对象编号信息;
步骤4:调用全局的对象编号生成器生成新的对象编号,并将原来的对象编号信息和新的对象编号的对应关系记录到映射中;
步骤5:调用PDF间接对象的输出类,将每个待合并PDF文件的页面对象输出到合并后的目标PDF文件的页面对象中,并记录其在目标PDF文件中的开始位置和长度;
步骤6:检查是否所有待合并PDF文件都已经完成合并,
如果否,返回步骤2;
如果是,则根据目标PDF文件的页面对象字典信息组合全局信息到合并后的目标PDF文件中。
在本发明一实施例中,其中,步骤3中从每个待合并PDF文件的catalog字典信息中解析的信息还包括对应待合并PDF文件的交互式表单信息和书签信息。
在本发明一实施例中,其中,步骤5具体为:
步骤501:将每个待合并PDF文件的页面对象字典信息中引用的所有间接对象存入一向量中;
步骤502:循环输出所述向量中的所有间接对象到合并后的目标PDF文件中,当任一输出是待合并PDF文件的页面对象的父类字典时,则使用目标PDF文件的页面对象代替并结束对应输出;
步骤503:判断是否所有间接对象都已经输出,
如果是,整理每个待合并PDF文件的页面对象字典信息,并记录所述向量中的所有间接对象在合并后的目标PDF文件中的开始位置和长度;
如果不是,返回步骤3。
在本发明一实施例中,其中,步骤501中每个待合并PDF文件的页面对象的父类的间接对象在存入时,将其修改为合并后的目标PDF文件的页面对象。
在本发明一实施例中,其中,步骤502中任一间接对象的输出仅执行一次。
在本发明一实施例中,其中,步骤6中组合的全局信息包括交互式表单信息和书签信息。
为了达到上述目的,本发明还提供了一种大批量PDF文件合并的系统,其包括:
PDFMerger模块,用于管理合并后的目标PDF文件,其包括PDF合并过程中输出的所有间接对象的对象编号、所有间接对象的偏移量以及目标PDF文件的页面对象字典信息;
MergePDFDocument模块,用于待合并PDF文件的管理与解析,其解析的内容包括所有间接对象的对象编号及偏移量、待合并PDF文件的catalog字典信息、所有页面对象字典信息及交互式表单字典信息。
MergePDFPage模块,用于处理待合并PDF文件所要输出的页面对象字典中的所有间接对象;
PDFObjnumGenerator模块,用于生成合并后的目标PDF文件的间接对象的对象编号,其为一面向全局的类模块。
本发明提供的大批量PDF文件合并的方法及系统,与现有技术相比,在对大批量PDF进行合并时,合并时间较短,整个过程对系统内存的占用很少,合并效率较高,且执行合并的操作不影响其他应用的使用。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为PDF文件结构示意图;
图2为本发明一实施例的流程图;
图3为本发明一实施例的系统架构图;
图4为本发明一实施例50个PDF文档合并一次的时间消耗比对图;
图5为本发明一实施例50个PDF文档合并一次的内存消耗比对图;
图6为本发明一实施例200个PDF文档合并一次的时间消耗比对图;
图7为本发明一实施例200个PDF文档合并一次的内存消耗比对图;
图8为本发明一实施例1000个PDF文档合并一次的时间消耗比对图;
图9为本发明一实施例1000个PDF文档合并一次的内存消耗比对图;
图10为本发明一实施例2000个PDF文档合并一次的时间消耗比对图;
图11为本发明一实施例2000个PDF文档合并一次的内存消耗比对图。
附图标记说明:10-大批量PDF文件合并的系统;101-PDFMerger模块;102-MergePDFDocument模块;103-MergePDFPage模块;104-PDFObjnumGenerator模块。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
实施例一
图2为本发明一实施例的流程图,如图2所示,本实施例提供了一种大批量PDF文件合并的方法,其包括以下步骤:
步骤1:确定并输出合并后的目标PDF文件的头部信息,输出对应catalog字典信息,生成对应PDF页面对象(pages)的对象编号(objnum)并记录;
其中,catalog字典,是PDF文档对象层次结构的根,其通过PDF文件尾注(trailer)中的根(Root)条目进行定位,相当于目录,其包含对定义文档内容、大纲(outline)、文章线程(article threads)、命名目标(named destinations)以及其他属性的其他对象的引用;页面对象(pages),是页面树节点,是文档页面树的根节点,是一个间接对象。
步骤2:依序解析多个待合并PDF文件,获取每个待合并PDF文件所有间接对象的对象编号(objnum)和偏移量(offset),同时获取每个待合并PDF文件的catalog字典信息;
步骤3:依序从每个待合并PDF文件的catalog字典信息中解析对应待合并PDF文件的页面对象(page)字典信息,并从所有页面对象(page)字典信息中依序读取每个页面对象(page)的对象编号(objnum)信息;
在本实施例中,其中,步骤3中从每个待合并PDF文件的catalog字典信息中解析的信息还包括对应待合并PDF文件的交互式表单(AcroForm)信息和书签(bookmark)等信息。
步骤4:调用全局的对象编号(objnum)生成器生成新的对象编号(objnumber),并将原来的对象编号(objnum)信息和新的对象编号(objnumber)的对应关系记录到映射(map)中;
步骤5:调用PDF间接对象的输出类,将每个待合并PDF文件的页面对象(page)输出到合并后的目标PDF文件的页面对象(pages)中,并记录其在目标PDF文件中的开始位置和长度;
在本实施例中,其中,步骤5具体为:
步骤501:将每个待合并PDF文件的页面对象(page)字典信息中引用的所有间接对象存入一向量(vector)中;
在本实施例中,其中,步骤501中每个待合并PDF文件的页面对象(page)的父类(parent)的间接对象在存入时,将其修改为合并后的目标PDF文件的页面对象(pages)。
步骤502:循环输出所述向量(vector)中的所有间接对象到合并后的目标PDF文件中,当任一输出的是待合并PDF文件的页面对象(page)的父类(parent)字典时,则使用目标PDF文件的页面对象(pages)代替并结束对应输出;
在本实施例中,其中,步骤502中所有间接对象只输出一次,在循环输出时,如果为已经输出过间接对象则不需要再次输出。
步骤503:判断是否所有间接对象都已经输出,
如果是,整理每个待合并PDF文件的页面对象(page)字典信息,并记录所述向量(vector)中的所有间接对象在合并后的目标PDF文件中的开始位置和长度;
如果不是,返回步骤3。
步骤6:检查是否所有待合并PDF文件都已经完成合并,
如果否,返回步骤2;
如果是,则根据目标PDF文件的页面对象(pages)字典信息组合全局信息到合并后的目标PDF文件中。
在本实施例中,其中,步骤6中组合的全局信息包括交互式表单(AcroForm)信息和书签(bookmark)等信息。
实施例二
图3为本发明一实施例的系统架构图,如图3所示,本实施例提供了一种大批量PDF文件合并的系统(10),用于实现实施例一的方法,其包括:
PDFMerger模块(101),用于管理合并后的目标PDF文件,其包括PDF合并 过程中输出的所有间接对象的对象编号(objnum)、所有间接对象的偏移量(offset)以及目标PDF文件的页面对象(pages)字典信息;
MergePDFDocument模块(102),用于待合并PDF文件的管理与解析;在本实施例中,MergePDFDocument模块(102)主要作用是解析待合并PDF文件,获取到这些文件中所有间接对象的对象编号(objnum)及偏移量(offset),同时也要解析待合并PDF文件的catalog字典,来获取对应文件的所有页面对象(page)的字典信息及交互式表单(AcroForm)的字典信息等。
MergePDFPage模块(103),用于处理待合并PDF文件所要输出的页面对象(page)字典中的所有间接对象;在本实施例中,页面对象(page)字典中的所有间接对象在输出的过程中不进行解压缩,而是使用待合并PDF文件中原来的压缩方式直接输出到合并后的目标PDF文件中。
PDFObjnumGenerator模块(104),用于生成合并后的目标PDF文件的间接引用对象编号(objnum),PDFObjnumGenerator模块(104)为一面向全局的类模块。在本实施例中,所有的对象的新的对象编号(objnum),均统一由这个类模块生成。
实施例三
在本实施例中,根据实施例一和实施例二搭建测试环境,对不同情况下的PDF文件合并的性能进行测试,并与Adobe Acrobat11.0.0.379合并相同PDF文件的性能进行对比,具体如下:
测试环境:Windows 7 Professional 64位操作系统,4GB内存;
PDF文件总数:8000;
执行方式:自动化执行,设置对应的测试文件路径、合并文件数、测试机等,对文档进行批量合并,获取每次合并过程中的性能数据,并与与Adobe Acrobat11.0.0.379的数据进行比对。
测试一:50个文档合并一次的性能数据
图4为本发明一实施例50个PDF文档合并一次的时间消耗比对图,图5为本发明一实施例50个PDF文档合并一次的内存消耗比对图,其中,图4和图5的横坐标为执行合并操作的组数,在本实施例中,每50个PDF文档为一组,总共合并了265组,纵坐标分别为耗时和内存占用值,如图4和图5所示,在本实施例中,一次合并相同的50个PDF文档时,本发明平均耗时为11秒,平均内存占用为112MB,而Adobe的平均耗时为23秒,平均内存占用为142MB,Adobe Acrobat的平均耗时远高于本发明,内存占用略大于本发明。
测试二:200个文档合并一次的性能数据
图6为本发明一实施例200个PDF文档合并一次的时间消耗比对图,图7为本发明一实施例200个PDF文档合并一次的内存消耗比对图,其中,图6和图7的横坐标为执行合并操作的组数,在本实施例中,每200个PDF文档为一组,总共合并了43组,纵坐标分别为耗时和内存占用值,如图6和图7所示,在本实施例中,一次合并相同的200个PDF文档时,本发明的平均耗时为48秒,平均内存占用为116MB,而Adobe的平均耗时为75秒,平均内存占用为189MB,可见,Adobe Acrobat的平均耗时和内存占用均高于本发明。
测试三:1000个文档合并一次的性能数据
图8为本发明一实施例1000个PDF文档合并一次的时间消耗比对图,图9为本发明一实施例1000个PDF文档合并一次的内存消耗比对图,其中,图8和图9的横坐标为执行合并操作的组数,在本实施例中,每1000个PDF文档为一组,总共合并了8组,纵坐标分别为耗时和内存占用值,如图8和图9所示,在本实施例中,一次合并相同的1000个PDF文档时,本发明的平均耗时为140秒,平均内存占用为124MB,而Adobe的平均耗时为291秒,平均内存占用为204MB,可见,Adobe Acrobat的平均耗时和内存占用均远高于本发明。
测试四:2000个文档合并一次的性能数据
图10为本发明一实施例2000个PDF文档合并一次的时间消耗比对图,图11为本发明一实施例2000个PDF文档合并一次的内存消耗比对图,其中,图10和图11的横坐标为执行合并操作的组数,在本实施例中,每2000个PDF文档为一组,总共合并了3组,纵坐标分别为耗时和内存占用值,如图10和图11所示,在本实施例中,一次合并相同的2000个PDF文档时,本发明的平均耗时为521秒,平均内存占用为133MB,而Adobe的平均耗时为657秒,平均内存占用为244MB,可见,Adobe Acrobat的平均耗时略高于本发明,但Adobe Acrobat的平均内存占用却远高于本发明的平均内存占用。
由此可见,本发明在合并不同数量的PDF文档的操作时间消耗较好,内存占用也相对稳定,在与Adobe Acrobat的性能数据比较中,可以看出在时间消耗上本发明要优于Adobe Acrobat,在内存占用上本发明也优于Adobe Acrobat。
本发明提供的大批量PDF文件合并的方法及系统,与现有技术相比,在对大批量PDF进行合并时,合并时间较短,整个过程对系统内存的占用很少,合并效率较高,且执行合并的操作不影响其他应用的使用。
本领域普通技术人员可以理解:附图只是一个实施例的示意图,附图中的模块或流程并不一定是实施本发明所必须的。
本领域普通技术人员可以理解:实施例中的装置中的模块可以按照实施例描述分布于实施例的装置中,也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围。

Claims (7)

  1. 一种大批量PDF文件合并的方法,其特征在于,包括以下步骤:
    步骤1:确定并输出合并后的目标PDF文件的头部信息,输出对应catalog字典信息,生成对应PDF页面对象的对象编号并记录;
    步骤2:依序解析多个待合并PDF文件,获取每个待合并PDF文件所有间接对象的对象编号和偏移量,同时获取每个待合并PDF文件的catalog字典信息;
    步骤3:依序从每个待合并PDF文件的catalog字典信息中解析对应待合并PDF文件的页面对象字典信息,并从所有页面对象字典信息中依序读取每个页面对象的对象编号信息;
    步骤4:调用全局的对象编号生成器生成新的对象编号,并将原来的对象编号信息和新的对象编号的对应关系记录到映射中;
    步骤5:调用PDF间接对象的输出类,将每个待合并PDF文件的页面对象输出到合并后的目标PDF文件的页面对象中,并记录其在目标PDF文件中的开始位置和长度;
    步骤6:检查是否所有待合并PDF文件都已经完成合并,
    如果否,返回步骤2;
    如果是,则根据目标PDF文件的页面对象字典信息组合全局信息到合并后的目标PDF文件中。
  2. 根据权利要求1所述的方法,其特征在于,步骤3中从每个待合并PDF文件的catalog字典信息中解析的信息还包括对应待合并PDF文件的交互式表单信息和书签信息。
  3. 根据权利要求1所述的方法,其特征在于,步骤5具体为:
    步骤501:将每个待合并PDF文件的页面对象字典信息中引用的所有间接对象存入一向量中;
    步骤502:循环输出所述向量中的所有间接对象到合并后的目标PDF文件中,当任一输出是待合并PDF文件的页面对象的父类字典时,则使用目标PDF文件的页面对象代替并结束对应输出;
    步骤503:判断是否所有间接对象都已经输出,
    如果是,整理每个待合并PDF文件的页面对象字典信息,并记录所述向量中的所有间接对象在合并后的目标PDF文件中的开始位置和长度;
    如果不是,返回步骤3。
  4. 根据权利要求3所述的方法,其特征在于,步骤501中每个待合并PDF文件的页面对象的父类的间接对象在存入时,将其修改为合并后的目标PDF文件的页面对象。
  5. 根据权利要求3所述的方法,其特征在于,步骤502中任一间接对象的输出仅执行一次。
  6. 根据权利要求1所述的方法,其特征在于,步骤6中组合的全局信息包括交互式表单信息和书签信息。
  7. 一种大批量PDF文件合并的系统,用于实现权利要求1~6任一项的方法,其特征在于,包括:
    PDFMerger模块,用于管理合并后的目标PDF文件,其包括PDF合并过程中输出的所有间接对象的对象编号、所有间接对象的偏移量以及目标PDF文件的页面对象字典信息;
    MergePDFDocument模块,用于待合并PDF文件的管理与解析,其解析的内容包括所有间接对象的对象编号及偏移量、待合并PDF文件的catalog字典信息、所有页面对象字典信息及交互式表单字典信息。
    MergePDFPage模块,用于处理待合并PDF文件所要输出的页面对象字典中的所有间接对象;
    PDFObjnumGenerator模块,用于生成合并后的目标PDF文件的间接对象的对象编号,其为一面向全局的类模块。
PCT/CN2022/000057 2021-04-19 2022-03-30 一种大批量pdf文件合并的方法及系统 WO2022222547A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/035,161 US20240005083A1 (en) 2021-04-19 2022-03-30 Method and system for merging pdf files in a large batch

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110419112.0 2021-04-19
CN202110419112.0A CN113128175B (zh) 2021-04-19 2021-04-19 一种大批量pdf文件合并的方法及系统

Publications (1)

Publication Number Publication Date
WO2022222547A1 true WO2022222547A1 (zh) 2022-10-27

Family

ID=76778096

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/000057 WO2022222547A1 (zh) 2021-04-19 2022-03-30 一种大批量pdf文件合并的方法及系统

Country Status (3)

Country Link
US (1) US20240005083A1 (zh)
CN (1) CN113128175B (zh)
WO (1) WO2022222547A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128175B (zh) * 2021-04-19 2023-01-24 福建福昕软件开发股份有限公司 一种大批量pdf文件合并的方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330073B1 (en) * 1998-07-20 2001-12-11 Nw Coughlin System and method for merging multi-platform documents
US20040061719A1 (en) * 2002-09-26 2004-04-01 International Business Machines Corporation Consolidation of computer documentation
CN102508880A (zh) * 2011-10-18 2012-06-20 广东威创视讯科技股份有限公司 一种文件合并方法及分解方法
CN103645974A (zh) * 2013-12-31 2014-03-19 厦门市美亚柏科信息股份有限公司 一种pdf文件恢复方法及装置
CN105302550A (zh) * 2015-10-12 2016-02-03 江苏中威科技软件系统有限公司 将页面转为版式数据流文件的方法及系统
CN106911743A (zh) * 2015-12-23 2017-06-30 中兴通讯股份有限公司 小文件的写聚合、读聚合方法及系统和客户端
CN113128175A (zh) * 2021-04-19 2021-07-16 福建福昕软件开发股份有限公司 一种大批量pdf文件合并的方法及系统

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7020837B1 (en) * 2000-11-29 2006-03-28 Todd Kueny Method for the efficient compression of graphic content in composite PDF files
US20020095443A1 (en) * 2001-01-17 2002-07-18 The Beacon Journal Publishing Company Method for automated generation of interactive enhanced electronic newspaper
US8310713B2 (en) * 2002-12-24 2012-11-13 Konica Minolta Business Technologies, Inc. Image forming device, image forming program, computer readable recording medium on which the program is recorded, and image forming method
JP2008072671A (ja) * 2006-09-15 2008-03-27 Ricoh Co Ltd 画像処理装置、パターン画像合成方法及びパターン画像合成プログラム
CN102541905B (zh) * 2010-12-15 2015-11-25 北大方正集团有限公司 用于pdf文件的属性处理方法及装置
US9785637B2 (en) * 2014-03-18 2017-10-10 Google Inc. System and method for computing, applying, and displaying document deltas
CN107590366B (zh) * 2016-07-06 2019-11-15 福建福昕软件开发股份有限公司 一种pdf文档按页保护的方法
CN109492199B (zh) * 2018-10-17 2023-04-28 四川译讯信息科技有限公司 一种基于ocr预判断的pdf文件转换方法
CN109948123B (zh) * 2018-11-27 2023-06-02 创新先进技术有限公司 一种图像合并方法及装置
CN109697281A (zh) * 2018-12-17 2019-04-30 万兴科技股份有限公司 在线合并文档的方法、装置及电子设备
CN111753500B (zh) * 2020-07-07 2021-05-04 江苏中威科技软件系统有限公司 版式化后的电子表单与ofd合并展现及目录生成的方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330073B1 (en) * 1998-07-20 2001-12-11 Nw Coughlin System and method for merging multi-platform documents
US20040061719A1 (en) * 2002-09-26 2004-04-01 International Business Machines Corporation Consolidation of computer documentation
CN102508880A (zh) * 2011-10-18 2012-06-20 广东威创视讯科技股份有限公司 一种文件合并方法及分解方法
CN103645974A (zh) * 2013-12-31 2014-03-19 厦门市美亚柏科信息股份有限公司 一种pdf文件恢复方法及装置
CN105302550A (zh) * 2015-10-12 2016-02-03 江苏中威科技软件系统有限公司 将页面转为版式数据流文件的方法及系统
CN106911743A (zh) * 2015-12-23 2017-06-30 中兴通讯股份有限公司 小文件的写聚合、读聚合方法及系统和客户端
CN113128175A (zh) * 2021-04-19 2021-07-16 福建福昕软件开发股份有限公司 一种大批量pdf文件合并的方法及系统

Also Published As

Publication number Publication date
CN113128175A (zh) 2021-07-16
CN113128175B (zh) 2023-01-24
US20240005083A1 (en) 2024-01-04

Similar Documents

Publication Publication Date Title
US8321839B2 (en) Abstracting test cases from application program interfaces
KR20040047657A (ko) 계층적 구조 데이터를 임포트/익스포트하는 방법 및컴퓨터-판독가능 매체
CN103412853A (zh) 一种针对文档转换器的测试用例自动生成方法
CN106648569B (zh) 目标序列化实现方法和装置
WO2022222547A1 (zh) 一种大批量pdf文件合并的方法及系统
CN113761879B (zh) 报文格式校验方法、装置以及存储介质
US8060490B2 (en) Analyzer engine
US8286075B2 (en) Reducing resource requirements when transforming source data in a source markup language to target data in a target markup language using transformation rules
WO2020024778A1 (zh) xml文件批量修改方法、系统、设备和计算机可读存储介质
CN111142871A (zh) 一种前端页面开发系统、方法、设备、介质
WO2023115833A1 (zh) 一种流式文档的生成方法、计算装置以及计算机可读介质
US8578268B2 (en) Rendering electronic documents having linked textboxes
JP2011028568A (ja) 文書データ処理装置
Nelson XML conversion of the windows registry for forensic processing and distribution
US20040025114A1 (en) Preserving content or attribute information during conversion from a structured document to a computer program
CN108628606B (zh) 一种嵌入式设备的web网管应用程序生成方法及系统
CN113050987A (zh) 一种接口文档的生成方法、装置、存储介质及电子设备
CN112464636A (zh) 约束文件的比较方法、装置、电子设备和存储介质
CN114492419B (zh) 基于标注中新增关键词语的文本标注方法、系统及装置
CN117436429B (zh) 文档导出方法、装置、计算机设备和存储介质
CN114489895B (zh) 批量海报生成方法、装置、设备及介质
CN114004211B (zh) 单证模板生成方法、处理方法、装置、计算机设备和介质
CN110825376B (zh) 一种解析带注释json文件的方法、存储介质及装置
CN116126300A (zh) 代码生成方法及装置、电子设备及计算机可读存储介质
CN113761948A (zh) 配置信息处理的方法、装置、设备、存储介质及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22790657

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18035161

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22790657

Country of ref document: EP

Kind code of ref document: A1